arXiv:1705.07830v1 [cs.CL] 22 May 2017 Ask the Right Questions: Active Question Reformulation with Reinforcement Learning Christian Buck [email protected] Jannis Bulian [email protected] Massimiliano Ciaramita [email protected] Andrea Gesmundo [email protected] Neil Houlsby [email protected] Wojciech Gajewski [email protected] Wei Wang [email protected] Google We propose an active question answering agent that learns to reformulate questions and combine evidence to improve question answering. The agent sits between the user and a black box question-answering system and learns to optimally probe the system with natural language reformulations of the initial question and to aggregate the evidence to return the best possible answer. The system is trained end-to-end to maximize answer quality using policy gradient. We evaluate on SearchQA, a dataset of complex questions extracted from Jeopardy!. Our agent improves F1 by 11% over a state-of-the-art base model that uses the original question/answer pairs. 1 Introduction Web and social media have become the primary sources for information needs. Users’ expectations and information seeking activities co-evolve with the increasing sophistication of these resources. Beyond navigation, document retrieval, and simple factual question answering, users seek direct answers to complex and compositional questions. Such search sessions may require multiple iterations, critical assessment and synthesis [Marchionini, 2006]. The productivity of natural language yields a myriad of ways to formulate a question. In the face of complex information needs, humans overcome uncertainty by reformulating 1 questions, issuing multiple searches, and aggregating responses. Inspired by the human ability to produce and ask the right questions, we present an agent that learns to carry out this process for the user. The agent sits between the user and a backend QA system that we refer to as the ‘environment’. The agent aims to maximize the chance of getting the correct answer by reformulating and reissuing a user’s question to the environment. By asking many question variants, and aggregating the returned evidences, the agent comes up with the single best answer. The internals of the environment are not available to the agent, so it must learn to probe a black-box optimally using only natural language. This approach resembles active learning [Settles, 2010]. In active learning, the algorithm chooses which data to send to an environment for labeling, aiming to collect the most valuable labels. Similarly, our agent chooses how to query an environment, aiming to maximize the chance of revealing the correct answer. Since the system actively chooses which questions to ask we call our approach Active Question Answering (AQA). AQA departs from the usual regime of active learning in the following ways. 1. AQA must choose the optimal input in the space of natural language questions, as opposed to the usual active learning scenario of selecting real-valued feature vectors. The former, being discrete and highly structured, is challenging to optimize. 2. AQA optimizes the input to send to the environment with respect to expected endto-end performance. Active learning usually optimized a pre-specified acquisition criterion 1 . 3. AQA performs active reformulation conditioned on a original input. Active learning seeks to acquire generally useful labels at training time. Here, we must ensure that the questions sent to, and answers received from, the environment are useful with respect to the original query. The key component to our solution is a text based sequence-to-sequence reinforcement learning to optimize the reformulator’s performance with respect to the QA environment. The second component to AQA combines the evidences from interacting with the environment using a convolutional neural network. We evaluate on a dataset of complex questions taken from Jeopardy!, the SearchQA dataset [Dunn et al., 2017]. These questions are hard to answer by design because they use obfuscated and convoluted language, e.g., “Travel doesn’t seem to be an issue for this sorcerer & onetime surgeon; astral projection & teleportation are no prob” (answer: Doctor Strange). Thus SearchQA tests the ability of AQA to reformulate questions such that the QA system has the best chance of returning the correct answer. AQA outperforms a deep network built for QA, BiDAF [Seo et al., 2017a], which has produced state-of-the-art results on multiple tasks, by 11% absolute F1, a 32% relative F1 improvement. We conclude by proposing AQA as a general framework for stateful, iterative, and interactive information seeking tasks. 1 Bayes optimal active learning also tries to optimize end-to-end performance, however, these techniques are often computationally prohibitive in large-scale settings [Roy and McCallum, 2001] 2 AQA q a Reformulate Aggregate [Seq2Seq] [CNN] {q'} Q&A {a'} Figure 1: AQA’s agent-environment. In the downward pass the agent re-interprets the question and hands it on to the QA system. In the upward pass, responses are evaluated to produce the final answer. 2 Active Question Answering In this section we detail the components of Active Question Answering (cf. Figure 1). 2.1 The Agent-environment framework Rather than send a user’s question to a QA system passively, an AQA system actively reformulates the question multiple times, and sends these variants to an environment, in our case the QA system. The environment then returns one or more responses, from which the final answer is selected. The agent must learn to communicate optimally with the environment to maximize the probability of receiving the correct answer. The environment can, in principle, include multiple information sources, possibly providing feedback in different modes (images, structured data from knowledge bases, free text, etc.). Here, the information source consists of a pre-trained question answering system, that accepts natural language question strings as input, and returns strings as answers. 2.2 Active Question Answering Agent The reformulator is a sequence-to-sequence model, popularized in the task of Neural Machine Translation (NMT) [Sutskever et al., 2014, Bahdanau et al., 2014] 2 . This model’s architecture follows a design using a multi-layer bidirectional LSTM, with attentionbased decoding. The major departure from the standard machine translation setting 2 We build on top of the public implementation of Britz et al. [2017]. (https://github.com/google/ seq2seq/) 3 is that our model reformulates utterances in the same language. Unlike in MT, there is little high quality training data available for monolingual paraphrasing. Training highly-parametrized neural networks effectively relies on an abundance of data, thus, our setting presents an additional challenge. We address this first, by exploiting end-to-end signals produced through the interaction with the QA environment, and secondly, by pre-training the model. This is a common strategy in deep learning, for which we develop appropriate strategies. In the downward pass the reformulator, the box on the left in Figure 1, maps the original question into one or many alternative questions that will probe the environment for candidate answers. The reformulator is trained end-to-end, using an answer quality objective, for which we use the standard F1 metric. Due to the non-differentiable nature of this sequence-level loss, the model is trained using reinforcement learning (cf. Section 3). The aggregator, the box on the right of Figure 1, is a Convolutional Neural Network (CNN). This model’s task is to evaluate the candidate answers returned by the environment and select the one to return. Here, we assume that there is a single best answer, as is the case in our evaluation setting; returning multiple answers is a straightforward extension of the model. The CNN is trained with supervised learning. Finally, we require an environment to interact with. For this we use a competitive neural question answering model, BiDirectional Attention Flow (BiDAF) [Seo et al., 2017a]3 . BiDAF is an extractive QA system. It takes as input a question and a document and returns as answer a continuous span from the passage. The model contains a bidirectional attention mechanism to score document snippets with respect to the question, implemented with multi-layer LSTMs and other components. The environment is opaque to our agent, which has no access to the internals: parameters, activations, gradients, etc. AQA may only send questions to it, and receive answers. This scenario enables us to design a general framework that in permits the use of any backend environment. However, it means that feedback on the quality of the question reformulations is noisy and indirect, thus presenting a challenge for training. 3 Training To train AQA we use a combination of reinforcement and supervised learning. We also present a strategy to overcome the data paucity issue in monolingual paraphrasing. 3.1 QA system We use BiDAF as the black box QA system [Seo et al., 2017a]. We updated BiDAF to TensorFlow 1.1 and trained a model on the SearchQA data. We use Adam with a learning rate of 0.001 as optimizer, and stop after 4500 steps. In the SearchQA task several snippets returned by Google form the overall context (50 on average). We join snippets to form the context where BiDAF can search for the answer candidate spans. Training on all snippets was slow, so we limited the context to the top 10 ranked Google 3 https://allenai.github.io/bi-att-flow/ 4 snippets to form the context. This corresponds to finding the answer on the first page of Google search results. The results are only mildly affected by this: for 10% of the questions there is no answer in this shorter context. These datapoints are all counted as losses. 3.2 Training the Initial Rewrite Model We pre-train the question reformulation model by building a paraphrasing NMT model, i.e. a model that can translate English to English. Unfortunately, very little parallel but monolingual data (segment, paraphrase) is available to train such a model directly. Instead, we first produce a multi-language translation system, that can translate between several languages [Johnson et al., 2016]. This allows us to use bilingual corpora as are commonly used for machine translation models. Multi-language training simply requires adding two special tokens to the data indicating the source and target languages 4 and use of a common vocabulary. The architecture of the translation model remains unchanged. As Johnson et al. [2016] show, this model can be used for zero-shot translation, i.e. to translate between language pairs for which it has seen no training examples, thus we can use this model for zero-shot paraphrasing. In the same work the authors note that zero-shot performance is generally worse than bridging, an approach that uses the model twice to first translate into a pivot language and then to the target language, but that this performance gap can be eliminated by running a few training steps for the desired language pair. Thus, we first train on multilingual data, then on a small corpus of noisy monolingual data. As multilingual training data we use the United Nations Parallel Corpus v1.0 [Ziemski et al., 2016]. This dataset contains 11.4M sentences which are fully aligned across the six UN languages : Arabic, English, Spanish, French, Russian, and Chinese. Using all bilingual pairings we produce a multilingual training corpus of 30 language pairs, yielding 340M training examples in total which we use to train a zero-shot neural MT system [Johnson et al., 2016]. We tokenize our data using 16k sentence pieces5 . Following standard settings [Britz et al., 2017] we use a bidirectional LSTM as encoder and an 4-layer LSTM with attention [Bahdanau et al., 2016] as decoder. The model converges after seeing about 400M instances6 . The monolingual model trained as described above has poor quality as a source of systematically-related question reformulations. For example, for a question like “What month, day and year did Super Bowl 50 take place?” the top translation is “What month and year goes back to the morning and year?”. To improve quality, after having trained the multilingual model, we resume training on a monolingual dataset which is two orders of magnitude smaller than the U.N. corpus. The monolingual data is extracted from the Paralex database of question paraphrases Fader et al. [2013]7 . Unfortunately, this data contains many noisy pairs. As a simple heuristic to filter these out we remove all 4 For example, we prefix <from_es> <to_en> to the source side of a Spanish-English training instance. https://github.com/google/sentencepiece 6 We run three million steps on batches of size 128 using the Adam optimizer with learning rate 0.001. 7 http://knowitall.cs.washington.edu/paralex/. 5 5 pairs where the Jaccard coefficient between the sets of source and target terms ≤ 0.5. Additionally, since the number of paraphrases for each question can vary significantly, for each question we keep at most 4 different question paraphrases. The total number of monolingual training pairs is ≈ 1.5M. This model has visibly better quality than the zero-shot one; e.g., for the example question above it would generate “What year did superbowl take place?”. We also tried training solely on the monolingual data. The quality of this model seems somewhere in between the two approaches above. E.g., it produces “What month did super blooming 50 take place?” for the same question. This pattern of results is consistent with the evidence from Johnson et al. [2016]. 3.3 Policy Gradient Training Our goal is to return the best possible answer a, maximizing a reward R(a). Typically, R is the token level F1 loss on the answer string, although any choice can be made. The answer is an unknown function of the query reformulation sent to the environment a = f (q). The reformulator produces a question by sequentially generating tokens w. The reformulation process defined by our model is the policy πθ (q|q0 ) (θ is the set of model parameters). A Q query is generated according to q ∼ πθ (q|q0 ) = Tt=1 p(wt |w1 , . . . , wt−1 , q0 ), where q0 is the original input question. Under the policy the goal is to maximize the reward of the answer, Eπθ [R(f (q(π)))]. We train using policy gradient [Sutton and Barto, 1998], optimizing the reward directly with respect to the policy. The reward cannot be computed in closed form, but Monte Carlo sampling from the policy yields an unbiased estimator, Eπθ [R(f (q))] ≈ N 1 X R(f (qi )), N i=1 qi ∼ πθ (qi |q0 ) (1) To compute gradients for training with use the REINFORCE method [Williams and Peng, 1991], ∇Eπθ [R(f (q))] = Eπθ [R(f (q))]∇θ log(πθ ) ≈ N 1 X R(f (qi ))∇θ log(π(qi |q0 )), N i=1 qi ∼ πθ (2) The REINFORCE estimator is often found to have high variance, leading to unstable training. We address this using a baseline [Williams, 1992] B(q0 ) = Eq0 [R(f (q))]. This expectation is computed by sampling from the policy given q0 . We often observed collapse onto a sub-optimal deterministic policy. To address this we include entropy regularization on the policy. This final reward is: Eπθ [R(f (q)) − B(q0 )] + wreg H[π(q|q0 )] where wreg is the regularization weight. 6 (3) 3.4 RL experiments In our experiments we switch the optimizer from Adam to SGD with a low learning rate for the RL experiments. During training we produce greedy, i.e. without beam search, translations on a dev set for every checkpoint and monitor their average reward in the QA environment. We found that a high learning rate can quickly deteriorate model performance. To avoid degeneration we also lower the weight of the entropy regularization weight wreg to 0.001. Another option to improve training stability is to switch between supervised and reinforcement learning. Since supervised data is not available for the question reformulation task we cannot interpolate the gradients as Wu et al. [2016] did. We train the model for an additional 100k steps using the RL updates. In contrast to our initial training which we ran on GPUs, this training phase is dominated by latency of the QA system and we run inference and updates on CPU. 3.5 Supervised Aggregator Using beam search or sampling our reformulation system can produce many rewrites of a single question. We issue each rewrite to the QA system, yielding a set of (query, rewrite, answer) tuples from which we need to pick the best instance. We train another neural network to pick the best answer from the candidates. While this is a ranking problem we phrase it as a binary classification, distinguishing between above and below average performance: For every instance, we compute the F1 score. An instance counts as positive if the rewrite produces and answer with an above average F1 score, compare to all other rewrites of the same question. We ignore questions where all rewrites yield equally good/bad answers, which is a sizeable part of our training data: around 50% of the questions yield the same answer score no matter the rewrite. As classifier we did a quick evaluation of feed-forward networks, LSTMs and CNNs. As inputs are triples of variable length sequences the latter two allow us to incorporate the tokens directly without need for feature engineering. As performance is similar for both CNN and LSTM based models we chose CNNs for computational efficiency. In particular, we use pre-trained 100-dimensional embeddings [Pennington et al., 2014] for the tokens of query, rewrite, and answer. For each, we add a 1-d CNN of width 3 and output dimension 100, followed by max-pooling. The three vectors are then concatenated and passed through a feed-forward network which produces the binary output. 4 Experiments 4.1 QA Data We evaluate on SearchQA [Dunn et al., 2017], a recently-released dataset built starting from a set of Jeopardy! clues. Clues are obfuscated queries such as “This ’Father of Our Country’ didn’t really chop down a cherry tree”. SearchQA associates each clue with the correct answer (e.g., “George Washington”), and a list of snippets from Google’s top search results. The dataset contains more than 140k question answer pairs, and 6.9M snippets. The data comes with pre-defined train, validation and test partitions. For 7 SearchQA Validation Test System Exact Match F1 Exact Match F1 ASR BiDAF – 31.7 24.2 37.9 – 28.6 22.8 34.6 AQA AQA AQA AQA AQA 32.0 33.6 34.0 35.5 40.5 38.2 40.5 40.1 42.0 47.4 30.6 33.3 32.7 33.8 38.7 36.8 39.3 38.9 40.2 45.6 – 48.3 – 56.0 43.9 46.6 – 54.3 TH WV HAC TC CNN Human Performance Oracle ranking Table 1: Results table for the experiments on SearchQA. both tasks we train our system on the train data, perform model selection and tuning on the development data and report results on both development and test partitions. The training, validation and test sets consist of 99,820, 13,393 and 27,248 examples, respectively. 4.2 Results Table 1 summarizes the experimental results of the AQA systems, including baselines and benchmarks. The metrics reported are exact match (EM) and F1, between the predicted answer and the gold answer, respectively. The numbers we report are on the full validation and test datasets (referred to as n-gram in [Dunn et al., 2017]). This includes answers that have a unigram answer; e.g., including the question “His marvel.com bio says he’s "highly proficient in hand-to-hand combat, swordsmanship and hammer throwing"” (answer: Thor), and also questions with longer answers such as “Uber debuted this kind of car in September 2016, & passengers didn’t have to make small talk or argue about the music” (answer: self-driving cars). SearchQA provides a baseline that ranks unigrams from the search snippets using their TF-IDF score. It is not given in the table, because it is not provided on all questions but only for those where the answer consists of a single word. Dunn et al. [2017] report also the results of a deep learning system trained specifically for the SearchQA task. This is a modified pointer network, called attention sum reader (ASR). Overall, it seems that SearchQA is a harder task then other recent ones such as SQuAD [Rajpurkar et al., 2016] and CNN/DailyMail [Hermann et al., 2015], both for systems and humans. BiDAF’s performance degrades by 40 absolute F1 points on SearchQA compared to SQuAD and CNN/DailyMail. On the other hand, BiDAF is still a competitive QA system on SeachQA, as it improves over the ASR network by 13.7 absolute F1 points. 8 We evaluate several variants of AQA systems. For each query q in the evaluation set we sample reformulations qi , i = 1..N using the sequence model trained with policy gradient; N = 20 in these experiments. AQA TH First, we just use q1 , the top hypothesis (TH) generated by the sequence model. This already gives an improvement over using the original question for BiDAF of 0.3 F1 points on validation and 2.2 F1 points on test. AQA WV We implement a simple heuristic weighted voting scheme that outputs the answer with the maximum vote. Let ai be the answer returned by BiDAF for query P qi , with associated score F (ai ), then FW V (a) = i:ai =a F (ai ). This model (AQA WV) scores 40.5 F1 on validation and 39.3 on test, improving over choosing the top hypothesis. AQA HAC Instead of using the sum as in the voting scheme, we also tried using the average score returned by the BiDAF model for an answer. With 34.0 F1 on validation and 38.9 F1 on test this improves over using the weighting scheme. AQA TC The most successful heuristic is the use of the answer with the highest confidence score returned by the BiDAF model. This gives a F1 score of 42.0 on validation and 40.2 on test. One can improve on top of these heuristics by training a supervised scoring function. We considered several neural networks variants for this purpose, and evaluated using the CNN design described in Section 3.5. With F1 47.4 on validation, and 45.6 this function improves over the heuristics by a wide margin, and with an Exact Match score of 38.7 on test it halfway closes the gap to human performance on the task. We also report scores for an Oracle system that always picks the answer maximizing the respective score. The table also reports human performance, based on a sample from the test set. The human performance is significantly better when restricted to unigram answers (64.85%), and the main reason for this is that the task is already demanding for humans on unigrams and significantly harder on n-gram answers. 5 Related work Bilingual corpora and machine translation have long been used to generate paraphrases[Madnani and Dorr, 2010]. These approaches generally involve pivoting through a second language. While early work[Bannard and Callison-Burch, 2005] focused on extracting words and phrases e, e0 from a phrasetable such that e is translated to a foreign phrase f and f translates back to e0 , this approach has been extended to extract full sentences [Ganitkevitch et al., 2011, 2013] using a parsing based MT system [Li et al., 2009]. Recent work Mallinson et al. uses neural translation models and multiple pivots. In contrast, our approach does not use pivoting and is, to our knowledge, the first direct neural paraphrasing system. 9 Reinforcement learning is gaining traction in natural language understanding across a broad spectrum of problems. For example, Narasimhan et al. [2015] use reinforcement learning to learn control policies for multi-user dungeon (MUD) games where the state of the game is summarized by a textual description, while Li et al. [2017] use reinforcement learning for dialogue generation. Policy gradient methods [Williams, 1992, Sutton et al., 1999] of different kinds have been investigated recently for machine translation and other sequence-to-sequence problems. They are beneficial in alleviating the limitations inherent in the word-level optimization of the cross-entropy loss, allowing sequence-level reward metrics, like BLEU. They also help with the so-called exposure bias: the fact that the model never sees its own mistakes at training time [Ranzato et al., 2015, Shen et al., 2016]. Policy gradient has also been used to define sequence level reward functions based on language models and reconstruction errors to bootstrap MT with less resources [Xia et al., 2016]. Bahdanau et al. [2016] use an actor-critic approach for machine translation. Our proposal is closely related to this body of work as we use NMT models in the agent, which we train with policy gradient methods, using the end-to-end question answering quality metrics as sequence-level rewards. In question answering, Liang et al. [2017] use policy gradient for learning a semantic parser to query a knowledge base, while Seo et al. [2017b] propose a variant of RNNs, QRN (query reduction networks), that transform an original query to answer questions that involve multi-hop common sense reasoning. Nogueira and Cho [2016] investigate an approach to question answering that is related to ours. Their goal is to identify a document containing an answer to a query by following links on a document graph. Evaluating on a set of questions from the game “Jeopardy!”, they start from a question and learn to walk the Wikipedia graph with an RNN until they reach the predicted article/answer.8 In a recent follow-up Nogueira and Cho [2017] use an approach inspired by relevance feedback in combination with reinforcement learning. They propose to reformulate a query using terms from the documents retrieved for a questions by a search system. With respect to this work, ours differs in that: we generate full reformulations of the original queries via monolingual sequence to sequence modeling and we use a full state-of-the-art question answering system as the environment, rather than a traditional document retrieval system. One popular technique to boost a QA system is to use ensemble methods. This is achieved simply by randomizing the predictive algorithm using several replicas of the base system, trained independently.AQA is conceptually different. An ensemble keeps the input fixed, and samples predictions by perturbing the model’s parameters. AQA, instead, aims at perturbing the input, while keeping the model fixed.We see this line of research as naturally related also to the problem of fact checking. E.g. Wu et al. [2017] propose to perturb structured database queries for detecting fake news. An important difference with respect to this work is that, in the AQA framework, what is perturbed is not a structured query expressed in a formal language but surface natural language forms, directly. As an example, while searching for an answer to a question like “Did Michelangelo paint the Sistine chapel?”, one could also ask “When did Michelangelo paint 8 Most Jeopardy! questions coincide with entities with an associated Wikipedia article. 10 A C st QR ut rt qt+1 rt+1 QA st+1 q0 A Figure 2: The generalized agent-environment framework for active question answering. the Sistine chapel?”, knowing that a response containing a date is likely to be evidence for an affirmative answer. Similarly, other correlated information that can be collected by asking semantically, or pragmatically, related questions might provides evidence that can help modeling support for candidate answers. 6 Conclusion We proposed a novel framework for question answering, inspired by interactive, iterative information seeking processes. We call it active question answering (AQA), as it aims at improving answering by systematically perturbing input questions. We investigated a first system of this kind that has three components: a question reformulator, a black box QA system, and a candidate answer scorer. The reformulator is a sequence-to-sequence model, while the scorer is a convolutional neural network. Both components are trained using both reinforcement learning and supervised learning. Experimental results prove that the approach is highly effective. We improve a sophisticated Deep QA system by 11% absolute F1, 32% relative F1, on a difficult dataset of long, semantically complex, questions. We plan to further investigate the problem as an end-to-end reinforcement learning task. Figure 2 depicts the generalized AQA agent-environment framework. One can see AQA as a game. The game starts with a question q0 . Then, the agent (A) starts interacting with the environment (QA) over discrete time steps, in episodes of arbitrary length. The environment makes available a number of information sources; e.g., a question answering system, a search engine etc. The environment is noisy and the output quality is not always reliable or interpretable. The agent decision-making component (C)9 assesses the answer at time t, contained in the state st , or any of the previously collected ones. The agent can either decide that it is satisfied with one (A), and stop, or ask another 9 E.g., a neural network similar to the one used here for scoring. 11 question by using the next rewrite qt+1 suggested by the reformulator. Notice that the reformulator and the control component can interact. The process repeats for a number of steps. Each answer returned is assigned a score. The objective of the game is to maximize the total score over a set of questions. With respect to this work there are a few important differences. The decision making process, including answer scoring and action type prediction (answer, reformulate), is jointly trained with the reformulator. This can take the form of a critic or Q-function. Additionally, the value function can pass internal states to the reformulator, to make reformulations stateful. Also, in this formulation multi-step episodes can be performed. Finally, the generalized framework will allow one to experiment with all recent advances in deep reinforcement learning. 12 References D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014. D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. https://arxiv.org/abs/1607.07086, 2016. C. Bannard and C. Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 597–604. Association for Computational Linguistics, 2005. D. Britz, A. Goldie, M.-T. Luong, and Q. Le. Massive exploration of neural machine translation architectures. https://arxiv.org/pdf/1703.03906.pdf, 2017. M. Dunn, L. Sagun, M. Higgins, U. Guney, V. Cirik, and K. Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. https://arxiv.org/abs/1704.05179, 2017. A. Fader, L. Zettlemoyer, and O. Etzioni. Paraphrase-Driven Learning for Open Question Answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013. J. Ganitkevitch, C. Callison-Burch, C. Napoles, and B. Van Durme. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1168–1179. Association for Computational Linguistics, 2011. J. Ganitkevitch, B. Van Durme, and C. Callison-Burch. Ppdb: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Comutational Linguistics: Human Language Technologies, pages 758–764, 2013. K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems, pages 1693–1701, 2015. M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. https://arxiv.org/abs/1611.04558, 2016. J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. https://arxiv.org/pdf/1606.01541.pdf, 2017. Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, S. Khudanpur, L. Schwartz, W. N. Thornton, J. Weese, and O. F. Zaidan. Joshua: An open source toolkit for parsing-based machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 135–139. Association for Computational Linguistics, 2009. C. Liang, J. Berant, Q. Le, K. D. Forbus, and N. Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017. N. Madnani and B. J. Dorr. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3):341–387, 2010. 13 J. Mallinson, R. Sennrich, and M. Lapata. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. G. Marchionini. Exploratory search: From finding to understanding. Commun. ACM, 49(4): 41–46, 2006. K. Narasimhan, T. Kulkarni, and R. Barzilay. Language understanding for text-based games using deep reinforcement learning. https://arxiv.org/abs/1506.08941, 2015. R. Nogueira and K. Cho. End-to-end goal-driven web navigation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016. R. Nogueira and K. Cho. Task-oriented query reformulation with reinforcement learning. https://arxiv.org/abs/1704.04572v1, 2017. J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014. P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016. M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. https://arxiv.org/abs/1511.06732, 2015. N. Roy and A. McCallum. Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, pages 441–448, 2001. M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR, 2017a. M. Seo, S. Min, A. Farhadi, and H. Hajishirzi. Query-reduction networks for question answering. In Proceedings of ICLR 2017, ICLR 2017, 2017b. B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010. S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1683–1692, 2016. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 3104–3112, 2014. R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981. R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, 1999. 14 R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3-4):229–256, 1992. R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991. Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu. Computational fact checking through query perturbations. ACM Trans. Database Syst., 42(1):4:1–4:41, 2017. Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma. Dual learning for machine translation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016. M. Ziemski, M. Junczys-Dowmunt, and B. Poliquen. The united nations parallel corpus v1.0. In Proceedings of Language Resources and Evaluation (LREC), 2016. 15
© Copyright 2026 Paperzz