An Open Domain Topic Prediction Model for Answer Selection

An Open Domain Topic Prediction Model for
Answer Selection
Zhao Yan1 , Nan Duan2 , Ming Zhou2 , Zhoujun Li1 and Jianshe Zhou3
1
Beihang University, No.37 Xueyuan Road, Beijing 100191, China
{yanzhao,lizj}@buaa.edu.cn
2
Microsoft Research Asia, Beijing, China
{nanduan,mingzhou}@microsoft.com
3
BAICIT, Capital Normal University
[email protected]
Abstract. We present an open domain topic prediction model for the
answer selection task. Different from previous unsupervised topic modeling methods, we automatically extract high quality and large scale
hsentence, topici pairs from Wikipedia as labeled data, and train an open
domain topic prediction model based on convolutional neural network,
which can predict the most possible topics for each given input sentence.
To verify the usefulness of our proposed approach, we add the topic
prediction model into an end-to-end open domain question answering
system and evaluate it on the answer selection task, and improvements
are obtained on both WikiQA and QASent datasets.
Keywords: Answer selection · Question answering · Topic prediction
1
Introduction
Answer selection (AS) is a subtask of the open-domain question answering (QA)
problem has received considerable attention in past few years [4, 12, 16, 17]. The
goal of this task is to select answer sentences of a given question from a set
of pre-selected answer candidates. Previous solutions to this task either use the
surface form overlaps between questions and answer sentences, or try to learn
the matching of different surface forms with similar meanings between questions
and answer sentences based on labeled data. The former ones cannot handle
questions who don’t share any content word with their answer sentences; while
the latter ones depend heavily on the limited training data, and are lack of
scalability to other unseen open domain questions.
Motivated by the issues mentioned above, we propose an open domain topic
prediction model for the answer selection task, which aims to measure the relevance between questions and answer sentences on the topic-level semantic space.
The contributes of this paper are two-fold:
2
– We show how to automatically acquire high-quality and large scale hsentence,
topici pairs based on Wikipedia as labeled data for the topic prediction
model training4 ;
– We show how to train the topic prediction model based on Convolutional
Neural Network (CNN), and how to use it to enhance an existing answer
selection baseline with state-of-the-art results on the WikiQA and QASent
dataset.
2
Topic Prediction Model
Formally, given an input sentence S (either a question or an answer sentence),
the goal of topic prediction is to predict the most possible N topics that S is
talking about T = [ht1 , p1 i, ..., htN , pN i], where ti denotes the ith predicted topic,
with its corresponding prediction probability pi .
In this section, we first illustrate how to acquire training data from Wikipedia,
and then describe how to train topic prediction model based on CNN.
2.1
Training Data Acquisition
To ensure the usefulness of our proposed topic prediction model, the training
data should cover most of common topics. Instead of letting human annotators
to label training data, which is very expensive and time-consuming, we propose
a simple but effective method to extract large scale and open domain labeled
data automatically from Wikipedia.
Each Wikipedia page is an entity-centered document, which follows the following format illustrated in Figure 1, where Entity denotes the central entity
that the current Wikipedia page is talking about, Topic denotes an entry existed in the Content table5 , Topic Sentence denotes a sentence that introduces
the given topic of the current entity.
Based on this format, we extract hsentence, topici pairs by the following three
steps:
Topic selection. We first extract all content tables from the Wikipedia
dump6 . Entries in content tables are considered as topic candidates. As we can
see in Figure 1, topics in content tables are organized in a tree-like structure.
In this work, we only consider topics in the first level, as topics in other levels
are more specific to entities. As the total number of topic candidates (741,060
topics in the first level) is very large, we filter the topic candidates based on the
following two rules: (1) each topic should be covered by at least 20 Wikipedia
pages; (2) each topic should not be contained by a general keyword list, such as
”abstract”, ”reference” and ”others”, as these kind of keywords are too general
4
5
6
The dataset is available now and can be downloaded at
https://github.com/helloQA/WikiTopicPredictionData
Each content table is the outline of the current page.
https://dumps.wikimedia.org/
3
Fig. 1. The typical format of a Wikipedia page.
to be useful in downstream tasks. After applying these two filter rules, there are
7,218 topics left.
hSetence, Topici pair extraction. For each topic, we then extract its
corresponding topic sentences from Wikipedia pages. In Wikipedia, each topic
in the content table will be introduced by one or multiple paragraphs. In this
work, we only select the first sentence in the first paragraph of a given topic as
its topic sentence, as we find that the first sentence has the largest possibility
to introduce the topic. More than 4.3M hSetence, Topici pairs are left after this
step.
hSetence, Topici pair balance. The number of hSetence, Topici pairs for
different topics are extremely unbalanced. It will lead huge bias for training
topic prediction model. In this step, we sample 300 hSetence, Topici pairs for
each topic and ignore those topics that contains hSetence, Topici pairs less than
300. Consequently, 1,287 topics and 384,965 hSetence, Topici pairs are left in the
final dataset7 .
7
1,135 sentences are removed by some rules, such as minimum length, which are too
detailed to introduce in this paper.
4
2.2
Model Training
A convolutional neural network based model, which resembles [5] and [11], is
used to train the topic prediction model, by (1) encoding each sentence and
its corresponding topic into two topic-level semantic representations; and (2)
making sure at the same time that the distance between the representations of
each sentence and its correct topic is closest.
Fig. 2. Illustration of the architecture for the topic prediction model.
– Input Layer. As the size of English words is large, it is expensive to learn
model parameters. So we obtain representation of the tth word-n-gram in
a sentence S by concatenating the tri-letter vectors of each word as: lt =
T
T
[wt−d
, ..., wtT , ..., wt+d
]T , where wt denotes the tth word representation, and
n = 2d + 1 denotes the contextual window size, which is set to 3.
– Convolution Layer. The convolution layer performs sliding window-based
feature extraction to project the vector representation lt of each word-n-gram
to a contextual feature vector ht :
ht = tanh(Wc · lt )
−2x
where Wc is the convolution matrix, tanh(x) = 1−e
1+e−2x is the activation
function.
– Pooling Layer. The pooling layer aggregates local features extracted by the
convolution layer to form a sentence-level global feature vector with a fixed
size independent of the length of the input utterance. Max pooling is used to
force the network to retain the most useful local features by lp = [v1 , ..., vK ]K ,
where
vi = max {ht (i)}
t=1,...,T
5
– Semantic Layer. For the global representation lp of S, one more non-linear
transformation is applied as:
y(S) = tanh(Ws · lp )
where Ws is the semantic projection matrix, y(S) is the final topic-level
semantic vector representation of S.
We first compute semantic vectors for sentence S and all topic candidates
using CNN. We then compute a cosine similarity score between S and each topic
candidate ti . The posterior probability of a topic given a sentence is computed
based on this cosine similarity score through a softmax function, and the model
is trained by maximizing the likelihood of the correctly associated topic given
training hsentence, topici pairs, using the mini-batch stochastic gradient descent
(SGD) algorithm.
3
Topic Prediction Model for Answer Selection
Given a question Q and an answer sentence A, we design two types of features
to use our proposed topic prediction model in the answer selection task.
3.1
Implicit Topic Matching Feature
The 1st feature is called the Implicit Topic Matching (or ITM) feature. Implicit
topic matching feature computes the relevance between Q and A directly as:
IT M (Q, A) = Cosine(y(Q), y(A))
where y(Q) and y(A) are topic-level vector representations for Q and A respectively, which are generated based on the CNN model described in Section
2.2.
3.2
Explicit Topic Matching Feature
The 2nd feature is called the Explicit Topic Matching (or ETM) feature. Aiming
to compute this feature for each hQ, Ai pair, we first predict the most possible
M 8 topics for Q and A respectively. The prediction score of a topic tSi given a
sentence S (either a question or an answer sentence) is computed as:
Score(S, tSi ) = Cosine(y(S), y(tSi ))
y(S) and y(tSi ) denote the topic-level embedding vectors for S and tSi respectively, which are generated based on the model described in Section 2.2. Then,
8
In this work, M is set to 10 empirically.
6
we concatenate the predicted topics of Q and A respectively, and generate two
topic strings:
Q
Q
StrQ = tQ
1 t2 ...tM
A
A
StrA = tA
1 t2 ...tM
Based on which, we define three methods, ET MM T (Q, A), ET MW E (Q, A)
and ET MP P (Q, A), to measure the explicit topic semantic similarity between
Q and A:
– ET MM T (Q, A) is a word-based translation model score:
P|StrQ |
ET MM T (Q, A) =
i=1
P|StrA |
j=1
ln(
p(wAj |wQi )
)
|StrA |
|StrQ |
where p(wAj |wQi ) denotes the probability that the ith word wQi in StrQ can
be translated into the j th word wAj in StrA . In this paper, we run GIZA++
[9] to train word alignments based on 10M hquestion, related questioni pairs
crawled from the community QA website (i.e., YahooAnswers9 ).
– ET MW E (Q, A) represent the word embedding-based feature to measure the
topic semantic similarity between Q and A:
P|StrQ |
ET MW E (Q, A) =
i=1
P|StrA |
ln(
j=1
cosine(vwQi ,vwAj )
)
|StrA |
|StrQ |
where vwQi (or vwAj ) denotes the word embedding learnt for the word wQi (or
wAj ). We generate word embeddings using word2vec [8] base on the sentences
from GoogleNews10 . The dimension of each word embedding vector is set to
300.
– ET MP P (S, Q) is defined as a paraphrase-based feature. Given a phrase table
P T extracted from a bilingual corpus11 , we follow [1] to extract a paraphrase
table P P , which has the following form:
P P = {hsi , sj , score(sj ; si )i}
where si and sj denote two phrases in source language, score(sj ; si ) denotes
a confidence score that si can be paraphrased to sj , which is computed based
on the phrase table P T as:
X
score(sj ; si ) =
{p(t|si ) · p(sj |t)}
t
9
10
11
https://answers.yahoo.com/
The embedding file is available at https://code.google.com/archive/p/word2vec/
We use 0.5M Chinese-English bilingual sentences in phrase table extraction, i.e., LDC2003E07, LDC2003E14,LDC2005T06, LDC2005T10, LDC2005E83, LDC2006E26,
LDC2006E34, LDC2006E85 and LDC2006E92.
7
The underlying idea of this approach is that, two source phrases that are
aligned to the same target phrase trend to be paraphrased.
Based on P P , we define a paraphrase-based feature to measure the semantic
similarity between S and Q:
PN
ET MP P (Q, A) =
n=1
PM
j=1
CountP P (tjA ,StrQ )
M
N
where tjA denotes the j th topic phrase in A and N denotes the maximum
n-gram order (here is 3). CountP P (tjA , StrQ ) is computed based on the following rules:
• If tjA ∈ StrQ , then CountP P (tj , StrQ ) = 1;
• Else, if htjA , s, score(s; tjA )i ∈ P P and tjA ’s paraphrase s occurs in Q,
then CountP P (tjA , Q) = score(s; tjA )
• Else, CountP P (tjA , Q) = 0.
There are other ways to compute the similarity between two sentences as well,
such as bi-LSTM (Long Short Term Memory), GRU (Gated Recurrent Unit),
and etc. Due to space limitation, we leave more experiments as our future work.
4
Related Work
The lexical gap between semantically correlative sentences is one of the biggest
challenges for modeling the relationship between question-answer pairs. Some
works use machine translation approach to representing words or phrases in
discrete space [6, 7]. [2] proposed to use a topic modeling method with the assumption that Q-A pairs should share similar topic distributions. Other works
considering the relevance between questions and answers in the syntactic level
instead of word level, by matching parse trees [4, 12, 15]. Representation learning by neural network has become a hot method as well, to learn sentence-level
representations for the answer selection task [3, 10, 17].
Uniquely, our method proposes a simple but effective way to extract labeled
data for the supervised topic prediction model training, demonstrate the effective
of such semantic-level feature in QA task.
5
5.1
Experiment
Evaluation on Topic Prediction
We first evaluate our proposed model in a topic prediction task.
Based on the training data extraction method described in Section 2.1, we collect 1,287 most-frequent topics and 384,965 hsentence, topici pairs from 5,028,106
English Wikipedia pages. We further divide all hsentence, topici pairs into two
datasets: 346,480 hsentence, topici pairs as training set, 38,485 hsentence, topici
pairs as testing set.
8
We train the topic prediction model on training set, and use it to predict the
most possible topics for each sentence in testing set. Accuracy@N (ACC@N) is
used as evaluation metric. We use Naive Bayes (NB) and TFIDF as baselines,
where
N B(tk , S) = log P (tk ) +
|S|
X
log P (wi |tk )
i=1
T F IDF (tk , S) =
|S|
X
T F (wi , tk ) · log(IDF (wi , T ))
i=1
Model ACC@1 ACC@3 ACC@5 ACC@10
NB
10.3%
18.1%
22.7%
30.6%
TFIDF
31.2%
49.0%
56.9%
67.3%
Our Model 33.9% 51.2% 58.7%
68.8%
Table 1. Evaluation on topic prediction task.
Evaluation results are shown in Table 1, where ours topic prediction model
achieves better results than the baselines.
Sentence
Top 10 Predicted Topics
Question :
measurement, features, usage, uses, comparison, waHow much is 1 tablespoon of ter, size, physical characteristics, characteristics,
water
outline
Answer candidate 1 (Correct) : size, features, statistics, ports, infrastructure,
This tablespoon has a capacity elements, specifications, measurement, sources, facilof about 15ml.
ities
Answer candidate 2 (Incorrect) : occurrence, nomenclature, branches, general inforIt is abbreviated as t, tb, tbs, mation, name, terminology, list, grammar, concept,
tbsp, tblsp, or tblspn.
availability
Table 2. Example of topic prediction.
The meaning between some topics are very similar, such as ”usage” and
”uses” in Table 2, which makes getting higher ACC@1 difficult. The model
with 58.7% ACC@5 and 68.8% ACC@10 is acceptable and can provide enough
semantic info to enhance the answer selection task.
9
5.2
Experiment on Answer Selection
For answer selection task, we select two benchmarks WikiQA12 [14] and QASent[13]
as the evaluation datasets.
Dataset The same triple format < Qi , Ai,j , label > is used to represent the
data in both WikiQA and QASent dataset. Given question Qi , each candidate
answer sentences Ai,1 ...Ai,k are labeled as 1 or 0, where 1 denotes the current
sentence is a correct answer sentence, and 0 denotes the opposite meaning.
The WikiQA dataset is precisely constructed based on natural language questions and Wikipedia documents. The questions are collected from question-like
Bing search queries and have clicks to Wikipedia. WikiQA dataset contains 2,118
questions and 20,360 ‘question-answer’ (Q-A) pairs in the training set, 296 questions and 1,130 ‘Q-A’ pairs in development set, and 633 questions and 2,352
‘Q-A’ pairs in testing set.
The QASent dataset select questions in TREC 8-13 QA tracks and chose sentences share one or more non-stopwords from questions. QASent dataset contains
94 questions and 5,919 ‘Q-A’ pairs in the training set, 65 questions and 1,117
‘Q-A’ pairs in development set, and 68 questions and 1,442 ‘Q-A’ pairs in testing
set.
Metrics The performance of answer selection is evaluated by Mean Average
Precision (MAP) and Mean Reciprocal Rank (MRR). Among all ‘Q-A’ pairs in
WikiQA, only one-third of questions contain at least one correct answer. Similar
to previous work, questions without correct answers in the candidate sentences
are not taken into account.
Results and Discussion We re-implement bi-CNN [14] as our baseline, which
uses a bi-gram CNN model combining with a word overlap feature that counts
the number of non-stopwords in the question that also occur in the answer. LDA
represent a cosine of topic distributions of a Q-A pair. The topic distribution
is estimated by LightLDA [18] on sentences from English wikipedia, where the
topic number is set to 1,000. In our experiment, features are combined by a
logistic regression function.
We first run evaluation on all questions in WikiQA. Table 3 shows that
different topic matching features can achieve comparable results with baselines.
This is due to the fact that only 20.3% answer sentences share no content word
with corresponding questions in WikiQA dataset. Therefore, the word overlap
feature plays an important role in all settings.
In order to illustrate the effectiveness of our model, we then run evaluation on
hard questions, each of which shares no content word with its corresponding
answer sentences. Table 4 shows that using the two topic-level semantic features
(ITM and ETM) brings significant and the biggest improvements, comparing to
bi-CNNcnt and bi-CNNcnt +LDA settings.
12
http://aka.ms/WikiQA
10
Model
MAP MRR ACC@1
bi-CNNcnt
65.27% 66.61% 51.35%
Baselines
bi-CNNcnt +LDA 65.74% 67.71% 52.24%
ITM
47.24% 45.57% 26.58%
ETMT M
53.42% 53.73% 33.74%
Single Features
ETMW E
49.28% 48.68% 28.40%
ETMP P
53.28% 53.54% 33.33%
bi-CNNcnt +ITM
65.72% 67.73% 51.90%
bi-CNNcnt +ETMT M 66.81% 68.27% 52.32%
Merged Methods bi-CNNcnt +ETMW E 65.92% 67.17% 51.64%
bi-CNNcnt +ETMP P 66.51% 68.29% 52.74%
bi-CNNcnt +ALL 67.50% 69.19% 54.43%
Table 3. Evaluation result on all questions of WikiQA.
Model
MAP MRR ACC@1
bi-CNNcnt
42.40% 43.02% 23.08%
Baselines
bi-CNNcnt +LDA 42.45% 43.22% 23.15%
bi-CNNcnt +ITM
42.34% 42.96% 23.08%
bi-CNNcnt +ETMT M 46.48% 47.87% 29.77%
Merged Methods bi-CNNcnt +ETMW E 42.47% 43.09% 24.26%
bi-CNNcnt +ETMP P 42.35% 42.92% 22.84%
bi-CNNcnt +ALL 47.36% 48.10% 30.77%
Table 4. Evaluation result on hard questions of WikiQA.
The results in Table 5 shows that our model can also achieve better results
on QASent dataset. The QASent data set is independent with Wikipedia articles
which both WikiQA and our topic prediction model are related to.
Model
MAP MRR
bi-CNNcnt
69.52% 76.33%
Baselines
bi-CNNcnt +LDA 69.75% 76.85%
bi-CNNcnt +ITM
70.12% 78.62%
bi-CNNcnt +ETMT M 70.62% 78.62%
Merged Methods bi-CNNcnt +ETMW E 69.64% 76.55%
bi-CNNcnt +ETMP P 70.36% 77.87%
bi-CNNcnt +ALL 71.09% 78.97%
Table 5. Evaluation result on QASent.
ACC@1
61.76%
63.23%
64.03%
65.63%
62.79%
64.71%
66.17%
To compare with implicit topic modeling methods, such as LDA and ITM,
our ETM method have good interpretability to explain why question-answer pair
have good (or bad) semantic relationship in topic level.
11
6
Conclusion
This paper introduces a method to obtain large scale open domain labeled data,
and propose a topic prediction model for the answer selection task. The result
shows our topic prediction model achieves promising results on WikiQA dataset,
especially on the hard questions. We also release our hsentence, topici labeled
pairs, to support researchers to use it in more NLP tasks, such as answer selection.
Acknowledgments
This work was supported by Beijing Advanced Innovation Center for Imaging
Technology (No.BAICIT-2016001), the National Natural Science Foundation of
China (Grand Nos. 61370126, 61672081), National High Technology Research
and Development Program of China (No.2015AA016004), the Fund of the State
Key Laboratory of Software Development Environment (No.SKLSDE-2015ZX16).
References
1. Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In:
Proceedings of Annual Meeting of the Association for Computational Linguistics
(ACL). pp. 597–604 (2005)
2. Duan, H., Cao, Y., Lin, C.Y., Yu, Y.: Searching questions by identifying question
topic and question focus. In: Proceedings of Annual Meeting of the Association for
Computational Linguistics (ACL). pp. 156–164 (2008)
3. Feng, M., Xiang, B., Glass, M.R., Wang, L., Zhou, B.: Applying deep learning to
answer selection: A study and an open task. Automatic Speech Recognition and
Understanding Workshop (ASRU) (2015)
4. Heilman, M., Smith, N.A.: Tree edit models for recognizing textual entailments,
paraphrases, and answers to questions. In: Proceedings of Annual Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies (NAACL-HLT). pp. 1011–1019 (2010)
5. Huang, P.S., He, X., Gao, J., Deng, L., Acero, A., Heck, L.: Learning deep structured semantic models for web search using clickthrough data. In: Proceedings of
the Conference on Information and Knowledge Management (CIKM). pp. 2333–
2338 (2013)
6. Jeon, J., Croft, W.B., Lee, J.H.: Finding similar questions in large question and
answer archives. In: Proceedings of the International Conference on Information
and Knowledge Management (CIKM). pp. 84–90 (2005)
7. Li, S., Manandhar, S.: Improving question recommendation by exploiting information need. In: Proceedings of Annual Meeting of the Association for Computational
Linguistics (ACL). pp. 1425–1434 (2011)
8. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural
information processing systems (NIPS). pp. 3111–3119 (2013)
12
9. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
10. Severyn, A., Moschitti, A.: Learning to rank short text pairs with convolutional
deep neural networks. In: Proceedings of ACM SIGIR Conference on Research and
Development in Information Retrieval. pp. 373–382 (2015)
11. Shen, Y., He, X., Gao, J., Deng, L., Mesnil, G.: A latent semantic model with
convolutional-pooling structure for information retrieval. In: Proceedings of the
Conference on Information and Knowledge Management (CIKM). pp. 101–110
(2014)
12. Wang, M., Manning, C.D.: Probabilistic tree-edit models with structured latent
variables for textual entailment and question answering. In: Proceedings of the
International Conference on Computational Linguistics (COLING). pp. 1164–1172
(2010)
13. Wang, M., Smith, N.A., Mitamura, T.: What is the jeopardy model? a quasisynchronous grammar for qa. In: Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP). vol. 7, pp. 22–32 (2007)
14. Yang, Y., Yih, W.t., Meek, C.: Wikiqa: A challenge dataset for open-domain question answering. In: Proceedings of the Conference on Empirical Methods in Natural
Language Processing (EMNLP). pp. 2013–2018 (2015)
15. Yao, X., Van Durme, B., Callison-Burch, C., Clark, P.: Answer extraction as sequence tagging with tree edit distance. In: Proceedings of Annual Conference of
the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies (NAACL-HLT). pp. 858–867 (2013)
16. Yih, W.t., Chang, M.W., Meek, C., Pastusiak, A.: Question answering using enhanced lexical semantic models. In: Proceedings of Annual Meeting of the Association for Computational Linguistics (ACL). pp. 1744–1753 (2013)
17. Yu, L., Hermann, K.M., Blunsom, P., Pulman, S.: Deep learning for answer
sentence selection. NIPS Deep Learning and Representation Learning Workshop
(2014)
18. Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., Ma,
W.Y.: Lightlda: Big topic models on modest computer clusters. In: Proceedings of
the Annual International Conference on World Wide Web. pp. 1351–1361 (2015)