Open Question Answering
Over Curated and Extracted
Knowledge Bases
Anthony Fader; Luke Zettlemoyer; Oren Etzioni
Gareth Dwyer
Outline
❏
Introduction
❏
❏
❏
❏
❏
❏
Knowledge Bases or OIE?
Overview of OQA
Subtasks
❏ Paraphrasing
❏ Parsing
❏ Rewriting
❏ Executing
Deriving answers
❏ Perceptron
❏ Learning
❏ Scoring
Experiment
Conclusion
(Interrupt | ask Question | tell a joke*) at
any time
* There aren’t enough in this presentation
Structured KBs vs OIE
❏ Structured or unstructured sources?
❏
❏
❏
We can use both - First time this has been tried [2014]
❏
❏
Abstract over structured/unstructured knowledge bases with a custom Query langauge
But unstructured data is more difficult
❏
❏
❏
❏
Freebase -> Well structured and easy to parse, but has limited information
The Web -> More information, but noisy and difficult to parse [OIE]
What if our knowledge base contains the answer but we can’t find it?
“What are the signs of ‘flu?” -> (Symptom-of, Chills, ‘Flu)
But this can also be useful. Questions are asked in NL.
Create subtasks
❏
❏
❏
Solving problems is fun
We had one problem. Let’s breed more.
Paraphrasing, Parsing, Rewriting, Executing
Overview of OQA
❏
We don’t find an answer, we “derive” one.
❏
❏
❏
❏
❏
Spoilers: Search space is pretty huge
❏
❏
Paraphrase = NL to NL (wiki answers)
Parse = NL -> Q (hand-written templates)
Rewrite = Q -> Q (machine learning)
Execute = look up in KB.
Beam-Search
Four resources
❏
❏
❏
Freebase (Structured - last week)
OpenIE (“family of techniques”)
Probase (only “is-a” relations)
❏
NELL (only 300 relation phrases - High
precision)
Outline
❏
Introduction
❏
❏
❏
❏
❏
❏
Knowledge Bases or OIE?
Overview of OQA
Subtasks
❏ Paraphrasing
❏ Parsing
❏ Rewriting
❏ Executing
Deriving answers
❏ Perceptron
❏ Learning
❏ Scoring
Experiment
Conclusion
Paraphrasing, saying the same thing in a different way
Restating with different words, AKA, an additional manner of expression, or equivalently, in other words, or more simply,
there are:
Many ways to ask a single question
❏
❏
❏
Idea: We may be able to answer an equivalent question more easily
Example:
❏ What is the latin name for __ ?
❏ What is ___'s scientific name?
We can find the answer for any paraphrase of Q and use it as the answer for Q
Wiki Answers
❏
❏
❏
❏
❏
Huge clustered user-created QA corpus: 23 Million clusters, avg 25 questions / cluster
Mine paraphrase templates from this
t = Why do we use computers; t’ = What did computers replace?
If co-occurrence of argument in (t, t’) >= 5, use it as a paraphrase template
Only considered candidate templates that appeared in more than 10 clusters
Paraphrasing - noisy
Paraphrase corpus is noisy
❏
❏
❏
❏
Even with restrictions (template in >10 clusters, argument co-occurence >=
5) data is unreliable
Add features
❏ PMI (Pointwise Mutual Information) between templates
❏ Do the templates co-occur in the WikiAnswers corpus?
❏ Language Model score
❏ POS tags: matched argument a and left and right of a in q
Used for learning / scoring - more later
We’ll need features for the next subtasks too
Parsing
❏
❏
Use POS tags and regex to define 10 high-precision templates
Who, What, Where, When
❏
❏
❏
No How, Why
Templates translate NL to a custom query language
Query language can extract triples from Freebase and OIE KBs.
Rewriting
❏
Similar idea to paraphrasing
❏
❏
❏
Modifying queries
❏
❏
❏
Change relation
Change order of arguments
Rewrite rules
❏
❏
❏
❏
Rewrite queries as these still contain (smaller) NL components
Rules mined from KBs directly (For paraphrases: WikiAnswers)
(r, r’)
> 10 shared argument pairs
74 million (r, r’)
Features
❏
PMI (r, r’)
Executing
Extract triples from all KBs
❏
❏
❏
Query -> Answer + Evidence
Lucene for indexing all triples
Query -> SQL?
Features
❏
Keyword Similarity
❏
❏
❏
❏
Query -> input question
Query -> evidence
Source (freebase, Open IE)
Word shape
❏
Aaaa 1111 for dates
Outline
❏
Introduction
❏
❏
❏
❏
❏
❏
Knowledge Bases or OIE?
Overview of OQA
Subtasks
❏ Paraphrasing
❏ Parsing
❏ Rewriting
❏ Executing
Deriving answers
❏ Perceptron
❏ Learning
❏ Scoring
Experiment
Conclusion
Now for the fun part
❏
Now that we have subtasks, we can get from input question to possible
answers.
❏
❏
❏
❏
Binary classification -> Correct: Incorrect
Machine learning?
Need: Learning data + features
❏
❏
❏
❏
❏
But we need the right answers
Paraphrasing: PMI, Language Model, POS tags
Rewriting: PMI
Executing: Keyword similarity, Source
Assume: We can draw a straight line between correct answers and
incorrect answers
Assume: Our features are also relevant for partial derivations
❏
❏
Question -> Paraphrase -> Parse can have a score
Full derivation consists of 2, 3, or 4 operations
Perceptron
❏
Neural Network with no hidden layers
❏
❏
❏
❏
❏
Latent-variable structured - get more ‘unseen’ data from small initial set
Iteratively adjust weights based on training set
Matrix of inputs
Matrix of weights
Some function
Machine learning
1) Get data
2) Label some if it
3) ???
4) Profit
Learning weights using a perceptron
Latent Variable Perceptron
Getting data
q = “How can you tell if you have the flu?”,
A = {“chills”, “fever”, “aches”}
Learning
Incrementally modify weight matrix using
training set (questions + answers)
BeamSearch (DeriveAnswers) to get
candidate derivations
If best answer isn’t correct (not in Ai) then
get best scoring correct answer and update
weights + incrementing features of correct
answer and removing features of incorrect
one.
Putting it together
❏
Each subtask is an operation
❏
❏
❏
❏
❏
Paraphrase -> Parse -> Rewrite -> Execute
Each has features with numerical scores
Learn weights for the features
Use scoring function to find best derivation
Noisy data means many incorrect derivations are possible
Score of a derivation from a feature function and weights
Sum over all k operations/states
Always pass input question s0
Finding the best derivation
Searching
❏
❏
❏
We can make derivations
We can score derivations
Can we just try everything and take the best one?
10 parse templates, 5 million paraphrase operators, 74 million rewrite
operators, 1 billion triples
❏
Beam-search (for learning and answering)
❏
❏
best-first, breadth-first-search with time and space limits
Score is heuristic for ordering nodes
Outline
❏
Introduction
❏
❏
❏
❏
❏
❏
Knowledge Bases or OIE?
Overview of OQA
Subtasks
❏ Paraphrasing
❏ Parsing
❏ Rewriting
❏ Executing
Deriving answers
❏ Perceptron
❏ Learning
❏ Scoring
Experiment
Conclusion
So does all this actually work?
Three questions
❏
How does OQA compare with Paralex and Sempre?
❏
❏
❏
Overview of these systems later
How do the different sources affect performance?
How do the different components affect performance?
Three test sets
❏
WebQuestions
❏
❏
TREC
❏
❏
All answerable through FreeBase
Answerable from small doc set
WikiAnswers
❏
Randomly sampled
Results
How generalisable is this?
❏
❏
3-sets of weights
Single set of weights
❏
WebQuestions
❏
❏
TREC and WikiAnswers
❏
❏
Answerable from freebase
More data is better
Note x-axis isn’t consistent
❏
WebQuestions is easiest
Blue: Trained on qs from all domains
Grey: Trained specifically for that domain
Confidence threshold varied to achieve higher
recall.
Paralex comparison
Paralex
Same authors
Similar idea, but uses single step for
paraphrasing, parsing, and rewriting
(Question -> Query)
Only lexicalised features
Also designed to work with noisy data
Pretrained
Mixes up where/when questions.
Gives date for “where”
Paralex vs OQA: precision/recall based on
confidence threshold for correct answers
Sempre comparison
Sempre
Lexicalised features.
“See-in” = “tourist attraction” from
Freebase.
Better performance for web
questions, but performance is specific
to this.
Pretrained on web questions.
Performance on pretrained models
was better than when trained on
TREC/WikiAnswers (!)
“Requires significant lexical overlap
between train and test sets”
Do we need all these operators?
Paraphrasing
Improvement for WebQuestions and
WikiAnswers. TREC is simpler, and
parsing operations are sufficient
Rewrites
Usually worsens performance, but can
improve for specific questions. Majority of
rewrites resulted in low-confidence
derivations. Some high-confidence
derivations were found through rewrites
Weight learning
Improvement across all three sets
Do we need all these sources?
Sources
Nell is least helpful: largely a subset
of other KBs
OpenIE is very useful (shows we’re
not just relying on freebase)
OpenIE even useful for
WebQuestions (In theory, all
answerable using only Freebase)
Outline
❏
Introduction
❏
❏
❏
❏
❏
❏
Knowledge Bases or OIE?
Overview of OQA
Subtasks
❏ Paraphrasing
❏ Parsing
❏ Rewriting
❏ Executing
Deriving answers
❏ Perceptron
❏ Learning
❏ Scoring
Experiment
Conclusion
Discussion of results
❏
Generalizability vs “power”
❏ OQA looks at general features, but
doesn’t take full advantage of
training data
❏
Better results for a specific data-set
mean less generalizability
❏
“What is the time zone in SA?”
❏ Saw nearly identical questions in
❏
❏
training
Still can’t answer in test
Rewrite not a write-off
❏ Some high confidence answers are
correct
Noisy data
❏ Still makes some basic
mistakes
❏ < 8% recall for Wikidata
❏ Natural language is
hard...
Conclusion
❏ Up to twice the precision and recall of previous state-ofthe-art
❏ Generalizable
❏ Curated and Structured knowledge can work together
Questions, Suggestions, Answers, Money?
Discussion points
❏
Are the assumptions sound?
❏
❏
❏
Could a multi-layer NN improve on the results of the perceptron?
More features / more data?
❏
❏
Correct / Incorrect answers are linearly separable
Bootstrap data from results of existing systems?
Does anyone actually know how to use WikiData?
© Copyright 2026 Paperzz