Query Auto-Completion using an Abstract Language Model

Query Auto-Completion using
an Abstract Language Model
BACHELOR THESIS BY NATALIE PRANGE, 02.11.2016
Introduction to query auto-completion
• Query auto-completion (QAC): suggesting completions for a query prefix
entered by a user
• Objective:
• Reduce the user‘s effort to enter a query
• Prevent spelling mistakes
• Assist in formulating a query
 A QAC-algorithm must suggest the desired query after a minimal amount of
keystrokes at a high rank
A common solution
• Suggest the most popular queries from a query log that match the given prefix
• Problems with this approach:
•
Recent and large enough query logs are hard to get
•
Queries which are asked for the first time are not suggested
A language-model-based solution
• Focus in this work is on whole questions
 possible solution: use a language model
• Language model = probability distribution learned over sequences of words
• Can be used to predict the word most likely to follow a given sequence
• Typical problem: data sparsity
This approach
• Use an abstract language model: specific entities are replaced by abstract types
•
E.g.: „Who played Gandalf in The Lord of the Rings ?“
 „Who played [fictional character] in [film] ?“
• When the language model predicts a type, entities are inserted again
• A prominence score and word vector similarity are used to rank suggestions
Basic pipeline of the Auto-Completion algorithm.
Building the abstract language model
• Choosing a type for each entity:
•
Out of a list of types of an entity, choose the most general but still meaningful type
•
E.g.: Albert Einstein: [Person, Astronomer, Diet Follower, Topic, …]  Person
• Choose a type according to a hand-picked list of preferred types
Building the abstract language model
• The training set consists of questions in which recognized entities are replaced by
their type
• An n-gram language model is learned on these questions
• N-gram model:
•
•
Estimate the probability of a word given it‘s (n-1) predecessors:
𝑃(𝑤𝑚 |𝑤1 , … , 𝑤𝑚−1 ) ≈ 𝑃(𝑤𝑚 |𝑤𝑚−(𝑛−1) , … , 𝑤𝑚−1 ) =
𝑐𝑜𝑢𝑛𝑡(𝑤𝑚− 𝑛−1 ,…,𝑤𝑚−1 ,𝑤𝑚 )
𝑐𝑜𝑢𝑛𝑡(𝑤𝑚− 𝑛−1 ,…,𝑤𝑚−1 )
Building the Word2vec model
• Word2vec uses a neural network to learn vector representations of words
• The more common context two words share, the higher the cosine similarity of their
word vectors
 can be used to compute semantic similarity between words
• E.g.: vector(Berlin) – vector(Germany) + vector(France) ≈ vector(Paris)
Predicting possible next words
• Normally:
•
last (n-1) complete words = n-gram context
•
last incomplete word = prefix of the next word
• Here: a predicted type can correspond to multiple words typed by the user
•
E.g.: „Who played [Fictional Character|Iron Man] in the first A“
•
 Which words are part of an entity name and which are normal words?
• Get predictions for all possible prefixes and their corresponding n-gram context
Inserting entities for types
• Insert entities for every type predicted by the n-gram model
• Entities need to match the given type and match the given prefix
• Prefix trees are used for retrieval of entities
Prefix tree, built from the words [to, too, tool, tour, see, fee]
Rating and ranking
• 1st scenario: the question prefix does not contain any entity
•
Use a prominence score to rate entities
•
Normalize score
•
𝑠𝑓𝑖𝑛𝑎𝑙 = 𝑝𝑛−𝑔𝑟𝑎𝑚 ∗ (𝑠𝑛𝑜𝑟𝑚 )0.3
• 2nd scenario: the question prefix contains at least one entity
•
Compute word vector similarity between the contained, and the suggested entity
•
Fill in the word vector similarity for 𝑠𝑛𝑜𝑟𝑚
• Normal words are assigned a fixed score in both approaches
Filling up the completion suggestions
• Use words that were not predicted by the n-gram model
• Use the prominence score and word count for rating the fill-up words
• Always append completely typed entities to the completion suggestions
Evaluation: Metrics
• User Interaction:
•
(total keystrokes + total selections)
(total number of characters in question)
• Mean Reciprocal Rank (MRR):
•
𝑞𝑐 : matching completion suggestion, 𝑆: completion suggestions
•
𝑅𝑅 𝑞𝑐 , 𝑆 =
•
RR is computed after typing the first letter of a word
•
MRR is the mean of the RR‘s of every word / entity name in every question
1
𝑟𝑎𝑛𝑘(𝑞𝑐 ,𝑆)
• Percentage of unidentified entities
Evaluation: Tested algorithm versions
• Baseline:
•
Without filling up completion suggestions
•
Without appending complete words
• 2nd Version: Without appending complete words
• 3rd Version: Only prominence score for rating (no word vectors)
• 4th Version: Complete algorithm as described
Evaluation: Results
Evaluation: Results
Completion suggestions using the Word2vec model:
Completion suggestions using only an entity prominence score:
Future work
• Integrate proper entity recognition
•
E.g.: USA  United States of America
•
Robustness against spelling mistakes
•
Multiple-word suggestions