QUERY EXPANSION

QUERY EXPANSION
Ben Montgomery
AGENDA
  Introduction
 
 
What is relevance feedback?
What is query expansion?
  Papers
 
 
Query Expansion Methods
Query Recommendation
  Demo
  Discussion
RELEVANCE FEEDBACK
  Actively
incorporating users into the search
process can improve results
1. 
2. 
3. 
4. 
5. 
User enters a short query
System returns initial result set
Users marks some documents as relevant or not relevant
System tries to compute a better representation of the query
System returns a revised result set
Cite: Chapter 9 of Intro. To IR
RELEVANCE FEEDBACK
 
RELEVANCE FEEDBACK
ROCCHIO ALGORITHM (1971)
  Treat
a query (q) and documents as vectors
  Maximize q’s similarity to relevant set
  Minimize q’s similarity to non-relevant set
Cite: Chapter 9 of Intro. To IR
ROCCHIO ALGORITHM
  In
terms of clusters, the optimal q is the
difference between centroids of relevant and nonrelevant documents
Cite: Chapter 9 of Intro. To IR
POP QUIZ!
  You
 
 
are using a naïve search engine with RF:
Query = “colour”
Query = “mountain lion”
  What’s
the problem here?
POP QUIZ!
Mountain Lion
Cougar
But… my “relevant” results only refer
specifically to mountain lions!
Puma
RELEVANCE FEEDBACK WEAKNESSES
  Missspellings
  Cross-language
IR
  Vocabulary mismatch
  Assumes that the user is able to make a query
which is similar to their result set
  Requires user participation
Cite: Chapter 9 of Intro. To IR
PSEUDO RF
  Automates
  Assumes
  Has
relevance feedback
top k ranked results to be relevant
been shown to improve search results
Cite: Chapter 9 of Intro. To IR
WHAT IS QUERY EXPANSION?
  Word
mismatch makes IR more difficult
  Query expansion supplements a base query with
more words in an attempt to improve search results
  Can be manual or automatic
“mountain lion”
“mountain lion cougar puma”
CAVEAT
  When
could query expansion be detrimental?
CAVEAT
  We
 
 
search using a different naïve search engine
Intending to find more big cat pictures
Query = “jaguar”
“jaguar”
“jaguar dealership xj highway”
Instead of getting mixed results, query expansion made
all of the results car-related
  Query
drift in pseudo RF
TYPES OF QUERY EXPANSION/RF
  Global
 
 
 
Work independently of the query and result set
Find word relationships in the corpus
Use an external source (spelling, thesaurus, etc…)
  Local
 
methods
methods
Analyze the documents returned by the base query
QUERY EXPANSION USING
LOCAL AND GLOBAL DOCUMENT
ANALYSIS
Xu and Croft (1996)
Note: not the 2000 journal version
OVERVIEW
  Compare
the effectiveness of using both local and
global feedback
  Show
that local is generally better than global
  Propose
a better QE method which combines
local and global feedback (local context analysis)
PROBLEMS
  Short
queries
  Vocabulary
mismatch
GLOBAL ANALYSIS
  “The
global context of a concept can be used to
determine similarities between concepts”
 
  In
E.g concept= word, context = co-occurrence
Phrasefinder
Concept = noun group (sans stopwords)
  Context = 1 to 3 sentences
 
Google
GLOBAL ANALYSIS
  Build
an INQUERY database using the context
and concept data
  INQUERY
takes a query and returns a list of
ranked phrases
  Some
of the top ranked phrases are weighted and
added to the query
  30
phrases added to each query
GLOBAL ANALYSIS
  Advantages
 
 
of Phrasefinder
Average performance improves
Provides a neat resource
  Disadvantages
 
 
Computationally expensive (space and time)
Can make a query worse
LOCAL ANALYSIS – LOCAL FEEDBACK
  50
most frequent terms and phrases (nonstopword pairs) are extracted from the top
ranked documents
  Added
to the query and weighted using the
Rocchio formula α:β:γ = 1 : 1 : 0
LOCAL ANALYSIS – LOCAL FEEDBACK
  Advantages
 
 
Efficiency
Not a lot of overhead
  Disadvantages
 
 
Slower at runtime
May not work well if the base query returns poor
results
LOCAL CONTEXT ANALYSIS
  Combines
local and global feedback
  Retrieves top ranked passages of 300 words each
  Concepts in the top n passages are ranked using
a variant of tf-idf weights
 
Rewards co-occurrence and penalizes commonness
  Top
m ranked concepts are added to the query
using a weighted average function
Qaux = w1*c1 + … + wn*cn
New = 1.0*Old + 2.0*Qaux
wi = 1 – 0.9*i/70
LOCAL CONTEXT ANALYSIS
  Advantages
 
 
 
Computationally feasible O(CollectionSize)
Proximity constraints
Doesn’t filter out frequent concepts
  Disadvantages
 
Slower than Phrasefinder
EXPERIMENTS
  Performed
on TREC3, TREC4, and WEST
LOCAL CONTENT ANALYSIS
  Higher
  With
11 point average precision on all corpora
downweighting on WEST:
COMPARISON TO GLOBAL
  Local
context analysis was clearly superior to
Phrasefinder
  LCA
seems to find better concepts
COMPARISON TO LOCAL
  Local
feedback is closer to LCA in TREC3 and
TREC4 than Phrasefinder
  Does
not appear to work on WEST
  Sensitive
to the number of documents
COMPARISON
EXPANSION RESULTS - PHRASEFINDER
  Query:
What are the different techniques used to
create self-induced hypnosis
EXPANSION RESULTS – LOCAL FEEDBACK
  Query:
What are the different techniques used to
create self-induced hypnosis
EXPANSION RESULTS - LCA
  Query:
What are the different techniques used to
create self-induced hypnosis
CONCLUSION
  LCA
> Local Feedback > Global Analysis
QUERY EXPANSION USING
RANDOM WALK MODELS
Collins-Thompson, Callan (2005)
OVERVIEW
  Use
Markov chains to model word relationships
and identify potential query expansion terms
  Use
their model for query expansion and
evaluate its effectiveness
  Their
hypothesis is that combining term
dependencies from different sources will lead to
better query expansion methods
OVERVIEW
  Example:
if ‘bank’ and ‘merger’ are relevant to a
query, it is possible to infer that ‘negotiations’
may also be relevant
Bank  agreement (C)  negotiate (C)  negotiations (M)
Merger  talks (C)  negotiations (S)
  C
= co-occurrence
  S = synonymy
  M = morphology
LINK TYPES
 
Synonyms from Extended Wordnet (SYN)
 
Stemming using the Krovetz stemmer (STEM)
 
General word association from the South Florida
Word Association database (ASSOC)
 
Co-occurrence in a large general Web corpus based on
700k Wikipedia articles (CWEB)
 
Co-occurrence in the top retrieved documents (CTOP)
 
Background smoothing used to link one word to all
other words with a small probability (SM)
MARKOV CHAIN FRAMEWORK
  The
authors used a Markov chain so that they
could infer the property of a target word, where
target words are potential expansion terms
Stationary distribution created by a
random walk starting at ‘Parkinson’s
Disease
MULTI-STAGE MARKOV CHAIN MODEL
  W
= {wi}
-- the set of words in the vocabulary
  The
relationship between wi and wj is modeled as
a combination of directional links
  Link
 
function λm(wi,wj) represents a specific link
E.g. synonymy, stem, co-occurrence
MULTI-STAGE MARKOV CHAIN MODEL
  Starting
 
with generating a document of length N
Given an author U
  0:
Choose an initial word w0 with probability P
(w0, U) [include base case]
  1: choosing wi-1
 
 
output wi-1 with probability 1-α
Else with probability α, sample a new word based on
links
MULTI-STAGE MARKOV CHAIN MODEL
  Switching
to query context, we have a query q
consisting of words {qi}
  For
each link type λm, a transition matrix C(q,m)
is defined
  There
are three stages: early, middle, and final
MULTI-STAGE MARKOV CHAIN MODEL
  Next,
each stage is combined into a single
transition matrix in k steps
  Γ(i)
  The
defines how a time step maps to a stage s
probability that a chain reaches qi after k
steps starting at w is, longer chains penalized
MULTI-STAGE MARKOV CHAIN MODEL
  Finally,
the overall probability of generating a
query term is:
QUERY EXPANSION MODEL
  They
used Indri to index and search the corpus
  Part of the Lemur Toolkit
Lemur is a toolkit designed to facilitate research in
language modeling and information retrieval (IR),
where IR is broadly interpreted to include such
technologies as ad hoc and distributed retrieval, with
structured queries, cross-language IR, summarization,
filtering, and categorization.
  Has
anyone used this?
BASELINE EXPANSION ALGORITHM
  Uses
Ponte (1998) to choose expansion
candidates
  Lafferty
(1992) used to find phrases
  Calculates
term weights using Lavrenko’s
relevance model
  Candidates
chosen
sorted by descending o(w), top k are
BASELINE EXPANSION ALGORITHM
  Each
expansion term is then given a weight
  p(q|D)
 
= document’s score in retrieval
Dirichlet smoothing, mu = 1000
  Query
  Has
and expansion terms have equal weight
been shown to be competitive
BASELINE EXPANSION ALGORITHM
  Query
weight example
ASPECT-BASED EXPANSION
  Hypothesis:
“a desirable property of good
expansion terms is that they somehow reflect
one, and preferably more, aspects of the query”
  Each
aspect is a word or phrase taken from the
query (bag of words)
  The
 
 
goal is to calculate the probability of p(A|v)
A = set of all aspects
v = potential expansion term
ASPECT-BASED EXPANSION
  p(A|v)
is calculated using the Markov chain model
  p(tj|v)
is the probability that a random walk from v
lands on tj
  Log
the equation
ASPECT-BASED EXPANSION
  Example
of log-probabilities
ASPECT-BASED EXPANSION
  The
top k candidates are chosen according to n()
  Remember:
€
  The
candidate’s weight is a scaled version of n
ASPECT-BASED EXPANSION
  Baseline
uses #weight, whereas this method uses
#wand
  Helps
prevent errors due to incomplete aspect
coverage
EVALUATION
  Compared
retrieval results with the baseline
  Compared
different versions of the random walk
  Compared
robustness
EVALUATION – COLLECTIONS USED
  AP89,
topics 1-50
  TREC8
  TREC
  No
ad-hoc collection
2001 wt10g, topics 501-550
stemming, why?
RESULTS
Every run was
comparable to the
baseline with no
statistically
significant results
RESULTS
  They
suggest the Markov model is better at
protecting against query drift (based on MAP of
the A run)
EFFICIENCY
  A
lot of big sparse transition matrices
  Vocabulary
size was approximately 300k
  They
limited walks to a maximum of 4 steps
  How
long would something like this take?
CONCLUSION
  The
Markov model is comparable to other query
expansion models (on the tests the authors ran)
  Authors
believe there is a lot of room for
improvement
QUERY RECOMMENDATION
USING QUERY LOGS IN SEARCH
ENGINES
Baeza-Yates, Hurtado, Mendoza
QUERY RECOMMENDATION
  When
a user submits a query to a search engine,
they receive a list of alternative queries
  Used
 
 
 
to resolve
Differences in vocabulary
Ambiguity
Lack of knowledge
  Allows
user to narrow in on a topic
QUERY RECOMMENDATION
  Example
OVERVIEW
  Authors
find related queries using clustering
  Clustering
is based on documents clicked on by
users searching a query
  Results
are ranked by comparison to the base
query and the support (how attractive the results
of a query are to users)
  Data
comes from the Todocl search engine
DISCOVERING RELATED QUERIES
  Use
only queries that appear in the query-log
  Create
sessions of the form (query,(clickedURL)*)
  Recommendation
 
 
 
 
Steps
Cluster sessions
Find the cluster an input query belongs to
Compute a ranking for every other query in that
cluster
Return query recommendations
QUERY EXPANSION
  Popularity
 
of a query != support
Some queries may be of little use when recommended
  They
measure support as the fraction of times a
recommended query has been clicked over how
many times it has been recommended
QUERY CLUSTERING
  Represent
candidate recommendations as termweight vectors
  Clustered
using a k-means algorithm in the
CLUTO package
RESULTS
RESULTS
  Ranking
of recommendations for “Chile rentals”
CONCLUSION
  Authors
presented a method for query expansion
based on clustering query logs
  They
are looking to improve on a number of
levels
DEMO
  Digger.com