ppt - University of Illinois Urbana

Toward A Session-Based
Search Engine
Smitha Sriram, Xuehua Shen, ChengXiang Zhai
Department of Computer Science
University of Illinois, Urbana-Champaign
U.S.A.
Motivation
•
Information retrieval is inherently an interactive process
– A user’s information need is unlikely fully satisfied with just
one query execution
– A user often needs to interact with the system several times
through query reformulation and document-browsing
•
•
•
– Thus in general, a query exists in a search session
A search session provides lots of contextual information for
a query that can be exploited (e.g., previous queries and
clickthrough data)
Such contextual information is mostly ignored in existing
search engines
We aim at developing a session-based search engine that
can exploit such contextual information to improve retrieval
Traditional vs. Session-based Retrieval
Traditional (“1-query”)
Session-based
Query=“IR applications”
Query=“IR applications”
Previous query=
“retrieval systems”
…
Retrieval
System
Document
Collection
“IR” can mean either
“information retrieval”
or “infrared”
Results:
D1 (infrared)
D2 (infrared)
D3 (retrieval)
D4 (infrared)
D5 (retrieval)
Retrieval
System
Results:
D3 (retrieval)
D5 (retrieval)
Frequency in
viewed docs:
Infrared: 0
Retrieval: 5
…
Uses more contextual information
Gives more accurate results
Research Issues
• What is an appropriate architecture for supporting
session-based retrieval?
– How to manage session information?
• How can we detect session boundaries?
• What contextual information should we exploit?
• How can we exploit such contextual information to
improve document ranking?
• How can we display search results in the context
of a session?
A Client-Server Architecture
for Session-based IR
Server Side
Docs
User
model
Search query
Engine
Search
context
Personalized
Agent
query
results
Top-N
Session
Manager
User
Local
Collection
Client Side
1.--2.--3.--……
Advantages of Server-Side Processing
• Persistent user profiles (imagine if a user often
uses different machines)
• Have access to global user information
– Can exploit information about all users to identify
common access patterns
– Can exploit information about similar users to help
improve performance for any individual user
• Have access to
all the documents
– Can perform more powerful statistical analysis
(e.g., to identify most frequently accessed docs)
– Can improve document representation over time
Advantages of a Client-Side Agent
• Can capture more information about the user thus
more accurate user modeling
– Can exploit the complete interaction history (e.g.,
easily capture click-through information)
– Can exploit a user’s other activities (e.g., searching
immediately after reading an email)
– Can detect session boundary more accurately
• More scalable (“distributed personalization”)
• Alleviate the problem of privacy for personalization
Session Boundary Detection
•
Detection is generally easier if done on the client side
– More information about the user can be exploited
•
•
– E.g., knowing that “logout” and “login” happened between
two queries
Sever side has access to query co-occurrence patterns,
which can help judge query coherence
Possible clues for session boundary detection
– Time interval between queries
– Query coherence (based on word relatedness and/or query
log analysis)
– Activities in between two queries
Useful Session Context Information
• Previous queries in the same session
• Documents viewed and not viewed so far in the
current session
• Other user activities during the same time as the
current session
• Context information collected in a similar session
by the current user or other users
• ……
Session-based Retrieval Models
•
•
Framework: The risk minimization retrieval framework
[Lafferty & Zhai 01, Zhai 02] can be naturally extended to support
session-based retrieval
One possible model (KL-divergence model)
– Retrieval = estimating a query model + estimating a doc
model + computing their KL-divergence
– Session context information (and any other potentially
useful information) can be used to estimate a better
(session-based) query model
ˆD  arg max p( | Doc)
ˆQ  arg max p( | Query,User , CurrentSessionContext )
Refinement of this model leads to specific retrieval formulas
Session-based Result Presentation
• Retrieval results can be displayed in the context of the
current session
•
– Previous search results in the session can be exploited to
show which document has been consistently moving up in
ranking as the user is reformulating the query
– All the queries in the session can be combined and analyzed
to generate a subtopic space for the user’s information need,
and documents can be organized and displayed in this
space
Session-based result presentation can
– Help a user digest the search results more effectively and
more efficiently
– Help a user to quickly focus on the important concept/topic
dimensions
– Help a user to figure out how to better formulate a query
ACES: A Contextual Engine for Search
• Architecture: server-side session management
• Session-boundary detection: probabilistic
measure of query similarity
• Session-based ranking: use the KL-div retrieval
model and estimate a query model based on
– Original query
– Displayed title and summary of viewed documents in
the same session
– Previous queries in the same search session
• Session-based result display: show ranks of
each doc w.r.t. all the previous queries
ACES System Architecture
Web Browser
Query
Clickthrough Data
Search Result
Document Text
Internet
Web/Application Server
Query
Clickthrough Data
Search
Engine
Profile
Capture
User Profile
Text DB
RDBMS
Details of the Ranking Algorithm
•
Query model updating using past queries q1, q2,…, qk
1 k c ( w , qi )
p( w | q )  p( w | q1 , q2 ,..., qk )   |qi | (1   ) k i 
k i 1
'
•
Further query model updating using the displayed title
and summary of the viewed documents s1, s2,…, sk
1   k 1 c ( w, si )
k i
p( w | q '')   p( w | q ) 


|si | (1   )
k  1 i 1
'
 is a decay factor to emphasize the most recent context
 is a parameter to control the influence of the clickthrough data
Currently all parameters are set in an ad hoc way
Demo:
Exploiting Previous Queries in ACES
• TREC AP data + Topics 1- 150 + judgments
• Allow us to compare traditional search and
contextual search
ACES is still far away from
a full-fledged session-based search engine…
Much further research needs to be done…
Architecture of Personalized System
User
model
Docs
Search query
Engine
Search
context
Personalized
Agent
Server Side
1.---
results 2.---
Top-N
Session
Manager
query
Profile
Collection
3.--……
Client Side
C
Query generation
θQ
Model Selection
q
U
Model Selection
S
θD
Document
generation
d
Web Browser
Query
Clickthrough Data
Search Result
Document Text
Internet
Web/Application Server
Query
Clickthrough Data
Search
Engine
Context
Capturer
User Profile
AP Text DB
RDBMS