Optimizing search engines using clickthrough data

ΕΘΝΙΚΟ ΜΕΤΣΟΒΙΟ ΠΟΛΥΤΕΧΝΕΙΟ
ΣΧΟΛΗ ΗΛΕΚΤΡΟΛΟΓΩΝ ΜΗΧΑΝΙΚΩΝ
ΚΑΙ ΜΗΧΑΝΙΚΩΝ ΥΠΟΛΟΓΙΣΤΩΝ
ΤΟΜΕΑΣ ΤΕΧΝΟΛΟΓΙΑΣ ΠΛΗΡΟΦΟΡΙΚΗΣ ΚΑΙ
ΥΠΟΛΟΓΙΣΤΩΝ
Optimizing search engines using
clickthrough data
Problem

Optimization of web search results ranking
 Ranking


algorithms mainly based on similarity
Similarity between query keywords and page text keywords
Similarity between pages (Page Rank)
 No
consideration for user personal preferences
Digital libraries
High rank
Vacation
Low rank
2
Problem

E.g.: query “delos”




Digital libraries
Archaeology
Vacation
Room for improvement by incorporating user behaviour
data: user feedback

Use of previous implicit search information to enhance result ranking
Digital libraries
High rank
Vacation
Low rank
3
Types of user feedback
 Explicit

feedback
User explicitly judges relevance of results with the
query
Costly in time and resources
 Costly for user  limited effectiveness
 Direct evaluation of results

 Implicit

feedback
Extracted from log files
Large amount of information
 Indirect evaluation of results through click behaviour

4
Implicit feedback (Categories)

Clicked results

Absolute relevance:

Clicked result -> Relevant





Percentage of result clicks for a query
Frequently clicked groups of results
Links followed from result page
Relative relevance:

Clicked result -> More relevant than non-clicked


Risky: poor user behaviour quality
More reliable
Time

Between clicks


Spent on a result page


E.g., fast clicks  maybe bad results
E.g., a lot of time  relevant page
First click, scroll

E.g., maybe confusing results
5
Implicit feedback (Categories)

Query chains: sequences of reformulated queries to
improve results of initial search

Detection:




Connection between results of different queries:



Enhancement of a bad query with another from the query chain
Comparison of result relevance between different query results
Scrolling




Query similarity
Result sets similarity
Time
Time (quality of results)
Scrolling behaviour (quality of results)
Percentage of page (results viewed)
Other features


Save, print, copy etc of result page  maybe relevant
Exit type (e.g. close window  poor results)
6
Joachims approach

Clickthrough data
 Relative
relevance
 Indicated by user behaviour study

Method
 Training
of svm functions
 Training input: inequations of query result rankings
 Trained function: weight vector for features examined
 Use of trained vector to give weights to examined
features

Experiments
 Comparison
of method with existing search engines
7
Clickthrough data

Form
 Triplets
(q,r,c) = (query,ranked results,links clicked)
 …in the log file of a proxy server

Relative relevance (regarding a specific query)
 dk



more relevant than dl if
dk clicked and
dl not clicked and
dk lower initial ranking than dl
d5 more relevant than d4
d5 more relevant than d2
d3 more relevant than d2
8
Clickthrough data (Studies)

Why not absolute relevance?


User behaviour influenced by initial
ranking
Study: Rank and viewership

Percentage of queries where a user
viewed the search result presented at
a particular rank

Conclusion: Most times users view
only the few first results

Study: Rank and viewership


Percentage of queries where a user
clicked the result presented at a
particular rank, both in normal and
swapped conditions
Conclusion: Users tend to click on
higher ranked results irrespective of
content
9
Method (System train input)

Data extracted from log
 Relevance inequalities:
 d k < rq d l
 dk more relevant than dl for query q
d5 more relevant than d4
d5 more relevant than d2
d3 more relevant than d2
 Construct
relevance inequalites for every query q and
every result pair (dk , dl) for q

For each link di (result of query q):
 construct
a feature vector Φ(q,di)
10
Method (System train input)

Feature vector:
 Describes
the quality of the match between a document
di and a query q
 Φ(q,di) = [rank_X, top1_X, top10_X, …..]
11
Method (System train input)

Weight vector:
= [w1, w2,…,wn]
 Assigns weights to each of the features in Φ(q,di)
 Sdi = w * Φ(q,di) assigns a score to document di for
query q
 w: unknown  must be trained by solving the system of
relative relevance inequalities derived from clickthrough
data
w
12
Method (System training)

Initial input transformation:
d5 more relevant than d4
d5 more relevant than d2
 dk
<rm dl => Sdk >rm Sdl =>
d3 more relevant than d2
 w * Φ(qm,dk) > w * Φ(qm,dl) =>
 [w1, w2,…,wn] Φ(qm,dk) > [w1, w2,…,wn] Φ(qm,dl)
 System of relevance inequalites for every query q and
every result pair (dk , dl) for q

Object of training:
 Knowing,
for every clicked link di , its feature vector,
 Find an optimal weight vector w which satisfies most of
the inequalities above (optimal solution)
 With vector w known, we can calculate a score for every
link, whether clicked or not, hence a new ranking
13
Method (System training)

Intuitively

Example: 2 dimension feature vector


X-axis: similarity between query terms and link text
Y-axis: time spent on link page

Looking for optimal weight vector w
 If training input: r1 < r2 < r3 < r4


If training input: r2 < r3 < r1 < r4


Optimal solution w1: text similarity more important for ranking
Optimal solution w2: time more important for ranking
w optimized by maximizing δ
Link 2
Link 3
Link 4
Link 1
14
Experiments

Based on a meta-search engine
 Combination
of results from different search
engines
 “Fair” presentation of results:
For a meta-meta-search engine combining 2 engines:
 Top z results of meta-search contain
 x results from engine A
 y results from engine b
 x + y = z
 |x – y| <= 1

15
Experiments
Meta-search
engine
 Assumptions

 Users
click more
often on more
relevant links
 Preference for
one of the search
engines not
influenced by
ranking
16
Experiments (Results)

Conditions





Comparison with




20 users (university students)
260 training queries collected
20 days of training
10 days of evaluation
Google
MSNSearch
Toprank (meta-search for Google, MSNSearch, Altavista, Excite,
Hotbot)
Results
17
Experiments (Results)

Weight vector
18
Open issues





Trade-off between amount of training data and
homogeneity
Clustering algorithms to find homogenous groups
of users
Adaptation to the properties of a particular
document collection
Incremental online learning/feedback algorithm
Protection from spamming
19
Joachims approach II

Clickthrough data
 Addition
of new relative relevance criterions
 Addition of query chains

Method
 Modified
feature vector
Rank features
 Term/document features

 Constraints

to the svm optimization problem
To avoid trivial/incorrect solutions
20
Query chains

Sequences of reformulated queries

Poor results for first query


Too many/few terms
Not representative terms


Incorrect spelling



q1: “Lexis Nexus”  q2: “Lexis Nexis”
Execution of new query


q1: “special collections”  q2: “rare books”
Addition/deletion of query terms to get better results
In every query of the sequence, searching for the same thing
Ways of detection



Term similarity between queries
Percentage of common results
Time between queries
21
Query chains detection method

1st strategy

Data recorded from log for 1285 queries:

query, date, IP address, results returned, number of
clicks on the results, session id uniquely assigned

Manual grouping of queries in query chains
 Division of data set into training and testing set
 Training



Testing




Feature vector for every pair of query
Training of a weight vector indicating which features
are more important for a query pair to be a query
chain
Recognition of query chains in testing set
Comparison with manual grouping
94.3% accuracy
2nd strategy

Assumption: Query chain if:




Queries from the same IP
Time between 2 queries < 30 min
91.6% accuracy
Adoption of (simpler) 2nd strategy
22
Clickthrough data (studies)

Conclusion: Most
times users view
the first 2 results

Conclusion: Most
times users view
the next result
after the last
clicked
23
Relevance criteria (query)

Click >q Skip Above

dk more relevant than dl if




dk clicked and
dl not clicked and
dk lower initial ranking than dl
d5 more relevant than d4
d5 more relevant than d2
d3 more relevant than d2
Click First >q No-Click Second

d1 more relevant than d2 if


d1 more relevant than d2
d1 clicked
d2 not clicked
24
Relevance criteria (query chain)

Click >q1 Skip Above





FOR QUERY 1 (q’)
d8 more relevant than d7
d8 more relevant than d6
dk more relevant than dl if
dk clicked and
dl not clicked and
dk lower initial ranking than dl
Click First >q1 No-Click Second

FOR QUERY 1 (q’)
d5 more relevant than d6
d1 more relevant than d2 if


d1 clicked
d2 not clicked
QUERY 1
QUERY 2
25
Relevance criteria (query chain II)

Click >q1 Skip Earlier Query

Click First >q1 Top 2 Earlier Query
QUERY 1
FOR QUERY 1 (q’)
d6 more relevant than d1
d6 more relevant than d2
d6 more relevant than d4
FOR QUERY 1 (q’)
d6 more relevant than d1
d6 more relevant than d2
QUERY 2
26
Relevance criteria (Experiment)

Sequences of reformulated queries
 Search
on 16 subjects
 Relevance inequalities produced by previous
criteria
 Comparison with explicit relevance judgements
on query chains
27
Method (SVM training)

Feature vector consists of
 Rank features φfirank(d, q)
 Term/document features φterms(d,

q)
Rank features
φfirank(d, q) for every retrieval
function fi examined
 Every φfirank(d, q) consists of 28
features
 One


For ranks 1,2,3,…10,15,20,…100 in the
initial ranking
If rank of d in initial ranking ≤
1,2,3,…10,15,20,…100 set 1 else set 0
 Allow
us to learn weights for and
combine different retrieval functions
28
Method (SVM training)

Term/document features
 φterms(d,


q) consists of N*M features
N number of terms
M number of documents
1 if di = d and tj ε q
 Trains weight vector to learn
assosiations between terms and
documents
 Set
29
Method (Constraints)

Problem
 Most
relevance criteria suggest that a lower ranked link
(in the initial ranking) is better than a higher ranked link
 Trivial solution: reverse the initial ranking by assigning
negative values to weight vector

Solution
 Each
w*φfirank(d, q) ≥ a minimun positive value
30
Experiments

Based on a meta-search engine


Training





Initial retrieval function from Nutch
9949 queries
7429 clicks
Pqc: Total of 120134 relative relevance inequalities
Pnc: A set of Pqc generated without use of query chains
Results: (rel0 initial ranking)
31
Experiments

Results
 Trained
weights for term/document features
32
Open issues




Tolerance of noise and malicious clicks
Different weights in each query of a query chain
Personalized ranking functions
Improvement of performance
33
Related work (Fox et al. 2005)



Evaluation of implicit measures
Use of explicit ratings of user satisfaction
Method



Bayesian modeling and decision trees
Gene analysis
Results: Importance of combination of




Clickthrough
Time
Exit type
Features used:
34
Related work (Agichtein, Brill, Dumais 2006)

Incorporation of implicit
feedback features directly
into the initial ranking
algorithm



BM25F
RankNet
Features used:
35
Related work


Query/links clustering based on features extracted
from implicit feedback
Score based on interrelations between queries and
web pages
36
Possible extensions

Utilization of time feedback



Other types of feedback





Improvement of detection method
Link association rules


Percentage of result clicks for a query
Links followed from result page
Query chains


Scrolling
Exit type
Combination with absolute relevance clickthrough feedback


As relative relevance criterion
As a feature
For frequently clicked groups of results
Query/links clustering
Constant training of ranking functions
37
References

Search Engines that Learn from Implicit Feedback


Joachims, Radlinski
Optimizing Search Engines Using Clickthrough Data

Joachims

Query Chains: Learning to Rank from Implicit Feedback

Radlinski, Joachims
Evaluating implicit measures to improve web search



Implicit Feedback for Inferring User Preference A Bibliography


Kelly, Teevan
Improving Web Search Ranking by Incorporating User Behavior


Fox et al.
Agichtein, Brill, Dumais
Optimizing web search using web click-through data

Xue, Zeng, Chen, Yu, Ma, Xi, Fan
38
Clickthrough data (Studies)

Why not absolute relevance?
 Study:

Rank and viewership
Percentage of queries where a user clicked the result presented
at a particular rank, both in normal and swapped conditions
 Conclusion:
Users tend to click on higher ranked results
irrespective of content
39
Types of user feedback
 Explicit

feedback
User explicitly judges relevance of results with the
query
Costly in time and resources
 Costly for user  limited effectiveness
 Direct evaluation of results

 Implicit

feedback
Extracted from log files
Large amount of information
 Real user behaviour (not expert judgement)
 Indirect evaluation of results through click behaviour

40