Personalizing Web Search

Personalizing Web Search using
Long Term Browsing History
Nicolaas Matthijs, Cambridge
Filip Radlinski, Microsoft
In Proceedings of WSDM 2011
1
Query:
“pia workshop”
Relevant result
2
Outline
Approaches to personalization
The proposed personalization strategy
Evaluation metrics
Results
Conclusions and Future work
3
Approaches to Personalization
 Observed user interactions
 Short-term interests
 Sriram et al. [24] and [6], session data is too sparse to personalize
 Longer-term interests
 [23, 16]: model users by classifying previously visited Web pages
promote URLs Joachims [11]: user click-through data to learn a search function
 PClink [7] and Teevan et al. [28]
rich user profile
 Other related approaches: [20, 25, 26]
 Representing the user
 Teevan et al. [28], rich keyword-based representations, no
use of web page characteristics
 Commercial personalization systems
 Google
 Yahoo!
4
Personalization Strategy
Weighting
Title Unigrams
WordNet Dictionary
Filtering
Metadata description
Unigrams
Browsing
History
Full text Unigrams
User Profile
Terms
Google N-Gram
Filtering
TF Weighting
User Profile
Terms
TFxIDF Weighting
BM25 Weighting
Metadata keywords
No Filtering
Extracted Terms
Noun phrases
User Profile Terms
and Weights
Data Extraction
Filtering
Visited URLs +
number of visits
Previous searches
&click-through data
User Profile Generation Workflow
5
Personalized Search
Firefox add-on:
AlterEgo
Browsing
History
dog
cat
india
mit
search
amherst
vegas
1
10
2
4
93
12
1
6
Personalized Search
Data extraction
dog
cat
india
mit
search
amherst
vegas
1
10
2
4
93
12
1
forest
hiking
dog cat
walking
baby
User
monkey
gorpProfile
csail mit infant
baby Terms
banana
artificial child boy
infant
food
research
girl
child boy web
robot
girl search
retrieval ir
hunt
7
Personalized Search
Term weighting
6.0
1.6
dog
cat
india
mit
search
amherst
vegas
1
10
2
4
93
12
1
web
search
retrieval
ir hunt
2.7
0.2
0.2
1.3
1.3
8
Term Weighting
TF: term frequency wTF(ti)
cow
search
cow
ir
hunt
dog
TF-IDF:
TF
2
100
=
0.02
1
wTF(ti)= log(DF ) * wTF(ti)
ti
forest
cow
dog cat
walking
baby
monkey
gorp csail mit infant
baby
banana
artificial child boy
infant
food
research cow
child boy cow
robot
girl search
cow
ir
hunt
dog
TF-IDF
1
log(103/107)
*
2
100
= 0.08
9
Term Weighting
Personalized BM25
World
N
0.3 0.7
0.1 0.23
0.6 0.6
0.1
0.7
0.001
0.23 0.6
ni
wpBM25(ti)=log
0.1
0.05
0.5 0.35
0.3
0.002
0.7 0.1
0.01 0.6
0.1
0.7
0.001
0.23 0.6
ri
(rti+0.5)(N-nti+0.5)
(nti+0.5)(R-rti+0.5)
R
0.2 0.8
0.1 0.001
0.3 0.4
10
Re-ranking
 Use the user profile to re-rank top results returned
by a search engine
 Candidate document vs. snippets
 Snippets are more effective. Teevan et al. [28]
 Allow straightforward personalization implementation
 Matching
 For each term occurs both in snippet and user profile, its
weight will be added to the snippet’s score
 Unique matching
Scoring
methods
 Counts each unique term once
 Language model
 Language model for user profile, weights for terms are used
as frequency counts
 PClink Dou et al. [7]
11
Evaluation Metrics
 Relevance judgements
rel
10
2 i-1
1
 NDCG@10 =
Σ
i=1 log (1+i)
2
Z
 Side-by-side
 Two alternative rankings side-by-side, ask users to vote for
best
 Clickthrough-based
 Look at the query and click logs from large search engine
 Interleaved
 New metric for personalized search
 Combine results of two search rankings (alternating
between results, omitting duplicates)
12
Offline Evaluation
 6 participants, 2 months of browsing history
 Judge relevance of top 50 pages returned by
Google for 12 queries
 25 general queries (16 from TREC 2009 Web
search track), each participant will judge 6
 Most recent 40 search queries, judge 5
 Each participant took about 2.5 hours to complete
13
Offline Evaluation
Personalization strategies. Rel: relative weighting
Strategy
Profile Parameters
Ranking Parameters
Full
text
Title
Meta
keywords
Meta
Descr.
Extracted
terms
Noun
Phrases
Term
weights
Snippet
Scoring
Google
rank
URLs
visited
MaxNDCG
-
Rel
Rel
-
-
Rel
TF-IDF
LM
1/log
v=10
MaxQuer
-
-
-
-
Rel
Rel
TF
LM
1/log
v=10
MaxNoRank
-
-
Rel
-
-
-
TF
LM
-
v=10
MaxBestPar
-
Rel
Rel
-
Rel
-
pBM25
LM
1/log
v=10
 MaxNDCG: yields highest average NDCG
 MaxQuer: improves the most queries
 MaxNoRank: the method with highest NDCG that does not take the original
Google ranking into account
 MaxBestPar: obtained by greedily selecting each parameter sequentially
14
Offline Evaluation
Offline evaluation performance
Method
Average NDCG
+/=/- Queries
Google
0.502 ± 0.067
-
Teevan et al. [28]
0.518 ± 0.062
44/0/28
PClink
0.533 ± 0.057
13/58/1
MaxNDCG
0.573 ± 0.042
48/1/23
MaxQuer
0.567 ± 0.045
52/2/18
MaxNoRank
0.520 ± 0.060
13/52/7
MaxBestPar
0.566 ± 0.044
45/5/22
 MaxNDCG and MaxQuer are both significantly better
 Interestingly, MaxNoRank is significantly better than Google and Teevan (may
be due to overfitting on small offline data)
 PClink improves fewest queries, but better than Teevan on average NDCG
15
Offline Evaluation
Distribution of relevance at rank for Google and MaxNDCG rankings
 3600 relevance judgements collected, 9% Very Relevant, 32% Relevant, 58%
Non-Relevant
 Google:places many Very Relevant results in Top 5
 MaxNDCG: adds more Very Relevant results into Top 5, and succeeds in adding
Very Relevant results between Top 5 and Top 10
16
Online Evaluation
 Large-scale interleaved evaluation, users
performing day-to-day real searches
 The first 50 results requested from Google,
personalization strategies were picked randomly
 Exploit Team-Draft interleaving algorithm [18] to
produce a combined ranking
 41 users, 7997 queries, 6033 query impressions,
6534 queries and 5335 query impressions
received a click
17
Online Evaluation
Results of online interleaving test
Method
Queries
Google Vote
Re-ranked Vote
MaxNDCG
2090
624(39.5%)
955(60.5%)
MaxQuer
2273
812(47.3%)
905(52.7%)
MaxBestPar 2171
734(44.8%)
906(55.2%)
Queries impacted by personalization
Method
Unchanged
Improved
Deteriorated
MaxNDCG
1419(67.9%)
500(23.9%)
171(8.2%)
MaxQuer
1639(72.1%)
423(18.6%)
211(9.3%)
MaxBestPar 1485(68.4%)
467(21.5%)
219(10.1%)
18
Online Evaluation
Rank differences for deteriorated(light) and improved(dark) queries for MaxNDCG




Degree of personalization per rank
For a large majority of deteriorated queries, the clicked results only loss 1 rank
The majority of clicked results that improved a query gain 1 rank
The gains from personalization are on average more than double the losses
MaxNDCG is the most effective personalization method
19
Conclusions
 First large-scale personalized search and online
evaluation work
 Proposed personalization techniques: significantly
outperform default Google and best previous ones
 Key to model users: use characteristics and
structures of Web pages
 Long-term, rich user profile is beneficial
20
Future Exploration
 Parameter extension
 Learning parameter weights
 Using other fields (e.g., headings in HTML) and learning
their weights
 Incorporating temporal information
 How much browsing history?
 Whether decaying weights of older terms?
 How page visit duration can be used?
 Making use of more personal data
 Using extracted profiles for other purposes
21
Thank you!
22