SI650-Filtering

Information Filtering
SI650: Information Retrieval
Winter 2010
School of Information
University of Michigan
Many slides are from Prof. ChengXiang Zhai’s lecture
2010 © University of Michigan
1
Lecture Plan
• Filtering vs. Retrieval
• Content-based filtering (adaptive filtering)
• Collaborative filtering (recommender systems)
2010 © University of Michigan
2
Short vs. Long Term Info Need
• Short-term information need (Ad hoc retrieval)
–
–
–
–
“Temporary need”, e.g., info about used cars
Information source is relatively static
User “pulls” information
Application example: library search, Web search
• Long-term information need (Filtering)
–
–
–
–
“Stable need”, e.g., new data mining algorithms
Information source is dynamic
System “pushes” information to user
Applications: news filter, recommender systems
2010 © University of Michigan
3
Examples of Information Filtering
•
•
•
•
•
News filtering
Email filtering
Movie/book recommenders
Literature recommenders
And many others …
2010 © University of Michigan
4
Content-based Filtering vs.
Collaborative Filtering
• Basic filtering question: Will user U like item X?
• Two different ways of answering it
– Look at what U likes
– Look at who likes X
 characterize X  content-based filtering
 characterize U  collaborative filtering
• Can be combined
2010 © University of Michigan
5
Content-based filtering is also called
1. “Adaptive Information Filtering” in TREC
2. “Selective Dissemination of Information (SDI)
in Library & Information Science
Collaborative filtering is also called
“Recommender Systems”
2010 © University of Michigan
6
Why Bother?
• Why do we need recommendation?
• Why should amazon provide recommendation?
• Why does Netflix spend millions to improve their
recommendation algorithm?
• Why can’t we rely on a search engine for all the
recommendation?
2010 © University of Michigan
7
The Long Tail
Need a way to discover items that match our interests & tastes among
tens or hundreds of thousands
Chris Anderson, ‘The Long Tail’, Wired, Issue 12.10 - October 2004
2010 © University of Michigan
8
Part I: Adaptive Filtering
2010 © University of Michigan
9
Adaptive Information Filtering
• Stable & long term interest, dynamic info. source
• System must make a delivery decision
immediately as a document “arrives”
my interest:
…
Filtering
System
2010 © University of Michigan
10
AIF vs. Retrieval & Categorization
• Like retrieval over a dynamic stream of docs, but
(global) ranking is impossible
• Like online binary categorization, but with no initial
training data and with limited feedback
• Typically evaluated with a utility function
–
–
–
–
Each delivered doc gets a utility value
Good doc gets a positive value
Bad doc gets a negative value
E.g., Utility = 3* #good - 2 *#bad (linear utility)
2010 © University of Michigan
11
A Typical AIF System
Initialization
...
Doc Source
Binary
Classifier
User profile
text
Accepted Docs
User
User
Interest
Profile
utility func.
Accumulated
Docs
Learning
2010 © University of Michigan
Feedback
12
Three Basic Problems in AIF
• Making filtering decision (Binary classifier)
– Doc text, profile text  yes/no
• Initialization
– Initialize the filter based on only the profile text or very
few examples
• Learning from
– Limited relevance judgments (only on “yes” docs)
– Accumulated documents
• All trying to maximize the utility
2010 © University of Michigan
13
Major Approaches to AIF
• “Extended” retrieval systems
–
–
–
–
“Reuse” retrieval techniques to score documents
Use a score threshold for filtering decision
Learn to improve scoring with traditional feedback
New approaches to threshold setting and learning
• “Modified” categorization systems
– Adapt to binary, unbalanced categorization
– New approaches to initialization
– Train with “censored” training examples
2010 © University of Michigan
14
Difference between Ranking, Classification,
and Recommendation
• What is the query?
– Ranking: no query
– Classification: labeled data (a.k.a, training data)
– Recommendation: user + items
• What is the target?
– Ranking: order of objects
– Classification: labels of objects
– Recommendation: Score? Label? Order?
2010 © University of Michigan
15
A General Vector-Space Approach
doc
vector
no
Scoring
Thresholding
Utility
Evaluation
yes
profile vector
Vector
Learning
threshold
Threshold
Learning
Feedback
Information
2010 © University of Michigan
16
Difficulties in Threshold Learning
36.5
33.4
32.1
29.9
27.3
…
...
R
N
R
?
?
=30.0
• Censored data
• Little/none labeled data
• Scoring bias due to
vector learning
• Exploration vs.
Exploitation
2010 © University of Michigan
17
Threshold Setting
in Extended Retrieval Systems
• Utility-independent approaches (generally not
working well, not covered in this lecture)
• Indirect (linear) utility optimization
– Logistic regression (score  prob. of relevance)
• Direct utility optimization
– Empirical utility optimization
– Expected utility optimization given score distributions
• All try to learn the optimal threshold
2010 © University of Michigan
18
Logistic Regression
(Robertson & Walker. 00)
• General idea: convert score of D to p(R|D)
log O( R | D)    s ( D)
• Fit the model using feedback data
• Linear utility is optimized with a fixed prob. cutoff
• But,
–
–
–
–
Possibly incorrect parametric assumptions
No positive examples initially
Censored data and limited positive feedback
Doesn’t address the issue of exploration
2010 © University of Michigan
19
Direct Utility Optimization
• Given
– A utility function U(CR+ ,CR- ,CN+ ,CN-)
– Training data D={<si, {R,N,?}>}
• Formulate utility as a function of the threshold
and training data: U=F(,D)
• Choose the threshold by optimizing F(,D), i.e.,
   arg max F ( , D)

2010 © University of Michigan
20
Empirical Utility Optimization
• Basic idea
– Compute the utility on the training data for each
candidate threshold (score of a training doc)
– Choose the threshold that gives the maximum utility
• Difficulty: Biased training sample!
– We can only get an upper bound for the true optimal
threshold.
• Solutions:
– Heuristic adjustment(lowering) of threshold
– Lead to “beta-gamma threshold learning”
2010 © University of Michigan
21
Score Distribution Approaches
( Aramptzis & Hameren 01; Zhang & Callan 01)
• Assume generative model of scores p(s|R),
p(s|N)
• Estimate the model with training data
• Find the threshold by optimizing the expected
utility under the estimated model
• Specific methods differ in the way of defining
and estimating the scoring distributions
2010 © University of Michigan
22
Gaussian-Exponential Distributions
• P(s|R) ~ N(,2) p(s-s0|N) ~ E()
p( s | R ) 
1
e
2 

( s   )2
2
p( s  s0 | N )  e   ( s  s0 )
(From Zhang & Callan 2001)
2010 © University of Michigan
23
Score Distribution Approaches (cont.)
• Pros
– Principled approach
– Arbitrary utility
– Empirically effective
• Cons
– May be sensitive to the scoring function
– Exploration not addressed
2010 © University of Michigan
24
Part II: Collaborative Filtering
2010 © University of Michigan
25
What is Collaborative Filtering (CF)?
• Making filtering decisions for an individual user
based on the judgments of other users
• Inferring individual’s interest/preferences from
that of other similar users
• General idea
– Given a user u, find similar users {u1, …, um}
– Predict u’s preferences based on the preferences of
u1, …, um
2010 © University of Michigan
26
CF: Assumptions
• Users with a common interest will have similar
preferences
• Users with similar preferences probably share the
same interest
• Examples
– “interest is IR”  “favor SIGIR papers”
– “favor SIGIR papers”  “interest is IR”
• Sufficiently large number of user preferences are
available
2010 © University of Michigan
27
CF: Intuitions
• User similarity (Paul Resnick vs. Rahul Sami)
– If Paul liked the paper, Rahul will like the paper
– ? If Paul liked the movie, Rahul will like the movie
– Suppose Paul and Rahul viewed similar movies in the
past six months …
• Item similarity
– Since 90% of those who liked Star Wars also liked
Independence Day, and, you liked Star Wars
– You may also like Independence Day
The content of items “didn’t matter”!
2010 © University of Michigan
28
Rating-based vs. Preference-based
• Rating-based: User’s preferences are
encoded using numerical ratings on items
– Complete ordering
– Absolute values can be meaningful
– But, values must be normalized to combine
• Preference-based: User’s preferences are
represented by partial ordering of items
(Learning to Rank!)
– Partial ordering
– Easier to exploit implicit preferences
2010 © University of Michigan
29
A Formal Framework for Rating
Objects: O
o1
o2
u1
u2
3
1.5 …. …
…
2
Users: U
ui
1
… oj … on
2
The task
?
•
•
...
um
Xij=f(ui,oj)=?
3
•
Unknown function
f: U x O R
2010 © University of Michigan
Assume known f values for
some (u,o)’s
Predict f values for other
(u,o)’s
Essentially function
approximation, like other
learning problems
30
Where are the intuitions?
• Similar users have similar preferences
– If u  u’, then for all o’s, f(u,o)  f(u’,o)
• Similar objects have similar user preferences
– If o  o’, then for all u’s, f(u,o)  f(u,o’)
• In general, f is “locally constant”
– If u  u’ and o  o’, then f(u,o)  f(u’,o’)
– “Local smoothness” makes it possible to predict
unknown values by interpolation or extrapolation
• What does “local” mean?
2010 © University of Michigan
31
Two Groups of Approaches
• Memory-based approaches
– f(u, o) = g(u)(o)  g(u’)(o) if u  u’
(g =preference function)
– Find “neighbors” of u and combine g(u’)(o)’s
• Model-based approaches (not covered)
– Assume structures/model: object cluster, user
cluster, f’ defined on clusters
– f(u, o) = f’(cu, co)
– Estimation & Probabilistic inference
2010 © University of Michigan
32
Memory-based Approaches
(Resnick et al. 94)
• General ideas:
–
–
–
–
xij: rating of object j by user i
ni: average rating of all objects by user i
Normalized ratings: vij  xij  ni
Memory-based prediction
m
vdj  k  w(a, i )vij
i 1
m
k  1 /  w(a, i )
xaj  vaj  na
i 1
• Specific approaches differ in w(a, i) -- the
distance/similarity between user a and i
2010 © University of Michigan
33
User Similarity Measures
• Pearson correlation coefficient (sum over
commonly rated items)
w p ( a, i ) 
• Cosine measure
w c ( a, i ) 
 (x
aj
 na )( x ij  ni )
j
 (x
aj
j
 na ) 2  ( x ij  ni ) 2
j
n
x
j 1
n
 x aj
j 1
aj
2
x ij
n
 x ij
2
j 1
• Many other possibilities!
2010 © University of Michigan
34
Improving User Similarity Measures
(Breese et al. 98)
• Dealing with missing values: default ratings
• Inverse User Frequency (IUF): similar to IDF
• Case Amplification: use w(a, i)p, e.g., p=2.5
2010 © University of Michigan
35
A Network Interpretation
• If your neighbors love it,
you are likely to love it
• Closer neighbors make
a larger impact
• Measure the closeness
by the items you and
her liked/disliked
2010 © University of Michigan
36
Network based approaches
• No clear notion of “distance”
now, but the “scores” are
estimated based on some
propagation process.
• Score propagation based
on random walk
• Two approaches:
– Starting from the user, how
likely/how long can I reach the
object?
– Starting from the object, how
likely/how long can I hit the
user?
2010 © University of Michigan
37
Random Walk and Hitting Time
P = 0.3
0.3
k
A
i
P = 0.7
0.7
j
• Hitting Time
– TA: the first time that the random
walk is at a vertex in A
• Mean Hitting Time
– hiA: expectation of TA given that
the walk starts from vertex i
2010 © University of Michigan
38
Computing Hitting Time
hiA = 0.7 hjA + 0.3 hkA + 1
h=0
0.7
k
A
i
0.7
TA: the first time that the random
walk is at a vertex in A
T A  min{ t : X t  A, t  0}
hiA: expectation of TA given that the
walk starting from vertex i
j
Apparently, hiA = 0 for those
hiA 
 p(i  j )h
jV
A
j
 1, for i  A
iA
Iterative
Computation
0, for i  A
2010 © University of Michigan
39
Bipartite Graph and Hitting Time
5
Bipartite Graph:
A
V1
4
0.4
k
0.7
7
i
V2
1
ww(i,
(i, j ) j) =3 3 j

dw
j (i, j (
)3w(1k) , j )
p(i  k ) 
d jw(i, j )  3
jV2 p (di i  j ) 
di
(3  7)
p( j  i ) 

- Edges between V1 and V2
- No edge inside V1 or V2
- Edges are weighted
- e.g., V1 = query; V2 = Url
Expected proximity of query i to the
query A : hitting time of i  A, hiA
• convert to a directed graph, even collapse one group
2010 © University of Michigan
40
Example: Query Suggestion
Welcome to the
hotel california
Suggestions
hotel california
eagles hotel california
hotel california band
hotel california by the eagles
hotel california song
lyrics of hotel california
listen hotel california eagle
2010 © University of Michigan
Generating Query Suggestion using Click
Graph
(Mei et al. 08)
freq(q, url)
• Construct a (kNN)
Query
Url
subgraph from the
300
A
query log data (of a
www.aa.com
15
aa
predefined number of
www.theaa.com/travelwatch/ queries/urls)
planner_main.jsp
• Compute transition
mexiana
probabilities p(i  j)
A
•
Compute
hitting
time
h
american
i
en.wikipedia.org/wiki/Mexicana
airline
• Rank candidate queries
using hiA
2010 © University of Michigan
42
Result: Query Suggestion
Query = friends
Hitting time
Google
wikipedia friends
friendship
friends poem
friendster
friends episode guide
friends scripts
how to make friends
true friends
Yahoo
friends tv show wikipedia
secret friends
friends home page
friends reunited
friends warner bros
hide friends
the friends series
hi 5 friends
friends official site
find friends
friends(1994)
poems for friends
friends quotes
2010 © University of Michigan
43
Summary
• Filtering is “easy”
– The user’s expectation is low
– Any recommendation is better than none
– Making it practically useful
• Filtering is “hard”
– Must make a binary decision, though ranking is
also possible
– Data sparseness
– “Cold start”
2010 © University of Michigan
44
Example: NetFlix
2010 © University of Michigan
45
References (Adaptive Filtering)


General papers on TREC filtering evaluation

D. Hull, The TREC-7 Filtering Track: Description and Analysis , TREC-7 Proceedings.

D. Hull and S. Robertson, The TREC-8 Filtering Track Final Report, TREC-8 Proceedings.

S. Robertson and D. Hull, The TREC-9 Filtering Track Final Report, TREC-9 Proceedings.

S. Robertson and I. Soboroff, The TREC 2001 Filtering Track Final Report, TREC-10
Proceedings
Papers on specific adaptive filtering methods

Stephen Robertson and Stephen Walker (2000), Threshold Setting in Adaptive Filtering .
Journal of Documentation, 56:312-331, 2000

Chengxiang Zhai, Peter Jansen, and David A. Evans, Exploration of a heuristic approach to
threshold learning in adaptive filtering, 2000 ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR'00), 2000. Poster presentation.

Avi Arampatzis and Andre van Hameren The Score-Distributional Threshold Optimization for
Adaptive Binary Classification Tasks , SIGIR'2001.

Yi Zhang and Jamie Callan, 2001, Maximum Likelihood Estimation for Filtering Threshold,
SIGIR 2001.

T. Ault and Y. Yang, 2002, kNN, Rocchio and Metrics for Information Filtering at TREC-10,
46
2010 © University of Michigan
TREC-10 Proceedings.
References (Collaborative Filtering)
•
•
•
•
•
•
•
•
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J., "GroupLens: An
Open Architecture for Collaborative Filtering of Netnews," CSCW. 175-186.
P Resnick, HR Varian, Recommender Systems, CACM 1997
Cohen, W.W., Schapire, R.E., and Singer, Y. (1999) "Learning to Order Things",
Journal of AI Research, Volume 10, pages 243-270.
Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1999). An efficient boosting algorithm
for combining preferences. Machine Learning Journal. 1999.
Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical Analysis of Predictive
Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on
Uncertainty in Articial Intelligence, pp. 43-52.
Alexandrin Popescul and Lyle H. Ungar, Probabilistic Models for Unified
Collaborative and Content-Based Recommendation in Sparse-Data Environments,
UAI 2001.
N. Good, J.B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl.
"Combining Collaborative Filtering with Personal Agents for Better
Recommendations." Proceedings AAAI-99. pp 439-446. 1999.
Qiaozhu Mei, Dengyong Zhou, Kenneth Church. Query Suggestion Using Hitting
Time, CIKM'08, pages 469-478
2010 © University of Michigan
47