Information Filtering SI650: Information Retrieval Winter 2010 School of Information University of Michigan Many slides are from Prof. ChengXiang Zhai’s lecture 2010 © University of Michigan 1 Lecture Plan • Filtering vs. Retrieval • Content-based filtering (adaptive filtering) • Collaborative filtering (recommender systems) 2010 © University of Michigan 2 Short vs. Long Term Info Need • Short-term information need (Ad hoc retrieval) – – – – “Temporary need”, e.g., info about used cars Information source is relatively static User “pulls” information Application example: library search, Web search • Long-term information need (Filtering) – – – – “Stable need”, e.g., new data mining algorithms Information source is dynamic System “pushes” information to user Applications: news filter, recommender systems 2010 © University of Michigan 3 Examples of Information Filtering • • • • • News filtering Email filtering Movie/book recommenders Literature recommenders And many others … 2010 © University of Michigan 4 Content-based Filtering vs. Collaborative Filtering • Basic filtering question: Will user U like item X? • Two different ways of answering it – Look at what U likes – Look at who likes X  characterize X  content-based filtering  characterize U  collaborative filtering • Can be combined 2010 © University of Michigan 5 Content-based filtering is also called 1. “Adaptive Information Filtering” in TREC 2. “Selective Dissemination of Information (SDI) in Library & Information Science Collaborative filtering is also called “Recommender Systems” 2010 © University of Michigan 6 Why Bother? • Why do we need recommendation? • Why should amazon provide recommendation? • Why does Netflix spend millions to improve their recommendation algorithm? • Why can’t we rely on a search engine for all the recommendation? 2010 © University of Michigan 7 The Long Tail Need a way to discover items that match our interests & tastes among tens or hundreds of thousands Chris Anderson, ‘The Long Tail’, Wired, Issue 12.10 - October 2004 2010 © University of Michigan 8 Part I: Adaptive Filtering 2010 © University of Michigan 9 Adaptive Information Filtering • Stable & long term interest, dynamic info. source • System must make a delivery decision immediately as a document “arrives” my interest: … Filtering System 2010 © University of Michigan 10 AIF vs. Retrieval & Categorization • Like retrieval over a dynamic stream of docs, but (global) ranking is impossible • Like online binary categorization, but with no initial training data and with limited feedback • Typically evaluated with a utility function – – – – Each delivered doc gets a utility value Good doc gets a positive value Bad doc gets a negative value E.g., Utility = 3* #good - 2 *#bad (linear utility) 2010 © University of Michigan 11 A Typical AIF System Initialization ... Doc Source Binary Classifier User profile text Accepted Docs User User Interest Profile utility func. Accumulated Docs Learning 2010 © University of Michigan Feedback 12 Three Basic Problems in AIF • Making filtering decision (Binary classifier) – Doc text, profile text  yes/no • Initialization – Initialize the filter based on only the profile text or very few examples • Learning from – Limited relevance judgments (only on “yes” docs) – Accumulated documents • All trying to maximize the utility 2010 © University of Michigan 13 Major Approaches to AIF • “Extended” retrieval systems – – – – “Reuse” retrieval techniques to score documents Use a score threshold for filtering decision Learn to improve scoring with traditional feedback New approaches to threshold setting and learning • “Modified” categorization systems – Adapt to binary, unbalanced categorization – New approaches to initialization – Train with “censored” training examples 2010 © University of Michigan 14 Difference between Ranking, Classification, and Recommendation • What is the query? – Ranking: no query – Classification: labeled data (a.k.a, training data) – Recommendation: user + items • What is the target? – Ranking: order of objects – Classification: labels of objects – Recommendation: Score? Label? Order? 2010 © University of Michigan 15 A General Vector-Space Approach doc vector no Scoring Thresholding Utility Evaluation yes profile vector Vector Learning threshold Threshold Learning Feedback Information 2010 © University of Michigan 16 Difficulties in Threshold Learning 36.5 33.4 32.1 29.9 27.3 … ... R N R ? ? =30.0 • Censored data • Little/none labeled data • Scoring bias due to vector learning • Exploration vs. Exploitation 2010 © University of Michigan 17 Threshold Setting in Extended Retrieval Systems • Utility-independent approaches (generally not working well, not covered in this lecture) • Indirect (linear) utility optimization – Logistic regression (score  prob. of relevance) • Direct utility optimization – Empirical utility optimization – Expected utility optimization given score distributions • All try to learn the optimal threshold 2010 © University of Michigan 18 Logistic Regression (Robertson & Walker. 00) • General idea: convert score of D to p(R|D) log O( R | D)    s ( D) • Fit the model using feedback data • Linear utility is optimized with a fixed prob. cutoff • But, – – – – Possibly incorrect parametric assumptions No positive examples initially Censored data and limited positive feedback Doesn’t address the issue of exploration 2010 © University of Michigan 19 Direct Utility Optimization • Given – A utility function U(CR+ ,CR- ,CN+ ,CN-) – Training data D={<si, {R,N,?}>} • Formulate utility as a function of the threshold and training data: U=F(,D) • Choose the threshold by optimizing F(,D), i.e.,    arg max F ( , D)  2010 © University of Michigan 20 Empirical Utility Optimization • Basic idea – Compute the utility on the training data for each candidate threshold (score of a training doc) – Choose the threshold that gives the maximum utility • Difficulty: Biased training sample! – We can only get an upper bound for the true optimal threshold. • Solutions: – Heuristic adjustment(lowering) of threshold – Lead to “beta-gamma threshold learning” 2010 © University of Michigan 21 Score Distribution Approaches ( Aramptzis & Hameren 01; Zhang & Callan 01) • Assume generative model of scores p(s|R), p(s|N) • Estimate the model with training data • Find the threshold by optimizing the expected utility under the estimated model • Specific methods differ in the way of defining and estimating the scoring distributions 2010 © University of Michigan 22 Gaussian-Exponential Distributions • P(s|R) ~ N(,2) p(s-s0|N) ~ E() p( s | R )  1 e 2   ( s   )2 2 p( s  s0 | N )  e   ( s  s0 ) (From Zhang & Callan 2001) 2010 © University of Michigan 23 Score Distribution Approaches (cont.) • Pros – Principled approach – Arbitrary utility – Empirically effective • Cons – May be sensitive to the scoring function – Exploration not addressed 2010 © University of Michigan 24 Part II: Collaborative Filtering 2010 © University of Michigan 25 What is Collaborative Filtering (CF)? • Making filtering decisions for an individual user based on the judgments of other users • Inferring individual’s interest/preferences from that of other similar users • General idea – Given a user u, find similar users {u1, …, um} – Predict u’s preferences based on the preferences of u1, …, um 2010 © University of Michigan 26 CF: Assumptions • Users with a common interest will have similar preferences • Users with similar preferences probably share the same interest • Examples – “interest is IR”  “favor SIGIR papers” – “favor SIGIR papers”  “interest is IR” • Sufficiently large number of user preferences are available 2010 © University of Michigan 27 CF: Intuitions • User similarity (Paul Resnick vs. Rahul Sami) – If Paul liked the paper, Rahul will like the paper – ? If Paul liked the movie, Rahul will like the movie – Suppose Paul and Rahul viewed similar movies in the past six months … • Item similarity – Since 90% of those who liked Star Wars also liked Independence Day, and, you liked Star Wars – You may also like Independence Day The content of items “didn’t matter”! 2010 © University of Michigan 28 Rating-based vs. Preference-based • Rating-based: User’s preferences are encoded using numerical ratings on items – Complete ordering – Absolute values can be meaningful – But, values must be normalized to combine • Preference-based: User’s preferences are represented by partial ordering of items (Learning to Rank!) – Partial ordering – Easier to exploit implicit preferences 2010 © University of Michigan 29 A Formal Framework for Rating Objects: O o1 o2 u1 u2 3 1.5 …. … … 2 Users: U ui 1 … oj … on 2 The task ? • • ... um Xij=f(ui,oj)=? 3 • Unknown function f: U x O R 2010 © University of Michigan Assume known f values for some (u,o)’s Predict f values for other (u,o)’s Essentially function approximation, like other learning problems 30 Where are the intuitions? • Similar users have similar preferences – If u  u’, then for all o’s, f(u,o)  f(u’,o) • Similar objects have similar user preferences – If o  o’, then for all u’s, f(u,o)  f(u,o’) • In general, f is “locally constant” – If u  u’ and o  o’, then f(u,o)  f(u’,o’) – “Local smoothness” makes it possible to predict unknown values by interpolation or extrapolation • What does “local” mean? 2010 © University of Michigan 31 Two Groups of Approaches • Memory-based approaches – f(u, o) = g(u)(o)  g(u’)(o) if u  u’ (g =preference function) – Find “neighbors” of u and combine g(u’)(o)’s • Model-based approaches (not covered) – Assume structures/model: object cluster, user cluster, f’ defined on clusters – f(u, o) = f’(cu, co) – Estimation & Probabilistic inference 2010 © University of Michigan 32 Memory-based Approaches (Resnick et al. 94) • General ideas: – – – – xij: rating of object j by user i ni: average rating of all objects by user i Normalized ratings: vij  xij  ni Memory-based prediction m vdj  k  w(a, i )vij i 1 m k  1 /  w(a, i ) xaj  vaj  na i 1 • Specific approaches differ in w(a, i) -- the distance/similarity between user a and i 2010 © University of Michigan 33 User Similarity Measures • Pearson correlation coefficient (sum over commonly rated items) w p ( a, i )  • Cosine measure w c ( a, i )   (x aj  na )( x ij  ni ) j  (x aj j  na ) 2  ( x ij  ni ) 2 j n x j 1 n  x aj j 1 aj 2 x ij n  x ij 2 j 1 • Many other possibilities! 2010 © University of Michigan 34 Improving User Similarity Measures (Breese et al. 98) • Dealing with missing values: default ratings • Inverse User Frequency (IUF): similar to IDF • Case Amplification: use w(a, i)p, e.g., p=2.5 2010 © University of Michigan 35 A Network Interpretation • If your neighbors love it, you are likely to love it • Closer neighbors make a larger impact • Measure the closeness by the items you and her liked/disliked 2010 © University of Michigan 36 Network based approaches • No clear notion of “distance” now, but the “scores” are estimated based on some propagation process. • Score propagation based on random walk • Two approaches: – Starting from the user, how likely/how long can I reach the object? – Starting from the object, how likely/how long can I hit the user? 2010 © University of Michigan 37 Random Walk and Hitting Time P = 0.3 0.3 k A i P = 0.7 0.7 j • Hitting Time – TA: the first time that the random walk is at a vertex in A • Mean Hitting Time – hiA: expectation of TA given that the walk starts from vertex i 2010 © University of Michigan 38 Computing Hitting Time hiA = 0.7 hjA + 0.3 hkA + 1 h=0 0.7 k A i 0.7 TA: the first time that the random walk is at a vertex in A T A  min{ t : X t  A, t  0} hiA: expectation of TA given that the walk starting from vertex i j Apparently, hiA = 0 for those hiA   p(i  j )h jV A j  1, for i  A iA Iterative Computation 0, for i  A 2010 © University of Michigan 39 Bipartite Graph and Hitting Time 5 Bipartite Graph: A V1 4 0.4 k 0.7 7 i V2 1 ww(i, (i, j ) j) =3 3 j  dw j (i, j ( )3w(1k) , j ) p(i  k )  d jw(i, j )  3 jV2 p (di i  j )  di (3  7) p( j  i )   - Edges between V1 and V2 - No edge inside V1 or V2 - Edges are weighted - e.g., V1 = query; V2 = Url Expected proximity of query i to the query A : hitting time of i  A, hiA • convert to a directed graph, even collapse one group 2010 © University of Michigan 40 Example: Query Suggestion Welcome to the hotel california Suggestions hotel california eagles hotel california hotel california band hotel california by the eagles hotel california song lyrics of hotel california listen hotel california eagle 2010 © University of Michigan Generating Query Suggestion using Click Graph (Mei et al. 08) freq(q, url) • Construct a (kNN) Query Url subgraph from the 300 A query log data (of a www.aa.com 15 aa predefined number of www.theaa.com/travelwatch/ queries/urls) planner_main.jsp • Compute transition mexiana probabilities p(i  j) A • Compute hitting time h american i en.wikipedia.org/wiki/Mexicana airline • Rank candidate queries using hiA 2010 © University of Michigan 42 Result: Query Suggestion Query = friends Hitting time Google wikipedia friends friendship friends poem friendster friends episode guide friends scripts how to make friends true friends Yahoo friends tv show wikipedia secret friends friends home page friends reunited friends warner bros hide friends the friends series hi 5 friends friends official site find friends friends(1994) poems for friends friends quotes 2010 © University of Michigan 43 Summary • Filtering is “easy” – The user’s expectation is low – Any recommendation is better than none – Making it practically useful • Filtering is “hard” – Must make a binary decision, though ranking is also possible – Data sparseness – “Cold start” 2010 © University of Michigan 44 Example: NetFlix 2010 © University of Michigan 45 References (Adaptive Filtering)   General papers on TREC filtering evaluation  D. Hull, The TREC-7 Filtering Track: Description and Analysis , TREC-7 Proceedings.  D. Hull and S. Robertson, The TREC-8 Filtering Track Final Report, TREC-8 Proceedings.  S. Robertson and D. Hull, The TREC-9 Filtering Track Final Report, TREC-9 Proceedings.  S. Robertson and I. Soboroff, The TREC 2001 Filtering Track Final Report, TREC-10 Proceedings Papers on specific adaptive filtering methods  Stephen Robertson and Stephen Walker (2000), Threshold Setting in Adaptive Filtering . Journal of Documentation, 56:312-331, 2000  Chengxiang Zhai, Peter Jansen, and David A. Evans, Exploration of a heuristic approach to threshold learning in adaptive filtering, 2000 ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'00), 2000. Poster presentation.  Avi Arampatzis and Andre van Hameren The Score-Distributional Threshold Optimization for Adaptive Binary Classification Tasks , SIGIR'2001.  Yi Zhang and Jamie Callan, 2001, Maximum Likelihood Estimation for Filtering Threshold, SIGIR 2001.  T. Ault and Y. Yang, 2002, kNN, Rocchio and Metrics for Information Filtering at TREC-10, 46 2010 © University of Michigan TREC-10 Proceedings. References (Collaborative Filtering) • • • • • • • • Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., and Riedl, J., "GroupLens: An Open Architecture for Collaborative Filtering of Netnews," CSCW. 175-186. P Resnick, HR Varian, Recommender Systems, CACM 1997 Cohen, W.W., Schapire, R.E., and Singer, Y. (1999) "Learning to Order Things", Journal of AI Research, Volume 10, pages 243-270. Freund, Y., Iyer, R.,Schapire, R.E., & Singer, Y. (1999). An efficient boosting algorithm for combining preferences. Machine Learning Journal. 1999. Breese, J. S., Heckerman, D., and Kadie, C. (1998). Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proceedings of the 14th Conference on Uncertainty in Articial Intelligence, pp. 43-52. Alexandrin Popescul and Lyle H. Ungar, Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments, UAI 2001. N. Good, J.B. Schafer, J. Konstan, A. Borchers, B. Sarwar, J. Herlocker, and J. Riedl. "Combining Collaborative Filtering with Personal Agents for Better Recommendations." Proceedings AAAI-99. pp 439-446. 1999. Qiaozhu Mei, Dengyong Zhou, Kenneth Church. Query Suggestion Using Hitting Time, CIKM'08, pages 469-478 2010 © University of Michigan 47