Modeling User Interactions in Web Search and Social Media Eugene Agichtein Intelligent Information Access Lab Emory University Intelligent Information Access Lab http://ir.mathcs.emory.edu/ • Research areas: – Information retrieval & extraction, text mining, and information integration – User behavior modeling, social networks and interactions, social media • People Walter Qi Guo, Yandong Liu, Alvin Grissom, Ryan Kelly, Abulimiti Aji, 2nd year Askew, nd year Ph.D nd year MS 2 st 2 Emory’10 1 Year Ph.D Ph.D EC‘09 And colleagues at Yahoo! Research, Microsoft Research, Emory Libraries, Psychology, Emory School of Medicine, Neuroscience, and Georgia Tech College of Computing. • Support User Interactions: The 3rd Dimension of the Web • Amount exceeds web content and structure – Published: 4Gb/day; Social Media: 10gb/Day – Page views: 100Gb/day [Andrew Tomkins, Yahoo! Search, 2007] 3 Finding Information Online • Search + Browse – Orienteering • Ask – Offline Online Talk Outline • Web Search Interactions – Click modeling – Browsing • Social media – Content quality – User satisfaction – Ranking and Filtering Interpreting User Interactions • Clickthrough and subsequent browsing behavior of individual users influenced by many factors – – – – Relevance of a result to a query Visual appearance and layout Result presentation order Context, history, etc. • General idea: – Aggregate interactions across all users and queries – Compute “expected” behavior for any query/page – Recover relevance signal for a given query Case Study: Clickthrough Relative Click Frequency 1 0.9 0.8 All queries 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 2 3 4 5 6 7 8 9 Clickthrough frequency for all queries in sample 10 result position Clickthrough (query q, document d, result position p) = expected (p) + relevance (q , d) Relative Click Frequency Clickthrough for Queries with Known Position of Top Relevant Result Higher clickthrough at top non-relevant than at top relevant document All queries PTR=1 PTR=3 1 2 3 5 10 Result Position Relative clickthrough for queries with known relevant results in position 1 and 3 respectively Model Deviation from “Expected” Behavior • Relevance component: deviation from “expected”: Relevance(q , d)= observed - expected (p) 0.16 Click frequency deviation 0.14 PTR=1 0.144 PTR=3 0.12 0.1 0.08 0.063 0.06 0.04 0.02 0.010 0 -0.02 1 -0.013 2 -0.023 3 5 -0.029 -0.04 Result position -0.002 -0.009 -0.001 10 Predicting Result Preferences • Task: predict pairwise preferences – A user will prefer Result A > Result B • Models for preference prediction – Current search engine ranking – Clickthrough – Full user behavior model Predicting Result Preferences: Granka et al., SIGIR 2005 • SA+N: “Skip Above” and “Skip Next” – Adapted from Joachims’ et al. [SIGIR’05] 1 – Motivated by gaze tracking 2 • Example – Click on results 2, 4 – Skip Above: 4 > (1, 3), 2>1 – Skip Next: 4 > 5, 2>3 3 4 5 6 7 8 Our Extension: Use Click Distribution • CD: distributional model, extends SA+N Clickthrough Frequency Deviation – Clickthrough considered iff frequency > ε than expected 0.4 1 0.2 2 0 1 2 3 4 5 3 -0.2 4 -0.4 5 -0.6 6 -0.8 7 Result position • Click on result 2 likely “by chance” • 4>(1,2,3,5), but not 2>(1,3) 8 0.8 0.78 Results: Click Deviation vs. Skip Above+Next 0.76 Precision 0.74 0.72 0.7 0.68 0.66 0.64 SA+N 0.62 0.6 0 0.1 0.2 Recall Problem: Users click based on result summaries/”captions”/”Snippets” Effect of Caption Features on Clickthrough Inversions, C. Clarke, E. Agichtien, S. Dumais, R. White, SIGIR 2007 Clickthrough Inversions Relevance is Not the Dominant Factor! Snippet Features Studied Feature Importance Important Words in Snippet Summary • Clickthrough inversions are powerful tool for assessing the influence of caption features. • Relatively simple caption features can significantly influence user behavior. •Can help more accurately predicting relevance from clickthough by accounting for summary bias. Idea: go beyond clickthrough/download counts Presentation ResultPosition Position of the URL in Current ranking QueryTitleOverlap Fraction of query terms in result Title Clickthrough DeliberationTime Seconds between query and first click ClickFrequency Fraction of all clicks landing on page ClickDeviation Deviation from expected click frequency Browsing DwellTime Result page dwell time DwellTimeDeviation Deviation from expected dwell time for query 21 User Behavior Model • Full set of interaction features – Presentation, clickthrough, browsing • Train the model with explicit judgments – Input: behavior feature vectors for each query-page pair in rated results – Use RankNet (Burges et al., [ICML 2005]) to discover model weights – Output: a neural net that can assign a “relevance” score to a behavior feature vector RankNet for User Behavior • RankNet: general, scalable, robust Neural Net training algorithms and implementation • Optimized for ranking – predicting an ordering of items, not scores for each • Trains on pairs (where first point is to be ranked higher or equal to second) – – – – Extremely efficient Uses cross entropy cost (probabilistic model) Uses gradient descent to set weights Restarts to escape local minima RankNet [Burges et al. 2005] • For query results 1 and 2, present pair of vectors and labels, label(1) > label(2) Feature Vector1 Label1 NN output 1 RankNet [Burges et al. 2005] • For query results 1 and 2, present pair of vectors and labels, label(1) > label(2) Feature Vector2 Label2 NN output 1 NN output 2 RankNet [Burges et al. 2005] • For query results 1 and 2, present pair of vectors and labels, label(1) > label(2) Error is function of both outputs (Desire output1 > output2) NN output 1 NN output 2 RankNet [Burges et al. 2005] • Update feature weights: – Cost function: f(o1-o2) – details in Burges et al. paper – Modified back-prop Error is function of both outputs (Desire output1 > output2) NN output 1 NN output 2 Predicting with RankNet • Present individual vector and get score Feature Vector1 NN output Example results: Predicting User Preferences 0.8 SA+N 0.78 CD UserBehavior Baseline 0.76 Precision 0.74 0.72 0.7 0.68 0.66 0.64 SA+N 0.62 0.6 0 0.1 0.2 0.3 0.4 Recall • Baseline < SA+N < CD << UserBehavior • Rich user behavior features result in dramatic improvement 29 How to Use Behavior Models for Ranking? • Use interactions from previous instances of query – General-purpose (not personalized) – Only for the queries with past user interactions • Models: – Rerank, clickthrough only: reorder results by number of clicks – Rerank, predicted preferences (all user behavior features): reorder results by predicted preferences – Integrate directly into ranker: incorporate user interactions as features for the ranker Enhance Ranker Features with User Behavior Features • For a given query – Merge original feature set with user behavior features when available – User behavior features computed from previous interactions with same query • Train RankNet [Burges et al., ICML’05] on the enhanced feature set Feature Merging: Details Query: SIGIR, fake results w/ fake feature values Result URL BM25 PageRank … Clicks DwellTime … … ? ? … sigir2007.org 2.4 0.5 Sigir2006.org 1.4 1.1 … 150 145.2 … acm.org/sigs/sigir/ 1.2 2 … 60 23.5 … • Value scaling: – Binning vs. log-linear vs. linear (e.g., μ=0, σ=1) • Missing Values: – 0? (meaning for normalized feats s.t. μ=0?) • Runtime: significant plumbing problems Evaluation Metrics • Precision at K: fraction of relevant in top K • NDCG at K: norm. discounted cumulative gain – Top-ranked results most important K N q M q (2 r( j) 1) / log( 1 j ) j 1 • MAP: mean average precision – Average precision for each query: mean of the precision at K values computed after each relevant document was retrieved Content, User Behavior: NDCG 0.68 0.66 0.64 NDCG 0.62 0.6 0.58 BM25 Rerank-CT Rerank-All BM25+All 0.56 0.54 0.52 0.5 1 2 3 4 5 K 6 7 8 9 BM25 < Rerank-CT < Rerank-All < +All 10 Full Search Engine, User Behavior: NDCG, MAP 0.74 0.72 0.7 NDCG 0.68 0.66 0.64 0.62 RN Rerank-All RN+All 0.6 0.58 0.56 1 2 3 4 MAP RN 0.270 RN+ALL 0.321 BM25 0.236 BM25+ALL 0.292 5 K 6 7 8 9 Gain 0.052 (19.13%) 0.056 (23.71%) 10 User Behavior Complements Content and Web Topology RN RN+All BM25 BM25+All 0.7 Precision 0.65 0.6 0.55 0.5 0.45 1 3 K 5 Method P@1 RN (Content + Links) 0.632 RN + All (User Behavior) 0.693 BM25 0.525 BM25+All 0.687 10 Gain 0.061(10%) 0.162 (31%) Which Queries Benefit Most Average Gain Frequency 0.2 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 -0.2 -0.25 -0.3 -0.35 -0.4 350 300 250 200 150 100 50 0 0.1 0.2 0.3 0.4 0.5 0.6 Most gains are for queries with poor ranking Result Summary • Incorporating user behavior into web search ranking dramatically improves relevance • Providing rich user interaction features to ranker is the most effective strategy • Large improvement shown for up to 50% of test queries User Generated Content 39 Some goals of mining social media • Find high-quality content • Find relevant and high quality content • Use millions of interactions to – Understand complex information needs – Model subjective information seeking – Understand cultural dynamics 40 41 http://answers.yahoo.com/question/index;_ylt=3?qid=20071008115118AAh1HdO 42 Lifecycle of a Question in CQA Choose a category Compose the question Open question Examine User Answer Answer Answer + User User - User User + User + User Close question Choose best answers Give ratings Find the answer? Question is closed by system. Best answer is chosen by voters + User 44 45 46 47 48 Community 49 50 51 52 53 54 Editorial Quality != User Popularity != Usefulness 55 Are editor/judge labels “meaningful”? • Information seeking process: want to find useful information about topic with incomplete knowledge • N. Belkin: “Anomalous States of Knowledge” • Want to model directly if user found satisfactory information • Specific (amenable) case: CQA Yahoo! Answers: The Good News • Active community of millions of users in many countries and languages • Accumulated a great number of questions and answers • Effective for subjective information needs – Great forum for socialization/chat • (Can be) invaluable for hard-to-find information not available on web 57 58 Yahoo! Answers: The Bad News • May have to wait a long time to get a satisfactory answer 40 30 25 20 15 10 5 0 Time to close (hours) 35 1 2 3 4 5 6 7 8 9 10 Time to close a question (hours) for sample question categories • May never obtain a satisfying answer 59 1. 2006 FIFA World Cup 2. Optical 3. Poetry 4. Football (American) 5. Scottish Football (Soccer) 6. Medicine 7. Winter Sports 8. Special Education 9. General Health Care 10. Outdoor Recreation Asker Satisfaction Problem • Given a question submitted by an asker in CQA, predict whether the user will be satisfied with the answers contributed by the community. – Where “Satisfied” is defined as: • The asker personally has closed the question AND • Selected the best answer AND • Provided a rating of at least 3 “stars” for the best answer – Otherwise, the asker is “Unsatisfied 60 Approach: Machine Learning over Content and Usage Features • Theme: holistic integration of content analysis and usage analysis • Method: Supervised (and later partiallysupervised) machine learning over features • Tools: – Weka (ML library): SVM, Boosting, DTs, NB, … – Part of speech taggers, chunkers – Corpora (wikipedia, web, queries, …) Satisfaction Prediction Features Answerer History Answer Features Textual Features Features Question Features Asker History Category Features Features Classifier Support Vector Machines Decision Tree Boosting Naïve Bayes asker is satisfied asker is not satisfied • Approach: Classification algorithms from machine learning 62 Prediction Algorithms • • • • • • Heuristic: # answers Baseline: Simply predicts the majority class (satisfied). ASP_SVM: Our system with the SVM classifier ASP_C4.5: with the C4.5 classifier ASP_RandomForest: with the RandomForest classifier ASP_Boosting: with the AdaBoost algorithm combining weak learners • ASP_NaiveBayes: with the Naive Bayes classifier 63 Evaluation metrics • Precision – The fraction of the predicted satisfied asker information needs that were indeed rated satisfactory by the asker. • Recall – The fraction of all rated satisfied questions that were correctly identified by the system. • F1 – The geometric mean of Precision and Recall measures, – Computed as 2*(precision*recall)/(precision+recall) • Accuracy – The overall fraction of instances classified correctly into the proper class. 64 Datasets Crawled from Yahoo! Answers in early 2008 (Thanks, Yahoo! for support) Question Answer Askers Categories % Satisfied 216,170 1,963,615 158,515 100 50.7% Data is available at http://ir.mathcs.emory.edu/shared 65 Dataset Statistics Category #Q #A #A per Q Satisfied Avg asker rating 2006 FIFA World Cup(TM) 1194 35659 329.86 55.4% 2.63 47 minutes Mental Health 151 1159 7.68 70.9% 4.30 1 day and 13 hours Mathematics 651 2329 3.58 44.5% 4.48 33 minutes Diet & Fitness 450 2436 5.41 68.4% 4.30 1.5 days Asker satisfaction varies significantly across different categories. 66 #Q, #A, Time to close… -> Asker Satisfaction Time to close by asker Satisfaction Prediction: Human Perf • Truth: asker’s rating • A random sample of 130 questions • Annotated by researchers to calibrate the asker satisfaction – Agreement: 0.82 – F1: 0.45 67 Satisfaction Prediction: Human Perf (Cont’d): Amazon Mechanical Turk A service provided by Amazon. Workers submit responses to a Human Intelligence Task (HIT) for $0.01-0.1 per Can usually get 1000s of items labeled in hours Satisfaction Prediction: Human Perf (Cont’d): Amazon Mechanical Turk • Methodology – Used the same 130 questions – For each question, list the best answer, as well as other four answers ordered by votes – Five independent raters for each question. – Agreement: 0.9 F1: 0.61. – Best accuracy achieved when at least 4 out of 5 raters predicted asker to be ‘satisfied’ (otherwise, labeled as “unsatisfied”). 69 Comparison of Human and Automatic (F1 measure) Classifier With Text Without Text Selected Features ASP_SVM 0.69 0.72 0.62 ASP_C4.5 0.75 0.76 0.77 ASP_RandomForest 0.70 0.74 0.68 ASP_Boosting 0.67 0.67 0.67 ASP_NB 0.61 0.65 0.58 Best Human Perf 0.61 Baseline (naïve) 0.66 C4.5 is the most effective classifier in this task Human F1 performance is lower than the naïve 70 baseline! Features by Information Gain (Satisfied class) • • • • • • • • • • 0.14219 Q: Askers’ previous rating 0.13965 Q: Average past rating by asker 0.10237 UH: Member since (interval) 0.04878 UH: Average # answers for by past Q 0.04878 UH: Previous Q resolved for the asker 0.04381 CA: Average asker rating for the category 0.04306 UH: Total number of answers received 0.03274 CA: Average voter rating 0.03159 Q: Question posting time 0.02840 CA: Average # answers per Q 71 “Offline” vs. “Online” Prediction • Offline prediction: – All features( question, answer, asker & category) – F1: 0.77 • Online prediction: – NO answer features – Only asker history and question features (stars, #comments, sum of votes…) – F1: 0.74 72 Feature Ablation Precision Recall F1 Selected features 0.80 0.73 0.77 No question-answer features 0.76 0.74 0.75 No answerer features 0.76 0.75 0.75 No category features 0.75 0.76 0.75 No asker features 0.72 0.69 0.71 No question features 0.68 0.72 0.70 Asker & Question features are most important. Answer quality/Answerer expertise/Category characteristics: may not be important caring or supportive answers might be preferred 73 sometimes Satisfaction: varying by asker experience Group together questions from askers with the same number of previous questions Accuracy of prediction increase dramatically Reaching F1 of 0.9 for askers with >= 5 questions Personalized Prediction of Asker Satisfaction with info • Same information != same usefulness for different users! • Personalized classifier achieves surprisingly good accuracy (even with just 1 previous question!) • Simple strategy of grouping users by number of previous questions is more effective than other methods for users with moderate amount of history • For users with >= 20 questions, textual features are more significant 75 Some Results 76 Some Personalized Models 77 Summary • Asker satisfaction is predictable – Can achieve higher than human accuracy by exploiting interaction history • User’s experience is important • General model: one-size-fits-all – 2000 questions for training model are enough • Personalized satisfaction prediction: – Helps with sufficient data (>= 1 prev interactions, can observe text patterns with >=20 prev. interactions) 78 Other tasks in progress • Subjectivity, sentiment analysis – B. Li, Y. Liu, and E. Agichtein, CoCQA: Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation, in Proc. of EMNLP 2008 • Discourse analysis • Cross-cultural comparisons • CQA vs. web search comparison 79 Summary • User-generated Content – Growing – Important: impact on main-stream media, scholarly publishing, … – Can provide insight into information seeking and social processes – “Training” data for IR, machine learning, NLP, …. – Need to re-think quality, impact, usefulness 80 Thank you! • Research areas: – Information retrieval & extraction, text mining, and information integration – User behavior modeling, social networks and interactions, social media • People Walter Askew, Qi Guo, Yandong Liu, Emory ‘09 2nd year Ph.D 2nd year Ph.D Alvin Grissom,Ryan Kelly, Abulimiti Aji, 2nd year MS Emory’10 1st Year Ph.D And colleagues at Yahoo! Research, Microsoft Research, Emory Libraries, Psychology, Emory School of Medicine, Neuroscience, and Georgia Tech College of Computing. • Support Question-Answer Features Q: length, posting time… Q:Terms QA: length, KL divergence Q:Votes 82 User Features U: Member since U: Total points U: #Questions U: #Answers Category Features • CA: Average time to close a question • CA: Average # answers per question • CA: Average asker rating • CA: Average voter rating • CA: Average # questions per hour • CA: Average # answers per hour Category #Q #A #A per Q Satisfied Avg asker rating Time to close by asker General Health 134 737 5.46 70.4% 84 4.49 1 day and 13 hours
© Copyright 2026 Paperzz