Asker Satisfaction - Emory`s Math Department

Modeling User Interactions
in Web Search and Social Media
Eugene Agichtein
Intelligent Information Access Lab
Emory University
Intelligent Information Access Lab
http://ir.mathcs.emory.edu/
•
Research areas:
– Information retrieval & extraction, text mining, and information integration
– User behavior modeling, social networks and interactions, social media
•
People
Walter
Qi Guo,
Yandong
Liu,
Alvin
Grissom,
Ryan
Kelly,
Abulimiti Aji, 2nd year
Askew,
nd year Ph.D
nd year MS
2
st
2
Emory’10
1 Year Ph.D Ph.D
EC‘09
And colleagues at Yahoo! Research, Microsoft Research, Emory Libraries, Psychology,
Emory School of Medicine, Neuroscience, and Georgia Tech College of Computing.
•
Support
User Interactions:
The 3rd Dimension of the Web
• Amount exceeds web content and structure
– Published: 4Gb/day; Social Media: 10gb/Day
– Page views: 100Gb/day
[Andrew Tomkins, Yahoo! Search, 2007]
3
Finding Information Online
• Search + Browse
– Orienteering
• Ask
– Offline  Online
Talk Outline
• Web Search Interactions
– Click modeling
– Browsing
• Social media
– Content quality
– User satisfaction
– Ranking and Filtering
Interpreting User Interactions
• Clickthrough and subsequent browsing behavior of
individual users influenced by many factors
–
–
–
–
Relevance of a result to a query
Visual appearance and layout
Result presentation order
Context, history, etc.
• General idea:
– Aggregate interactions across all users and queries
– Compute “expected” behavior for any query/page
– Recover relevance signal for a given query
Case Study: Clickthrough
Relative Click Frequency
1
0.9
0.8
All queries
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
2
3
4
5
6
7
8
9
Clickthrough frequency for all queries in sample
10
result position
Clickthrough (query q, document d, result position p) =
expected (p) + relevance (q , d)
Relative Click Frequency
Clickthrough for Queries with Known Position
of Top Relevant Result
Higher clickthrough at top
non-relevant than at top
relevant document
All queries
PTR=1
PTR=3
1
2
3
5
10
Result Position
Relative clickthrough for queries with known relevant
results in position 1 and 3 respectively
Model Deviation from “Expected” Behavior
• Relevance component: deviation from “expected”:
Relevance(q , d)= observed - expected (p)
0.16
Click frequency deviation
0.14
PTR=1
0.144
PTR=3
0.12
0.1
0.08
0.063
0.06
0.04
0.02
0.010
0
-0.02
1 -0.013
2
-0.023
3
5
-0.029
-0.04
Result position
-0.002
-0.009
-0.001
10
Predicting Result Preferences
• Task: predict pairwise preferences
– A user will prefer Result A > Result B
• Models for preference prediction
– Current search engine ranking
– Clickthrough
– Full user behavior model
Predicting Result Preferences:
Granka et al., SIGIR 2005
• SA+N: “Skip Above” and “Skip Next”
– Adapted from Joachims’ et al. [SIGIR’05]
1
– Motivated by gaze tracking
2
• Example
– Click on results 2, 4
– Skip Above: 4 > (1, 3), 2>1
– Skip Next: 4 > 5, 2>3
3
4
5
6
7
8
Our Extension: Use Click Distribution
• CD: distributional model, extends SA+N
Clickthrough Frequency Deviation
– Clickthrough considered iff frequency > ε than expected
0.4
1
0.2
2
0
1
2
3
4
5
3
-0.2
4
-0.4
5
-0.6
6
-0.8
7
Result position
• Click on result 2 likely “by chance”
• 4>(1,2,3,5), but not 2>(1,3)
8
0.8
0.78
Results: Click
Deviation vs. Skip Above+Next
0.76
Precision
0.74
0.72
0.7
0.68
0.66
0.64
SA+N
0.62
0.6
0
0.1
0.2
Recall
Problem: Users click based on result
summaries/”captions”/”Snippets”
Effect of Caption Features on Clickthrough Inversions, C.
Clarke, E. Agichtien, S. Dumais, R. White, SIGIR 2007
Clickthrough Inversions
Relevance is Not the Dominant Factor!
Snippet Features Studied
Feature Importance
Important Words in Snippet
Summary
• Clickthrough inversions are powerful tool for assessing the
influence of caption features.
• Relatively simple caption features can significantly influence user
behavior.
•Can help more accurately predicting relevance from clickthough by
accounting for summary bias.
Idea: go beyond clickthrough/download counts
Presentation
ResultPosition
Position of the URL in Current ranking
QueryTitleOverlap
Fraction of query terms in result Title
Clickthrough
DeliberationTime
Seconds between query and first click
ClickFrequency
Fraction of all clicks landing on page
ClickDeviation
Deviation from expected click frequency
Browsing
DwellTime
Result page dwell time
DwellTimeDeviation Deviation from expected dwell time for query
21
User Behavior Model
• Full set of interaction features
– Presentation, clickthrough, browsing
• Train the model with explicit judgments
– Input: behavior feature vectors for each query-page pair
in rated results
– Use RankNet (Burges et al., [ICML 2005])
to discover model weights
– Output: a neural net that can assign a “relevance” score to
a behavior feature vector
RankNet for User Behavior
• RankNet: general, scalable, robust Neural Net
training algorithms and implementation
• Optimized for ranking – predicting an ordering of
items, not scores for each
• Trains on pairs (where first point is to be ranked
higher or equal to second)
–
–
–
–
Extremely efficient
Uses cross entropy cost (probabilistic model)
Uses gradient descent to set weights
Restarts to escape local minima
RankNet [Burges et al. 2005]
• For query results 1 and 2, present pair of
vectors and labels, label(1) > label(2)
Feature Vector1 Label1
NN output 1
RankNet [Burges et al. 2005]
• For query results 1 and 2, present pair of
vectors and labels, label(1) > label(2)
Feature Vector2 Label2
NN output 1
NN output 2
RankNet [Burges et al. 2005]
• For query results 1 and 2, present pair of
vectors and labels, label(1) > label(2)
Error is function of both outputs
(Desire output1 > output2)
NN output 1
NN output 2
RankNet [Burges et al. 2005]
• Update feature weights:
– Cost function: f(o1-o2) – details in Burges et al. paper
– Modified back-prop
Error is function of both outputs
(Desire output1 > output2)
NN output 1
NN output 2
Predicting with RankNet
• Present individual vector and get score
Feature Vector1
NN output
Example results: Predicting User Preferences
0.8
SA+N
0.78
CD
UserBehavior
Baseline
0.76
Precision
0.74
0.72
0.7
0.68
0.66
0.64
SA+N
0.62
0.6
0
0.1
0.2
0.3
0.4
Recall
• Baseline < SA+N < CD << UserBehavior
• Rich user behavior features result in dramatic improvement
29
How to Use Behavior Models for Ranking?
• Use interactions from previous instances of query
– General-purpose (not personalized)
– Only for the queries with past user interactions
• Models:
– Rerank, clickthrough only:
reorder results by number of clicks
– Rerank, predicted preferences (all user behavior features):
reorder results by predicted preferences
– Integrate directly into ranker:
incorporate user interactions as features for the ranker
Enhance Ranker Features with User Behavior
Features
• For a given query
– Merge original feature set with user behavior
features when available
– User behavior features computed from previous
interactions with same query
• Train RankNet [Burges et al., ICML’05] on
the enhanced feature set
Feature Merging: Details
Query: SIGIR, fake results w/ fake feature values
Result URL
BM25 PageRank
…
Clicks
DwellTime
…
…
?
?
…
sigir2007.org
2.4
0.5
Sigir2006.org
1.4
1.1
…
150
145.2
…
acm.org/sigs/sigir/
1.2
2
…
60
23.5
…
• Value scaling:
– Binning vs. log-linear vs. linear (e.g., μ=0, σ=1)
• Missing Values:
– 0? (meaning for normalized feats s.t. μ=0?)
• Runtime: significant plumbing problems
Evaluation Metrics
• Precision at K: fraction of relevant in top K
• NDCG at K: norm. discounted cumulative gain
– Top-ranked results most important
K
N q  M q  (2
r( j)
 1) / log( 1  j )
j 1
• MAP: mean average precision
– Average precision for each query: mean of the precision
at K values computed after each relevant document was
retrieved
Content, User Behavior: NDCG
0.68
0.66
0.64
NDCG
0.62
0.6
0.58
BM25
Rerank-CT
Rerank-All
BM25+All
0.56
0.54
0.52
0.5
1
2
3
4
5
K
6
7
8
9
BM25 < Rerank-CT < Rerank-All < +All
10
Full Search Engine, User Behavior: NDCG,
MAP
0.74
0.72
0.7
NDCG
0.68
0.66
0.64
0.62
RN
Rerank-All
RN+All
0.6
0.58
0.56
1
2
3
4
MAP
RN
0.270
RN+ALL
0.321
BM25
0.236
BM25+ALL
0.292
5
K
6
7
8
9
Gain
0.052 (19.13%)
0.056 (23.71%)
10
User Behavior Complements Content
and Web Topology
RN
RN+All
BM25
BM25+All
0.7
Precision
0.65
0.6
0.55
0.5
0.45
1
3
K
5
Method
P@1
RN (Content + Links)
0.632
RN + All (User Behavior)
0.693
BM25
0.525
BM25+All
0.687
10
Gain
0.061(10%)
0.162 (31%)
Which Queries Benefit Most
Average Gain
Frequency
0.2
0.15
0.1
0.05
0
-0.05
-0.1
-0.15
-0.2
-0.25
-0.3
-0.35
-0.4
350
300
250
200
150
100
50
0
0.1
0.2
0.3
0.4
0.5
0.6
Most gains are for queries with poor ranking
Result Summary
• Incorporating user behavior into web search
ranking dramatically improves relevance
• Providing rich user interaction features to
ranker is the most effective strategy
• Large improvement shown for up to 50% of
test queries
User Generated Content
39
Some goals of mining social media
• Find high-quality content
• Find relevant and high quality content
• Use millions of interactions to
– Understand complex information needs
– Model subjective information seeking
– Understand cultural dynamics
40
41
http://answers.yahoo.com/question/index;_ylt=3?qid=20071008115118AAh1HdO
42
Lifecycle of a Question in CQA
Choose
a category
Compose
the question
Open
question
Examine
User
Answer Answer Answer
+
User
User
-
User
User
+
User
+
User
Close question
Choose best answers
Give ratings
Find the answer?
Question is closed by system.
Best answer is chosen by voters
+
User
44
45
46
47
48
Community
49
50
51
52
53
54
Editorial Quality != User Popularity !=
Usefulness
55
Are editor/judge labels “meaningful”?
• Information seeking process: want to find useful
information about topic with incomplete
knowledge
• N. Belkin: “Anomalous States of Knowledge”
• Want to model directly if user found satisfactory
information
• Specific (amenable) case: CQA
Yahoo! Answers: The Good News
• Active community of millions of users in many
countries and languages
• Accumulated a great number of questions and
answers
• Effective for subjective information needs
– Great forum for socialization/chat
• (Can be) invaluable for hard-to-find
information not available on web
57
58
Yahoo! Answers: The Bad News
• May have to wait a long time to get a satisfactory
answer
40
30
25
20
15
10
5
0
Time to close (hours)
35
1
2
3
4
5
6
7
8
9
10
Time to close a question (hours) for sample question categories
• May never obtain a satisfying answer
59
1. 2006 FIFA World Cup
2. Optical
3. Poetry
4. Football (American)
5. Scottish Football
(Soccer)
6. Medicine
7. Winter Sports
8. Special Education
9. General Health Care
10. Outdoor Recreation
Asker Satisfaction Problem
• Given a question submitted by an asker in CQA,
predict whether the user will be satisfied with the
answers contributed by the community.
– Where “Satisfied” is defined as:
• The asker personally has closed the question AND
• Selected the best answer AND
• Provided a rating of at least 3 “stars” for the best
answer
– Otherwise, the asker is “Unsatisfied
60
Approach: Machine Learning over Content and
Usage Features
• Theme: holistic integration of content analysis
and usage analysis
• Method: Supervised (and later partiallysupervised) machine learning over features
• Tools:
– Weka (ML library): SVM, Boosting, DTs, NB, …
– Part of speech taggers, chunkers
– Corpora (wikipedia, web, queries, …)
Satisfaction Prediction Features
Answerer History
Answer Features
Textual Features
Features
Question Features
Asker History
Category Features
Features
Classifier
Support Vector Machines
Decision Tree
Boosting
Naïve Bayes
asker is
satisfied
asker is not
satisfied
• Approach: Classification algorithms from machine
learning
62
Prediction Algorithms
•
•
•
•
•
•
Heuristic: # answers
Baseline: Simply predicts the majority class (satisfied).
ASP_SVM: Our system with the SVM classifier
ASP_C4.5: with the C4.5 classifier
ASP_RandomForest: with the RandomForest classifier
ASP_Boosting: with the AdaBoost algorithm combining
weak learners
• ASP_NaiveBayes: with the Naive Bayes classifier
63
Evaluation metrics
• Precision
– The fraction of the predicted satisfied asker information needs
that were indeed rated satisfactory by the asker.
• Recall
– The fraction of all rated satisfied questions that were correctly
identified by the system.
• F1
– The geometric mean of Precision and Recall measures,
– Computed as 2*(precision*recall)/(precision+recall)
• Accuracy
– The overall fraction of instances classified correctly into the
proper class.
64
Datasets
Crawled from Yahoo! Answers in early 2008 (Thanks, Yahoo!
for support)
Question Answer Askers Categories % Satisfied
216,170 1,963,615 158,515
100
50.7%
Data is available at
http://ir.mathcs.emory.edu/shared
65
Dataset Statistics
Category
#Q
#A
#A per Q
Satisfied
Avg asker rating
2006 FIFA
World Cup(TM)
1194
35659
329.86
55.4%
2.63
47 minutes
Mental Health
151
1159
7.68
70.9%
4.30
1 day and 13
hours
Mathematics
651
2329
3.58
44.5%
4.48
33 minutes
Diet &
Fitness
450
2436
5.41
68.4%
4.30
1.5 days
Asker satisfaction varies significantly across
different categories.
66
#Q, #A, Time to close… ->
Asker Satisfaction
Time to close by
asker
Satisfaction Prediction: Human Perf
• Truth: asker’s rating
• A random sample of 130 questions
• Annotated by researchers to calibrate the
asker satisfaction
– Agreement: 0.82
– F1: 0.45
67
Satisfaction Prediction: Human Perf (Cont’d):
Amazon Mechanical Turk


A service provided by Amazon. Workers submit responses to a
Human Intelligence Task (HIT) for $0.01-0.1 per
Can usually get 1000s of items labeled in hours
Satisfaction Prediction: Human Perf (Cont’d):
Amazon Mechanical Turk
• Methodology
– Used the same 130 questions
– For each question, list the best answer, as well as
other four answers ordered by votes
– Five independent raters for each question.
– Agreement: 0.9 F1: 0.61.
– Best accuracy achieved when at least 4 out of 5
raters predicted asker to be ‘satisfied’ (otherwise,
labeled as “unsatisfied”).
69
Comparison of Human and Automatic
(F1 measure)
Classifier
With Text
Without Text
Selected Features
ASP_SVM
0.69
0.72
0.62
ASP_C4.5
0.75
0.76
0.77
ASP_RandomForest
0.70
0.74
0.68
ASP_Boosting
0.67
0.67
0.67
ASP_NB
0.61
0.65
0.58
Best Human Perf
0.61
Baseline (naïve)
0.66
C4.5 is the most effective classifier in this task
Human F1 performance is lower than the naïve
70
baseline!
Features by Information Gain (Satisfied class)
•
•
•
•
•
•
•
•
•
•
0.14219 Q: Askers’ previous rating
0.13965 Q: Average past rating by asker
0.10237 UH: Member since (interval)
0.04878 UH: Average # answers for by past Q
0.04878 UH: Previous Q resolved for the asker
0.04381 CA: Average asker rating for the category
0.04306 UH: Total number of answers received
0.03274 CA: Average voter rating
0.03159 Q: Question posting time
0.02840 CA: Average # answers per Q
71
“Offline” vs. “Online” Prediction
• Offline prediction:
– All features( question, answer, asker & category)
– F1: 0.77
• Online prediction:
– NO answer features
– Only asker history and question features (stars,
#comments, sum of votes…)
– F1: 0.74
72
Feature Ablation
Precision
Recall
F1
Selected features
0.80
0.73
0.77
No question-answer features
0.76
0.74
0.75
No answerer features
0.76
0.75
0.75
No category features
0.75
0.76
0.75
No asker features
0.72
0.69
0.71
No question features
0.68
0.72
0.70
Asker & Question features are most important.
Answer quality/Answerer expertise/Category characteristics:
may not be important
caring or supportive answers
might be preferred
73
sometimes
Satisfaction: varying by asker experience
Group together questions from askers with the same
number of previous questions
Accuracy of prediction increase dramatically
Reaching F1 of 0.9 for askers with >= 5 questions
Personalized Prediction of Asker
Satisfaction with info
• Same information != same usefulness for different users!
• Personalized classifier achieves surprisingly good accuracy
(even with just 1 previous question!)
• Simple strategy of grouping users by number of previous
questions is more effective than other methods for users with
moderate amount of history
• For users with >= 20 questions, textual features are more
significant
75
Some
Results
76
Some Personalized Models
77
Summary
• Asker satisfaction is predictable
– Can achieve higher than human accuracy by exploiting
interaction history
• User’s experience is important
• General model: one-size-fits-all
– 2000 questions for training model are enough
• Personalized satisfaction prediction:
– Helps with sufficient data (>= 1 prev interactions, can
observe text patterns with >=20 prev. interactions)
78
Other tasks in progress
• Subjectivity, sentiment analysis
– B. Li, Y. Liu, and E. Agichtein, CoCQA: Co-Training
Over Questions and Answers with an Application
to Predicting Question Subjectivity Orientation, in
Proc. of EMNLP 2008
• Discourse analysis
• Cross-cultural comparisons
• CQA vs. web search comparison
79
Summary
• User-generated Content
– Growing
– Important: impact on main-stream media,
scholarly publishing, …
– Can provide insight into information seeking and
social processes
– “Training” data for IR, machine learning, NLP, ….
– Need to re-think quality, impact, usefulness
80
Thank you!
•
Research areas:
– Information retrieval & extraction, text mining, and information integration
– User behavior modeling, social networks and interactions, social media
•
People
Walter Askew, Qi Guo,
Yandong Liu,
Emory ‘09
2nd year Ph.D
2nd year Ph.D
Alvin Grissom,Ryan Kelly, Abulimiti Aji,
2nd year MS Emory’10 1st Year Ph.D
And colleagues at Yahoo! Research, Microsoft Research, Emory Libraries, Psychology,
Emory School of Medicine, Neuroscience, and Georgia Tech College of Computing.
•
Support
Question-Answer Features
Q: length, posting
time…
Q:Terms
QA: length, KL
divergence
Q:Votes
82
User Features
U: Member since
U: Total points
U: #Questions
U: #Answers
Category Features
• CA: Average time to close a
question
• CA: Average # answers per
question
• CA: Average asker rating
• CA: Average voter rating
• CA: Average # questions per
hour
• CA: Average # answers per
hour
Category
#Q
#A
#A per Q
Satisfied
Avg asker rating
Time to close by asker
General Health
134
737
5.46
70.4%
84
4.49
1 day and 13 hours