Exploring Cost-Effective Approaches to Human

A Framework for Human Evaluation of
Search Engine Relevance
Kamal Ali
Chi Chao Chang
Yun-fang Juan
Outline
 The Problem
 Framework of Test Types
 Results
• Expt. 1: Set-level versus Item-level
• Expt. 2: Perceived Relevance versus Landing-Page Relevance
• Expt. 3: Editorial Judges versus Panelist Judges
 Related Work
 Contributions
The Problem
 Classical IR: Cranfield technique
 Search Engine Experience
• Results + UI + Speed + Advertising, spell suggestions ...
 Search Engine Relevance
•
•
•
•
Users see summaries (abstracts), not documents
Heterogeneity of results: images, video, music, ...
User population more diverse
Information needs more diverse (commercial, entertainment)
 Human Evaluation costs are high
• Holistic alternative: judge set of results
Framework of test types
 Dimension 1: Query Category
• Images, News, Product Research, Navigational, …
 Dimension 2: Modality
• Implicit (click behavior) versus Explicit (judgments)
 Dimension 3: Judge Class
• Live users versus Panelists versus Domain experts
 Dimension 4: Granularity
• Set-level versus item-level
 Dimension 5: Depth
• Perceived relevance versus Landing-Page (“Actual”) relevance
 Dimension 6: Relativity
• Absolute judgments versus Relative judgments
Explored in this work
Dimension 1: Query Category
Random (mixed), Adult, How-to, Map, Weather,
Biographical, Tourism, Sports, …
Dimension 2: Modality
Implicit/Behavioral
Explicit
Advantages
Direct user behavior
Can give reason for behavior
Disadvantages
Click behavior etc is
ambiguous
Subject to self-reflection
Explored in this work
Dimension 3: Judge Class
Live
Panelists
Domain Experts
Advantages
End user
Representative?
Participation
Highly incented
Available
High Quality
Disadvantages
Participation
Quality
Biased class?
Dimension 4: Granularity
Set-level
Item-level
Advantages
Rewards diversity
Penalizes duplicates
Re-combinable
Disadvantages
Credit Assignment
Not recomposable
User sees set, not single item
Explored in this work
Dimension 5: Depth
Perceived Relevance
Landing-Page Relevance
Advantages
User sees this first
User’s final goal
Disadvantages
Perceived relevance may
be high, actual may be low
Dimension 6: Relativity
Absolute Relevance
Relative Relevance
Advantages
Usable by search engine
engineers to optimize
Easier for judges to judge
Disadvantages
Harder for judges
Cant incorporate 3rd engine easily
== Not re-combinable
Experimental Methodology
 Random set of queries taken from Yahoo! Web logs
 Per-set:
• Judge sees 10 items from a search engine
• Gives one judgment for entire set
• Order is as supplied by search engine
 Per-item
• Judge sees one item (document) at a time
• Gives one judgment per item
• Order is scrambled
 Side-by-side
• Judge sees 2 sets; sides are scrambled
• Within each set, order is preserved
Expt. 1: Effect of Granularity:
Set-level versus Item-level
Methodology:
 Domain/expert editors
 Self-selection of queries
 Value 1: Per-set judgments
Given on a 3-scale: 1=best, 3=worst, thus producing a discrete random
variable
 Value 2: Item-level judgments
10 item-level judgments are rolled up to a single number (using DCG
roll-up function)
 DCG values discretized into 3 bins/levels; thus producing the 2nd discrete
random variable
 Look at resulting 3 * 3 contingency matrix
 Compute correlation between these variables
Expt 1: Effect of Granularity:
Set-level versus Item-level: Images
 Domain 1: Search at Image site or Image tab
 299 image queries; 2 – 3 judges per query; 6856 judgments
198 queries in common between set-level, item-level tests
 20 images (in 4*5 matrix) shown per query
SET=1
SET=2
SET=3
Marginal
DCG=1
130
18
1
149; 75%
DCG=2
16
13
7
36; 18%
DCG=3
3
5
5
13; 7%
Marginal
149; 75%
36; 18%
13; 7%
198; r=0.54
Expt 1: Effect of Granularity:
Set-level versus Item-level: Images
 Interpretation of Results
• Spearman Correlation is a middling 0.54
• Look at outlier queries
• “Hollow Man” – high set-level, low item-level scores
 Most items irrelevant – explain low item-level DCG; recall judges
were seeing images one at a time in scrambled order
 Set-level: eye can quickly (in parallel) see relevant image; less
sensitive to irrelevant images
 Set-level: Ranking function was poor leading to low DCG score.
Unusual since normally set-level picks out poor ranking.
Expt 2: Effect of Depth:
Perceived Relevance vs. Landing Page
 Fix granularity at Item level
 Perceived Relevance:
• Title, abstract (summary) and URL shown (T.A.U.)
• Judgment 1: Relevance of title
• Judgment 2: Relevance of abstract
 Click on URL to reach Landing Page
• Judgment 3: Relevance of Landing Page
Expt 2: Effect of Depth:
Perceived vs. “Actual”: Advertisements
 Domain 1: Search at News site
 Created compound random variable for Perceived Relevance:
AND’d Title Relevance and Abstract Relevance
 Correlated with Landing Page (“Actual”) Relevance
 Higher Correlation: News Title/Abstract carefully constructed
landingPg=NR
landingPg=R
Marginal
perceived=NR
511
62
573; 20%
perceived=R
22
2283
2305; 80%
marginal
533; 19%
2345; 81%
2878; r=0.91
Expt. 3: Effect of Judge Class:
Editors versus Panelists
Methodology:



1000 randomly selected queries frequency-biased sampling
40 judges, few hundred panelists
Panelist

• Recruitment: long-standing panel
• Reward: gift certificate
• Remove panelists that completed test too quickly or missed sentinel questions
Query=*
Modality=Explicit
Granularity=Set-level
Depth=Mixed
Relativity=Relative

Questions:
•
•
1. Does judge class affect overall conclusion on which Search engine is better?
2. Are there particular types of queries for which significant differences exist?
Expt. 3: Effect of Judge Class:
Editors versus Panelists
• Column percentages: p( P | E ):
p(P=eng1) = .28 but p(P=eng1 | E=eng1) = .35
Lift = .35 / .28 = 1.25 … 25% modest lift
Row marginal
E = engine1
E = neutral
E = engine2
P = engine1
28%
35%
26%
16%
P = neutral
47%
47%
49%
48%
P = engine2
25%
17%
25%
35%
Expt. 3: Effect of Judge Class:
Editors versus Panelists
• Editor marginal distrib (.33, .33, .33)
• Panelist less likely to discern diff: (.25, .50, .25)
• Given P favors Eng1, E are more likely to favor Eng1 than
if P favors Eng2.
E = engine1
E = neutral
E = engine2
Col. Marginal
37%
32%
30%
P = engine1
49%
32%
19%
P = neutral
36%
34%
30%
P = engine2
25%
33%
42%
Expt. 3: Effect of Judge Class:
Correlation, Association
• Linear Model:
r2 = 0.03
• 3*3 Categorical Model
f = 0.29
• Test for Non-zero Association:
c2 = 16.3, 99.5% signif.
Expt. 3: Effect of Judge Class:
Qualitative Analysis

Editors’ top feedback:
1. Ranking not as good (precision)
2. Both equally good (precision)
3. Relevant sites missing (recall)
4. Perfect site missing (recall)
 Panelists’ top feedback:
1. Both equally good (precision)
2. Too general (precision)
3. Both equally bad (precision)
4. Ranking not as good (precision)
Panelists need to see other SE to penalize poor recall.
Related Work
 Mizzaro
• Framework with 3 key dimensions of evaluation:
 Information needs (expression level)
 Information resources (TAU, documents,..)
 Information context
• High level of disagreement among judges
 Amento et al.
• Correlation Analysis
• Expert Judges
• Automated Metrics: in-degree, PageRank, page size, ...
Contributions
 Framework
 Set-level to Item-level correlation:
• Middling: measuring different aspects of relevance
• measure aspects missed by per-item:
Poor ranking, duplicates, missed senses
 Perceived Relevance to “Actual” relevance
• Higher correlation – maybe because domain = News
 Editorial judges versus Panelists
• Panelists sit on fence more
• Panelist focus on precision more, need other SE for recall
• Panelist methodology, reward structure is crucial