A Framework for Human Evaluation of Search Engine Relevance Kamal Ali Chi Chao Chang Yun-fang Juan Outline The Problem Framework of Test Types Results • Expt. 1: Set-level versus Item-level • Expt. 2: Perceived Relevance versus Landing-Page Relevance • Expt. 3: Editorial Judges versus Panelist Judges Related Work Contributions The Problem Classical IR: Cranfield technique Search Engine Experience • Results + UI + Speed + Advertising, spell suggestions ... Search Engine Relevance • • • • Users see summaries (abstracts), not documents Heterogeneity of results: images, video, music, ... User population more diverse Information needs more diverse (commercial, entertainment) Human Evaluation costs are high • Holistic alternative: judge set of results Framework of test types Dimension 1: Query Category • Images, News, Product Research, Navigational, … Dimension 2: Modality • Implicit (click behavior) versus Explicit (judgments) Dimension 3: Judge Class • Live users versus Panelists versus Domain experts Dimension 4: Granularity • Set-level versus item-level Dimension 5: Depth • Perceived relevance versus Landing-Page (“Actual”) relevance Dimension 6: Relativity • Absolute judgments versus Relative judgments Explored in this work Dimension 1: Query Category Random (mixed), Adult, How-to, Map, Weather, Biographical, Tourism, Sports, … Dimension 2: Modality Implicit/Behavioral Explicit Advantages Direct user behavior Can give reason for behavior Disadvantages Click behavior etc is ambiguous Subject to self-reflection Explored in this work Dimension 3: Judge Class Live Panelists Domain Experts Advantages End user Representative? Participation Highly incented Available High Quality Disadvantages Participation Quality Biased class? Dimension 4: Granularity Set-level Item-level Advantages Rewards diversity Penalizes duplicates Re-combinable Disadvantages Credit Assignment Not recomposable User sees set, not single item Explored in this work Dimension 5: Depth Perceived Relevance Landing-Page Relevance Advantages User sees this first User’s final goal Disadvantages Perceived relevance may be high, actual may be low Dimension 6: Relativity Absolute Relevance Relative Relevance Advantages Usable by search engine engineers to optimize Easier for judges to judge Disadvantages Harder for judges Cant incorporate 3rd engine easily == Not re-combinable Experimental Methodology Random set of queries taken from Yahoo! Web logs Per-set: • Judge sees 10 items from a search engine • Gives one judgment for entire set • Order is as supplied by search engine Per-item • Judge sees one item (document) at a time • Gives one judgment per item • Order is scrambled Side-by-side • Judge sees 2 sets; sides are scrambled • Within each set, order is preserved Expt. 1: Effect of Granularity: Set-level versus Item-level Methodology: Domain/expert editors Self-selection of queries Value 1: Per-set judgments Given on a 3-scale: 1=best, 3=worst, thus producing a discrete random variable Value 2: Item-level judgments 10 item-level judgments are rolled up to a single number (using DCG roll-up function) DCG values discretized into 3 bins/levels; thus producing the 2nd discrete random variable Look at resulting 3 * 3 contingency matrix Compute correlation between these variables Expt 1: Effect of Granularity: Set-level versus Item-level: Images Domain 1: Search at Image site or Image tab 299 image queries; 2 – 3 judges per query; 6856 judgments 198 queries in common between set-level, item-level tests 20 images (in 4*5 matrix) shown per query SET=1 SET=2 SET=3 Marginal DCG=1 130 18 1 149; 75% DCG=2 16 13 7 36; 18% DCG=3 3 5 5 13; 7% Marginal 149; 75% 36; 18% 13; 7% 198; r=0.54 Expt 1: Effect of Granularity: Set-level versus Item-level: Images Interpretation of Results • Spearman Correlation is a middling 0.54 • Look at outlier queries • “Hollow Man” – high set-level, low item-level scores Most items irrelevant – explain low item-level DCG; recall judges were seeing images one at a time in scrambled order Set-level: eye can quickly (in parallel) see relevant image; less sensitive to irrelevant images Set-level: Ranking function was poor leading to low DCG score. Unusual since normally set-level picks out poor ranking. Expt 2: Effect of Depth: Perceived Relevance vs. Landing Page Fix granularity at Item level Perceived Relevance: • Title, abstract (summary) and URL shown (T.A.U.) • Judgment 1: Relevance of title • Judgment 2: Relevance of abstract Click on URL to reach Landing Page • Judgment 3: Relevance of Landing Page Expt 2: Effect of Depth: Perceived vs. “Actual”: Advertisements Domain 1: Search at News site Created compound random variable for Perceived Relevance: AND’d Title Relevance and Abstract Relevance Correlated with Landing Page (“Actual”) Relevance Higher Correlation: News Title/Abstract carefully constructed landingPg=NR landingPg=R Marginal perceived=NR 511 62 573; 20% perceived=R 22 2283 2305; 80% marginal 533; 19% 2345; 81% 2878; r=0.91 Expt. 3: Effect of Judge Class: Editors versus Panelists Methodology: 1000 randomly selected queries frequency-biased sampling 40 judges, few hundred panelists Panelist • Recruitment: long-standing panel • Reward: gift certificate • Remove panelists that completed test too quickly or missed sentinel questions Query=* Modality=Explicit Granularity=Set-level Depth=Mixed Relativity=Relative Questions: • • 1. Does judge class affect overall conclusion on which Search engine is better? 2. Are there particular types of queries for which significant differences exist? Expt. 3: Effect of Judge Class: Editors versus Panelists • Column percentages: p( P | E ): p(P=eng1) = .28 but p(P=eng1 | E=eng1) = .35 Lift = .35 / .28 = 1.25 … 25% modest lift Row marginal E = engine1 E = neutral E = engine2 P = engine1 28% 35% 26% 16% P = neutral 47% 47% 49% 48% P = engine2 25% 17% 25% 35% Expt. 3: Effect of Judge Class: Editors versus Panelists • Editor marginal distrib (.33, .33, .33) • Panelist less likely to discern diff: (.25, .50, .25) • Given P favors Eng1, E are more likely to favor Eng1 than if P favors Eng2. E = engine1 E = neutral E = engine2 Col. Marginal 37% 32% 30% P = engine1 49% 32% 19% P = neutral 36% 34% 30% P = engine2 25% 33% 42% Expt. 3: Effect of Judge Class: Correlation, Association • Linear Model: r2 = 0.03 • 3*3 Categorical Model f = 0.29 • Test for Non-zero Association: c2 = 16.3, 99.5% signif. Expt. 3: Effect of Judge Class: Qualitative Analysis Editors’ top feedback: 1. Ranking not as good (precision) 2. Both equally good (precision) 3. Relevant sites missing (recall) 4. Perfect site missing (recall) Panelists’ top feedback: 1. Both equally good (precision) 2. Too general (precision) 3. Both equally bad (precision) 4. Ranking not as good (precision) Panelists need to see other SE to penalize poor recall. Related Work Mizzaro • Framework with 3 key dimensions of evaluation: Information needs (expression level) Information resources (TAU, documents,..) Information context • High level of disagreement among judges Amento et al. • Correlation Analysis • Expert Judges • Automated Metrics: in-degree, PageRank, page size, ... Contributions Framework Set-level to Item-level correlation: • Middling: measuring different aspects of relevance • measure aspects missed by per-item: Poor ranking, duplicates, missed senses Perceived Relevance to “Actual” relevance • Higher correlation – maybe because domain = News Editorial judges versus Panelists • Panelists sit on fence more • Panelist focus on precision more, need other SE for recall • Panelist methodology, reward structure is crucial
© Copyright 2026 Paperzz