Consensus Relevance with Topic and Worker Conditional Models

Consensus Relevance with Topic
and Worker Conditional Models
Paul N. Bennett, Microsoft Research
Joint with
Ece Kamar, Microsoft Research
Gabriella Kazai, Microsoft Research Cambridge
Motivation for Consensus Task
• Recover actual relevance of a topic-document pair
based on noisy predictions from multiple labelers.
• Obtain a more reliable signal from the crowd and/or
benefit from scale (expert quality from inexperienced
assessors).
• Variety of proposed approaches in the literature and in
competition.
– Supervised: Classification models.
– Semi-supervised: EM-style algorithms.
– Unsupervised: majority vote.
Common Axes of Generalization
Documents
Relevance
Observed in Training
Relevance
Not Observed in Training
Observed
In Training
Compute consensus
for “new
documents” on
known topics.
Use rules or
observed worker
accuracies on other
topics/documents to
compute consensus
on new topics and
documents.
Not Observed
In Training
Topics
Compute consensus
on new topics for
documents with
known relevance on
other topics.
Note hidden axis of observed workers.
Our Approach
• Supervised
– Given gold truth judgments on a topic set and worker responses, learn
a consensus model to generalize to new documents on same topic set.
– Must be able to generalize to new workers.
• Want a well-founded probabilistic method
– Need to handle major sources of worker error.
• Worker skill/accuracy.
• Topic difficulty.
– Needs to handle correlation in labels.
• Correlation expected because of underlying label.
• Note: will use “assessor” for ground truth labeler and “worker” for
noisy labelers.
Basic Model
The probability of relevance should depend on the
document, worker response vector, and topic:
P 𝑅𝑖,𝑗 𝑡𝑖 , 𝑑𝑗 , 𝑤1 , … , 𝑤𝑛 𝑤𝑘 is elicited for i, j
where 𝑡𝑖 ∈ 𝑇 is a particular topic, 𝑑𝑗 ∈ 𝐷 is a
particular document, 𝑅𝑖,𝑗 ∈ 0,1 is the event that
𝑑𝑗 is relevant to topic 𝑡𝑖 and 𝑤𝑘 ∈ 0,1 is the
response of worker k for the i,jth pair abbreviate
𝑤𝑖:𝑗
Exchangeability Related Assumptions
• Given two identical sets of voting history, we
assume two workers have the same response
distribution.
• Whether or not a worker’s opinion is elicited is
not informative.
• The ordering of responses/elicitation is not
informative.
Relevance Conditional Independence
• Assume conditional independence of worker response
given document relevance.
– implies workers have comparable accuracies across tasks.
• Assume one topic independent prior on relevance
P 𝑟𝑖,𝑗 𝑡𝑖 , 𝑑𝑗 , 𝑤𝑖:𝑗 ∝ P 𝑟𝑖,𝑗
P 𝑤 𝑟𝑖,𝑗
𝑤∈𝑤𝑖:𝑗
• Probability
Referred
to as naïve
of relevance
across allBayes.
topics.
Probability of a random worker’s
response given relevance
(across all topics).
Topic and Relevance
Conditional Independence
• Assume response conditionally independent
given topic and relevance.
– Implies workers have comparable accuracy within
a topic, but varying across topics.
• Assume topic dependent prior on relevance.
P 𝑟𝑖,𝑗 𝑡𝑖 , 𝑑𝑗 , 𝑤𝑖:𝑗 ∝ P 𝑟𝑖,𝑗 𝑡𝑖
P 𝑤 𝑟𝑖,𝑗 , 𝑡𝑖
𝑤∈𝑤𝑖:𝑗
• Probability
Referred
to asfornB
of relevance
this Topic.
topic.
Probability of a random worker’s
response given relevance
for this topic.
Worker and Relevance
Conditional Independence
• Each worker has a particular skill/accuracy in
making relevance judgments.
• This can be estimated by aggregating a history of
accuracy ℎ𝑘 across all tasks.
• Responses are independent conditional on
historical accuracy and relevance.
P 𝑟𝑖,𝑗 𝑡𝑖 , 𝑑𝑗 , 𝑤𝑖:𝑗 ∝ P 𝑟𝑖,𝑗
P 𝑤𝑘 𝑟𝑖,𝑗 , ℎ𝑘
𝑤𝑘 ∈𝑤𝑖:𝑗
• Probability
Referred
to asacross
nB all
Worker.
of relevance
topics.
Probability of this worker’s response
given relevance
(across all topics).
Evaluation
• Which Label
– Gold: evaluate using expert assessor’s label as truth.
– Consensus: evaluate using consensus of participants’ responses
as truth.
– Other Participant: evaluate using a particular participant’s
responses as truth.
• Methodology
– Use development validation as test to decide what method to
submit.
– Split development train into 80/20 train/validation by topicdocID pair (i.e. for a given topic all responses for a docID were
completely in/out of the validation set.
Development Set
Model
TruePos
TrueNeg
FalsePos FalseNeg Accuracy DefaultAcc Prec
Recall
Specificity
Majority Vote
101
8
17
19
75.2%
82.8% 85.6%
84.2%
32.0%
naive Bayes
120
0
25
0
82.8%
82.8% 82.8%
100.0%
0.0%
nB Topic
115
7
18
5
84.1%
82.8% 86.5%
95.8%
28.0%
nB Worker
117
1
24
3
81.4%
82.8% 83.0%
97.5%
4.0%
• Skew and scarcity of development set, made model
selection challenging.
• Chose nB Topic since only method that outperformed
the baseline (predicting most common class).
Results
Team
Soft
Accuracy Accuracy Recall
Precision Specificity Log Loss RMSE
Soft
Accuracy Accuracy
Rank
Rank
MSRC
69.3%
64.0%
79.0%
66.2%
59.6% 610.28
44.9%
3
6
uogTr
36.7%
44.1%
13.6%
25.3%
59.8% 931.74
58.8%
10
10
LingPipe
67.6%
66.2%
76.2%
65.0%
59.0% 975.88
49.7%
5
4
GeAnn
60.7%
57.7%
88.4%
56.9%
33.0% 1150.45
51.3%
7
8
UWaterlooMDS
69.4%
67.4%
80.2%
66.0%
58.6% 1435.79
50.1%
2
3
uc3m
69.9%
69.9%
75.4%
67.9%
64.4% 2772.38
54.9%
1
1
BUPT-WILDCAT
68.5%
68.5%
78.6%
65.4%
58.4% 2901.33
56.1%
4
2
TUD_DMIR
66.2%
66.2%
76.4%
63.5%
56.0% 3113.16
58.1%
6
5
UTaustin
60.4%
60.4%
90.8%
56.5%
30.0% 3647.36
62.9%
8
7
qirdcsuog
52.9%
52.9%
82.4%
51.8%
23.4% 4338.12
68.6%
9
9
• Methods that report probabilities did better on probability measures in
almost all cases and almost always improve on decision theoretic
threshold.
• Outlier’s performance in Log loss and conversion to accuracy implies
poorly calibrated wrt decision threshold, but likely good overall.
• Our method best on probability measures and near top in general.
Conclusions
• Simple topic and relevance conditional assumption model
produces
– Best performance on probability measures on gold set.
– Nearly best performance on accuracy.
• Topic-level effects explain the majority of variability in
judgments (on this data and over set of submissions).
• Future:
– Worker-relevance on test set
– Worker-topic-relevance conditional independence model
– Method performance versus best/median individual worker
(sufficient data to evaluate?)
Thoughts for Future
Crowdsourcing Tracks
• Is consensus independent of elicitation?
– Can consensus be studied independent of the design for
worker response collection?
– Probably okay if development and test sets are collected
with the same methodology.
• Likely collection design impact factors worth analyzing.
–
–
–
–
–
Number of gold standard in “training set” on topic
Number of labels per worker
Number of labels per item
Number of worker responses on observed items
Stability of topic-conditional prior of relevance
Questions?