Optimistic Knowledge Gradient Policy for
Optimal Budget Allocation in Crowdsourcing
Xi Chen
Mentor: Denny Zhou
In collaboration with: Qihang Lin
Machine Learning Department
Carnegie Mellon University
Crowdsourcing Services
Building predictive models: collecting reliable labels for training
Crowdsourcing: outsourcing labeling tasks to a group of
workers, who usually are not experts on these tasks
2
Challenge: budget allocation in crowdsourcing
Labels from crowd are very noisy: repeated labeling. Aggregate
labels to infer the truth. More labels lead to higher confidence.
No free lunch: each labeling will cost a certain amount of money!
Different workers have different reliability
Goal: given a fixed amount of budget, how to sequentially allocate
the budgets on different item-worker pairs so that the overall
labeling accuracy is maximized ?
Estimate items’ difficulty and workers’ reliability and incorporate
the estimation into the sequential allocation process.
Simplest setting:
Binary Labeling
Homogenous Workers
3
Binary Labeling by Homogenous Workers
𝐾 items: 𝑖 = {1, … , 𝐾}
Soft-label 𝜃𝑖 = Pr 𝑍𝑖 = 1 ∈ 0,1 : unknown
Easy: 𝜃𝑖 → 0, 𝜃𝑖 → 1. Difficult : 𝜃𝑖 → 0.5
Positive Set: 𝐻 ∗ ={𝑖: 𝜃𝑖 ≥ 0.5}
Example: identify whether individuals are adult or not
4
Binary Labeling by Homogenous Workers
Homogenous Workers: provide labeling according to Bernoulli(𝜃𝑖 )
worker
+
Coin Tossing Problem:
[CrowdSynth: Kamar et al. 12]
There are K different biased coins
Positive/Head Set:
Labeling Procedure:
Tossing the Coin
Unknown
Total budget 𝑇 : the number of labels that we can acquire
Challenge: How to dynamically allocate budgets on 𝐾 items
so that the overall accuracy can be maximized ?
5
Binary Labeling by Homogenous Workers
Dynamic Budget Allocation:
Step 1: Dynamically acquire labels from the crowd for different items:
Step 2: When the budget 𝑇 is exhausted, make an inference about the
positive set 𝐻𝑇 based on collected labels.
Homogenous workers: majority vote
Heterogeneous workers: involves the reliability of the workers
Goal: maximize the accuracy:
Theory Tools: Bayesian Markov Decision Process
Bayesian Statistical Decision
Bayesian Sequential Optimization
Bayesian Reinforcement Learning
6
Roadmap
Bayesian Markov Decision Process &
Optimal Allocation Policy via Dynamic Programming
Approximate Policy: Optimistic Knowledge Gradient
Modeling Workers’ Reliability
Extensions: Incorporate of Feature Information
Extensions: Multi-Class Labeling
7
Bayesian Markov Decision Process
Beta Prior
Conjugate prior of Bernoulli
𝑎𝑖 : counts of 1s,
𝑏𝑖 : counts of -1s
At each stage:
Current state:
Decision rule:
The Decision rule is Markovian
Taking the Observation:
Allocation Policy:
8
Bayesian Markov Decision Process
State Transition and Transition Probability:
State:
Action:
Sample Path
Filtration:
9
Final Inference on Positive Set
When the budget 𝑇 is exhausted, make an inference about
the positive set 𝐻𝑇 based on collected labels.
Bayesian Decision Rule
𝑎𝑖𝑇 : counts of observed 1s plus 𝑎𝑖0
𝑏𝑖𝑇 : counts of observed -1s plus 𝑏𝑖0
𝑎𝑖0 = 𝑏i0
10
Expected Accuracy Maximization
Value Function: optimization over the policy
Optimization: Markov Decision Process and Dynamic Programming
Stage-wise Reward
Final Accuracy:
No stage-wise reward
11
Stage-wise Reward
Value Function
Telescope Expansion
[Ng et al. 99
Xie et al. 11]
Expected Stage-wise Reward
12
Markov Decision Process
Value Function
Stage-wise Reward
State
Markov Decision Process (Finite-horizon)
13
Optimal Policy via Dynamic Programming
Finite Horizon Markov Decision Process:
Dynamic Programming (a.k.a. Backward Induction)
Curse of dimensionality:
Approximate policies are needed
14
Roadmap
Bayesian Markov Decision Process &
Optimal Allocation Policy via Dynamic Programming
Approximate Policy: Optimistic Knowledge Gradient
Modeling Workers’ Reliability
Extensions: Incorporate of Feature Information
Extensions: Multi-Class Labeling
15
Approximate Policies
Uniform Sampling:
[J.C.Gittins, 79]
Finite-Horizon Gittins Index Rule:
Decompose the joint state space into state space for each single
item 𝑂(𝑇 2 )
Infinite-horizon and discounted reward: Gittins index rule is optimal
Finite-horizon and non-discounted reward: suboptimal policy
• Exact Computation:
• Approximate Computation:
[Nino-Mora, 11]
(Time & Space)
(Time & Space)
16
Knowledge Gradient
Knowledge Gradient (KG):
[Powell, 07]
Myopic/single-step look-ahead policy: optimal if only one labeling is
still remaining.
Deterministic KG: breaking ties by choosing the smallest index
Randomized KG: breaking ties at random
17
Optimistic Knowledge Gradient
Optimistic Knowledge Gradient:
Proof Sketch:
𝑅+ 𝑎, 𝑏 > 0
lim 𝑅 + 𝑎, 𝑏 = 0
𝑎+𝑏→∞
𝑇 → ∞ , each item will be labeled infinitely
many times
By strong law of large number, 𝐻𝑇 = 𝐻 ∗ , 𝑎. 𝑠.
Pessimistic Knowledge Gradient:
Inconsistent Policy
18
Optimistic Knowledge Gradient
19
Conditional Value-at-Risk
Conditional Value-at-Risk (CVaR)
[Rockafellar and Uryasev, 02]
Value-at-Risk: 𝛼 -upper quantile
Conditional Value-at-Risk (CVaR):
Expected reward exceeding VaR 𝛼 (𝑅)
Max
Reward
CVaR is a consistent
policy for any 𝛼 < 1
Knowledge Gradient
Optimistic Knowledge Gradient
20
Experiments
Simulated Data
𝐾 = 50, 𝜃𝑖 ∼ Beta(1,1)
𝑇 = 2𝐾, 3𝐾, … , 10𝐾
Recognizing Textual Entailment Data
(Snow at .al. EMNLP’08)
𝐾 = 800, 𝜃𝑖 ∼ Beta(1,1)
𝑇 = 2𝐾, 3𝐾, … , 10𝐾
21
Roadmap
Bayesian Markov Decision Process &
Optimal Allocation Policy via Dynamic Programming
Approximate Policy: Optimistic Knowledge Gradient
Modeling Workers’ Reliability
Extensions: Incorporate of Feature Information
Extensions: Multi-Class Labeling
22
Heterogeneous Workers: modeling reliability
Labeling Matrix: 𝑍𝑖𝑗 ∈ −1,1
Heterogeneous workers:
Modeling reliability of workers to
facilitate the estimation of true label
Assign more items to reliable workers
𝑁 items 1 ≤ 𝑖 ≤ 𝑁
𝜃𝑖 = Pr(𝑍𝑖 = 1)
𝜃𝑖 ∼ Beta(𝑎𝑖0 , 𝑏𝑖0 )
𝑀 workers 1 ≤ 𝑗 ≤ 𝑀
[Dawid and Skene, 79]
Reliability: 𝜌𝑗 = Pr(𝑍𝑖𝑗 = 𝑍𝑖 𝑍𝑖
𝜌𝑗 ∼ Beta(𝑐𝑗0 , 𝑑𝑗0 )
Action space: 𝑖, 𝑗 ∈ 1, … , 𝑁 × {1, … , 𝑀}
Likelihood (Law of total probability):
Homogeneous Worker Model
23
Variational Approximation and Moment Matching
Prior (product of Beta):
Likelihood:
Posterior:
No longer product
of Beta distribution
Variational Approximation: Approximate the posterior
by the product of marginal distributions
Approximate marginal distribution by beta
distribution using moment matching
Two-Coin Model:
24
Optimistic Knowledge Gradient
Reward of getting label 1
Reward of getting label -1
25
Experiments
Simulated Data
𝐾 = 50, 𝜃𝑖 ∼ Beta(1,1)
𝑀 = 10, 𝜌𝑗 ∼ Beta(4,1)
𝑇 = 2𝐾, 3𝐾, … , 10𝐾
Recognizing Textual Entailment Data
(Snow at .al. EMNLP’08)
𝐾 = 800, 𝜃𝑖 ∼ Beta(1,1), 𝑀 = 164, 𝜌𝑗 ∼ Beta(4,1)
𝑇 = 2𝐾, 3𝐾, … , 10𝐾
Homogenous (Perfect) Workers
Heterogeneous Workers
Best
accuracy
(92.25% )
with
only 40% of
the budget
26
Roadmap
Bayesian Markov Decision Process &
Optimal Allocation Policy via Dynamic Programming
Approximate Policy: Optimistic Knowledge Gradient
Modeling Workers’ Reliability
Extensions: Incorporate of Feature Information
Extensions: Multi-Class Labeling
27
Incorporate Feature Information
Each item 𝑖 has a feature vector x𝑖 ∈ R𝑝
Prior:
Posterior:
Laplace Approximation:
Updated Mean:
Rank-1 update: Sherman-Morrison
Updated Covariance:
Bottleneck:
Variational Bayesian logistic regression update
[Jaakkola & Jordan, 00]
28
Incorporate Feature Information
Simulated Data
𝐾 = 50, dimension of feature 𝑝 = 10,
𝒘 ~ 𝑁 0,0.1 ∗ 𝑰 , 𝒙 ~ 𝑁 0,0.3 𝑖−𝑗
29
Roadmap
Bayesian Markov Decision Process &
Optimal Allocation Policy via Dynamic Programming
Approximate Policy: Optimistic Knowledge Gradient
Modeling Workers’ Reliability
Extensions: Incorporate of Feature Information
Extensions: Multi-Class Labeling
30
Multi-Class Labeling
Classes: 𝑐 = 1, … , 𝐶
𝜃𝑖 ∈ [0,1]: underlying
probability of being
positive
𝜃𝑖 : Beta Prior
𝑦𝑖𝑡 ∈ −1, 1 : Bernoulli 𝜃𝑖
𝜽𝑖 = 𝜃𝑖1 , … , 𝜃𝑖𝐶 : 𝐶𝑐=1 𝜃𝑖𝑐 = 1
𝜃𝑖𝑐 : underlying probability of
belong to class c
𝜽𝑖 = 𝜃𝑖1 , … , 𝜃𝑖𝐶 : Dirichlet Prior
𝑦𝑖𝑡 ∈ 1, 2, … , 𝐶 : Categorical 𝜽𝑖
31
Real Experiment
Stanford Image Data (4-classes of dogs)
(Zhou at .al. NIPS’12, labeled by Amazon MTurk)
𝐾 = 807, 𝜃𝑖 ∼ Dirichlet(1, … ,1)
𝑇 = 2𝐾, 3𝐾, … , 10𝐾
Bing Search Relevance Data
(5 Ratings)
𝐾 =2653, 𝜃𝑖 ∼ Dirichlet(1, … ,1)
32
𝑇 = 2𝐾, 3𝐾, … , 6𝐾
Conclusions
A general MDP framework for budget allocation in crowdsourcing
Optimistic Knowledge Gradient Policy: approximate dynamic
programming
Future Works:
Saving computational cost (e.g., features / multi-class settings)
Budget allocation in other settings in crowdsourcing (e.g., rating)
Make the current framework more practical: batch assignment
(assign a set of items to a worker at each stage)
Apply algorithms to real platforms in Bing
33
Acknowledgement
Great summer at Redmond: May 1st ~ Oct 12th
CLUES
Group
Machine
Learning
Department
Theory
Group
Interns
34
© Copyright 2026 Paperzz