Does a better consensus algorithm improve results

T-61.6010 Autumn 2014:Algorithms and applications of human computation P (3 cr)
Bonus Project (+ 2 cr)
Does a better consensus algorithm improve results when labeling data for machine learning algorithms?
The problem of aggregating multiple labels from noisy workers into a single “correct solution”
has attracted a lot of attention in the context of crowdsourcing. As we have seen during the
seminar, a number of consensus algorithms have been proposed for this problem.
The purpose of this project is to investigate under what circumstances such algorithms can
beat majority voting when the task is to collect labeled data for supervised machine
learning. In particular, we want to compare the performance of the crowd with that of an expert.
Formally the problem is as follows: We want to learn a classifier m, that predicts a label y 00 for
some feature vector x ∈ Rn . We consider two training data sets:
Dexpert = {(x1 , y1 ), (x2 , y2 ), . . . , (xN , yN )},
0
Dcrowd = {(x1 , y10 ), (x2 , y20 ), . . . , (xN , yN
)}.
Above yi is a label provided by the expert to example i, and yi0 is an aggregated label that we have
computed given the crowdsourced labels of example i. That is, yi0 = f (li1 , li2 , . . . , liMi ), where f is
some consensus algorithm, and lij is the j:th label obtained for task i from the crowd. Note that
some of the more sophisticated consensus algorithms can use expert labels to train themselves.
Given Dexpert and Dcrowd we learn the classifiers mexpert and mcrowd using any suitable machine
learning algorithm, e.g. a decision tree or SVM. We can train different variants of mcrowd depending on the consensus algorithm f . Then we evaluate the classifiers using standard techniques
(e.g. cross validation or a simple train/test split), but always consider the expert labels as the
ground truth. That is, when evaluating mcrowd (xi ) = yi00 , we compute the evaluation metric
against the expert provided label yi , as these are assumed to be the real “ground truth”.
The question is: Does it make sense to use some other consensus algorithm than majority voting?
Your task is as follows:
1. Obtain at least three suitable data sets e.g. from the UCI machine learning repository1 .
These come with features, as well as ground truth labels that we assume is the experts
output. You can use also other data sets. Anything is okay as long as it has some feature
vectors and the ground truth labels. (If you happen to have such data ready from earlier
courses, it is okay to use these.) To keep things simple, use only binary labels.
2. Devise some noise model that takes the ground truth labels, and turns these into simulated labels from the crowd. That is, for every example i, use some simple randomisation
to generate the labels li1 , . . . , liMi . You can do this e.g. by assuming that every worker has
some skill and every example is of some difficulty. The label from worker j to example i is
obtained by combining the skill and difficulty, possibly with some randomness. Feel free to
consider other models, this is up to you, as long as the model makes sense. You might also
want to take a look at some of the papers that we have been discussing to get ideas.
3. Given the training data sets, train the expert classifier mexpert , and at least two crowd
classifiers. One where f = “majority voting”, and another where f = “some other algorithm”.
Use at least one machine learning algorithm of your choice, e.g. a decision tree or SVM.
4. Evaluate the classifiers e.g. by using accuracy, precision, recall or the F1 measure. Use
either cross-validation or train/test split. Remember to always use the expert labels as the
1
http://archive.ics.uci.edu/ml/
ground truth! Please vary some parameter of your noise model. E.g. it should be fairly
clear that majority voting fails when there are a lot of adversarial workers who always give
the incorrect label.
5. Try to take the labeling costs into account. E.g. you can assume that obtaining one label
from the expert costs Cexpert euros, and obtaining a single label from the crowd costs Ccrowd
euros. You can vary the number of workers to see if spending more money can help in
getting better labels. Be precise. Don’t forget to include every label that you use at any
part of the process when computing the cost.
6. Prepare a deck of slides that present your results. The slides should be self-contained,
i.e., understandable without a presentation. Figures and tables are nice. Be clear and concise. Only report what you did, and what the results are. If you use a complex noise model,
be sure to describe that in sufficient detail.
You do not have to implement any of the methods yourself. Especially the machine learning
algorithm is better to take from some library. Use e.g. Matlab, R or Weka, or any other system
that you are familiar with. Also, it is not necessary to implement the consensus methods yourself.
A number of Java implementations can be found at SQUARE2 . They also provide links to other
implementations. You can also try designing a consensus algoritm yourself. (If you do, please
describe it thoroughly.)
You can do the project either alone, or in groups of two people.
This is a fairly open-ended task, and it is up to you to figure out the details and the precise
question you are studying. (This is a postgraduate level course, after all.) As this is only 2
credits, I am suggesting to keep the project fairly simple. If some detail is missing from these
instructions, feel free to make your own decision regarding that.
Please send me an email by December 10th (to [email protected]) if you plan to do
this project. Mention in the email at least:
1. Members of the project team.
2. What data sets you will use and where do you get them from.
3. What consensus algorithm(s) you want to try out.
4. What machine learning algorithm you are considering.
5. Initial estimate when you plan to submit the report.
There is no deadline once you have registered. You will get credits for the course as soon as you
submit the report and other course requirements are satisfied. However, I am encouraging you
to submit the report as soon as possible, but ideally in January the latest, before the 3rd period
begins.
2
http://ir.ischool.utexas.edu/square/