Crowdsourcing Markus Rokicki L3S Research Center 09.05.2017 Markus Rokicki | [email protected] | 1/27
Crowdsourcing
Markus Rokicki
L3S Research Center
09.05.2017
Markus Rokicki | [email protected] | 1/27
Human Computation
“Automaton Chess Player” or “Mechanical Turk”.1
Crowdsourcing facilitates human computation through the web.
1Source: https://en.wikipedia.org/wiki/The_Turk
Markus Rokicki | [email protected] | 2/27
Human Computation
“Automaton Chess Player” or “Mechanical Turk”.1
Crowdsourcing facilitates human computation through the web.
1Source: https://en.wikipedia.org/wiki/The_Turk
Markus Rokicki | [email protected] | 2/27
Example: Product Categorization
Figure: Product categorization task on Amazon Mechanical Turk2
2https://www.mturk.com
Markus Rokicki | [email protected] | 3/27
Example: Stereotype vs Gender appropriate3
3Tolga Bolukbasi et al. “Man is to computer programmer as woman is to homemaker? debiasing word embed-dings”. In: Advances in Neural Information Processing Systems. 2016, pp. 4349–4357.
Markus Rokicki | [email protected] | 4/27
Example: Damage Assessment during Disasters4
4Luke Barrington et al. “Crowdsourcing earthquake damage assessment using remote sensing imagery”. In:Annals of Geophysics 54.6 (2012).
Markus Rokicki | [email protected] | 5/27
Crowdsourcing definition
Crowdsourcing, according to a literature survey5:
I Participative online activity
I an individual, institution, non-profit, or company proposesvoluntary undertaking of a task
I in a flexible open call to a group of individualsI varying knowledge, heterogeneity, and number
I Users receive satisfaction of a given type of needI economic, social recognition, self-esteem, learning, . . .
I always entails mutual benefit
5Enrique Estelles-Arolas and Fernando Gonzalez-Ladron-de Guevara. “Towards an integrated crowdsourcingdefinition”. In: Journal of Information science 38.2 (2012), pp. 189–200.
Markus Rokicki | [email protected] | 6/27
Crowdsourcing Platforms
Some crowdsourcing platforms:I Amazon Mechanical Turk6
I Paid “microtasks”I CrowdFlower7
I Paid “microtasks”
I Topcoder8
I Programming competitions
I Threadless9
I Propose and vote on t-shirt designs
I Kickstarter10
I Fund products6https://www.mturk.com
7https://www.crowdflower.com
8https://www.topcoder.com
9https://www.threadless.com
10https://www.kickstarter.com
Markus Rokicki | [email protected] | 7/27
Paid Microtask Crowdsourcing
Paid crowdsourcing scheme11.
11Figure source: http://dx.doi.org/10.1155/2014/135641
Markus Rokicki | [email protected] | 8/27
Paid Microtask Crowdsourcing
Paid crowdsourcing scheme11.
11Figure source: http://dx.doi.org/10.1155/2014/135641
Markus Rokicki | [email protected] | 8/27
Paid Microtask Crowdsourcing
Paid crowdsourcing scheme11.
11Figure source: http://dx.doi.org/10.1155/2014/135641
Markus Rokicki | [email protected] | 8/27
Task design
Depending on application:I (how) should you break up the task?I more or less answering options? Free-form answers?I how to present the problem
E.g.: Highlighting keywords in search results12:
12Omar Alonso and Ricardo Baeza-Yates. “Design and implementation of relevance assessments using crowd-sourcing”. In: European Conference on Information Retrieval. Springer. 2011, pp. 153–164.
Markus Rokicki | [email protected] | 9/27
Task design issues: Cognitive Biases15
Anchoring effects
I “Humans start with a first approximation (anchor) and thenmake adjustments to that number based on additionalinformation.”13
I Observed in crowdsourcing experiments14
I Group A Q1: More or less than 65 African countries in UN?I Group B Q1: More or less than 12 African countries in UN?
I Q2: How many countries in Africa?I Group A mean: 42.6I Group B mean: 18.5
Also: Order effects and many more13Daniel Kahneman and Amos Tversky. “Subjective probability: A judgment of representativeness”. In: Cognitive
psychology 3.3 (1972), pp. 430–454.14Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. “Running experiments on Amazon Mechanical
Turk”. In: Judgment and Decision Making 5.5 (2010).15Adapted from https://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation
Markus Rokicki | [email protected] | 10/27
Quality: Worker Access and Qualification
Crowdsourcing platforms offer means to restrict access to tasksbased on:
I Trust levels based on overall past accuracy
I GeographyI Language skills
I Certified based on language testsI Classified by the platform based on geolocation, browser data
and user history16
I Qualification for specific kinds of tasks gained throughqualification/training tasks
16https://www.crowdflower.com/crowdflower-now-offering-twelve-language-skill-groups
Markus Rokicki | [email protected] | 11/27
Qualification TasksQualification tests also provide feedback about the judgmentsexpected by the worker and thus influence quality17.
Figure on the right:Accuracy depending onclass imbalance in trainingquestions.
Classes: ‘Matching’, ‘NotMatching’
17John Le et al. “Ensuring quality in crowdsourced search relevance evaluation: The effects of training questiondistribution”. In: SIGIR 2010 workshop on crowdsourcing for search evaluation. 2010, pp. 21–26.
Markus Rokicki | [email protected] | 12/27
Measuring Worker Accuracy: Gold Standard
Would like to know how reliable answers are for the taskI Requesters may want to reject the work of unreliable workers
I In particular for ‘spammers’ who do not try to solve the task
Gold standard / honey-pot tasks:
I Add tasks for which the ground truth is already known atrandom positions
I Compare worker input with the ground truth
I Feedback on correctness is given to the workers
Markus Rokicki | [email protected] | 13/27
Using Gold Standard18
Challenges:I Need enough gold standard tasks or limit #tasks per worker
I Workers who recognize gold standard will tend to ‘spam’
I Data composition:I Class balanceI Educational examples addressing likely errors
I Need to be indistinguishable from regular tasks. But: need tobe unambiguous
Standard workflow:
I Iterate ground truth creation
I Start with hand crafted small ground truth
I Use annotations to create additional gold standard data
18David Oleson et al. “Programmatic Gold: Targeted and Scalable Quality Assurance in Crowdsourcing.” In:Human computation 11.11 (2011).
Markus Rokicki | [email protected] | 14/27
Redundancy: Wisdom of Crowds
So far: ensuring individual annotation quality. However: even besteffort human annotations are not perfect most of the time.
Source: https://www.domo.com/learn/the-wisdom-of-crowdsSource: https://www.domo.com/learn/the-wisdom-of-crowds
Markus Rokicki | [email protected] | 15/27
Redundancy: Wisdom of Crowds
So far: ensuring individual annotation quality. However: even besteffort human annotations are not perfect most of the time.
Source: https://www.domo.com/learn/the-wisdom-of-crowds
Source: https://www.domo.com/learn/the-wisdom-of-crowds
Markus Rokicki | [email protected] | 15/27
Redundancy: Wisdom of Crowds
So far: ensuring individual annotation quality. However: even besteffort human annotations are not perfect most of the time.
Source: https://www.domo.com/learn/the-wisdom-of-crowds
Source: https://www.domo.com/learn/the-wisdom-of-crowds
Markus Rokicki | [email protected] | 15/27
Redundant Annotations and Aggregation of Results
Redundancy:
I Each task is annotated by multiple workersI Quality can be estimated based on inter-annotator agreement
(e.g. Fleiss’ kappa)I If certain agreement is not reached: obtain more annotations
I Aggregate answers by majority vote (in categorical case)I Improves accuracy if workers are better than random
I Introduces additional cost
Markus Rokicki | [email protected] | 16/27
Majority Vote Accuracy20
Majority vote accuracy depending on redundancy and individualaccuracy, assuming equal accuracy (p) for all workers19.
19Figure source: https://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation
20Ludmila I Kuncheva et al. “Limits on the majority vote accuracy in classifier fusion”. In: Pattern Analysis &Applications 6.1 (2003), pp. 22–31.
Markus Rokicki | [email protected] | 17/27
Estimating Worker Reliability21
In reality, worker accuracy varies. Solution: estimate workerreliability and take it into account for computing the labels.
Sketch of the approach:I Treat each worker as a classifier characterized by
I Probability of correctly classifying each class
I Estimate iteratively using expectation maximization:I E-step: Estimate hidden true labels of the data given current
estimated worker reliabilityI M-step: Estimate worker reliability given current estimated
labels
21Alexander Philip Dawid and Allan M Skene. “Maximum likelihood estimation of observer error-rates using theEM algorithm”. In: Applied statistics (1979), pp. 20–28.
Markus Rokicki | [email protected] | 18/27
Incentives: Influence of Payment 122
Setting: Reorder a list of images taken from traffic cameraschronologically.
611 participants sorted 36,000 images sets of varying size, forvarying payments.
22Winter Mason and Duncan J Watts. “Financial incentives and the performance of crowds”. In: ACM SigKDDExplorations Newsletter 11.2 (2010), pp. 100–108.
Markus Rokicki | [email protected] | 19/27
Findings
Figure: Accuracy (left) and number of completed tasks (right) inrelation to payment
Markus Rokicki | [email protected] | 20/27
Influence of Payment 223
Setting: Find planets orbiting distantstars
23Andrew Mao et al. “Volunteering versus work for pay: Incentives and tradeoffs in crowdsourcing”. In: FirstAAAI conference on human computation and crowdsourcing. 2013.
Markus Rokicki | [email protected] | 22/27
Findings
Experiments with 356 workers and 17.000 annotated light curves.Simulated transits were added to real light curves (with noisybrightness values).
Figure: Accuracy of results depending on task difficulty.
Markus Rokicki | [email protected] | 23/27
Additional Incentives: Competitions and Teamwork24
1 2 3 4 5 6 7 8 9 10
Re
wa
rd
Rank
...
Rewards:
I Non-linear distributionamong teams
I Individual share proportionalto contribution
Communication
I Team chats
24Markus Rokicki, Sergej Zerr, and Stefan Siersdorfer. “Groupsourcing: Team competition designs for crowd-sourcing”. In: Proceedings of the 24th International Conference on World Wide Web. ACM. 2015, pp. 906–915.
Markus Rokicki | [email protected] | 24/27
No Payment: Games With A Purpose25
Figure: The ESP Game
25Luis Von Ahn and Laura Dabbish. “Labeling images with a computer game”. In: Proceedings of the SIGCHIconference on Human factors in computing systems. ACM. 2004, pp. 319–326.
Markus Rokicki | [email protected] | 25/27
No Payment: Games With A Purpose25
Figure: The ESP Game
25Luis Von Ahn and Laura Dabbish. “Labeling images with a computer game”. In: Proceedings of the SIGCHIconference on Human factors in computing systems. ACM. 2004, pp. 319–326.
Markus Rokicki | [email protected] | 25/27
Mechanical Turk Workers26
Surveys of 500-1000 workers on Mechanical Turk.26Joel Ross et al. “Who are the crowdworkers?: shifting demographics in mechanical turk”. In: CHI’10 extended
abstracts on Human factors in computing systems. ACM. 2010, pp. 2863–2872.
Markus Rokicki | [email protected] | 26/27
Selected papersI Dynamics and task types on MTurk:
Djellel Eddine Difallah et al. “The dynamics of micro-task crowdsourcing: The
case of amazon mturk”. In: Proceedings of the 24th International Conference
on World Wide Web. ACM. 2015, pp. 238–247
I Postprocessing results for quality:Vikas C Raykar et al. “Learning from crowds”. In: Journal of Machine Learning
Research 11.Apr (2010), pp. 1297–1322
I Influence of compensation and payment on quality:Gabriella Kazai. “In search of quality in crowdsourcing for search engine
evaluation”. In: European Conference on Information Retrieval. Springer.
2011, pp. 165–176
I Collaborative workflows:Michael S. Bernstein et al. “Soylent: a word processor with a crowd inside”.
In: Proceedings of the 23rd Annual ACM Symposium on User Interface
Software and Technology. 2010, pp. 313–322
Markus Rokicki | [email protected] | 27/27