Crowdsourcing using Mechanical Turk Quality Management and Scalability Panos Ipeirotis – New York University
Jan 02, 2016
Crowdsourcing using Mechanical Turk
Quality Management and Scalability
Panos Ipeirotis – New York University
“A Computer Scientist in a Business
School”
http://behind-the-enemy-lines.blogspot
.com/
Email: [email protected]
“A Computer Scientist in a Business
School”
http://behind-the-enemy-lines.blogspot
.com/
Email: [email protected]
Panos Ipeirotis - Introduction
New York University, Stern School of Business
4
5
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X
(porn)
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr
Amazon Mechanical Turk: Paid Crowdsourcing
Example: Build an “Adult Web Site” Classifier
Need a large number of hand-labeled sites Get people to look at sites and classify them as:G (general audience) PG (parental guidance) R (restricted) X
(porn)
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr MTurk: 2500 websites/hr, cost: $12/hr
Cost/Speed Statistics Undergrad intern: 200 websites/hr, cost:
$15/hr MTurk: 2500 websites/hr, cost: $12/hr
Bad news: Spammers!
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general
audience)
Worker ATAMRO447HWJQ
labeled X (porn) sites as G (general
audience)
Improve Data Quality through Repeated Labeling
Get multiple, redundant labels using multiple workers Pick the correct label based on majority vote
Probability of correctness increases with number of workers
Probability of correctness increases with quality of workers
1 worker
70%
correct
1 worker
70%
correct
11 workers
93%
correct
11 workers
93%
correct
11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:
$15/hr
11-vote Statistics MTurk: 227 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:
$15/hr
Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:
$15/hr
Single Vote Statistics MTurk: 2500 websites/hr, cost: $12/hr Undergrad: 200 websites/hr, cost:
$15/hr
But Majority Voting is Expensive
Using redundant votes, we can infer worker quality
Look at our spammer friend ATAMRO447HWJQ together with other 9 workers
Our “friend” ATAMRO447HWJQ mainly marked sites as G.Obviously a spammer…
We can compute error rates for each worker
Error rates for ATAMRO447HWJQ P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%
Rejecting spammers and BenefitsRandom answers error rate = 50%
Average error rate for ATAMRO447HWJQ: 45.2% P[X → X]=9.847% P[X → G]=90.153% P[G → X]=0.053% P[G → G]=99.947%
Action: REJECT and BLOCK
Results: Over time you block all spammers Spammers learn to avoid your HITS You can decrease redundancy, as quality of workers is higher
After rejecting spammers, quality goes up Spam keeps quality down Without spam, workers are of higher quality Need less redundancy for same quality Same quality of results for lower cost
With spam
1 worker
70%
correct
With spam
1 worker
70%
correct
With spam
11 workers
93%
correct
With spam
11 workers
93%
correct
Without
spam
1 worker
80% correct
Without
spam
1 worker
80% correct
Without
spam
5 workers
94% correct
Without
spam
5 workers
94% correct
Correcting biases
Classifying sites as G, PG, R, X Sometimes workers are careful but biased
Classifies G → P and P → R Average error rate for ATLJIK76YH1TF: too high
Is she a spammer?Is she a spammer?
Error Rates for CEO of AdSafe
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error Rates for CEO of AdSafe
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Correcting biases
For ATLJIK76YH1TF, we simply need to “reverse the errors” (technical details omitted) and separate error and bias
True error-rate ~ 9%
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Error Rates for Worker: ATLJIK76YH1TF
P[G → G]=20.0% P[G → P]=80.0%P[G → R]=0.0% P[G → X]=0.0%P[P → G]=0.0% P[P → P]=0.0% P[P → R]=100.0% P[P → X]=0.0%P[R → G]=0.0% P[R → P]=0.0% P[R → R]=100.0% P[R → X]=0.0%P[X → G]=0.0% P[X → P]=0.0% P[X → R]=0.0% P[X → X]=100.0%
Too much theory?
Demo and Open source implementation available at:
http://qmturk.appspot.com Input:
– Labels from Mechanical Turk– Cost of incorrect labelings (e.g., XG costlier than GX)
Output: – Corrected labels– Worker error rates– Ranking of workers according to their quality
Beta version, more improvements to come! Suggestions and collaborations welcomed!
Scaling Crowdsourcing: Use Machine Learning Human labor is expensive, even when paying cents Need to scale crowdsourcing
Basic idea: Build a machine learning model and use it instead of humans
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers
New CaseNew Case Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Automatic
Answer
Automatic
Answer
20
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy Improve data quality Improve classification
Example Case: Porn or not?
40
50
60
70
80
90
100
1 20 40 60 80 100120140160180200220240260280300
Number of examples (Mushroom)
Acc
ura
cy
Data Quality = 50%
Data Quality = 60%
Data Quality = 80%
Data Quality = 100%
Confide
ntConfide
nt
Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Scaling Crowdsourcing: Iterative training
Use machine when confident, humans otherwise
Retrain with new human input → improve model → reduce need for humans
Get human(s)
to answer
Get human(s)
to answer
New CaseNew Case
Not confident
Not confident
Automatic
Answer
Automatic
Answer
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers
22
Tradeoffs for Automatic Models: Effect of Noise
Get more data Improve model accuracy Improve data quality Improve classification
Example Case: Porn or not?
40
50
60
70
80
90
100
1 20 40 60 80 100120140160180200220240260280300
Number of examples (Mushroom)
Acc
ura
cy
Data Quality = 50%
Data Quality = 60%
Data Quality = 80%
Data Quality = 100%
Not confident
Not confident
Confiden
tConfiden
t
Automatic Model
(through machine
learning)
Automatic Model
(through machine
learning)
Scaling Crowdsourcing: Iterative training, with noise
Use machine when confident, humans otherwise Ask as many humans as necessary to ensure quality
Get human(s)
to answer
Get human(s)
to answer
New CaseNew Case
Automatic
Answer
Automatic
Answer
Confident for quality?
Not confident
for quality?
Data from existing
crowdsourced answers
Data from existing
crowdsourced answers
Thank you!
Questions?
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com
/
Email: [email protected]
“A Computer Scientist in a Business School”
http://behind-the-enemy-lines.blogspot.com
/
Email: [email protected]