Crowdsourced Data Processing: Industry and Academic Perspectives Adam Marcus and Aditya Parameswaran 1
1
Crowdsourced Data Processing: Industry and Academic Perspectives
Adam Marcus and Aditya Parameswaran
2
A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)
Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data Processing in Academia (Me)
Part 2: A Survey of Crowd-Powered Data Processing in Industry (Adam)
3
Part 0.1(background and survey of Part 1)
4
What is crowdsourcing?• Our definition: [Von Ahn]
Crowdsourcing is a paradigm that utilizes human processing power to solve problems that computers cannot yet solve.
e.g., processing and understanding images, videos, and text.(80% or more of all data – a 5 year old IBM study)
5
Why is it important?We’re on the cusp of an AI revolution [NYT, July’16]:
– “a transformation many believe will have a payoff on the scale of … personal computing … or the internet”
AI requires large volumes of training data.
Our best hope of understanding images, videos, and text, comes from humans
6
How does one deploy crowdsourcing?
• Our focus: paid crowdsourcing– Other ways: volunteer, gaming– “paid” is broad: $$, pigs on your farm, MBs,
bitcoin, …
• A typical paid platform: – Requesters put jobs up, assign rewards– Workers pick up and work on these jobs, get
rewards
7
Our Focus: Data, Data, DataHow do we get crowds to process large volumes of data efficiently and effectively
– Design of algorithms– Design of systems
We call this “crowdsourced data processing”
This is the primary concern of industry users.
8
Context: Other WorkCrowdsourced data processing depends on many other fields …. (but not the focus of this tutorial)
9
Humans = Data ProcessorsOur abstraction: Humans are Data Processors• compare two items• rate an item• evaluate a predicate on an item
Human operator set is not fully known or understood!
10
Boolean Question
11
K-Ary Question
But: Unlike Computer Processors
So, algorithm development has to be done “ab initio”
Latency
Cost
Quality
How much am I willing to spend?
How long can I wait?
What is my desired quality?
12
… Humans cost money, take time, and make mistakes
13
Illustration of Challenges: Sorting
Sort n animals on “dangerousness”• Option 1: give it all to one human worker –
could take very long, likely error prone.• Option 2: apply a sorting algorithm, with
pairwise comparisons being done by humans instead of automatically
< <
14
Illustration of Challenges: Sorting
• Option 2: But:– Workers may make mistakes! So how do
you know if you can trust a worker response?
– Cycles may form– Should we get more worker answers for the
same pair or for different pairs?
< <
><
>
Also: InterfacesComparison
moredangerous?
howdangerous?
Rating
15
16
Overall: Challenges
• Which questions do I ask of humans?• Do I ask sequentially or in parallel?• How much redundancy in questions?• How do I combine answers?• When do I stop?
17
In the longer part of this talk …• A recipe for crowdsourced algorithm
design– What all do you need to keep into
account– Plus a couple of examples
18
Next Part: Systems
• Wouldn’t it be nice if you could just “say” what you wanted gathered or processed, and have the system do it for you?– Akin to database systems– Database systems have a query
language: SQL
• Here are some examples
Get/Process data
Crowdsourced Data Processing Systems
Country
Capital Language
Peru Lima SpanishPeru Lima QuechuaBrazil Brasilia Portugue
se… … …
Find the capitals of five Spanish-speaking countries
SystemGive me a Spanish-speaking countryWhat language do they speak in country X?What is the capital of country X?Give me a valid <Country, Capital, Language> combination
Gathering more
data
Processing
(Filtering)
19
Country
Capital
Language
Peru Lima SpanishPeru Lima QuechuaBrazil Brasili
aPortugue
se… … …
Find the capitals of five Spanish-speaking countries
System• What if some humans say Brazil is Spanish-speaking and others say Portuguese?
• What if some humans answer “Chile” and others “Chili”?
Inconsistencies
Crowdsourced Data Processing SystemsOne specific issue…
20
21
What are the challenges?• What is the query language for
expressing stuff like this?• How is it optimized?• How does it mesh with existing data?• How does it deal with the latency of the
crowd, etc.
More on how different systems solve these challenges later on
22
A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)
Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data Processing in Academia (Me)
Part 2: A Survey of Crowd-Powered Data Processing in Industry (Adam)
23
Part 0.2(survey of Part 2)
24
The Industry PerspectiveCirca 2013:
– HCOMP becomes a real conference Crowdsourcing now an academic discipline
– Industry folks at HCOMP claiming:• “Crowdsourcing is still a dark art…”• “We use crowdsourcing at scale… but...”• “Academics are not solving real problems…”
Problem: No one had really chronicled the use of crowdsourcing in industry.
25
What happened?
Adam and I spoke to 13 large scale users of crowds + 4 marketplace vendors to identify:
– scale, use-cases, status-quo– challenges, pain-points
Tried to bridge the gap between industry and academia
Crowdsourced Data Management: Industry and Academic Perspectives, Foundations and Trends in DB Series, 2015
26
Qualitative Study: Who did we talk to?
Team issued a large # of categorization
tasks/week.
Data extraction
from images
Go-to team for
crowdsourcing
27
Shocker I: Internal PlatformsFive of the largest cos. we spoke to primarily use their own “internal” or “in-house” platforms:
– Workers typically hired via an outsourcing firm– Working 9—5 on this company’s tasks– May be due to:
• Fine-grained tracking, hiring, leaderboards• Data of a sensitive nature• Economies of scale
What we’re seeing is a drop in the bucket.
28
Shocker II: ScaleMost companies use crowdsourcing @ scale
• One reported 50+ employees just to manage their internal marketplace
• Another issues 0.5M tasks/week• Another has an internal crowdsourcing user
mailing list with hundreds of employees
Most large firms spend Ms 10s of Ms/ year, and a comparable amount administering internal mktplaces.
29
Shocker II: Scale (Continued)
Why the scale?– AI eating the world: where there’s a
model there’s a need for training data– Moving target: need for fresh training
data as the problem constantly evolves– More data beats better models:
models trained are more general, less over fit, …
30
Shocker III: Academic work is not used (yet)!
• Quality assurance: almost all use majority vote; <50% use fancy stuff. – <25% use active learning!
• Workflows: most workflows are single step– “In my experience, if you need multiple steps of
crowdsourcing, it’s almost always more productive to go back and do a bit more automation upfront.”
• Frameworks: no use of crowdsourced data proc systems, APIs/frameworks
31
Other Findings• Design is super hard
– Many iterations to get to the “right” task– Some actively use A/B testing between
task types
• Top-3 benefits of crowds: – flexible scaling, low cost, enabled
previously difficult tasks. – “It’s easier to justify money for crowds
than another employee”
32
Other Findings: Use Cases
1. Categorization2. Content
Moderation3. Entity Resolution4. Relevance5. Data Cleaning6. Data Extraction7. Text Generation
33
Major Takeaways
ShockersI. Understudied
paradigm: “Internal”
MarketplacesII. @ scale – need to
shout from the rooftops!
III. Academic stuff isn’t used much (yet)
Other TakeawaysI. Academia is
working on the (~) right problems!II. Crowds admit
flexibility in companies w/o
politicsIII. Design is super
challenging!
34
What else?• Sizes of teams, scale, throughput • Recruiting, retention• Use cases• Quality assurance• Task design and decomposition• Prior approaches, benefits of crowdsourcing• Incentivization
Lots of good stuff coming up in Adam’s Part 2!
35
A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)
Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data Management in Academia (Me)
Part 2: A Survey of Crowd-Powered Data Management in Industry (Adam)
36
Part 1
37
Data Processing AlgorithmsHumans are Data Processors
How do we design algorithms using human operators?
Latency
Cost
Quality
How much am I willing to spend?
How long can I wait?
What is my desired quality?
Crowdsourced Data Processing
Marketplace #1
Marketplace #2
Marketplace #n
……
• Interfaces• Incentives• Trust, reputation• Spam• Pricing
Plumbing
AlgorithmsBasic ops• Compare• Filter
38
39
Data Processing Algorithms• Sorting, Max, Top-K• Filtering, Rating, Finding• Entity Resolution, Clustering, Joins• Categorization• Gathering, Extracting• Counting• …
50-odd papers in this space! @ VLDB, SIGMOD, ICDE, …
40
Algorithm
Items
Crowdsourcing Marketplace
Algorithm Flow
A. Error ModelB. Latency Model
CurrentAnswers
Preparation
I. Unit Operations
II. Cost ModelIII. Objectives
41
Algorithm Design Recipe• Explicit Choices:
– Unit Operations– Cost Model – Objectives
• Assumptions:– Error Model– Latency Model
Illustration: My paper “Crowdscreen: Algorithms for Filtering…”, SIGMOD 12, AKA Filtering• Given a dataset of images, find those that don’t show inappropriate content
Adam’s paper “Crowd-Powered Sorts and Joins”, VLDB 11, AKA Sorting• Given a dataset of animal images, sort them in increasing “dangerousness”
42
Explicit Choice: Unit Operations
What sorts of input can we get from human workers?
• Simple vs. complex:– Simpler = are easier to analyze, easier to “aggregate” and assign
correctness to.– Complex = help us get more fine-grained, open-ended data.
• Number of types:– One type is simpler to analyze and aggregate than two.
Most work ends up picking a small number of simple operations
Filtering: filter an itemSorting: compare two items, or rate an item
43
Explicit Choice: Cost ModelHow do we set the reward for each unit operation?
Cost can depend on:• Type of operation• Type of item• Number of items
Typical rule of thumb – time the operation, pay using minimum wage Simple assumption: same cost for each operation
Filtering: c(filter an item) constantSorting: c(compare two items) = c(rate an item)
44
Explicit Choice: ObjectivesWhat do we optimize for?
Care about cost, latency, quality.• Bound one (or two), optimize others
Typically bound on cost, maximize qualitySometimes bound on quality, minimize cost
Filtering: bound on quality, minimize costSorting: bound on cost, maximize quality
45
Assumption: Error ModelHow do we model human accuracies?
All models are wrong, but can still be useful.• Simplest model: no errors!
– Similar: ask fixed # of workers, then no error – Same error probability per worker (Filtering)
• Each worker has a fixed error probability• Each worker has a error probability dependent on item• No assumptions about error – just get something that
works well (Sorting)
Opt for what can be analyzed – simple is goodThis is a bit of an “art” – may require iterations
46
Placing it all together: Filtering
• Goal: filter a set of items on some property X; i.e., find all items that satisfy X
• Operation: ask a person “does this item satisfy the filter or not?”
• Cost model: all operations cost the same• Objective: accuracy across all items is fixed (alpha,
e.g., 95%), minimize cost• Error model: people make mistakes with a fixed
probability (beta, e.g., 5%)
Dataset of Items
BooleanPredicat
e
Filtered
Dataset
Does this image show an animal?
54321
54321
Yes
No
Our Visualization of Strategies
decide PASScontinue
decide FAIL
Markov DecisionProcess
47
54321
54321
Yes
No
Evaluating Strategiesdecide PASScontinue
decide FAIL
Pr. [reach (4, 2)] = Pr. [reach (4, 1) & get a No]+ Pr. [reach (3, 2) & get a Yes]
Cost = (x+y) Pr [reach(x,y)]Error = Pr [reach ∧1] + Pr [reach ∧0]
∑ ∑ ∑ y
x 48
Naïve Approach
For all strategies:• Evaluate cost & errorReturn the best
O(3g), g = O(m2)
This is obviously bad.Paper has probabilistic methods that identify optimal strategies (LP)
54321
54321
Yes
NoFor each grid pointAssign , or
49
50
Placing it all together: Sorting
• Goal: sort a set of items on some property X• Operation: ask a person “is A better than B on
property X”, or “rate A on property X”• Cost model: all operations cost the same• Objective: total cost is fixed, maximize
accuracy• Error model: more ad-hoc; no fixed assumption
Dataset of Items
Sort on Predicat
e
Sorted
Dataset
Sort animals on
dangerousness
• Completely Comparison-Based– Accuracy = 1 (completely accurate)– O(# items2)
• Completely Rating-Based– Accuracy ≈ 0.8 (accurate)– O(# items)
Placing it all together: Sorting
51
• First, gather a bunch of ratings• Order based on average ratings• Then, use comparisons, in one of three
flavors:– Random: pick S items, compare– Confidence-based: pick most confusing
“window”, compare that first, repeat– Sliding-window: for all windows, compare the best
Placing it all together: Sorting
52
0 10 20 30 40 50 60 70 800.8
0.85
0.9
0.95
1
Compare Rate
# Tasks
Accu
racy
53
0 10 20 30 40 50 60 70 800.8
0.85
0.9
0.95
1
Hybrid Compare Rate
# Tasks
Accu
racy
54
Crowdsourced Data Processing
Marketplace #1
Marketplace #2
Marketplace #n
……
• Interfaces• Incentives• Trust, reputation• Spam• Pricing
Plumbing
Algorithms
Systems
Basic ops• Compare• Filter
Complex ops• Sort• Cluster• Clean
• Get data• Verify
55
56
Data Processing SystemsDeclarative Crowdsourcing Systems:
Qurk (MIT), Deco (Stanford/UCSC), CrowdDB (Berkeley)
Treat crowds as just another “access method” for the database• Fetch data from disk, the web, … , the crowd• Not just process data, but also gather data.
57
There Are Other Systems…
• Toolkits– Turkit, Automan – Crowds = “API calls”– Little to no optimization
• Imperative Systems– Jabberwocky, CrowdForge– Crowds = “Data Processing Units”– Programmer dictated flow, limited optimization within the units
• Declarative Systems– Deco, Qurk, CrowdDB– Crowds = “Data Processing Units”– Programmer specifies goal, optimized across the spectrum
Analogous to Programming APIs
Analogous to Pig or Map-Reduce
Analogous to Relational Databases
Incr
easi
ng D
ecla
rativ
ity
Why is Declarative Good?• Take away repeatable code and
redundancy• Lack of manual optimization• Less cumbersome to specify
58
59
What does one need? (Simple Version)
1. A Mechanism to “Store”/”Represent” Data2. A Mechanism to “Get” More Data3. A Mechanism to “Fix” Existing Data4. A “Query” Language
Two prototypical systems: Deco: an end-to-end redesignQurk: a small modification to existing
databases
name capitalname capitalPeru Lima
France Nice
France Paris
France Paris
60
name name language
name language capital
namePeru
France
fetch ruleφname
name languagePeru Spanish
Peru Spanish
France French
fetch rulename language fetch rule
namecapitalfetch rule
languagename
namePeru
France
name languagePeru Spanish
France French
name capitalPeru Lima
France Paris
resolution rule
name language capitalPeru Spanish Lima
France French Paris
⋈o
resolution rule
Userview
Raw Tables
A D1 D2
Deco: Declarative Crowdsourcing
DBMS
1) Representation scheme:Countries(name,lang,capital)
2) “Get” more data: fetch rules name capital capital,lang name
3) “Fix” data: resolution rules lang: dedup() capital: majority()4) Declarative queries select name from Countries where language = ‘Spanish’ atleast 5
User or Application
61
Qurk• A regular old database• Human processing/gathering as UDFs
– User-defined functions– Commonly also used by relational
databases to capture operations outside relational algebra
– Typically external API calls
62
Qurk filter: inappropriate content
photos(id PRIMARY KEY, picture IMAGE)
Query =SELECT * FROM photos WHERE isSmiling(photos.picture);UDF
1) Representation scheme: UDFs are “pre-declared”
2) “Get” more data: UDFs translate into one/more fixed task types
3) “Fix” data: UDF internally handle quality assurance
4) “Query” SQL + UDF 63
64
A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)
Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data Management in Academia (Me)
Part 2: A Survey of Crowd-Powered Data Management in Industry (Adam) UP NEXT
65
A little about meAssistant prof at Illinois since 2014Thesis work on crowdsourced data processingNow work on Human-in-the-loop data analytics (HILDA)
Twitter: @adityagpHomepage: http://data-people.cs.illinois.edu
Understand
Incre
asing
soph
istica
tion o
f ana
lysis
Visualize
Manipulate
Collaborate
http://populace-org.github.iohttp://orpheus-db.github.iohttp://zenvisage.github.iohttp://dataspread.github.io
66