Top Banner
Crowdsourced Data Processing: Industry and Academic Perspectives Adam Marcus and Aditya Parameswaran 1
66

Crowdsourced Data Processing: Industry and Academic Perspectives

Apr 12, 2017

Download

Internet

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Crowdsourced Data Processing: Industry and Academic Perspectives

1

Crowdsourced Data Processing: Industry and Academic Perspectives

Adam Marcus and Aditya Parameswaran

Page 2: Crowdsourced Data Processing: Industry and Academic Perspectives

2

A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)

Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2

Part 1: A Survey of Crowd-Powered Data Processing in Academia (Me)

Part 2: A Survey of Crowd-Powered Data Processing in Industry (Adam)

Page 3: Crowdsourced Data Processing: Industry and Academic Perspectives

3

Part 0.1(background and survey of Part 1)

Page 4: Crowdsourced Data Processing: Industry and Academic Perspectives

4

What is crowdsourcing?• Our definition: [Von Ahn]

Crowdsourcing is a paradigm that utilizes human processing power to solve problems that computers cannot yet solve.

e.g., processing and understanding images, videos, and text.(80% or more of all data – a 5 year old IBM study)

Page 5: Crowdsourced Data Processing: Industry and Academic Perspectives

5

Why is it important?We’re on the cusp of an AI revolution [NYT, July’16]:

– “a transformation many believe will have a payoff on the scale of … personal computing … or the internet”

AI requires large volumes of training data.

Our best hope of understanding images, videos, and text, comes from humans

Page 6: Crowdsourced Data Processing: Industry and Academic Perspectives

6

How does one deploy crowdsourcing?

• Our focus: paid crowdsourcing– Other ways: volunteer, gaming– “paid” is broad: $$, pigs on your farm, MBs,

bitcoin, …

• A typical paid platform: – Requesters put jobs up, assign rewards– Workers pick up and work on these jobs, get

rewards

Page 7: Crowdsourced Data Processing: Industry and Academic Perspectives

7

Our Focus: Data, Data, DataHow do we get crowds to process large volumes of data efficiently and effectively

– Design of algorithms– Design of systems

We call this “crowdsourced data processing”

This is the primary concern of industry users.

Page 8: Crowdsourced Data Processing: Industry and Academic Perspectives

8

Context: Other WorkCrowdsourced data processing depends on many other fields …. (but not the focus of this tutorial)

Page 9: Crowdsourced Data Processing: Industry and Academic Perspectives

9

Humans = Data ProcessorsOur abstraction: Humans are Data Processors• compare two items• rate an item• evaluate a predicate on an item

Human operator set is not fully known or understood!

Page 10: Crowdsourced Data Processing: Industry and Academic Perspectives

10

Boolean Question

Page 11: Crowdsourced Data Processing: Industry and Academic Perspectives

11

K-Ary Question

Page 12: Crowdsourced Data Processing: Industry and Academic Perspectives

But: Unlike Computer Processors

So, algorithm development has to be done “ab initio”

Latency

Cost

Quality

How much am I willing to spend?

How long can I wait?

What is my desired quality?

12

… Humans cost money, take time, and make mistakes

Page 13: Crowdsourced Data Processing: Industry and Academic Perspectives

13

Illustration of Challenges: Sorting

Sort n animals on “dangerousness”• Option 1: give it all to one human worker –

could take very long, likely error prone.• Option 2: apply a sorting algorithm, with

pairwise comparisons being done by humans instead of automatically

< <

Page 14: Crowdsourced Data Processing: Industry and Academic Perspectives

14

Illustration of Challenges: Sorting

• Option 2: But:– Workers may make mistakes! So how do

you know if you can trust a worker response?

– Cycles may form– Should we get more worker answers for the

same pair or for different pairs?

< <

><

>

Page 15: Crowdsourced Data Processing: Industry and Academic Perspectives

Also: InterfacesComparison

moredangerous?

howdangerous?

Rating

15

Page 16: Crowdsourced Data Processing: Industry and Academic Perspectives

16

Overall: Challenges

• Which questions do I ask of humans?• Do I ask sequentially or in parallel?• How much redundancy in questions?• How do I combine answers?• When do I stop?

Page 17: Crowdsourced Data Processing: Industry and Academic Perspectives

17

In the longer part of this talk …• A recipe for crowdsourced algorithm

design– What all do you need to keep into

account– Plus a couple of examples

Page 18: Crowdsourced Data Processing: Industry and Academic Perspectives

18

Next Part: Systems

• Wouldn’t it be nice if you could just “say” what you wanted gathered or processed, and have the system do it for you?– Akin to database systems– Database systems have a query

language: SQL

• Here are some examples

Page 19: Crowdsourced Data Processing: Industry and Academic Perspectives

Get/Process data

Crowdsourced Data Processing Systems

Country

Capital Language

Peru Lima SpanishPeru Lima QuechuaBrazil Brasilia Portugue

se… … …

Find the capitals of five Spanish-speaking countries

SystemGive me a Spanish-speaking countryWhat language do they speak in country X?What is the capital of country X?Give me a valid <Country, Capital, Language> combination

Gathering more

data

Processing

(Filtering)

19

Page 20: Crowdsourced Data Processing: Industry and Academic Perspectives

Country

Capital

Language

Peru Lima SpanishPeru Lima QuechuaBrazil Brasili

aPortugue

se… … …

Find the capitals of five Spanish-speaking countries

System• What if some humans say Brazil is Spanish-speaking and others say Portuguese?

• What if some humans answer “Chile” and others “Chili”?

Inconsistencies

Crowdsourced Data Processing SystemsOne specific issue…

20

Page 21: Crowdsourced Data Processing: Industry and Academic Perspectives

21

What are the challenges?• What is the query language for

expressing stuff like this?• How is it optimized?• How does it mesh with existing data?• How does it deal with the latency of the

crowd, etc.

More on how different systems solve these challenges later on

Page 22: Crowdsourced Data Processing: Industry and Academic Perspectives

22

A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)

Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2

Part 1: A Survey of Crowd-Powered Data Processing in Academia (Me)

Part 2: A Survey of Crowd-Powered Data Processing in Industry (Adam)

Page 23: Crowdsourced Data Processing: Industry and Academic Perspectives

23

Part 0.2(survey of Part 2)

Page 24: Crowdsourced Data Processing: Industry and Academic Perspectives

24

The Industry PerspectiveCirca 2013:

– HCOMP becomes a real conference Crowdsourcing now an academic discipline

– Industry folks at HCOMP claiming:• “Crowdsourcing is still a dark art…”• “We use crowdsourcing at scale… but...”• “Academics are not solving real problems…”

Problem: No one had really chronicled the use of crowdsourcing in industry.

Page 25: Crowdsourced Data Processing: Industry and Academic Perspectives

25

What happened?

Adam and I spoke to 13 large scale users of crowds + 4 marketplace vendors to identify:

– scale, use-cases, status-quo– challenges, pain-points

Tried to bridge the gap between industry and academia

Crowdsourced Data Management: Industry and Academic Perspectives, Foundations and Trends in DB Series, 2015

Page 26: Crowdsourced Data Processing: Industry and Academic Perspectives

26

Qualitative Study: Who did we talk to?

Team issued a large # of categorization

tasks/week.

Data extraction

from images

Go-to team for

crowdsourcing

Page 27: Crowdsourced Data Processing: Industry and Academic Perspectives

27

Shocker I: Internal PlatformsFive of the largest cos. we spoke to primarily use their own “internal” or “in-house” platforms:

– Workers typically hired via an outsourcing firm– Working 9—5 on this company’s tasks– May be due to:

• Fine-grained tracking, hiring, leaderboards• Data of a sensitive nature• Economies of scale

What we’re seeing is a drop in the bucket.

Page 28: Crowdsourced Data Processing: Industry and Academic Perspectives

28

Shocker II: ScaleMost companies use crowdsourcing @ scale

• One reported 50+ employees just to manage their internal marketplace

• Another issues 0.5M tasks/week• Another has an internal crowdsourcing user

mailing list with hundreds of employees

Most large firms spend Ms 10s of Ms/ year, and a comparable amount administering internal mktplaces.

Page 29: Crowdsourced Data Processing: Industry and Academic Perspectives

29

Shocker II: Scale (Continued)

Why the scale?– AI eating the world: where there’s a

model there’s a need for training data– Moving target: need for fresh training

data as the problem constantly evolves– More data beats better models:

models trained are more general, less over fit, …

Page 30: Crowdsourced Data Processing: Industry and Academic Perspectives

30

Shocker III: Academic work is not used (yet)!

• Quality assurance: almost all use majority vote; <50% use fancy stuff. – <25% use active learning!

• Workflows: most workflows are single step– “In my experience, if you need multiple steps of

crowdsourcing, it’s almost always more productive to go back and do a bit more automation upfront.”

• Frameworks: no use of crowdsourced data proc systems, APIs/frameworks

Page 31: Crowdsourced Data Processing: Industry and Academic Perspectives

31

Other Findings• Design is super hard

– Many iterations to get to the “right” task– Some actively use A/B testing between

task types

• Top-3 benefits of crowds: – flexible scaling, low cost, enabled

previously difficult tasks. – “It’s easier to justify money for crowds

than another employee”

Page 32: Crowdsourced Data Processing: Industry and Academic Perspectives

32

Other Findings: Use Cases

1. Categorization2. Content

Moderation3. Entity Resolution4. Relevance5. Data Cleaning6. Data Extraction7. Text Generation

Page 33: Crowdsourced Data Processing: Industry and Academic Perspectives

33

Major Takeaways

ShockersI. Understudied

paradigm: “Internal”

MarketplacesII. @ scale – need to

shout from the rooftops!

III. Academic stuff isn’t used much (yet)

Other TakeawaysI. Academia is

working on the (~) right problems!II. Crowds admit

flexibility in companies w/o

politicsIII. Design is super

challenging!

Page 34: Crowdsourced Data Processing: Industry and Academic Perspectives

34

What else?• Sizes of teams, scale, throughput • Recruiting, retention• Use cases• Quality assurance• Task design and decomposition• Prior approaches, benefits of crowdsourcing• Incentivization

Lots of good stuff coming up in Adam’s Part 2!

Page 35: Crowdsourced Data Processing: Industry and Academic Perspectives

35

A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)

Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2

Part 1: A Survey of Crowd-Powered Data Management in Academia (Me)

Part 2: A Survey of Crowd-Powered Data Management in Industry (Adam)

Page 36: Crowdsourced Data Processing: Industry and Academic Perspectives

36

Part 1

Page 37: Crowdsourced Data Processing: Industry and Academic Perspectives

37

Data Processing AlgorithmsHumans are Data Processors

How do we design algorithms using human operators?

Latency

Cost

Quality

How much am I willing to spend?

How long can I wait?

What is my desired quality?

Page 38: Crowdsourced Data Processing: Industry and Academic Perspectives

Crowdsourced Data Processing

Marketplace #1

Marketplace #2

Marketplace #n

……

• Interfaces• Incentives• Trust, reputation• Spam• Pricing

Plumbing

AlgorithmsBasic ops• Compare• Filter

38

Page 39: Crowdsourced Data Processing: Industry and Academic Perspectives

39

Data Processing Algorithms• Sorting, Max, Top-K• Filtering, Rating, Finding• Entity Resolution, Clustering, Joins• Categorization• Gathering, Extracting• Counting• …

50-odd papers in this space! @ VLDB, SIGMOD, ICDE, …

Page 40: Crowdsourced Data Processing: Industry and Academic Perspectives

40

Algorithm

Items

Crowdsourcing Marketplace

Algorithm Flow

A. Error ModelB. Latency Model

CurrentAnswers

Preparation

I. Unit Operations

II. Cost ModelIII. Objectives

Page 41: Crowdsourced Data Processing: Industry and Academic Perspectives

41

Algorithm Design Recipe• Explicit Choices:

– Unit Operations– Cost Model – Objectives

• Assumptions:– Error Model– Latency Model

Illustration: My paper “Crowdscreen: Algorithms for Filtering…”, SIGMOD 12, AKA Filtering• Given a dataset of images, find those that don’t show inappropriate content

Adam’s paper “Crowd-Powered Sorts and Joins”, VLDB 11, AKA Sorting• Given a dataset of animal images, sort them in increasing “dangerousness”

Page 42: Crowdsourced Data Processing: Industry and Academic Perspectives

42

Explicit Choice: Unit Operations

What sorts of input can we get from human workers?

• Simple vs. complex:– Simpler = are easier to analyze, easier to “aggregate” and assign

correctness to.– Complex = help us get more fine-grained, open-ended data.

• Number of types:– One type is simpler to analyze and aggregate than two.

Most work ends up picking a small number of simple operations

Filtering: filter an itemSorting: compare two items, or rate an item

Page 43: Crowdsourced Data Processing: Industry and Academic Perspectives

43

Explicit Choice: Cost ModelHow do we set the reward for each unit operation?

Cost can depend on:• Type of operation• Type of item• Number of items

Typical rule of thumb – time the operation, pay using minimum wage Simple assumption: same cost for each operation

Filtering: c(filter an item) constantSorting: c(compare two items) = c(rate an item)

Page 44: Crowdsourced Data Processing: Industry and Academic Perspectives

44

Explicit Choice: ObjectivesWhat do we optimize for?

Care about cost, latency, quality.• Bound one (or two), optimize others

Typically bound on cost, maximize qualitySometimes bound on quality, minimize cost

Filtering: bound on quality, minimize costSorting: bound on cost, maximize quality

Page 45: Crowdsourced Data Processing: Industry and Academic Perspectives

45

Assumption: Error ModelHow do we model human accuracies?

All models are wrong, but can still be useful.• Simplest model: no errors!

– Similar: ask fixed # of workers, then no error – Same error probability per worker (Filtering)

• Each worker has a fixed error probability• Each worker has a error probability dependent on item• No assumptions about error – just get something that

works well (Sorting)

Opt for what can be analyzed – simple is goodThis is a bit of an “art” – may require iterations

Page 46: Crowdsourced Data Processing: Industry and Academic Perspectives

46

Placing it all together: Filtering

• Goal: filter a set of items on some property X; i.e., find all items that satisfy X

• Operation: ask a person “does this item satisfy the filter or not?”

• Cost model: all operations cost the same• Objective: accuracy across all items is fixed (alpha,

e.g., 95%), minimize cost• Error model: people make mistakes with a fixed

probability (beta, e.g., 5%)

Dataset of Items

BooleanPredicat

e

Filtered

Dataset

Does this image show an animal?

Page 47: Crowdsourced Data Processing: Industry and Academic Perspectives

54321

54321

Yes

No

Our Visualization of Strategies

decide PASScontinue

decide FAIL

Markov DecisionProcess

47

Page 48: Crowdsourced Data Processing: Industry and Academic Perspectives

54321

54321

Yes

No

Evaluating Strategiesdecide PASScontinue

decide FAIL

Pr. [reach (4, 2)] = Pr. [reach (4, 1) & get a No]+ Pr. [reach (3, 2) & get a Yes]

Cost = (x+y) Pr [reach(x,y)]Error = Pr [reach ∧1] + Pr [reach ∧0]

∑ ∑ ∑ y

x 48

Page 49: Crowdsourced Data Processing: Industry and Academic Perspectives

Naïve Approach

For all strategies:• Evaluate cost & errorReturn the best

O(3g), g = O(m2)

This is obviously bad.Paper has probabilistic methods that identify optimal strategies (LP)

54321

54321

Yes

NoFor each grid pointAssign , or

49

Page 50: Crowdsourced Data Processing: Industry and Academic Perspectives

50

Placing it all together: Sorting

• Goal: sort a set of items on some property X• Operation: ask a person “is A better than B on

property X”, or “rate A on property X”• Cost model: all operations cost the same• Objective: total cost is fixed, maximize

accuracy• Error model: more ad-hoc; no fixed assumption

Dataset of Items

Sort on Predicat

e

Sorted

Dataset

Sort animals on

dangerousness

Page 51: Crowdsourced Data Processing: Industry and Academic Perspectives

• Completely Comparison-Based– Accuracy = 1 (completely accurate)– O(# items2)

• Completely Rating-Based– Accuracy ≈ 0.8 (accurate)– O(# items)

Placing it all together: Sorting

51

Page 52: Crowdsourced Data Processing: Industry and Academic Perspectives

• First, gather a bunch of ratings• Order based on average ratings• Then, use comparisons, in one of three

flavors:– Random: pick S items, compare– Confidence-based: pick most confusing

“window”, compare that first, repeat– Sliding-window: for all windows, compare the best

Placing it all together: Sorting

52

Page 53: Crowdsourced Data Processing: Industry and Academic Perspectives

0 10 20 30 40 50 60 70 800.8

0.85

0.9

0.95

1

Compare Rate

# Tasks

Accu

racy

53

Page 54: Crowdsourced Data Processing: Industry and Academic Perspectives

0 10 20 30 40 50 60 70 800.8

0.85

0.9

0.95

1

Hybrid Compare Rate

# Tasks

Accu

racy

54

Page 55: Crowdsourced Data Processing: Industry and Academic Perspectives

Crowdsourced Data Processing

Marketplace #1

Marketplace #2

Marketplace #n

……

• Interfaces• Incentives• Trust, reputation• Spam• Pricing

Plumbing

Algorithms

Systems

Basic ops• Compare• Filter

Complex ops• Sort• Cluster• Clean

• Get data• Verify

55

Page 56: Crowdsourced Data Processing: Industry and Academic Perspectives

56

Data Processing SystemsDeclarative Crowdsourcing Systems:

Qurk (MIT), Deco (Stanford/UCSC), CrowdDB (Berkeley)

Treat crowds as just another “access method” for the database• Fetch data from disk, the web, … , the crowd• Not just process data, but also gather data.

Page 57: Crowdsourced Data Processing: Industry and Academic Perspectives

57

There Are Other Systems…

• Toolkits– Turkit, Automan – Crowds = “API calls”– Little to no optimization

• Imperative Systems– Jabberwocky, CrowdForge– Crowds = “Data Processing Units”– Programmer dictated flow, limited optimization within the units

• Declarative Systems– Deco, Qurk, CrowdDB– Crowds = “Data Processing Units”– Programmer specifies goal, optimized across the spectrum

Analogous to Programming APIs

Analogous to Pig or Map-Reduce

Analogous to Relational Databases

Incr

easi

ng D

ecla

rativ

ity

Page 58: Crowdsourced Data Processing: Industry and Academic Perspectives

Why is Declarative Good?• Take away repeatable code and

redundancy• Lack of manual optimization• Less cumbersome to specify

58

Page 59: Crowdsourced Data Processing: Industry and Academic Perspectives

59

What does one need? (Simple Version)

1. A Mechanism to “Store”/”Represent” Data2. A Mechanism to “Get” More Data3. A Mechanism to “Fix” Existing Data4. A “Query” Language

Two prototypical systems: Deco: an end-to-end redesignQurk: a small modification to existing

databases

Page 60: Crowdsourced Data Processing: Industry and Academic Perspectives

name capitalname capitalPeru Lima

France Nice

France Paris

France Paris

60

name name language

name language capital

namePeru

France

fetch ruleφname

name languagePeru Spanish

Peru Spanish

France French

fetch rulename language fetch rule

namecapitalfetch rule

languagename

namePeru

France

name languagePeru Spanish

France French

name capitalPeru Lima

France Paris

resolution rule

name language capitalPeru Spanish Lima

France French Paris

⋈o

resolution rule

Userview

Raw Tables

A D1 D2

Page 61: Crowdsourced Data Processing: Industry and Academic Perspectives

Deco: Declarative Crowdsourcing

DBMS

1) Representation scheme:Countries(name,lang,capital)

2) “Get” more data: fetch rules name capital capital,lang name

3) “Fix” data: resolution rules lang: dedup() capital: majority()4) Declarative queries select name from Countries where language = ‘Spanish’ atleast 5

User or Application

61

Page 62: Crowdsourced Data Processing: Industry and Academic Perspectives

Qurk• A regular old database• Human processing/gathering as UDFs

– User-defined functions– Commonly also used by relational

databases to capture operations outside relational algebra

– Typically external API calls

62

Page 63: Crowdsourced Data Processing: Industry and Academic Perspectives

Qurk filter: inappropriate content

photos(id PRIMARY KEY, picture IMAGE)

Query =SELECT * FROM photos WHERE isSmiling(photos.picture);UDF

1) Representation scheme: UDFs are “pre-declared”

2) “Get” more data: UDFs translate into one/more fixed task types

3) “Fix” data: UDF internally handle quality assurance

4) “Query” SQL + UDF 63

Page 64: Crowdsourced Data Processing: Industry and Academic Perspectives

64

A Tutorial in Three PartsPart 0: A (Super Short) Survey of Part 1 and 2, plus Background (Me)

Part 0.1: Background + Survey of Part 1Part 0.2: Survey of Part 2

Part 1: A Survey of Crowd-Powered Data Management in Academia (Me)

Part 2: A Survey of Crowd-Powered Data Management in Industry (Adam) UP NEXT

Page 65: Crowdsourced Data Processing: Industry and Academic Perspectives

65

A little about meAssistant prof at Illinois since 2014Thesis work on crowdsourced data processingNow work on Human-in-the-loop data analytics (HILDA)

Twitter: @adityagpHomepage: http://data-people.cs.illinois.edu

Understand

Incre

asing

soph

istica

tion o

f ana

lysis

Visualize

Manipulate

Collaborate

http://populace-org.github.iohttp://orpheus-db.github.iohttp://zenvisage.github.iohttp://dataspread.github.io

Page 66: Crowdsourced Data Processing: Industry and Academic Perspectives

66