Top Banner
Integrating Crowd & Cloud Resources for Big Data Michael Franklin Middleware 2012, Montreal December 6 2012 UC BERKELEY Expeditions in Computing
56

Middeware2012 crowd

Jan 20, 2015

Download

Technology

mjfrankli

An overview of issues and early work on combining human computation and scalable computing to tackle big data analytics problems. Includes a survey of relevant projects underway at the UC Berkeley AMPLab.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Middeware2012 crowd

Integrating Crowd & Cloud Resources for Big Data

Michael Franklin

Middleware 2012, MontrealDecember 6 2012

UC BERKELEY

Expeditionsin Computing

Page 2: Middeware2012 crowd

CROWDSOURCING WHAT IS IT?

Page 3: Middeware2012 crowd

Citizen Science

NASA “Clickworkers” 2000

Page 4: Middeware2012 crowd

Citizen Journalism/Participatory Sensing

4

Page 5: Middeware2012 crowd

Communities & Expertise

Page 6: Middeware2012 crowd

Data Collection & Curatione.g., Freebase

Page 7: Middeware2012 crowd

An Academic View

From Quinn & Bederson, “Human Computation: A Survey and Taxonomy of a Growing Field”, CHI 2011.

Page 8: Middeware2012 crowd

The Way Industry Looks At ItHow Industry Looks At It

Page 9: Middeware2012 crowd

Useful Taxonomies

• Doan, Halevy, Ramakrishnan; (Crowdsourcing) CACM 4/11– nature of collaboration (implicit vs. explicit)– architecture (standalone vs. piggybacked)– must recruit users/workers? (yes or no)– What do users/workers do?

• Bederson & Quinn; (Human Computation) CHI ’11– Motivation (Pay, Altruism, Enjoyment, Reputation)– Quality Control (many mechanisms)– Aggregation (how are results combined?)– Human Skill (Visual recognition, language, …)– …

Page 10: Middeware2012 crowd

Types of Tasks

Task Granularity Examples

Complex Tasks • Build a website• Develop a software system• Overthrow a government?

Simple Projects • Design a logo and visual identity• Write a term paper

Macro Tasks • Write a restaurant review• Test a new website feature• Identify a galaxy

Micro Tasks • Label an image• Verify an address• Simple entity resolution

Inspired by the report: “Paid Crowdsourcing”, Smartsheet.com, 9/15/2009

Page 11: Middeware2012 crowd

MICRO-TASK MARKETPLACES

Page 12: Middeware2012 crowd

Amazon Mechanical Turk (AMT)

Page 13: Middeware2012 crowd

Microtasking – Virutalized Humans

• Current leader: Amazon Mechanical Turk• Requestors place Human Intelligence Tasks

(HITs)– set price per “assignment” (usually cents)– specify #of replicas (assignments), expiration, …– User Interface (for workers)– API-based: “createHit()”, “getAssignments()”,

“approveAssignments()”, “forceExpire()”

• Requestors approve jobs and payment• Workers (a.k.a. “turkers”) choose jobs, do them,

get paid13

Page 14: Middeware2012 crowd

AMT Worker Interface

Page 15: Middeware2012 crowd
Page 16: Middeware2012 crowd
Page 17: Middeware2012 crowd

Microtask Aggregators

Page 18: Middeware2012 crowd

Crowdsourcing for Data Management

• Relational– data cleaning– data entry– information extraction– schema matching– entity resolution– data spaces– building structured KBs– sorting– top-k– ...

• Beyond relational– graph search– classification– transcription– mobile image search– social media analysis– question answering– NLP – text summarization– sentiment analysis– semantic wikis– ...

18

Page 19: Middeware2012 crowd

TOWARDS HYBRID CROWD/CLOUD COMPUTING

Page 20: Middeware2012 crowd

Not Exactly Crowdsourcing, but…

“The hope is that, in not too many years, human brains and computing machines will be coupled together very tightly, and that the resulting partnership will think as no human brain has ever thought and process data in a way not approached by the information-handling machines we know today.”

Page 21: Middeware2012 crowd

AMP: Integrating Diverse Resources

21

Algorithms: Machine Learning and

Analytics

People:CrowdSourcing &

Human Computation

Machines: Cloud Computing

Page 22: Middeware2012 crowd

The Berkeley AMPLab• Goal: Data analytics stack integrating A, M & P

• BDAS: Released as BSD/Apache Open Source

• 6 year duration: 2011-2017• 8 CS Faculty

• Directors: Franklin(DB), Jordan (ML), Stoica (Sys)

• Industrial Support & Collaboration:

• NSF Expedition and Darpa XData22

Page 23: Middeware2012 crowd

People in AMP• Long term Goal: Make people an

integrated part of the system!• Leverage human activity• Leverage human intelligence

• Current AMP People Projects– Carat: Collaborative Energy

Debugging– CrowdDB: “The World’s Dumbest

Database System”– CrowdER: Hybrid computation for

Entity Resolution– CrowdQ: Hybrid Unstructured Query

Answering23

Machines + Algorithms

data

, ac

tivity

Que

stio

ns Answ

ers

Page 24: Middeware2012 crowd

Carat: Leveraging Human Activity

24

~500,000 downloads to date

A. J. Oliner, et al. Collaborative Energy Debugging for Mobile Devices. Workshop on Hot Topics in System Dependability (HotDep), 2012.

Page 25: Middeware2012 crowd

Carat: How it works

25

Collaborative Detection of Energy Bugs

Page 26: Middeware2012 crowd

Leveraging Human Intelligence

26

First Attempt: CrowdDB

See also:

Qurk – MIT

Deco – Stanford

CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011Query Processing with the VLDB Crowd, VLDB 2011

Page 27: Middeware2012 crowd

DB-hard Queries

27

SELECT Market_CapFrom CompaniesWhere Company_Name = “IBM”

Number of Rows: 0

Problem: Entity Resolution

Company_Name Address Market Cap

Google Googleplex, Mtn. View CA $210Bn

Intl. Business Machines Armonk, NY $200Bn

Microsoft Redmond, WA $250Bn

Page 28: Middeware2012 crowd

DB-hard Queries

28

SELECT Market_CapFrom CompaniesWhere Company_Name = “Apple”

Number of Rows: 0

Problem: Closed-World Assumption

Company_Name Address Market Cap

Google Googleplex, Mtn. View CA $210Bn

Intl. Business Machines Armonk, NY $200Bn

Microsoft Redmond, WA $250Bn

Page 29: Middeware2012 crowd

29

SELECT ImageFrom PicturesWhere Image contains “Good Looking Dog”

Number of Rows: 0

Problem: Subjective Comparision

DB-hard Queries

Page 30: Middeware2012 crowd

Leveraging Human Intelligence

30

First Attempt: CrowdDB

Where to use the crowd:• Cleaning and

Disambiguation• Find missing data• Make subjective

comparisons

CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011Query Processing with the VLDB Crowd, VLDB 2011

Page 31: Middeware2012 crowd

CrowdDB - Worker Interface

31

Page 32: Middeware2012 crowd

Mobile Platform

32

Page 33: Middeware2012 crowd

CrowdSQL

SELECT * FROM companies WHERE Name ~= “Big Blue”

33

CREATE CROWD TABLE department ( university STRING, department STRING, phone_no STRING) PRIMARY KEY (university, department);

CREATE TABLE company ( name STRING PRIMARY KEY, hq_address CROWD STRING);

DML Extensions:

SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");

DDL Extensions:

CROWDORDER operators (currently UDFs):CrowdEqual:

Crowdsourced columns Crowdsourced tables

Page 34: Middeware2012 crowd

CrowdDB Query: Picture ordering

34

Query:SELECT p FROM picture WHERE subject = "Golden Gate Bridge" ORDER BY CROWDORDER(p, "Which pic shows better %subject");

Data-Size: 30 subject areas, with 8 pictures eachBatching: 4 orderings per HITReplication: 3 Assignments per HITPrice: 1 cent per HIT

(turker-votes, turker-ranking, expert-ranking)

Page 35: Middeware2012 crowd

User Interface vs. Quality

(Department first) (Professor first) (De-normalized Probe)

≈10% Error-Rate ≈80% Error-Rate35

≈10% Error-Rate

Page 36: Middeware2012 crowd

Turker Affinity and Errors

36

Turker Rank

Page 37: Middeware2012 crowd

A Bigger Underlying Issue

Closed-World Open-World

37

Page 38: Middeware2012 crowd

What Does This Query Mean?

SELECT COUNT(*) FROM IceCreamFlavors

38

Trushkowsky et al. Croudsourcing Enumeration Queries, ICDE 2013 (to appear)

Page 39: Middeware2012 crowd

Estimating Completeness

39

US States using Mechanical Turk

Average US States

# responses

SELECT COUNT(*) FROM US States

Species Estimation techniques perform well on average• Uniform under-predicts slightly, coeff of var. = 0.5• Decent estimate after 100 HITs

Page 40: Middeware2012 crowd

Estimating Completeness

40

Ice Cream Flavors• Ice Cream Flavors– Estimators don’t

converge– Very highly

skewed (CV = 5.8)

– Detect that # HITs insufficient (beginning of curve)

SELECT COUNT(*) FROM IceCreamFlavors

Few, short lists of ice cream flavors (e.g. “alumni swirl, apple cobbler crunch, arboretum breeze,…” from Penn State Creamery

Page 41: Middeware2012 crowd

pay-as-you-go• “I don’t believe it is usually possible to estimate the

number of species... but only an appropriate lower bound for that number. This is because there is nearly always a good chance that there are a very large number of

extremely rare species” – Good, 1953• So instead, can ask: “What’s the benefit of

m additional HITs?”

41

m Actual Shen Spline10 1 1.79 1.62

50 7 8.91 8.22

200 39 35.4 32.9

Ice Cream after 1500 HITs

Page 42: Middeware2012 crowd

CrowdER - Entity Resolution

DB

42/17

Page 43: Middeware2012 crowd

Threshold = 0.2#Pairs = 8,315#HITs = 508Cost= $38.1

Time = 4.5hTime(QT) = 20h

Hybrid Entity-Resolution

43/17

J. Wang et al. CrowdER: Crowdsourcing Entity Resolution, PVLDB 2012

Page 44: Middeware2012 crowd

CrowdQ – Query Generation

44

Demartini et al. CroudQ: Crowdsourced Query Understanding, CIDR 2013 (to appear)

• Help find answers to unstructured queries– Approach: Generate a structured query via templates

• Machines do parsing and ontology lookup• People do the rest: verification, entity extraction, etc.

Page 45: Middeware2012 crowd

SO, WHERE DOES MIDDLEWARE FIT IN?

Page 46: Middeware2012 crowd

Generic Architecture

application

Hybrid Platform

Middleware is the software that resides between applications and the underlying architecture. The goal of middleware is to facilitate the development of applications by providing higher-level abstractions for better programmability, performance, scalability, security, and a variety of essential features.

Middleware 2012 web page

Page 47: Middeware2012 crowd

The Challenge

47

IncentivesLatency & PredictionFailure ModesWork ConditionsInterfaceTask StructuringTask Routing …

Some issues:

Page 48: Middeware2012 crowd

Can you incentivize workers?

48http://waxy.org/2008/11/the_faces_of_mechanical_turk/

Page 49: Middeware2012 crowd

Incentives

49

Page 50: Middeware2012 crowd

Can you trust the crowd?

“The Elephant population in Africa has tripled over the past six months.”[1]

Wikiality: Reality as decided on by majority rule.[2][1] http://en.wikipedia.org/wiki/Cultural_impact_of_The_Colbert_Report[2] http://www.urbandictionary.com/define.php?term=wikiality

On Wikipedia ”any user can change any entry, and if enough users agree with them, it becomes true."

Page 51: Middeware2012 crowd

Answer Quality Approaches

• Some General Techniques– Approval Rate / Demographic Restrictions– Qualification Test– Gold Sets/Honey Pots– Redundancy and Voting– Statistical Measures and Bias Reduction– Verification/Review

• Query Specific Techniques• Worker Relationship Management51

Page 52: Middeware2012 crowd

Can you organize the crowd?

52

Find

Fix

Verify

“Identify at least one area that can be shortened without changing the meaning of the paragraph.”

“Edit the highlighted section to shorten its length without changing the meaning of the paragraph.”

Soylent, a prototype...

“Choose at least one rewrite that has style errors, and at least one rewrite that changes the meaning of the sentence.”

Independent agreement to identify patches

Randomize order of suggestions

[Bernstein et al: Soylent: A Word Processor with a Crowd Inside. UIST, 2010]

Page 53: Middeware2012 crowd

Can You Predict the Crowd?

53

Streakers List walking

Page 54: Middeware2012 crowd

Can you build a low-latency crowd?

54

from: M S Bernstein, J Brandt, R C Miller, D R Karger, “Crowds in Two Seconds: Enabling Realtime Crowdsourced Applications”, UIST 2011.

Page 55: Middeware2012 crowd

Can you help the crowd?

Page 56: Middeware2012 crowd

For More InformationCrowdsourcing Tutorials:

• P. Ipeirotis, Managing Crowdsourced Human Computation, WWW

‘11, March 2011.

• O. Alonso, M. Lease, Crowdsourcing for Information Retrieval:

Principles, Methods, and Applications, SIGIR July 2011.

• A. Doan, M. Franklin, D. Kossmann, T. Kraska, Crowdsourcing

Applications and Platforms: A Data Management Perspective,

VLDB 2011.

AMPLab: amplab.cs.berkeley.edu• Papers• Project Descriptions and Pages• News updates and Blogs

56