Top Banner
Query Optimization over Crowdsourced Data Hyunjung Park, Jennifer Widom Stanford University
21

Query Optimization over Crowdsourced Data

Jul 09, 2015

Download

Software

Hyunjung Park

Presented in VLDB 2013.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Query Optimization over Crowdsourced Data

Query Optimization over Crowdsourced Data

Hyunjung Park, Jennifer Widom Stanford University

Page 2: Query Optimization over Crowdsourced Data

Deco: Declarative Crowdsourcing

Give me a Spanish-speaking country.

Give me a country. What language do they speak in country X? What is the capital of country X?

8/27/2013 Hyunjung Park 2

“Find the capitals of eight Spanish-speaking countries”

DBMS

country language capital

Italy Italian Rome

Spain Spanish Madrid

… … …

country language capital

Italy Italian Rome

Spain Spanish Madrid

Deco System

Page 3: Query Optimization over Crowdsourced Data

Deco Query Optimization

•  Crowd incurs monetary cost •  Some query plans are much cheaper than others

•  Cost estimation is complicated by: –  Previously collected data –  Unknown database state

–  Inconsistency of human answers

8/27/2013 Hyunjung Park 3

Page 4: Query Optimization over Crowdsourced Data

Outline

•  Motivating example •  Deco data model and queries

•  Cost and cardinality estimation

•  Experimental results

8/27/2013 Hyunjung Park 4

Everything implemented in full prototype

Page 5: Query Optimization over Crowdsourced Data

Motivating Example: Plan 1

8/27/2013 Hyunjung Park 5

Give me a country.

What language do they speak in country X?

What is the capital of country X?

unseen

Spanish

F

T

T

F

“Find the capitals of eight Spanish-speaking countries”

8x

Page 6: Query Optimization over Crowdsourced Data

Give me a country. Give me a country. Give me a country.

Motivating Example: Plan 2

8/27/2013 Hyunjung Park 6

Give me a Spanish-speaking country.

What language do they speak in country X?

What is the capital of country X?

unseen

Spanish

F

T

T

F

“Find the capitals of eight Spanish-speaking countries”

8x

Page 7: Query Optimization over Crowdsourced Data

Preview of Experimental Results

0

5

10

15

Plan 1 Plan 2

Actual costs spent on Mechanical Turk

What is the capital of country X?

What language do they speak in country X?

Give me a Spanish-speaking country.

Give me a country.

8/27/2013 Hyunjung Park 7

($)

Page 8: Query Optimization over Crowdsourced Data

Outline

•  Motivating example •  Deco data model and queries

•  Cost and cardinality estimation

•  Experimental results

8/27/2013 Hyunjung Park 8

Page 9: Query Optimization over Crowdsourced Data

Deco: Data Model (1/2)

•  Conceptual Relation: visible to end-users Country (country, language, capital)

•  Resolution Rules: cleanse raw data using UDFs country: dupElim language: majority(3)

capital: majority(3)

8/27/2013 Hyunjung Park 9

Page 10: Query Optimization over Crowdsourced Data

Deco: Data Model (2/2)

•  Fetch Rules: “access methods” for the crowd language => country

“Give me a {language}-speaking country.”

Ø => country “Give me a country.”

country => language “What language do they speak in {country}?”

country => capital “What is the capital of {country}?”

8/27/2013 Hyunjung Park 10

[$0.05]

[$0.01]

[$0.02]

[$0.03]

Page 11: Query Optimization over Crowdsourced Data

Deco: Queries

•  Deco query: SQL query over conceptual relations SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

•  Query processor: access the crowd as needed to produce query result while: 1.  Minimizing monetary cost

2.  Reducing latency

8/27/2013 Hyunjung Park 11

query optimizer

query execution engine

Page 12: Query Optimization over Crowdsourced Data

Query Optimization

•  Find the best query plan in terms of estimated monetary cost

•  As in traditional query optimizer 1.  Cost and cardinality estimation 2.  Search space

3.  Plan enumeration algorithm

8/27/2013 12 Hyunjung Park

Page 13: Query Optimization over Crowdsourced Data

Cost Estimation

•  Total monetary cost = ∑Fetch  F  F.price × F.cardinality –  Existing data is “free”

•  Definition of Cardinality in Deco –  Total number of expected output tuples from operator

until query execution terminates

•  Cardinality estimation –  Final database state needs to be estimated

simultaneously

8/27/2013 Hyunjung Park 13

Page 14: Query Optimization over Crowdsourced Data

Cardinality Estimation: Setting

•  $0.05 for all fetch rules

•  No existing data

•  Selectivity factors –  language=‘Spanish’: 0.1

–  dupElim: 0.8 –  majority(3): 0.4 (=1/2.5)

8/27/2013 Hyunjung Park 14

Page 15: Query Optimization over Crowdsourced Data

Cardinality Estimation: Plan 1

8/27/2013 15 Hyunjung Park

SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

MinTuples[8]

Project[co,ca]

DLOJoin[co]

DLOJoin[co]

Resolve[dupeli] Resolve[maj3]

Resolve[maj3]Filter[la=’Spanish’]

Scan[CtryA]

Fetch[Øàco]

Scan[CtryD2]

Fetch[coàca]

Scan[CtryD1]

Fetch[coàla]

1

2

3

4 12

5 13

96

7 8 10 11

14

Ø => country country => language country => capital

Cost estimation: $0.05×(100+200+20) = $16.00 200

20

100

Page 16: Query Optimization over Crowdsourced Data

Cardinality Estimation: Plan 2

8/27/2013 16 Hyunjung Park

MinTuples[8]

Project[co,ca]

DLOJoin[co]

DLOJoin[co]

Resolve[dupeli] Resolve[maj3]

Resolve[maj3]Filter[la=’Spanish’]

Scan[CtryA]

Fetch[laàco]

Scan[CtryD2]

Fetch[coàca]

Scan[CtryD1]

Fetch[coàla]

1

2

3

4 12

5 13

96

7 8a 10 11

14

SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8

language => country country => language country => capital

Cost estimation: $0.05×(10+20+20) = $2.50 20 10

20

Page 17: Query Optimization over Crowdsourced Data

8/27/2013 Hyunjung Park 17

0

1

2

3

Actual

Plan 2

Experimental Results

0

5

10

15

Actual

Plan 1

country => capital country => language language => country Ø => country

($) ($)

Page 18: Query Optimization over Crowdsourced Data

8/27/2013 Hyunjung Park 18

0

1

2

3

Actual Estimated

Plan 2

Experimental Results

0

5

10

15

Actual Estimated

Plan 1

country => capital country => language language => country Ø => country

($) ($)

Page 19: Query Optimization over Crowdsourced Data

Related Work

•  Declarative approach for crowdsourcing –  Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ...

•  Crowd-powered algorithms/operations –  Filter, sort, join, max, entity resolution, …

•  Also: –  Traditional query optimization –  Heterogeneous or federated database systems

8/27/2013 19 Hyunjung Park

Page 20: Query Optimization over Crowdsourced Data

Summary

•  Cost estimation in Deco –  Distinguish between existing data vs. new data

–  Estimate cardinality and final database state simultaneously

•  In the paper: –  Full description of cost estimation and plan

enumeration algorithms

–  More experimental results

8/27/2013 Hyunjung Park 20

Page 21: Query Optimization over Crowdsourced Data

Thank you!