Top Banner
Speeding Up Data Science: From a Data Management Perspective JIANNAN WANG SIMON FRASER UNIVERSITY 1 Database Group @ UBC July 24, 2017
36

SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Jun 05, 2018

Download

Documents

phungtuyen
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Speeding Up Data Science:From a Data Management Perspective

JIANNAN WANGSIMON FRASER UNIVERSITY

1

Database Group @ UBCJuly 24, 2017

Page 2: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Our Lab’s Mission

2

Speeding Up Data Science

Page 3: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Computer Science vs. Data Science

3

What When Who GoalComputer Science 1950- Software Engineer Write software to make computers work

Plan à Design à Develop à Test à Deploy à Maintain

What When Who GoalData Science 2010- Data Scientist Extract insights from data to answer questions

Collect à Clean à Integrate à Analyze à Visualize à Communicate

Page 4: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Lab Members

4

Collect à Clean à Integrate à Analyze à Visualize à Communicate

Page 5: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Today’s Talk

5

Collect à Clean à Integrate à Analyze à Visualize à Communicate

DeepER

AQP++

Page 6: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Where is the bottleneck?

6

Data scientists spend 60% of their time on cleaning and organizing data.

(Source: Cloudera)

Page 7: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

DeepER’s Key Idea

7

Leveraging Deep Web To Speed Up

Data Cleaning and Data Enrichment

Page 8: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Deep WebHidden Database

Invaluable External Resource◦ Big: Consisting of a substantial number of entities◦ Rich: Having rich Information about each entity◦ High-quality. Being trustful and up-to-date

8

Page 9: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

A real-world example

9

1. Data Enrichment

2. Data Cleaning

User ID Location Zip Code Frequency

U12345 Lotus of Siam 891004 20 visits

User ID Location Zip Code Frequency Category

U12345 Lotus of Siam 891004 20 visits Thai, Wine Bars

Customer Location Data

User ID Location Zip Code Frequency

U12345 Lotus of Siam 89104 20 visits

Page 10: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle House

Entity Resolution

10

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle HouseMonta RamenSteak HouseYard HouseRamen Bar

Ramen House

Local Database (𝑫)

Hidden Database (𝑯)

Page 11: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle House

Deep Entity Resolution

11

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle HouseMonta RamenSteak HouseYard HouseRamen Bar

Ramen House

Local Database (𝑫)

Hidden Database (𝑯)

Keyword-Search

Interface

Keyword Search1. Conjunctive Query2. Top-k Constraint3. Deterministic Query Processing

Local and Hidden DBs1. D has no duplicate record2. H has no duplicate record

Page 12: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

New ChallengesLimited Query Budget◦ Yelp API is restricted to 25,000 free requests per day◦ Goolge Maps API only allows 2,500 free requests per day

Top-k Constraint◦ Return top-k results based on an unknown ranking function

12

Page 13: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

NaiveCrawl

13

= “Thai Noodle House”= “Thai Pot”= “Thai House”= “BBQ Noodle House”

Keyword-Search Query𝑞$𝑞%𝑞&𝑞'

Enumerate each record in D and then generate a query to cover itLimitation◦ Cover one record at a time

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle House

Keyword Queries

Page 14: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

FullCrawl

14

1. Try to crawl the entire hidden database𝐻*+,-./02. Perform entity resolution between 𝐷 and 𝐻*+,-./0

Limitation◦ Not aware of the existence of a local database

Page 15: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

SmartCrawlInsight 1. Query Sharing

Cover multiple records at a time

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle House

Keyword Queries

= “Thai Noodle House”= “Thai Pot”= “Thai House”= “BBQ Noodle House”

𝑞$𝑞%𝑞&𝑞'

= “Noodle House”= “House”= “Thai”

𝑞3𝑞4𝑞5

Page 16: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

SmartCrawlInsight 2. Local-database-aware crawling

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle HouseMonta RamenSteak HouseYard HouseRamen Bar

Ramen House

𝑞3= “Noodle House”

𝑞4= “House”

BetterRestaurant

Thai Noodle HouseThai Pot

Thai HouseBBQ Noodle House

Page 17: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

SmartCrawl Framework1. Generate a query pool 𝑄

2. Select at most 𝑏 queries from 𝑄 such that 𝐻*+,-./0 ∩ 𝐷 is maximized

3. Perform entity resolution between 𝐻*+,-./0 and 𝐷

17

Page 18: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Query Pool GenerationBasic Idea◦ Only need to consider the queries in D

RestaurantThai Noodle House

Thai PotThai House

BBQ Noodle House

𝑞= “Sushi”

= “Thai Noodle House”= “Thai Pot”= “Thai House”= “BBQ Noodle House”

𝑞$𝑞%𝑞&𝑞'

Keyword Queries

= “Noodle House”= “House”= “Thai”

𝑞3𝑞4𝑞5

Page 19: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

SmartCrawl Framework1. Generate a query pool 𝑄

2. Select at most 𝑏 queries from 𝑄 such that 𝐻*+,-./0 ∩ 𝐷 is maximized

3. Perform entity resolution between 𝐻*+,-./0 and 𝐷

19

Page 20: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Query Selection

20

NP-Hard Problem◦ Can be proved by a reduction from the maximum coverage problem

Greedy Algorithm◦ Suffers from a chicken-and-egg problem

Page 21: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Sampling and EstimationDeep Web Sampling [Zhang et al. SIGMOD 2011]

◦ 𝐻9 is a random sample of 𝐻◦ 𝜃 is the sampling ratioTwo classes of queries◦ Solid Query◦ Overflowing Query

21

IF |𝑞 𝐻 | ≤ 𝑘THEN𝑞 is a solid query

ELSE𝑞 is an overflowing query

END

|𝑞 𝐻9 |𝜃

Page 22: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

How to estimate |𝑞 𝐷 ∩ 𝑞(𝐻)|?

Key Observation: |𝑞(𝐷) − 𝑞(𝐻)| is small

Biased Estimator: 𝑞 𝐷

Solid Query

22

Unbiased Estimator: |A B ∩A(CD)|E

Page 23: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Overflowing QueryHow to estimate |𝑞 𝐷 ∩ 𝑞(𝐻)𝒌|?

Basic Idea

How to estimate G|A C |

×|𝑞 𝐷 ∩ 𝑞(𝐻)|?

23

An unknown ranking functionHigh Low

How many black balls in 5 draws?410×5 = 2

Page 24: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

A Summary of Estimators

24

Page 25: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Other Contributions1. Theoretical Analysis

2. Efficient Implementations

3. Inadequate Sample Size

4. Fuzzy Matching25

Page 26: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Experimental SettingsSimulation◦ Hidden Database: DBLP◦ Local Database: Database Researchers’ publications

Real-world◦Hidden Database: Yelp◦Local Database: 3000 restaurants in AZ◦Ground-Truth: Manually Labeled

26

Page 27: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

DBLP

27

𝐷 = 10,000, 𝐻 = 100,000, 𝐾 = 100, 𝜃 = 0.2%

1. SmartCrawl performed very well with asmall sampling ratio

2. SmartCrawl outperforms straightforwardsolutions

Page 28: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Yelp

28

𝐷 = 3,000, 𝐻 ≈ 250,000, 𝐾 = 50, 𝜃 = 0.2%

1. SmartCrawl outperformed straightforwardsolutions

2. SmartCrawl was more robust to the fuzzy-matching situation than NaiveCrawl

Page 29: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

DeepER ConclusionWe are the first to study the DeepER problem

SmartCrawl outperforms NaiveCrawl and FullCrawlby a factor of 2 − 7×

SmartCrawl is more robust to the fuzzy-matching situationthan NaiveCrawl

29

Page 30: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Today’s Talk

30

Collect à Clean à Integrate à Analyze à Visualize à Communicate

DeepER

AQP++

Page 31: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Interactive Analytics

31

Page 32: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Two Separate Ideas

32

Approximate Query Processing (AQP)◦ Trade answer quality for interactive response time

Aggregate Precomputation (AggPre)◦ Trade preprocessing cost for interactive response time

Page 33: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

AQP++: Connecting AQP with AggPre

33

Response Time

Preprocessing Cost

Query Error

AQP

AggPre

AQP++

Page 34: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

TPCD-Skew (10GB, z = 2, 0.3% sample)

Experimental Result

34

Page 35: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

On-going Projects

35

Students Stages ProjectsPei & Yongjun Clean & Integrate DeepER: Deep Entity Resolution

Jinglin Peng Analyze & Visualize AQP++: Connecting AQP with AggPre

Mathew & Mohamad Clean & Analyze Data Cleaning Advisor for ML

Changbo & Ruochen Collect & Clean Live Video Highlight Detection using Crowdsourced User Comments

Nathan Yan Clean Data Cleaning with Statistical ConstraintsYoung Woo Analyze ML Explanation and Debugging

Page 36: SpeedingUpDataScience - cs.sfu.cajnwang/ppt/SFU-DSL-UBC.pdf · Keyword-Search Query # $ # % # & # ’ Enumerate each record in Dand then generate a query to cover it Limitation Cover

Two Take-away MessagesData scientists waste a lot of time on dataprocessing

Database researchers play a central role tospeed up data science

36

Collect à Clean à Integrate à Analyze à Visualize à Communicate