Top Banner
Scalable Approximate Query Processing Florin Rusu
50

Scalable Approximate Query Processing

Feb 23, 2016

Download

Documents

mckile

Scalable Approximate Query Processing. Florin Rusu. Data Explosion. Data storage advancements Price / capacity ($70 / 1 TB) Human generated Web 2.0 & social networking User data Communication Network & web logs (eBay – 50 TB / day) Call Detail Records (CDRs) Scientific experiments - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Scalable Approximate Query Processing

Scalable Approximate Query Processing

Florin Rusu

Page 2: Scalable Approximate Query Processing

2

Data Explosion• Data storage advancements

– Price / capacity ($70 / 1 TB)• Human generated

– Web 2.0 & social networking• User data

– Communication• Network & web logs (eBay – 50 TB / day)• Call Detail Records (CDRs)

• Scientific experiments– LHC (Large Hadron Collider)– SKA (Square Kilometer Array) – 1 EB (1018) / day– Sensor networks

04/19/2010

Page 3: Scalable Approximate Query Processing

3

Large-Scale Data Analytics• Traditional DB (OLTP)– Multi-user transaction processing– Optimized for specific workloads (views, indexes, …)

• Analytic processing (OLAP)– Data cubes

• Aggregate at different hierarchical levels• Pre-defined aggregates, not flexible

– Shared-nothing architectures (MPP)• Startups: Netezza, Greenplum, AsterData, Vertica, …• Parallel databases on clusters of computers• Storage layer (row store, column store, hybrid)• Compression

04/19/2010

Page 4: Scalable Approximate Query Processing

4

Interactive Data Analysis & Exploration

• Ad-hoc queries• Compute statistical aggregates over all data• Example: web log analysis– Documents (URL, Content)– UserVisits (IP, URL, Date, Duration)– “How much time did users spend searching for cars during the

period May – July 2009?”

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 5: Scalable Approximate Query Processing

5

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 6: Scalable Approximate Query Processing

6

Query Execution

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Selections push down•Sort-Merge Join•Aggregate

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 7: Scalable Approximate Query Processing

7

SelectionURL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Storage manager•One thread for each table scan•Project unused columns

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 8: Scalable Approximate Query Processing

8

•Tuples are pipelined into join

SelectionURLA

B

C

E

F

G

I

J

URL DurationA 45

B 60

J 30

D 90

F 15

G 10

E 20

E 35

B 25

J 35

I 25

D 40

C 50

H 75

G 90

F 5

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 9: Scalable Approximate Query Processing

9

URL Duration

A 45

B 60

J 30

D 90

F 15

G 10

E 20

E 35

•Sort tuples on join attribute•Write sorted runs to disk•Buffer space: UV(8)

Sort-Merge Join – Sort Phase

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

URL

A

B

C

E

F

G

I

J

Run 1

URL Duration

A 45

B 60

D 90

E 20

E 35

F 15

G 10

J 30

URL Duration

B 25

J 35

I 25

D 40

C 50

H 75

G 90

F 5

Run 2

URL Duration

B 25

C 50

D 40

F 5

G 90

H 75

I 25

J 35

04/19/2010

Page 10: Scalable Approximate Query Processing

10

Sort-Merge Join – Merge Phase

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

Run 1

URL Duration

D 90

E 20

E 35

F 15

G 10

J 30

Run 2

URL Duration

C 50

D 40

F 5

G 90

H 75

I 25

J 35

URL

B

C

E

F

G

I

J

Run

URL Duration

B 25

B 60

URL Duration

A 45

URL

A

Duration

45

σ

UV

σ

D

Σ

04/19/2010

Page 11: Scalable Approximate Query Processing

11

Sort-Merge Join – Merge Phase

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

Run 1

URL Duration

F 15

G 10

J 30

Run 2

URL Duration

G 90

H 75

I 25

J 35

URL Duration

E 20

E 35

F 5

URL

E

URL Duration

D 40

D 90

σ

UV

σ

D

Σ

04/19/2010

URLF

G

I

J

Run

Page 12: Scalable Approximate Query Processing

12

Duration

0

Duration

45

•Update the sum as tuples are produced

Aggregation

Duration45

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 13: Scalable Approximate Query Processing

13

Duration45

25

60

50

20

35

15

5

10

90

25

30

35

Duration

445

Final Result

σ

UV

σ

D

Σ

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 14: Scalable Approximate Query Processing

14

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 15: Scalable Approximate Query Processing

15

What is the problem?

• TPC-H benchmark results (price / performance)– 10 TB scale

• 928 hard-disks (90 TB total storage capacity)• 16 × quad-core processors• 512 GB RAM• $1.5 million

– Load time: 55 hours– Q1: linear scan over one table with aggregates on top

• 1 query: 19 minutes• 9 queries: 3 hours (linear scaling)

04/19/2010

Page 16: Scalable Approximate Query Processing

16

Approximate Query Processing

Time

Que

ry re

sult

Traditional query processing

Result estimate

Confidence bounds

SELECT SUM f(r1•r2• … •rn)FROM R1 as r1, R2 as r2, …, Rn as rn

04/19/2010

Page 17: Scalable Approximate Query Processing

17

DBO System Architecture[Rusu et al. 2008]

σ

UV

σ

D

Σ

DB Engine

Query Result

Levelwise Step Controller

In-Memory Join

⋈UV' D'

Estimation Module

ResultConfidence bounds

1

2 3

4

5

Approximate answer

6

7

04/19/2010

Page 18: Scalable Approximate Query Processing

18

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 19: Scalable Approximate Query Processing

19

Sampling[Dobra, Jermaine, Rusu & Xu 2009]

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Control, coordinate & schedule data flow between operators•Embed randomness in each operator

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 20: Scalable Approximate Query Processing

URLJ 68

F 220

C 312

H 389

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL Duration

A 45 70

B 60 140

J 30 185

D 90 252

URL

J

In-Memory JoinURL

J

URL

F 220

C 312

A 389

B 447

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

Page 21: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

F 220

C 312

A 389

B 447

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

B 60 140

J 30 185

D 90 252

F 15 358

URL Duration

A 45

In-Memory JoinURL

J

Page 22: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

F 220

C 312

H 389

B 447

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

D 90 252

F 15 358

G 10 409

E 20 476

URL Duration

J 30

URL Duration

J 30

In-Memory JoinURL

J

Page 23: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

G 515

I 695

E 799

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

B 25 722

J 35 739

I 25 745

D 40 791

URL Duration

J 30

F 15

In-Memory JoinURL

J

F

C

A

B

50% input:360; [-328, 1048] 95% probability

Page 24: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

E 799

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

I 25 745

D 40 791

C 50 798

H 75 837

URL Duration

J 30

F 15

B 25

J 35

In-Memory JoinURL

J

F

C

A

B

G

I

Exceed In-Memory Join capacity (10 tuples)!Eliminate tuples such that variance is minimized.

Page 25: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

URL

E 799

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

I 25 745

D 40 791

C 50 798

H 75 837

URL Duration

J 30

B 25

J 35

In-Memory JoinURL

J

A

B

G

74% input:258; [-293, 808]95% probability

Page 26: Scalable Approximate Query Processing

Sampling – Selection

URL ContentJ car 68

F car 220

C car 312

D phone X

A car 389

B car 447

G car 515

H PC X

I car 695

E car 799

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45 70

1 B 06-01-09 60 140

1 J 06-01-09 30 185

1 D 05-15-09 90 252

1 I 04-28-09 35 X

2 A 04-30-09 60 X

2 F 06-15-09 15 358

2 G 06-13-09 10 409

2 E 06-01-09 20 476

2 E 07-10-09 35 495

3 C 04-28-09 25 X

3 B 05-23-09 25 722

3 J 05-29-09 35 739

3 I 06-13-09 25 745

3 D 06-09-09 40 791

4 C 07-30-09 50 798

4 H 05-14-09 75 837

4 H 08-02-09 65 X

4 G 07-23-09 90 953

4 F 06-16-09 5 973

σ

UV

σ

D

Σ

•Data in random order•Assign random timestamp to tuples•Controller schedules data flow between operators

URL Duration

URL Duration

J 30

B 25

J 35

G 90

In-Memory JoinURL

J

A

B

G

E

URL

All input:448; [3, 892]

95% probability

Page 27: Scalable Approximate Query Processing

27

Sampling Estimation – Intermediate Levels

• Query result estimator & variance estimator computed from result tuples found by In-Memory Join

• Confidence bounds derived with Central Limit Theorem • Solve optimization problem to keep bounds stable when

tuples are deleted from In-Memory Join

)( )( )()( )( 212111 22

ni

tRt Rt Rt

in tttfXiptTStTStTStTSIi

nn

11 22

)(... !

21Rt Rt Rt

ni

in

nn

tttfn

ppE

22 Var EE

04/19/2010

Page 28: Scalable Approximate Query Processing

28

•Sort tuples on random function of join attribute

Sampling – Join (Sort)

σ

UV

σ

D

ΣSELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

URL

J 888

F 67

C 489

A 227

B 987

G 51

I 342

E 739

Run 1

URL

F 67

A 227

C 489

J 888

Run 2

URL

G 51

I 342

E 739

B 987

URL Duration

A 45 227

B 60 987

J 30 888

D 90 43

F 15 67

G 10 51

E 20 739

E 35 739

B 25 987

J 35 888

I 25 342

D 40 43

C 50 489

H 75 150

G 90 51

F 5 67

URL Duration

D 90 43

G 10 51

F 15 67

A 45 227

E 20 739

E 35 739

J 30 888

B 60 987

Run 1

URL Duration

D 40 43

G 90 51

F 5 67

H 75 150

I 25 342

C 50 489

J 35 888

B 25 987

Run 204/19/2010

Page 29: Scalable Approximate Query Processing

29

Duration

0 0

Sampling – Join (Merge)

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

σ

UV

σ

D

Σ

URL Duration

G 10 51

F 15 67

A 45 227

E 20 739

E 35 739

J 30 888

B 60 987

Run 1

URL Duration

G 90 51

F 5 67

H 75 150

I 25 342

C 50 489

J 35 888

B 25 987

Run 2

Run 1

URL

F 67

A 227

C 489

J 888

Run 2

URL

G 51

I 342

E 739

B 987

URL Duration

G 10 51

G 90 51

URL

G 51

F 67

URL

G 51

URL Duration

G 10 51

G 90 51

Duration

10 51

90 51 In-Memory Join

Duration

100 51

04/19/2010

Page 30: Scalable Approximate Query Processing

30

Sampling – Join (Merge)

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

σ

UV

σ

D

Σ

URL Duration

E 20 739

E 35 739

J 30 888

B 60 987

Run 1

URL Duration

C 50 489

J 35 888

B 25 987

Run 2 Run 1

URL

C 489

J 888

Run 2

URL

E 739

B 987

URL Duration

C 50 489

E 20 739

E 35 739

URL

C 489

E 739

URL

C 489

URL Duration

C 50 489

Duration

50 489 In-Memory Join

Duration

240 489

50% input:468; [194, 741]95% probability

04/19/2010

Page 31: Scalable Approximate Query Processing

31

Sampling – Join (Merge)

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

σ

UV

σ

D

Σ

URL Duration

B 60 987

Run 1

URL Duration

B 25 987

Run 2 Run 1

URL

Run 2

URL

B 987

URL Duration

B 25 987

B 60 987

URL

B 987

URL

B 987

URL Duration

B 25 987

B 60 987

Duration

25 987

60 987

In-Memory Join

Duration

445 987

04/19/2010

Page 32: Scalable Approximate Query Processing

32

Sampling Estimation – Upper Level

• Bernoulli sampling with probability given by domain fraction seen so far

• Consolidate tuples generated by same join key• Solve optimization problem to minimize

variance across levels– Keep confidence bounds stable

04/19/2010

Page 33: Scalable Approximate Query Processing

33

Contributions

• Design & implement DBO, first online analytical processing engine– Provide estimates & confidence bounds

throughout entire query execution– SELECT-PROJECT-JOIN (SPJ) & GROUP BY queries

over any number of relations• Design & analyze fastest convergent

estimation method for online aggregation– Statistics & optimization techniques

04/19/2010

Page 34: Scalable Approximate Query Processing

34

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 35: Scalable Approximate Query Processing

35

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

IP URL Date Duration1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

σ

UV

σ

D

Σ

•Build sketches on join attribute while data is read from disk•Use attributes in aggregate

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

04/19/2010

Page 36: Scalable Approximate Query Processing

36

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 0 0 0

A B C D E F G H I J

S1 + - - - - + + + - -

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

URL

A

1 2 3

S1 0 0 0

1 2 3

S1 1 0 0

S1 + S1 1

04/19/2010

Page 37: Scalable Approximate Query Processing

37

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 1 0 0

A B C D E F G H I J

S1 + - - - - + + + - -

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

URL Duration

A 45

1 2 3

S1 0 0 0

1 2 3

S1 45 0 0

S1 + S1 1

04/19/2010

Page 38: Scalable Approximate Query Processing

38

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 0 1 -3

A B C D E F G H I J

S1 + - - - - + + + - -

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

1 2 3

S1 -140 35 -65

S1 230

04/19/2010

Page 39: Scalable Approximate Query Processing

39

Sketches

URL ContentA car

B car

C car

D phone

E car

F car

G car

H PC

I car

J car

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 0 1 -3

S2 -1 2 1

S3 -3 0 1

A B C D E F G H I J

S1 + - - - - + + + - -

S2 + - + - + - + - + -

S3 - - - + + - + + - +

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

S2 3 3 2 1 2 1 2 1 3 2

S3 1 1 2 1 3 1 3 2 3 2

1 2 3

S1 -140 35 -65

S2 -225 140 -15

S3 -20 90 130

S1 230

S2 490

S3 190

230; [-416, 876]95% probability

04/19/2010

Page 40: Scalable Approximate Query Processing

40

Sketches Estimation

• Two random processes– Bucket selection– Sign

• Sketch update• Estimator• Confidence bounds– Multiple independent sketches– Chebyshev & Chernoff inequalities (worst-case)– Median Central Limit Theorem, Student-t distribution

(statistics)

HDh :

1,1: D

2,1,),join.()()join.()Sk(join)()Sk( iRtttfthR.thR iiiiiiiii

11 22

)(,])[(Sk)(Sk 2121Rt RtHh

ttfEhRhR

04/19/2010

Page 41: Scalable Approximate Query Processing

41

Pseudo-Random Number Generators[Rusu & Dobra 2006, 2007b]

• Detailed comparison of generating schemes– Abstract algebra (orthogonal arrays, vector spaces,

prime & extension fields)• Degree of independence as function of seed size• Fast range-summable

– Empirical evaluation• Generating time is few processor cycles

• Identify EH3 as generator for sketches– Lowest possible degree of independence– 7.3 ns to generate single number

04/19/2010

Page 42: Scalable Approximate Query Processing

42

Statistical Analysis[Rusu & Dobra 2007a, 2008]

• Detailed comparison of sketch estimators– Same accuracy (worst-case analysis)– Statistical analysis

• Distribution (probability density function)• Higher frequency moments (kurtosis)• Confidence bounds

– Empirical evaluation• Data skew, correlation, memory usage, update time

• Identify Fast-AGMS as most reliable scheme– Accurate over entire range of data– Small memory footprint, fast update time

04/19/2010

Page 43: Scalable Approximate Query Processing

43

Roadmap

• Database query execution• System design & implementation– DataBaseOnline (DBO)

• Approximation methods (theoretical analysis & practical implementation)– Sampling– Sketches– Sketches over samples

04/19/2010

Page 44: Scalable Approximate Query Processing

44

Sketches over Samples[Rusu & Dobra 2009]

σ

UV

σ

D

Σ

•Data is random on disk•Build sketches on join attribute while data is read from disk•Use attributes in aggregate•Provide estimates at any point

SELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

URL Content

J car

F car

C car

D phone

A car

B car

G car

H PC

I car

E car

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

04/19/2010

Page 45: Scalable Approximate Query Processing

45

Sketches over SamplesSELECT SUM(UV.Duration)FROM Documents D, UserVisits UVWHERE D.URL = UV.DocURL ANDD.Content contains ‘car’ ANDUV.Date between [05-01-09, 07-31-09]

IP URL Date Duration

1 A 05-30-09 45

1 B 06-01-09 60

1 J 06-01-09 30

1 D 05-15-09 90

1 I 04-28-09 35

2 A 04-30-09 60

2 F 06-15-09 15

2 G 06-13-09 10

2 E 06-01-09 20

2 E 07-10-09 35

3 C 04-28-09 25

3 B 05-23-09 25

3 J 05-29-09 35

3 I 06-13-09 25

3 D 06-09-09 40

4 C 07-30-09 50

4 H 05-14-09 75

4 H 08-02-09 65

4 G 07-23-09 90

4 F 06-16-09 5

1 2 3

S1 1 1 -2

S2 -1 0 1

S3 -2 0 0

A B C D E F G H I J

S1 + - - - - + + + - -

S2 + - + - + - + - + -

S3 - - - + + - + + - +

A B C D E F G H I J

S1 1 2 3 1 1 2 2 3 3 3

S2 3 3 2 1 2 1 2 1 3 2

S3 1 1 2 1 3 1 3 2 3 2

1 2 3

S1 -100 -35 -30

S2 -105 35 -15

S3 -30 30 65

URL Content

J car

F car

C car

D phone

A car

B car

G car

H PC

I car

E car

S1 -300

S2 360

S3 240

50% input:100; [-2382, 2582]

95% probability

04/19/2010

Page 46: Scalable Approximate Query Processing

46

Sketches over Samples – Estimation

• Define estimator over two completely different random processes & analyze statistically – Sampling – random partition, tuple domain– Sketches – random projection, frequency domain– Consider correlation between multiple sketches that share

same sample– Moment generating functions

• Generic analysis independent of sampling process– Bernoulli sampling– Sampling without replacement– Sampling with replacement

04/19/2010

Page 47: Scalable Approximate Query Processing

47

Sketches over Samples – Analysis

''i

Dii gEfECXE

jDi Dj

jii gfCX

''

Di Dji

Dii

Diiijiji

Djj

Dii gEfEgEfEggEffEgEfECX

2''2'2'''''2'2'2 22Var

Var[sketch over samples] =Var[samples] + Var[sketch] + Var[interaction]

04/19/2010

Page 48: Scalable Approximate Query Processing

48

Conclusions• Data explosion– Cheap, high-capacity storage– Current processing technology is too expensive for performance

it provides• Framework for online analytical processing– DBO system architecture

• Embed randomization into data processing• Provide estimates and bounds at any time

– Approximation methods• Sampling – most flexible• Sketches – single pass• Sketches over samples – fastest

04/19/2010

Page 49: Scalable Approximate Query Processing

49

Future Work• Short term

– Define & design query optimization for DBO– Extend DBO to other types of queries and with other approximation techniques

(end-biased samples, histograms, …)– Generalize sketches to multiple relations– Find optimal amount of data to sketch– Fully integrate sketches into DBO system

• Medium term– Develop data aggregation & approximation techniques for other types of

architectures• Multicore processors, GPUs• Distributed processing (Map-Reduce, Hadoop, …)

• Long term– Design & build scalable analytic processing system

• Aggregation & approximation

04/19/2010

Page 50: Scalable Approximate Query Processing

50

Publications• A. Dobra, C. Jermaine, F. Rusu, F. Xu – Turbo-Charging Estimate

Convergence in DBO. In VLDB 2009.• F. Rusu and A. Dobra – Sketching Sampled Data Streams. In ICDE 2009.• F. Rusu et al. – The DBO Database System. In SIGMOD 2008 (demo).• F. Rusu and A. Dobra – Sketches for Size of Join Estimation. In TODS, vol.

33, no. 3, 2008.• F. Rusu and A. Dobra – Pseudo-Random Number Generation for Sketch-

Based Estimations. In TODS, vol. 32, no. 2, 2007.• F. Rusu and A. Dobra – Statistical Analysis of Sketch Estimators. In

SIGMOD 2007.• F. Rusu and A. Dobra – Fast Range-Summable Random Variables for

Efficient Aggregate Estimation. In SIGMOD 2006.

04/19/2010