Top Banner
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research *Work done at Microsoft Research
24

Primitives for Workload Summarization and Implications for SQL

Dec 31, 2015

Download

Documents

amy-daugherty

Primitives for Workload Summarization and Implications for SQL. Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research *Work done at Microsoft Research. Motivation. Workload: Set of SQL Statements Many tasks exploit workload information - PowerPoint PPT Presentation
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Primitives for Workload Summarization and Implications for SQL

1

Primitives for Workload Summarization and Implications

for SQL

Prasanna Ganesan*Stanford University

Surajit Chaudhuri Vivek NarasayyaMicrosoft Research

*Work done at Microsoft Research

Page 2: Primitives for Workload Summarization and Implications for SQL

2

Motivation

• Workload: Set of SQL Statements• Many tasks exploit workload information

– DB Admin, Index Tuning, Statistics building, Approximate Query Processing

• DBMS profilers produce large workloads (+additional info)

• Most tasks need small workloads • Goal: Summarization - Find a “representative”

subset of a given, large workload. – Sometimes a weighted subset

Page 3: Primitives for Workload Summarization and Implications for SQL

3

Why Not Random Sampling?

• One Size does not fit all– Different definitions of “representative subset”– Random sampling may lose valuable info

• Ignores additional info associated with statements

• Shown to work poorly, e.g., for Index Selection [chaudhuri02] – May oversample queries on some tables, while

ignoring less frequent queries on other tables

Page 4: Primitives for Workload Summarization and Implications for SQL

4

Our Solution

1. Treat input as a relation• Each SQL statement (+associated info) is a tuple

2. Extend SQL with new language primitives • Allow declarative specification of desired subset• Usable on arbitrary relations, not just workloads

3. Implement extensions inside query engine• Why? Primitives appear widely applicable• Other implementation options available

Page 5: Primitives for Workload Summarization and Implications for SQL

5

The Architecture

Query SQL FROM …… Estimated ExecutionID String Tables Cost Cost

Q1SELECT *

FROM R1, R2 {R1, R2} 2.5 3.03

Q2 … … … …

.. … … .. … …

SELECT *, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols(SLAVE.GroupByCols MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols)REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*))

ExecutionEngine

Summary Application

Page 6: Primitives for Workload Summarization and Implications for SQL

6

Outline

• New Primitives for Summarization (Subsetting)– Dominance– Representation

• Implementing summarization primitives in SQL • Experiments

Page 7: Primitives for Workload Summarization and Implications for SQL

7

Dominance

• Idea: Filter and aggregate using a partial order on tuples

• Specify condition for one tuple to dominate another– Transitive condition– Encapsulates application knowledge

• Output: Keep throwing away tuples that are dominated– Retain aggregate info about dominated tuples

Page 8: Primitives for Workload Summarization and Implications for SQL

8

A Graphical Representation

23

6 23Buono 75 25

Cattivo 50 50

Vendor Quality Price

Page 9: Primitives for Workload Summarization and Implications for SQL

9

Applying Dominance to Workloads

• Example: Index Selection

– An index useful for Q1 likely to be useful for Q2

SELECT ... FROM R

GROUP BY A, B, C

SELECT … FROM R

GROUP BY A, Bdominates

Q1 Q2

MASTER.FromTables=SLAVE.FromTables AND MASTER.GroupByCols SLAVE.GroupByCols AND MASTER.OrderByCols PREFIX SLAVE.OrderByCols

Page 10: Primitives for Workload Summarization and Implications for SQL

10

Outline

• New Primitives for Summarization (Subsetting)– Dominance– Representation

• Implementing Summarization Primitives in SQL• Experiments

Page 11: Primitives for Workload Summarization and Implications for SQL

11

Representation

• Dominance only gets us so far– Need a “lossier” way to select a subset

• Idea: Pick a subset that solves a Linear Program – Optimize some criterion – Satisfy lots of constraints– Support concept of partitioning

Page 12: Primitives for Workload Summarization and Implications for SQL

12

Details

• Partition tuples by a set of attributes

• Criterion: Maximize/Minimize Aggregate– E.g., Minimize Count(*)

• Global Constraints– E.g., Sum(B) in chosen subset > 60% Sum(B) in input

• Local Constraints - apply to each partition– E.g., Sum(B) in chosen subset > 40% Sum(B) in that partition

A B C1 10 ..2 5 ..3 71 …2 … ..3 ….. … ..

A B C1 10 ..1 .. ..1 .. ..

A B C2 5 ..2 .. ..2 .. ..

A B C3 7 ..3 .. ..3 .. ..

Page 13: Primitives for Workload Summarization and Implications for SQL

13

An Index Selection Example

• Partition by Tables, Join Conditions and attributes in WHERE clause

• Criterion: Maximize Sum(ExecutionCost)– Need best “coverage”

• Global Constraint: Count(*) ≤ 200• Local Constraint: Proportionate representation

– A partition with 20% of input should have 20% of output

– Count(*) ≥int(200*LOCAL.Count(*)/GLOBAL.Count(*))

Page 14: Primitives for Workload Summarization and Implications for SQL

14

Putting it all together

1. Apply dominance criterion (as earlier).

2. Apply representation (as earlier, but maximize SUM(DOM_Count) ).

3. Weight each tuple by the number of tuples it dominates.

SELECT SqlString, DOMSUM(Count) FROM WkldTbl DOMINATE WITH PARTITIONING BY FromTables, JoinConds, WhereCols(SLAVE.GroupByCols MASTER.GroupByCols) AND (SLAVE.OrderByCols PREFIX MASTER.OrderByCols)REPRESENT WITH PARTITIONING BY FromTables, JoinConds, WhereCols MAXIMIZING SUM(DOM_Count) GLOBAL CONSTRAINT Count(*) ≤ 200 LOCAL CONSTRAINT Count(*) ≥ int(200*LOCAL.Count(*)/GLOBAL.Count(*))

Page 15: Primitives for Workload Summarization and Implications for SQL

15

Outline

• New Primitives for Summarization (Subsetting)– Dominance– Representation

• Implementing Summarization Primitives in SQL• Experiments

Page 16: Primitives for Workload Summarization and Implications for SQL

16

Implementing Summarization Primitives in SQL

• Assume set and sequence support in SQL– The mills of the standards bodies…

• Partitioning useful for both primitives– Hashing, Sort-based, Index-based…

• Implementing Dominance– Naïve O(n2) algorithm– Techniques from group-wise processing – Leverage Skyline optimizations

Page 17: Primitives for Workload Summarization and Implications for SQL

17

Representation

• Implementing directly is LP-hard• Many queries are much simpler

– Fall into one of two special cases

• Other queries are handled by a simple heuristic– User-guided search

• Implement as multiple operators

Page 18: Primitives for Workload Summarization and Implications for SQL

18

User-Guided Search

• Scan tuples in a specific order– User-specified, or heuristically chosen

• Will always minimize/maximize Count(*) – Use ordering to transform other objectives– Slightly different algorithms for the two cases

Page 19: Primitives for Workload Summarization and Implications for SQL

19

A Minimization Example

Satisfied

Violated

Output

A

B

D

C

E

F

Page 20: Primitives for Workload Summarization and Implications for SQL

20

Two Special Cases

• Maximize SUM(Attr)– All constraints are on Count(*)– Use partitioning and sort-order access

• Minimize Count(*)– Single constraint: Again easily solved– More special cases also solvable– Multiple constraints: Approximation algorithm

Page 21: Primitives for Workload Summarization and Implications for SQL

21

Experiments

• Evaluate utility for index selection• Compare to sophisticated Wkld. Compression

[chaudhuri02]– Clusters using a complex distance function

• Simple query as described earlier– Constrained to output same number of statements as

Workload Compression– Orders of magnitude faster

• TPC-H 1GB database– Multiple synthetic workloads introduced in

[chaudhuri02]

Page 22: Primitives for Workload Summarization and Implications for SQL

22

Experiments (Contd.)

Workload Compress Tuning Wizard

Evaluate

Total Estimated Cost

Page 23: Primitives for Workload Summarization and Implications for SQL

23

Comparing Estimated Costs

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

SPJ SPJ-GB SPJ-GBOB SingleTable

Workloads

Est

imat

ed C

ost

Wkld Compression Proportionate(Syntactic)

Page 24: Primitives for Workload Summarization and Implications for SQL

24

Conclusion

• Our contributions– Summarization can be expressed declaratively– Introduction of new operators for summarization– Discussion of SQL implementation

• The Future– An automatic monitoring and tuning infrastructure?– More workload-sensitive tasks?