Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.

Post on 19-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

Transcript

Online Analytical Processing (OLAP)

An Overview

Kian Win Ong, Nicola OnoseMar 3rd 2006

Overview• Motivation

• Multi-Dimensional Data Model

• Research Areas

• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy

Motivation• Aggregation, summarization and exploration• Of historical data• To help management make informed decisions

Different Goal• Aggregation, summarization and exploration• Of historical data• To help management make informed decisions

Product Branch Time Price

Coke (0.5 gallon) Convoy Street 2006-03-01 09:00:01 $1.00

Pepsi (0.5 gallon) UTC 2006-03-01 09:00:01 $1.03

Coke (1 gallon) UTC 2006-03-01 09:00:02 $1.50

Altoids Costa Verde 2006-03-01 09:01:33 $0.30

...

• Find the total sales for each product and month• Find the percentage change in the total monthly

sales for each product

Different Requirements

OLTP OLAP

Tasks Day to day operation High level decision support

Size of database Gigabytes Terabytes

Time span Recent, up-to-date Spanning over months / years

Size of working set Tens of records, accessed through primary keys

Consolidated data from multiple databases

Workload Structured / repetitive Ad-hoc, exploratory queries

Performance Transaction throughput Query latency

• OLTP – On-Line Transaction Processing• OLAP – On-Line Analytical Processing

Overview• Motivation

• Multi-Dimensional Data Model

• Research Areas

• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy

Query Language Extensions

• In the real world, data is stored in RDBs.

Query Language Extensions

• In the real world, data is stored in RDBs.

• How to express N-dimensional problems using 2D tables?

Query Language Extensions

• In the real world, data is stored in RDBs.

• How to express N-dimensional problems using 2D tables?

• Can we combine OLAP and SQL queries?

• Jim Gray et al:Data Cube: A Relational Aggregation Operator1997

Query Language Extensions

1.histograms

Problems with GROUP BY

SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population

Query Language Extensions

1.histograms

2.rollup/drilldown

Problems with GROUP BY

ProductCategory

Product Name

Month Sales Sales by Cat.,

by Name

Salesby Cat.

Drinks Coke Feb 30.3

Mar 93.9 124.2

Heineken Feb 34.8

Mar 123.8 158.6 282.8

non relational representation

Query Language Extensions

1.histograms

2.rollup/drilldown

Problems with GROUP BY

Product Category

Product Name

Month Sales Sales by Cat.,

by Name

Salesby Cat.

Drinks Coke Feb 30.3 124.2 282.8

Drinks Coke Mar 93.9 124.2 282.8

Drinks Heineken Feb 34.8 158.6 282.8

Drinks Heineken Mar 123.8 158.6 282.8

relational, but the rollup is huge

Query Language Extensions

1.histograms

2.rollup/drilldown

3.cross tabulations

Problems with GROUP BY

2-D aggregation is more compact and more natural:

Drinks Feb Mar Total

Coke 30.3 93.9 124.2

Heineken 34.8 123.8 158.6

Total 65.1 217.7 282.8

Query Language ExtensionsReducing the number of attributes

Product Category

Product Name

Month Sales

Drinks Coke Feb 30.3

Drinks Coke Mar 93.9

Drinks Coke ALL 124.2

Drinks Heineken Feb 34.8

Drinks Heineken Mar 123.8

Drinks Heineken ALL 158.6

Drinks ALL ALL 282.8

Drinks ALL Feb 65.1

Drinks ALL Mar 217.7

Query Language Extensions

• introduce a new value: “ALL”

Reducing the number of attributes

“ALL” = the set over which we aggregate

Drinks Feb Mar Total (ALL)

Coke 30.3 93.9 124.2

Heineken 34.8 123.8 158.6

Total (ALL) 65.1 217.7 282.8

Query Language Extensions

• GROUP BY (1D)

General approach

Sales by Product Name

Feb Mar

Coke 30.3 93.9

Heineken 34.8 123.8

SUM 65.1 217.7

Query Language Extensions

• GROUP BY (1D)

• Cross Tab (2D)

General approach

Drinks Feb Mar ALL

Coke 30.3 93.9 124.2

Heineken 34.8 123.8 158.6

ALL 65.1 217.7 282.8

Product Category

Product Name

Month Sales

Drinks Coke Feb 30.3

Drinks Coke Mar 93.9

Drinks Coke ALL 124.2

Drinks Heineken Feb 34.8

Drinks Heineken Mar 123.8

Drinks Heineken ALL 158.6

Drinks ALL Feb 65.1

Drinks ALL Mar 217.7

Drinks ALL ALL 282.8

the corresponding relation:

Query Language Extensions

• GROUP BY (1D)

• Cross Tab (2D)

• Cube (3D)

General approachProduct

CategoryProduct Name

Month Sales

Drinks Coke Feb 30.3

Drinks Coke Mar 93.9

Drinks Coke ALL 124.2

… … … …

Snacks Doritos Feb 123.8

Snacks Doritos Mar 158.6

Snacks Doritos ALL 65.1

… … … …

ALL ALL ALL 964.0

By cat.andmonth

By cat. andname (does it make sense?)

By monthand name

Query Language Extensions

• GROUP BY (1D)

• Cross Tab (2D)

• Cube (3D)

• Any hypercube can be represented as a relation!

General approach

Query Language Extensions

• a CUBE relation, with aggregation function f(.)(x1, x2, …, xn-1, xn, f() )

……………………………

(x1, xn-1, …, xn, ALL, f() )

……………………………

(x1, x2, …, ALL, xn, f() )

……………………………

• after ROLLUP , reduce to a linear # of tuples(x1, x2, …, xn-1, xn, f() )

…………………………………

(x1, xn-1, …, xn, ALL, f() )

…………………………………

(x1, x2, …, ALL, ALL, f() )

…………………………………

(ALL, ALL, …, ALL, ALL, f() )

General approach

Query Language ExtensionsThe new operators: CUBE, ROLLUP

SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month

Product Category

Product Name

Month Sales

Drinks Coke Feb 30.3

Drinks Coke Mar 93.9

Drinks Coke ALL 124.2

… … … …

Drinks ALL Feb 99.8

… … … …

ALL ALL ALL 964.0

Idea:Group by the CUBE list.Union the aggregates.Introduce the ALL values.

Query Language ExtensionsThe new operators: CUBE, ROLLUP

SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state

Product Category

Month Day State Product Name

Sales

Drinks Feb 26 CA Coke 12.3

Feb 26 CA Heineken 5.4

… … … … …

Feb 26 CA ALL 30.4

Feb 26 ALL Coke …

… … … …

Snacks Feb 26 CA Doritos 12.0

… … … …

Overview• Motivation

• Multi-Dimensional Data Model

• Research Areas

• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy

Research Areas• SQL language extensions

• Server architecture

• Parallel processing

• Index structures

• Materialized views

Overview• Motivation

• Multi-Dimensional Data Model

• Research Areas

• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy

Simultaneous

AggregatesMulti-Dimensional

• Optimization to calculate multiple aggregates simultaneously

• Useful for materialization of aggregate views

• Y. Zhao, P. Deshpande, J. NaughtonAn Array-Based Algorithm for Simultaneous Multidimensional AggregatesSIGMOD 1997

Multiple Aggregates

Month /Product

Feb Mar Total

Altoids 36 131 167

Coke 37 138 175

Doritos 21 136 157

Heineken 44 110 154

Pepsi 31 122 153

Pringles 37 126 164

Total 206 764 970

Product City Month Sales

Coke San Diego Feb 06 12

Pepsi Los Angeles Feb 06 13

Doritos San Diego Mar 06 72

Altoids San Diego Mar 06 65

...

Aggregate on…

Multiple Aggregates

City /Product

San Diego Los Angeles Total

Altoids 90 77 167

Coke 89 86 175

Doritos 74 83 157

Heineken 74 80 154

Pepsi 68 85 153

Pringles 73 90 164

Total 469 501 970

Month /City

Feb Mar Total

Los Angeles 112 358 469

San Diego 95 407 501

Total 206 764 970

Month /Product

Feb Mar Total

Altoids 36 131 167

Coke 37 138 175

Doritos 21 136 157

Heineken 44 110 154

Pepsi 31 122 153

Pringles 37 126 164

Total 206 764 970

Product City Month Sales

Coke San Diego Feb 06 12

Pepsi Los Angeles Feb 06 13

Doritos San Diego Mar 06 72

Altoids San Diego Mar 06 65

...

Aggregate on…

Multiple Aggregates

Product City Month Sales

Coke San Diego Feb 06 12

Pepsi Los Angeles Feb 06 13

Doritos San Diego Mar 06 72

Altoids San Diego Mar 06 65

...

1. Sales by Product / City2. Sales by Product / Month3. Sales by Month / City4. Sales by Product5. Sales by City6. Sales by Month7. Sales (Total)

Is it possible to• make a single pass over the transactional table?• calculate multiple aggregates simultaneously?

Aggregate on…

Chunking

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

64

20

36

42Dimension B

Dimension A

Dimension C

1

Product City Month Sales

Coke San Diego Feb 06 12

12

Array Chunk

Product

City

Month

Partition transactional data into array chunks

Naïve Algorithm

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

64

20

36

42

Dimension A

Dimension C

Pivot on ABaggregate on all C

Dimension A

Dimension B

Naïve Algorithm

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

64

20

36

42

Dimension A

Dimension C

Pivot on ABaggregate on all C

Pivot on ACaggregate on all B

Pivot on BCaggregate on all A

Dimension B

Single Pass Algorithm

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

64

20

36

42

Dimension A

Dimension C

B

1 2 3 4

1 2 3 4

1 2 3 4

AB

AC

BC

Make a single pass over data

Single Pass Algorithm

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

64

20

36

42

Dimension A

Dimension C

B

13

9 10 11 12

5 6 7 8

1 2 3 4

1 5 9 13

2 6 10 3 7 11 4 5 12

13

9 10 11 12

5 6 7 8

1 2 3 4

AB

AC

BC

Simultaneously maintain multiple aggregates

Single Pass Algorithm

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

64

20

36

42

Dimension A

Dimension C

B

13

9 10 11 12

5 6 7 8

1 2 3 4

1 5 9 13

2 6 10 3 7 11 4 5 12

13

9 10 11 12

5 6 7 8

1 2 3 4

AB

AC

BC

Write out completed aggregates

Single Pass Algorithm

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

64

20

36

42

Dimension A

Dimension C

B

13

9 10 11 12

5 6 7 8

1 2 3 4

1 5 9 13

2 6 10 3 7 11 4 5 12

13

AB

AC

BC

Only allocate memory that is necessary

Single Pass Algorithm13

9 10 11 12

5 6 7 8

1 2 3 4

1 5 9 13

2 6 10 3 7 11 4 5 12

13

AB

AC

BC

Array Chunk

ABC4 x 4 x 4

AB16 x 4 x 4

AC4 x 4 x 4

BC4 x 4

A4 x 4

B4

C4

all1

Minimum memory spanning tree

Multi Pass Algorithm

ABCD

ABC ABD ACD BCD

AB AC BC AD BD CD

A B C D

all

Recursively aggregate

Overview• Motivation

• Multi-Dimensional Data Model

• Research Areas

• Optimizations– Materializing multiple aggregates simultaneously– Materialization strategy

Implementing Data Cubes

• Biggest problem for data warehouses: the size

• Space / time trade-off:accelerate queries by materializing the cube

Implementing Data Cubes

• Biggest problem for data warehouses: the size

• Space / time trade-off:accelerate queries by materializing the cube

• The size of the relations gets even bigger!

Implementing Data Cubes

• Biggest problem for data warehouses: the size

• Space / time trade-off:accelerate queries by materializing the cube

• The size of the relations gets even bigger!

• M(ultidimensional)OLAP: good query performance, but bad scalability

• R(elational)OLAP: very scalable; query performance improved by materializing (partial) results

Implementing Data Cubes

• V. Harinarayan, A. Rajaraman, J.D. Ullman:Implementing Data Cubes EfficientlySIGMOD 1996

Presents a materialization strategy for the cells of the cube.

Implementing Data Cubes

Time Id

City Id

Product Id

Sales

Day

Month

WeekCity Id

City

State

Product Id

Name

Category

Week

Month

Year Year

Category Id

Category Name

Implementing Data Cubes

• casted as particular case of the rewriting using views problem

• what cells to materialize what SQL views to materialize

Implementing Data Cubes

• casted as particular case of the rewriting using views problem

• what cells to materialize what SQL views to materialize

p = productt = timec = city

• simple idea: Q1 depends on Q2 (Q1≤Q2) if Q1 can be fully answered using the results of Q2

ptc

pt tc pc

t p c

none

Implementing Data Cubes

• but cube dimensions are usually hierarchical

product_name

product_category

none

day

week month

year

none

city

state

none

X X

• direct-product latticep = productt = timec = city

ptc

pt tc pc

ptspwc

pyc

pmc

ps

pcatt

… ……

Implementing Data Cubes

• Def. cost of answering Q = # of rows in the table of ancestor(Q)

• It can be estimated w/o materializing the views

• Assume that all queries are identical to some view in the lattice

Implementing Data Cubes

• For a set S and a view vB(v,S) = ∑w≤v, (w not in S) max{cost(w)-cost(v), 0}

• Greedy algorithm for selecting k views to materialize from the lattice:

1. S := {top view}

2. For i=1 to k, add v to S s.t. B(v,S) is maximized

• The greedy algorithm is an (e-1)/e ≈ 0.63 approx. of the optimum.

Discussion

• Questions from the audience…

top related