Distributed OLAP Databases - CMU 15-445/645 · SALES_FACT PRODUCT_FK TIME_FK LOCATION_FK CUSTOMER_FK PRICE QUANTITY. CMU 15-445/645 (Fall 2018) ... →If the DBMS fails during query

Database Systems

15-445/15-645

Fall 2018

Andy PavloComputer Science Carnegie Mellon Univ.AP

Lecture #24

Distributed OLAP Databases

https://db.cs.cmu.edu/

http://www.cs.cmu.edu/~pavlo/

CMU 15-445/645 (Fall 2018)

UPCOMING DATABASE EVENTS

Swarm64 Tech Talk→ Thursday November 29th @ 12pm→ GHC 8102 ← Different Location!

VoltDB Research Talk→ Monday December 3rd @ 4:30pm→ GHC 8102

2


https://15445.courses.cs.cmu.edu/

https://db.cs.cmu.edu/events/hardware-accelerated-databases-karsten-ronner-swarm64/

https://db.cs.cmu.edu/events/db-seminar-fall-2018-ethan-zhang-voltdb/

CMU 15-445/645 (Fall 2018)

OLTP VS. OL AP

On-line Transaction Processing (OLTP):→ Short-lived read/write txns.→ Small footprint.→ Repetitive operations.

On-line Analytical Processing (OLAP):→ Long-running, read-only queries.→ Complex joins.→ Exploratory queries.

8



CMU 15-445/645 (Fall 2018)

BIFURCATED ENVIRONMENT

9

ExtractTransform

Load

OLAP DatabaseOLTP Databases



CMU 15-445/645 (Fall 2018)

DECISION SUPPORT SYSTEMS

Applications that serve the management, operations, and planning levels of an organization to help people make decisions about future issues and problems by analyzing historical data.

Star Schema vs. Snowflake Schema

10



CMU 15-445/645 (Fall 2018)

STAR SCHEMA

11

CATEGORY_NAMECATEGORY_DESCPRODUCT_CODEPRODUCT_NAMEPRODUCT_DESC

PRODUCT_DIM

COUNTRYSTATE_CODESTATE_NAMEZIP_CODECITY

LOCATION_DIM

IDFIRST_NAMELAST_NAMEEMAILZIP_CODE

CUSTOMER_DIM

YEARDAY_OF_YEARMONTH_NUMMONTH_NAMEDAY_OF_MONTH

TIME_DIM

SALES_FACTPRODUCT_FKTIME_FKLOCATION_FKCUSTOMER_FK

PRICEQUANTITY



CMU 15-445/645 (Fall 2018)

SNOWFL AKE SCHEMA

12

CATEGORY_FKPRODUCT_CODEPRODUCT_NAMEPRODUCT_DESC

PRODUCT_DIM

COUNTRYSTATE_FKZIP_CODECITY

LOCATION_DIM

IDFIRST_NAMELAST_NAMEEMAILZIP_CODE

CUSTOMER_DIM

YEARDAY_OF_YEARMONTH_FKDAY_OF_MONTH

TIME_DIM

SALES_FACTPRODUCT_FKTIME_FKLOCATION_FKCUSTOMER_FK

PRICEQUANTITY

CATEGORY_IDCATEGORY_NAMECATEGORY_DESC

CAT_LOOKUP

STATE_IDSTATE_CODESTATE_NAME

STATE_LOOKUPMONTH_NUMMONTH_NAMEMONTH_SEASON

MONTH_LOOKUP



CMU 15-445/645 (Fall 2018)

STAR VS. SNOWFL AKE SCHEMA

Issue #1: Normalization→ Snowflake schemas take up less storage space.→ Denormalized data models may incur integrity and

consistency violations.

Issue #2: Query Complexity→ Snowflake schemas require more joins to get the data

needed for a query.→ Queries on star schemas will (usually) be faster.

13



CMU 15-445/645 (Fall 2018)

P3 P4

P1 P2

PROBLEM SETUP

14

ApplicationServer

PartitionsSELECT * FROM R JOIN SON R.id = S.id



CMU 15-445/645 (Fall 2018)

P3 P4

P1 P2

PROBLEM SETUP

14

ApplicationServer

PartitionsSELECT * FROM R JOIN SON R.id = S.id

P2P4P3



CMU 15-445/645 (Fall 2018)

TODAY'S AGENDA

Execution Models

Query Planning

Distributed Join Algorithms

Cloud Systems

15



CMU 15-445/645 (Fall 2018)

PUSH VS. PULL

Approach #1: Push Query to Data→ Send the query (or a portion of it) to the node that

contains the data.→ Perform as much filtering and processing as possible

where data resides before transmitting over network.

Approach #2: Pull Data to Query→ Bring the data to the node that is executing a query that

needs it for processing.

16



CMU 15-445/645 (Fall 2018)

PUSH QUERY TO DATA

17

Node

ApplicationServer Node

P1→ID:1-100

P2→ID:101-200

SELECT * FROM R JOIN SON R.id = S.id

R ⨝ SIDs [101,200] Result: R ⨝ S



CMU 15-445/645 (Fall 2018)

Storage

PULL DATA TO QUERY

18

Node


Page ABC

Page XYZ

R ⨝ SIDs [101,200]

P1→ID:1-100

P2→ID:101-200




CMU 15-445/645 (Fall 2018)

Storage

PULL DATA TO QUERY

18

Node


Page ABC

Page XYZ

R ⨝ SIDs [101,200]

P1→ID:1-100

P2→ID:101-200




CMU 15-445/645 (Fall 2018)

Storage

PULL DATA TO QUERY

18

Node


R ⨝ SIDs [101,200]

P1→ID:1-100

P2→ID:101-200


Result: R ⨝ S



CMU 15-445/645 (Fall 2018)

FAULT TOLERANCE

Traditional distributed OLAP DBMSs were designed to assume that nodes will not fail during query execution. → If the DBMS fails during query execution, then the whole

query fails.

The DBMS could take a snapshot of the intermediate results for a query during execution to allow it to recover after a crash.

21



CMU 15-445/645 (Fall 2018)

QUERY PL ANNING

All the optimizations that we talked about before are still applicable in a distributed environment.→ Predicate Pushdown→ Early Projections→ Optimal Join Orderings

But now the DBMS must also consider the location of data at each partition when optimizing

22



CMU 15-445/645 (Fall 2018)

QUERY PL AN FRAGMENTS

Approach #1: Physical Operators→ Generate a single query plan and then break it up into

partition-specific fragments.→ Most systems implement this approach.

Approach #2: SQL→ Rewrite original query into partition-specific queries.→ Allows for local optimization at each node.→ MemSQL is the only system that I know that does this.

23



CMU 15-445/645 (Fall 2018)


25


Id:1-100


WHERE R.id BETWEEN 1 AND 100

Id:101-200



Id:201-300





CMU 15-445/645 (Fall 2018)


25


Id:1-100



Id:101-200



Id:201-300



Union the output of each join together to produce final result.



CMU 15-445/645 (Fall 2018)

OBSERVATION

The efficiency of a distributed join depends on the target tables' partitioning schemes.

One approach is to put entire tables on a single node and then perform the join.→ You lose the parallelism of a distributed DBMS.→ Costly data transfer over the network.

26



CMU 15-445/645 (Fall 2018)

DISTRIBUTED JOIN ALGORITHMS

To join tables R and S, the DBMS needs to get the proper tuples on the same node.

Once there, it then executes the same join algorithms that we discussed earlier in the semester.

27



CMU 15-445/645 (Fall 2018)

SCENARIO #1

One table is replicated at every node.Each node joins its local data and then sends their results to a coordinating node.

28

R (Id)

S

Id:1-100

Replicated

R (Id)

S

Id:101-200

Replicated


P1:R⨝S P2:R⨝S



CMU 15-445/645 (Fall 2018)

SCENARIO #1

One table is replicated at every node.Each node joins its local data and then sends their results to a coordinating node.

28

R (Id)

S

Id:1-100

Replicated

R (Id)

S

Id:101-200

Replicated


P1:R⨝S

P2:R⨝SR⨝S



CMU 15-445/645 (Fall 2018)

SCENARIO #2

Tables are partitioned on the join attribute. Each node performs the join on local data and then sends to a node for coalescing.

29

R (Id)

S (Id)

Id:1-100 R (Id)

S (Id)

Id:101-200

Id:1-100 Id:101-200

P1:R⨝S P2:R⨝S




CMU 15-445/645 (Fall 2018)

SCENARIO #2

Tables are partitioned on the join attribute. Each node performs the join on local data and then sends to a node for coalescing.

29

R (Id)

S (Id)

Id:1-100 R (Id)

S (Id)

Id:101-200

Id:1-100 Id:101-200

P1:R⨝S

P2:R⨝SR⨝S




CMU 15-445/645 (Fall 2018)

SCENARIO #3

Both tables are partitioned on different keys. If one of the tables is small, then the DBMS broadcaststhat table to all nodes.

30

R (Id)

S (Val)

Id:1-100 R (Id)

S (Val)

Id:101-200

Val:1-50 Val:51-100




CMU 15-445/645 (Fall 2018)

SCENARIO #3


30

R (Id)

S (Val)

Id:1-100 R (Id)

S (Val)

Id:101-200

Val:1-50 Val:51-100

S




CMU 15-445/645 (Fall 2018)

SCENARIO #3


30

R (Id)

S (Val)

Id:1-100 R (Id)

S (Val)

Id:101-200

Val:1-50 Val:51-100

S S




CMU 15-445/645 (Fall 2018)

SCENARIO #3


30

R (Id)

S (Val)

Id:1-100 R (Id)

S (Val)

Id:101-200

Val:1-50 Val:51-100

S S

P1:R⨝S P2:R⨝S




CMU 15-445/645 (Fall 2018)

SCENARIO #3


30

R (Id)

S (Val)

Id:1-100 R (Id)

S (Val)

Id:101-200

Val:1-50 Val:51-100

S S

P1:R⨝S

P2:R⨝SR⨝S




CMU 15-445/645 (Fall 2018)

SCENARIO #4

Both tables are not partitioned on the join key. The DBMS copies the tables by reshuffling them across nodes.

31

R (Name)

S (Val)

Name:A-M R (Name)

S (Val)

Name:N-Z

Val:1-50 Val:51-100




CMU 15-445/645 (Fall 2018)

SCENARIO #4


31

R (Name)

S (Val)

Name:A-M R (Name)

S (Val)

Name:N-Z

Val:1-50 Val:51-100

R (Id) Id:101-200




CMU 15-445/645 (Fall 2018)

SCENARIO #4


31

R (Name)

S (Val)

Name:A-M R (Name)

S (Val)

Name:N-Z

Val:1-50 Val:51-100

R (Id)Id:1-100 R (Id) Id:101-200




CMU 15-445/645 (Fall 2018)

SCENARIO #4


31

R (Name)

S (Val)

Name:A-M R (Name)

S (Val)

Name:N-Z

Val:1-50 Val:51-100

Id:101-200S (Id)

R (Id)Id:1-100 R (Id) Id:101-200




CMU 15-445/645 (Fall 2018)

SCENARIO #4


31

R (Name)

S (Val)

Name:A-M R (Name)

S (Val)

Name:N-Z

Val:1-50 Val:51-100

Id:1-100 S (Id) Id:101-200S (Id)

R (Id)Id:1-100 R (Id) Id:101-200




CMU 15-445/645 (Fall 2018)

SCENARIO #4


31

R (Name)

S (Val)

Name:A-M R (Name)

S (Val)

Name:N-Z

Val:1-50 Val:51-100

Id:1-100 S (Id) Id:101-200S (Id)

P1:R⨝S P2:R⨝S

R (Id)Id:1-100 R (Id) Id:101-200




CMU 15-445/645 (Fall 2018)

SCENARIO #4


31

R (Name)

S (Val)

Name:A-M R (Name)

S (Val)

Name:N-Z

Val:1-50 Val:51-100

Id:1-100 S (Id) Id:101-200S (Id)

P1:R⨝S

P2:R⨝SR⨝S

R (Id)Id:1-100 R (Id) Id:101-200




CMU 15-445/645 (Fall 2018)

REL ATIONAL ALGEBRA: SEMI-JOIN

Like a natural join except that the attributes that are not used tocompute the join are restricted.

Syntax: (R⋉ S)

32

a_id b_id xxx

a1 101 X1

a2 102 X2

a3 103 X3

R(a_id,b_id,xxx) S(a_id,b_id,yyy)a_id b_id yyy

a3 103 Y1

a4 104 Y2

a5 105 Y3

(R ⋉ S)a_id b_id

a3 103Distributed DBMSs use semi-join to minimize the amount of data sent during joins.This is the same as a projection pushdown.



CMU 15-445/645 (Fall 2018)

CLOUD SYSTEMS

Vendors provide database-as-a-service (DBaaS) offerings that are managed DBMS environments.

Newer systems are starting to blur the lines between shared-nothing and shared-disk.

33



CMU 15-445/645 (Fall 2018)

CLOUD SYSTEMS

Approach #1: Managed DBMSs→ No significant modification to the DBMS to be "aware"

that it is running in a cloud environment.→ Examples: Most vendors

Approach #2: Cloud-Native DBMS→ The system is designed explicitly to run in a cloud

environment. → Usually based on a shared-disk architecture.→ Examples: Snowflake, Google BigQuery, Amazon

Redshift, Microsoft SQL Azure

34



CMU 15-445/645 (Fall 2018)

UNIVERSAL FORMATS

Traditional DBMSs store data in proprietary binary file formats that are incompatible.

One can use text formats (XML/JSON/CSV) to share data across different systems.

There are now standardized file formats.

35



CMU 15-445/645 (Fall 2018)

UNIVERSAL FORMATS

Apache Parquet→ Compressed columnar storage from Cloudera/Twitter

Apache ORC→ Compressed columnar storage from Apache Hive.

HDF5→ Multi-dimensional arrays for scientific workloads.

Apache Arrow→ In-memory compressed columnar storage from Pandas/Dremio

36



https://parquet.apache.org/

https://orc.apache.org/

https://www.hdfgroup.org/

https://arrow.apache.org/

CMU 15-445/645 (Fall 2018)

CONCLUSION

Again, efficient distributed OLAP systems are difficult to implement.

More data, more problems…

37



CMU 15-445/645 (Fall 2018)

NEXT CL ASS

VoltDB Guest Speaker

38



Distributed OLAP Databases - CMU 15-445/645 · SALES_FACT PRODUCT_FK TIME_FK LOCATION_FK CUSTOMER_FK PRICE QUANTITY. CMU 15-445/645 (Fall 2018) ... →If the DBMS fails during query

Documents