Top Banner
Physical Database Design for MPP and Columnar Databases Geoffrey Clark Principal at Lucidata, Inc. September 2013 copywrite, Lucidata, 2013
44

Data modelingzone geoffrey-clark-v2

Nov 01, 2014

Download

Travel

Geoffrey Clark

I presented these slides at Data Modeling Zone Europe on September 24 2013.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data modelingzone geoffrey-clark-v2

Physical Database Design for MPP and Columnar Databases

Geoffrey ClarkPrincipal at Lucidata, Inc.

September 2013

copywrite, Lucidata, 2013

Page 2: Data modelingzone geoffrey-clark-v2

Conceptual, Logical, Physical

• Conceptual links to Business Strategy.– This is now becoming more quantitative

• Logical maps to the Business Semantics.– Con-way example

• Physical maps to your Data Stores– These will be more varied and heterogeneous in

the future, due to specialization.

copywrite, Lucidata, 2013

Page 3: Data modelingzone geoffrey-clark-v2

HBR Business Strategy

The New Dynamics of Competition, Michael D. Ryall, Harvard Business Review, June 2013

Michael Porter’s Five Forces has dominated strategic and competitive analysis since 1979. This analysis has largely been conceptual in nature.

Quantitative analysis on structured data in context is changing the nature of business culture, and improving business decisions.

This drives the demand for data modeling and management.

copywrite, Lucidata, 2013

Page 4: Data modelingzone geoffrey-clark-v2

Design and Evolution• Hierarchies

– 14th Century Europe and the Financial Revolution– Aggregations & Allocations

• Cards, Tapes – physical analog media• Computer Science

– Moore’s Law• Processor Speed Improvements• Memory Improvements• Media Improvements – Punch Cards, Tape, Disk, Memory

• Design for Context & the Future– Character encoding - Internationalization– Calendars – Gregorian, Fiscal, Lunar, ... Y2K?

• Files and Fields– Separation of Data and Metadata– Modern versions -> XML, JSON

• Joins!– Data Sets – Super types, Sub types– Associations describe Networks!

copywrite, Lucidata, 2013

Page 5: Data modelingzone geoffrey-clark-v2

Technology’s Improvement Pace

copywrite, Lucidata, 2013

Page 6: Data modelingzone geoffrey-clark-v2

... and Demand Forecast

copywrite, Lucidata, 2013

Page 7: Data modelingzone geoffrey-clark-v2

Separation of Church and State

• Operational uses– Capture the data, hand-entered <- validation– A Data Flow, such as Order to Cash cycle– Con-way example of PRO(-gressive) numbers

• Analytical uses– Desire for reports, Reporting crashes the

Operational cycle, Cash flow problem.– Banished from OLTP, go make an ODS

copywrite, Lucidata, 2013

Page 8: Data modelingzone geoffrey-clark-v2

The Star SchemaThe purpose of business computers is to sort data. A graphical representation of sorted data is called a ‘Star Schema’.

– Michael Silves, Principal at Datamorphosis

• The right design at the right time, becomes default doctrine for DW– Early RDBMS (Relational Data Base Management Systems)

• Low memory, slow disks, slow CPU• Big Demand, with questions that spanned the datasets• Performance issues over large datasets

– Interview Business people to get questions• Pre-process the data, based on business questions

– Separation into Dimensions and Facts/Metrics• Link to Business Semantics• OLAP (On-Line Analytical Processing)• Educate Users on Aggregation and Allocation• Conformed Dimensions across Departments to give an Enterprise-wide view of the data.

• But as technology changes, problems emerge– Ad-hoc questions require redesign & rework– With business hierarchies when one concept is both a fact & dimension, e.g. Shipment– Fact tables become difficult to distribute for MPP ... e.g. Teradata prefers a normalized DW

• Example – transportation networks

copywrite, Lucidata, 2013

Page 9: Data modelingzone geoffrey-clark-v2

copywrite, Lucidata, 2013

Example – Multi-Modal Freight

• Shipments are agreements between a Carrier and a Shipper to move goods between two places.

• Shipments can be split into “ProFreight” (which is assigned a cost via activity-based costing).

• Shipments/ProFreight are composed of Freight handling units.

• Freight can be “re-tendered” to another carrier, in which case is is linked to the original and the new Shipment.

• Freight moves between places on one or many “VFCs” or Containers.

• Containers are moved between places on Trips.

Page 10: Data modelingzone geoffrey-clark-v2

Kimball on Transportation, 3NF

copywrite, Lucidata, 2013

Page 11: Data modelingzone geoffrey-clark-v2

Kimball on Transportation, Star

copywrite, Lucidata, 2013

Page 12: Data modelingzone geoffrey-clark-v2

Table Level DW diagram

copywrite, Lucidata, 2013

Page 13: Data modelingzone geoffrey-clark-v2

Dim Modeling Dogma

• “Our carefully normalized data model can not be translated into a star schema... “– Dimensional modeling is necessary in order to

generate correct queries – Any (normalized) data model can be transformed

in a dimensional model... – ... and there exists an algorithm to do it

copywrite, Lucidata, 2013

Page 14: Data modelingzone geoffrey-clark-v2

Dim Modeling Example

copywrite, Lucidata, 2013

Page 15: Data modelingzone geoffrey-clark-v2

Star option considered

copywrite, Lucidata, 2013

Page 16: Data modelingzone geoffrey-clark-v2

Bridge table(remember, we tried this)

We tried this with hesmith When selecting a main hierarchy is has too much of a downside, and you don’t have a weight factor …

copywrite, Lucidata, 2013

Page 17: Data modelingzone geoffrey-clark-v2

copywrite, Lucidata, 2013

Multi-fact option considered

Page 18: Data modelingzone geoffrey-clark-v2

Oracle’s Algorithmic approach

copywrite, Lucidata, 2013

Page 19: Data modelingzone geoffrey-clark-v2

Basic DW diagram

copywrite, Lucidata, 2013

Page 20: Data modelingzone geoffrey-clark-v2

Build Dimensional Model in BI

copywrite, Lucidata, 2013

Page 21: Data modelingzone geoffrey-clark-v2

Freight moves through Networks

copywrite, Lucidata, 2013

Page 22: Data modelingzone geoffrey-clark-v2

Information Factory & MPP

• Normalized Base– Integrate data once

• Source -> Normalized -> Denormalized -> OK• Source -> Denormalized? -> Un-normalized -> ?

– Detect problems and fix them once!• Does not preclude Data Marts• Massive Parallel Processing– Data distribution

• Optimizations – Broadcast, Co-location, Re-distribution• Scalability, the quest for 1:1• Normalized data - reduced IO, better match for

copywrite, Lucidata, 2013

Page 23: Data modelingzone geoffrey-clark-v2

Bob Conway’s Rapid Methodology

copywrite, Lucidata, 2013

Page 24: Data modelingzone geoffrey-clark-v2

Core Model with many Roles

TransactionTables

Reference Tables

copywrite, Lucidata, 2013

Page 25: Data modelingzone geoffrey-clark-v2

Power of Conformed Dimensions

copywrite, Lucidata, 2013

Page 26: Data modelingzone geoffrey-clark-v2

Example Data Model & Hierarchy

copywrite, Lucidata, 2013

Page 27: Data modelingzone geoffrey-clark-v2

Data Flow and Usage

copywrite, Lucidata, 2013

Page 28: Data modelingzone geoffrey-clark-v2

Cubes and In-memory BI

• Multi-Dimensional OLAP (MOLAP)– Drag-and-Drop OLAP environment, analysts become

capable of self-service.– Dealt with Ragged Hierarchies, common in Financial

data such as General Ledger (GL)– Limited by memory size– Pressure for more dimensionality floods cube size,

build times from relational sources exceed load windows ...

• Relational OLAP (ROLAP)copywrite, Lucidata, 2013

Page 29: Data modelingzone geoffrey-clark-v2

But a network this size choked it

copywrite, Lucidata, 2013

Page 30: Data modelingzone geoffrey-clark-v2

Columnar vs Row-wise

• Physically store data by Column vs Row– Rather like Fifth Normal Form.– If Semantically Organized, then Rapid Response to

user’s ad-hoc aggregation requests.– Prefers batch loading, always loads once per

column, even if loading one row.• Continues to Appear and Operate as a normal

Row-wise cousin.

copywrite, Lucidata, 2013

Page 31: Data modelingzone geoffrey-clark-v2

Columnar IO example

Compression becomes much more effective

Reading a Column is like reading a Row

copywrite, Lucidata, 2013

Page 32: Data modelingzone geoffrey-clark-v2

Design Pattern for Log DataData Stewards for

Master DataData Stewards for

Metadata

Architects integrate data and metadata

Architects organize data for

analysis with physical in mind

Architects identify levels for analysis, and distributionColumnar

MPP

copywrite, Lucidata, 2013

Page 33: Data modelingzone geoffrey-clark-v2

Importance of Reference Data

copywrite, Lucidata, 2013

Page 34: Data modelingzone geoffrey-clark-v2

Infobright’s Database Landscape 2011

copywrite, Lucidata, 2013

Page 35: Data modelingzone geoffrey-clark-v2

Analytic Database ComparisonActian

ParAccelIBM

NetezzaHP

VerticaGreenplum

Teradata

Sybase IQ

copywrite, Lucidata, 2013

Page 36: Data modelingzone geoffrey-clark-v2

Gartner’s Magic Quadrant

copywrite, Lucidata, 2013

Page 37: Data modelingzone geoffrey-clark-v2

Hadoop (Cloudera & Hortonworks)

“Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspx

copywrite, Lucidata, 2013

Page 38: Data modelingzone geoffrey-clark-v2

Hadoop for Analytics?

Analytics performs best on Structured

Data, for good reasons.

Maintain MPP strengths in the solution through

Architecture.copywrite, Lucidata, 2013

Page 39: Data modelingzone geoffrey-clark-v2

Message from Hortonworks (Hadoop)

“Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspxcopywrite, Lucidata, 2013

Page 40: Data modelingzone geoffrey-clark-v2

Hadoop as ETL

copywrite, Lucidata, 2013

Page 41: Data modelingzone geoffrey-clark-v2

Data Flow Reference Architecture

copywrite, Lucidata, 2013

Page 42: Data modelingzone geoffrey-clark-v2

Message from Neo4J NoSQL

copywrite, Lucidata, 2013

Page 43: Data modelingzone geoffrey-clark-v2

Message from MongoDB (NoSQL)

http://www.slideshare.net/fullscreen/mongodb/schema-design-by-example/1copywrite, Lucidata, 2013

Page 44: Data modelingzone geoffrey-clark-v2

Message from Couchbase (NoSQL)

http://www.couchbase.com/why-nosql/nosql-databasecopywrite, Lucidata, 2013