Physical Database Design for MPP and Columnar Databases Geoffrey Clark Principal at Lucidata, Inc. September 2013 copywrite, Lucidata, 2013
Nov 01, 2014
Physical Database Design for MPP and Columnar Databases
Geoffrey ClarkPrincipal at Lucidata, Inc.
September 2013
copywrite, Lucidata, 2013
Conceptual, Logical, Physical
• Conceptual links to Business Strategy.– This is now becoming more quantitative
• Logical maps to the Business Semantics.– Con-way example
• Physical maps to your Data Stores– These will be more varied and heterogeneous in
the future, due to specialization.
copywrite, Lucidata, 2013
HBR Business Strategy
The New Dynamics of Competition, Michael D. Ryall, Harvard Business Review, June 2013
Michael Porter’s Five Forces has dominated strategic and competitive analysis since 1979. This analysis has largely been conceptual in nature.
Quantitative analysis on structured data in context is changing the nature of business culture, and improving business decisions.
This drives the demand for data modeling and management.
copywrite, Lucidata, 2013
Design and Evolution• Hierarchies
– 14th Century Europe and the Financial Revolution– Aggregations & Allocations
• Cards, Tapes – physical analog media• Computer Science
– Moore’s Law• Processor Speed Improvements• Memory Improvements• Media Improvements – Punch Cards, Tape, Disk, Memory
• Design for Context & the Future– Character encoding - Internationalization– Calendars – Gregorian, Fiscal, Lunar, ... Y2K?
• Files and Fields– Separation of Data and Metadata– Modern versions -> XML, JSON
• Joins!– Data Sets – Super types, Sub types– Associations describe Networks!
copywrite, Lucidata, 2013
Technology’s Improvement Pace
copywrite, Lucidata, 2013
... and Demand Forecast
copywrite, Lucidata, 2013
Separation of Church and State
• Operational uses– Capture the data, hand-entered <- validation– A Data Flow, such as Order to Cash cycle– Con-way example of PRO(-gressive) numbers
• Analytical uses– Desire for reports, Reporting crashes the
Operational cycle, Cash flow problem.– Banished from OLTP, go make an ODS
copywrite, Lucidata, 2013
The Star SchemaThe purpose of business computers is to sort data. A graphical representation of sorted data is called a ‘Star Schema’.
– Michael Silves, Principal at Datamorphosis
• The right design at the right time, becomes default doctrine for DW– Early RDBMS (Relational Data Base Management Systems)
• Low memory, slow disks, slow CPU• Big Demand, with questions that spanned the datasets• Performance issues over large datasets
– Interview Business people to get questions• Pre-process the data, based on business questions
– Separation into Dimensions and Facts/Metrics• Link to Business Semantics• OLAP (On-Line Analytical Processing)• Educate Users on Aggregation and Allocation• Conformed Dimensions across Departments to give an Enterprise-wide view of the data.
• But as technology changes, problems emerge– Ad-hoc questions require redesign & rework– With business hierarchies when one concept is both a fact & dimension, e.g. Shipment– Fact tables become difficult to distribute for MPP ... e.g. Teradata prefers a normalized DW
• Example – transportation networks
copywrite, Lucidata, 2013
copywrite, Lucidata, 2013
Example – Multi-Modal Freight
• Shipments are agreements between a Carrier and a Shipper to move goods between two places.
• Shipments can be split into “ProFreight” (which is assigned a cost via activity-based costing).
• Shipments/ProFreight are composed of Freight handling units.
• Freight can be “re-tendered” to another carrier, in which case is is linked to the original and the new Shipment.
• Freight moves between places on one or many “VFCs” or Containers.
• Containers are moved between places on Trips.
Kimball on Transportation, 3NF
copywrite, Lucidata, 2013
Kimball on Transportation, Star
copywrite, Lucidata, 2013
Table Level DW diagram
copywrite, Lucidata, 2013
Dim Modeling Dogma
• “Our carefully normalized data model can not be translated into a star schema... “– Dimensional modeling is necessary in order to
generate correct queries – Any (normalized) data model can be transformed
in a dimensional model... – ... and there exists an algorithm to do it
copywrite, Lucidata, 2013
Dim Modeling Example
copywrite, Lucidata, 2013
Star option considered
copywrite, Lucidata, 2013
Bridge table(remember, we tried this)
We tried this with hesmith When selecting a main hierarchy is has too much of a downside, and you don’t have a weight factor …
copywrite, Lucidata, 2013
copywrite, Lucidata, 2013
Multi-fact option considered
Oracle’s Algorithmic approach
copywrite, Lucidata, 2013
Basic DW diagram
copywrite, Lucidata, 2013
Build Dimensional Model in BI
copywrite, Lucidata, 2013
Freight moves through Networks
copywrite, Lucidata, 2013
Information Factory & MPP
• Normalized Base– Integrate data once
• Source -> Normalized -> Denormalized -> OK• Source -> Denormalized? -> Un-normalized -> ?
– Detect problems and fix them once!• Does not preclude Data Marts• Massive Parallel Processing– Data distribution
• Optimizations – Broadcast, Co-location, Re-distribution• Scalability, the quest for 1:1• Normalized data - reduced IO, better match for
copywrite, Lucidata, 2013
Bob Conway’s Rapid Methodology
copywrite, Lucidata, 2013
Core Model with many Roles
TransactionTables
Reference Tables
copywrite, Lucidata, 2013
Power of Conformed Dimensions
copywrite, Lucidata, 2013
Example Data Model & Hierarchy
copywrite, Lucidata, 2013
Data Flow and Usage
copywrite, Lucidata, 2013
Cubes and In-memory BI
• Multi-Dimensional OLAP (MOLAP)– Drag-and-Drop OLAP environment, analysts become
capable of self-service.– Dealt with Ragged Hierarchies, common in Financial
data such as General Ledger (GL)– Limited by memory size– Pressure for more dimensionality floods cube size,
build times from relational sources exceed load windows ...
• Relational OLAP (ROLAP)copywrite, Lucidata, 2013
But a network this size choked it
copywrite, Lucidata, 2013
Columnar vs Row-wise
• Physically store data by Column vs Row– Rather like Fifth Normal Form.– If Semantically Organized, then Rapid Response to
user’s ad-hoc aggregation requests.– Prefers batch loading, always loads once per
column, even if loading one row.• Continues to Appear and Operate as a normal
Row-wise cousin.
copywrite, Lucidata, 2013
Columnar IO example
Compression becomes much more effective
Reading a Column is like reading a Row
copywrite, Lucidata, 2013
Design Pattern for Log DataData Stewards for
Master DataData Stewards for
Metadata
Architects integrate data and metadata
Architects organize data for
analysis with physical in mind
Architects identify levels for analysis, and distributionColumnar
MPP
copywrite, Lucidata, 2013
Importance of Reference Data
copywrite, Lucidata, 2013
Infobright’s Database Landscape 2011
copywrite, Lucidata, 2013
Analytic Database ComparisonActian
ParAccelIBM
NetezzaHP
VerticaGreenplum
Teradata
Sybase IQ
copywrite, Lucidata, 2013
Gartner’s Magic Quadrant
copywrite, Lucidata, 2013
Hadoop (Cloudera & Hortonworks)
“Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspx
copywrite, Lucidata, 2013
Hadoop for Analytics?
Analytics performs best on Structured
Data, for good reasons.
Maintain MPP strengths in the solution through
Architecture.copywrite, Lucidata, 2013
Message from Hortonworks (Hadoop)
“Although it’s true that Hadoop can be valuable as an analytic silo, most organizations will prefer to get the most business value out of Hadoop by integrating it with—or into—their BI, DW, DI, and analytics technology stacks.” – Philip Russom TDWI http://tdwi.org/webcasts/2013/04/integrating-hadoop-into-business-intelligence-and-data-warehousing.aspxcopywrite, Lucidata, 2013
Hadoop as ETL
copywrite, Lucidata, 2013
Data Flow Reference Architecture
copywrite, Lucidata, 2013
Message from Neo4J NoSQL
copywrite, Lucidata, 2013
Message from MongoDB (NoSQL)
http://www.slideshare.net/fullscreen/mongodb/schema-design-by-example/1copywrite, Lucidata, 2013
Message from Couchbase (NoSQL)
http://www.couchbase.com/why-nosql/nosql-databasecopywrite, Lucidata, 2013