Top Banner
Data Warehouse Design Best Practices
38

Data Warehouse Design and Best Practices

Jun 26, 2015

Download

Data & Analytics

Ivo Andreev

A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Data Warehouse Design and Best Practices

Data Warehouse Design

Best Practices

Page 2: Data Warehouse Design and Best Practices

About me

� Project Manager @

� 12 years professional experience

� .NET Web Development MCPD

� SQL Server 2012 (MCSA)

� Business Interests

� Web Development, SOA, Integration

� Security & Performance Optimization

� Horizon2020, Open BIM, GIS, Mapping

� Contact me

[email protected]

� www.linkedin.com/in/ivelin

� www.slideshare.net/ivoandreev

2 |

Page 3: Data Warehouse Design and Best Practices

About me

� Senior Developer @

� .NET Web Development MCPD

� Business Interests

� Web Development, WCF, Integration

� SQL Server – Query Optimization and Tuning

� Data Warehousing

� Contact me� [email protected]

� www.linkedin.com/in/georgimishev

Page 4: Data Warehouse Design and Best Practices

Sponsors

Page 5: Data Warehouse Design and Best Practices

Agenda

� Why Data Warehouse

� Main DW Architectures

� Dimensional Modeling

� Patterns & Practices

� DW Maintenance

� ETL Process

� SSIS Demo

Page 6: Data Warehouse Design and Best Practices

Lots of Data Everywhere

� Can’t find data?

� Data scattered over the network

� Can’t get data?

� Need an expert to get the data

� Can’t understand data?

� Data poorly documented

� Can’t use data found?

� Data needs to be transformed

Page 7: Data Warehouse Design and Best Practices

Data Warehouse?

� Integrated

� Heterogeneous sources

� Data clean and conversion ($, €, 元)

� Focus on subject

� i.e. Customer, Sale, Product

� Time variant

� Timestamp every key

� Historical data (10+ years)

Def: Central repository where data are organized, cleansed and in standardized format.

Page 8: Data Warehouse Design and Best Practices

Different Problems - Different Solutions

OLTP Database Data Warehouse

Users Customer Knowledge worker

Design Normalized, Data Integrity Denormalized

Function Daily operation Decision making

Data Current, Detailed Historical, Aggregated

Usage Real time Ad-hoc

Access Short R/W transactions Complex R/O queries

Data accessed Comparatively lower Large Amounts

# Records x100 x1’000’000

# Users x1’000 x10

DB Size x10 GB x100GB-TB

Page 9: Data Warehouse Design and Best Practices

Different DW Architectures

Page 10: Data Warehouse Design and Best Practices

B.Inmon Model

Top-Down Approach

� Warehouse (3NF)

� Data Mart & OLAP (MD)

http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640&h=368

Page 11: Data Warehouse Design and Best Practices

R.Kimball Model

Bottom-Up Approach

� Data Marts (3NF or MD)

� Warehouse & OLAP (MD)

http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640&h=369

Page 12: Data Warehouse Design and Best Practices

Data Vault (by Dan Linstedt)

� Hubs

� List of unique business keys

� Links

� Unique relationships between keys

� Satellites

� Hub and Link details and history

Page 13: Data Warehouse Design and Best Practices

It is irrelevant which camp you belong…

as far as you understand why!

Page 14: Data Warehouse Design and Best Practices

Making Your Choice

• Data Vault

Backend Data Warehouse

+ Multiple sources; Full history; Incremental build

- Up-front work; Long-term payoff; Many joins

• Inmon (3NF)

+ Structured

+ Easy to maintain

+ Easier data mining

- Timely to build

• Kimball (MD)

+ Start small, scale big

+ Faster ROI

+ Analytical tools

- Low reusability

Page 15: Data Warehouse Design and Best Practices

Dimensional modeling as de-facto standard

Page 16: Data Warehouse Design and Best Practices

Dimensions

Def: The object of BI interest

� Keys

� Surrogate key

� Business key

� Hierarchical attributes

� Analysis and Drill Down

� Member properties

� Presentation labels

� Auditing information (not for end users)

Page 17: Data Warehouse Design and Best Practices

Slowly Changing Dimensions

Def: Scheme for recording changes over time

� Type 1 - Overwrite

� Type 2 – Multiple Records

Page 18: Data Warehouse Design and Best Practices

Facts

Def: Measurement of a business process

� Keys

� FK from all dimensional tables (in the star)

� PK - Composite (usually) or Surrogate

� Measures

� Numeric columns, that are of interest to the business

� Additive, Non-additive, Semi-additive

� Factless facts

� Auditing information (optional)

Page 19: Data Warehouse Design and Best Practices

Practices and Design Patterns

Page 20: Data Warehouse Design and Best Practices

Data Warehouse Pitfalls

� Admit it is not as it seems to be

� You need education

� Find what is of business value� Rather than focus on performance

� Spend a lot of time in Extract-Transform-Load

� Homogenize data from different sources

� Find (and resolve) problems in source systems

Page 21: Data Warehouse Design and Best Practices

Prepare your Sources

� Data integrity

� Avoid redundancy

� Data quality

� Master data source

� Data validation

� Auditing

� CreatedDate / CreatedBy

� ChangedDate / ChangedBy

� Nightly jobs

Page 22: Data Warehouse Design and Best Practices

Dimension Design

� Business key with non-clustered index� Include date (if dimension has history)

� Surrogate key� The smallest possible integer

� Clustered index

� FK constraints� Do not enforce (WITH NOCHECK)

� Document the relation

� Faster load

� Data validation� Task for the Source system

Page 23: Data Warehouse Design and Best Practices

Conformed Dimensions

Def. Having the same meaning and content

when referred from multiple fact tables

� Date Dimension

� Partitioning best candidate

� Granularity

� Do not store every hour, when reporting daily

� Avoid surrogate keys

� Saves lookup and joins

� Integer representing date (yyyyMMdd, days after 1/1/1900)

Page 24: Data Warehouse Design and Best Practices

Pre-join Hierarchies

� Recursive relationships

� Fast drill and report

� Pre-computed aggregations

Hierarchy Bridge

� For each dimension row

� 1 association with self

� 1 row for each subordinate

Page 25: Data Warehouse Design and Best Practices

Determine the Facts

The center of a Star schema

� Identify subject areas

� Identify key business events

� Identify dimensions

� Start from OLTP logical model

� Identify historical requirements

� Identify attributes

Page 26: Data Warehouse Design and Best Practices

The Grain

Def: The level of detail of a fact table

� What is the business objective?

� Fine grain - behaviour and frequency analysis

� Coarse grain - overall and trend analysis

� Aggregates

� DO NOT summarize prematurely

� DO NOT mix detail and summary

� DO use “summary tables”

Page 27: Data Warehouse Design and Best Practices

C3-PO is fluent in 6M forms of communication.

What about your customers?

Page 28: Data Warehouse Design and Best Practices

Multinational DW

� What parts need translation?

� Where to store various language versions?

� How to support future languages?

� Dimensions

� Add language attribute

� Include text data in the dimension

� Problem 1: The dimension key?

� Replicate PK for every language

Fact.DimId = Dim.Id AND Dim.Lang=[Lang]

� Problem 2: Storage = [Dim] x [Lang]

� Sub-dimension with language attributes

TxtId Attr1 Attr2 LangId

1 large Yes En

2 small No En

1 stor Ja No

2 liten Nei No

3 … … …

Page 29: Data Warehouse Design and Best Practices

Data warehouse maintenance

Page 30: Data Warehouse Design and Best Practices

How Large is “Large”

Is big really big?

Page 31: Data Warehouse Design and Best Practices

Partitioning

� Why� Faster index maintenance

� Faster load

� Faster queries

� When� Tables 10GB+

� How� Do not partition dimension tables

� Partition by date (most analysis are time-based)

� Eliminate partitions (WHERE [PartitionKey]=…)

� Avoid split and merge of existing partitions� Can cause inefficient log generation

Page 32: Data Warehouse Design and Best Practices

Columnstore Index

� Non-clustered in SQL 2012

� Clustered in SQL 2014

� Pros

� Better data compression

� High performance on table scan

� Clustered CSI Limitations

� No other indexes allowed

� Little advantage on seek operations

� No XML, computed column or replication

Page 33: Data Warehouse Design and Best Practices

Extract-Transform-Load

� Extract data from OLTP

� Data transformations

� Data loads

� DW maintenance

Page 34: Data Warehouse Design and Best Practices

Efficient Load Process

� Use simple recovery model during data load

� Staging

� Avoid indexing

� Populate in parallel

� Maintain DW

� Disable indexes on load

� Rebuild manually after load

� Automatic stats update slow down SQL Server

Page 35: Data Warehouse Design and Best Practices

To SSIS, or not to SSIS ?

Pros� Minimum coding to none

� Extensive support of various data sources

� Parallel execution of migration tasks

� Better organization of the ETL process

Cons� Another way of thinking

� Hidden options

� T-SQL developer would do much faster

� Auto-generated flows need optimization

� Sometimes simply does not work (i.e. Sort by GUID)

Page 36: Data Warehouse Design and Best Practices
Page 37: Data Warehouse Design and Best Practices

Takeaways

� Books� The Data Warehouse Toolkit (3rd ed), Ralph Kimball

� Implementing DW with Microsoft SQL Server 2012

� Data Warehousing Fundamentals, Paulraj Ponniah

� Articles� Best Practices in Data Warehouse (Hanover Research Council)

� http://www.kimballgroup.com/category/design-tips/

� http://sqlmag.com/business-intelligence

� Resources� http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-

techniques/dimensional-modeling-techniques/

� http://www.databaseanswers.org/data_models/index.htm

Page 38: Data Warehouse Design and Best Practices