Data Warehouse Design and Best Practices

Data Warehouse Design

Best Practices

About me

� Project Manager @

� 12 years professional experience

� .NET Web Development MCPD

� SQL Server 2012 (MCSA)

� Business Interests

� Web Development, SOA, Integration

� Security & Performance Optimization

� Horizon2020, Open BIM, GIS, Mapping

� Contact me

� [email protected]

� www.linkedin.com/in/ivelin

� www.slideshare.net/ivoandreev

2 |

About me

� Senior Developer @

� .NET Web Development MCPD

� Business Interests

� Web Development, WCF, Integration

� SQL Server – Query Optimization and Tuning

� Data Warehousing

� Contact me� [email protected]

� www.linkedin.com/in/georgimishev

Sponsors

Agenda

� Why Data Warehouse

� Main DW Architectures

� Dimensional Modeling

� Patterns & Practices

� DW Maintenance

� ETL Process

� SSIS Demo

Lots of Data Everywhere

� Can’t find data?

� Data scattered over the network

� Can’t get data?

� Need an expert to get the data

� Can’t understand data?

� Data poorly documented

� Can’t use data found?

� Data needs to be transformed

Data Warehouse?

� Integrated

� Heterogeneous sources

� Data clean and conversion ($, €, 元)

� Focus on subject

� i.e. Customer, Sale, Product

� Time variant

� Timestamp every key

� Historical data (10+ years)

Def: Central repository where data are organized, cleansed and in standardized format.

Different Problems - Different Solutions

OLTP Database Data Warehouse

Users Customer Knowledge worker

Design Normalized, Data Integrity Denormalized

Function Daily operation Decision making

Data Current, Detailed Historical, Aggregated

Usage Real time Ad-hoc

Access Short R/W transactions Complex R/O queries

Data accessed Comparatively lower Large Amounts

# Records x100 x1’000’000

# Users x1’000 x10

DB Size x10 GB x100GB-TB

Different DW Architectures

B.Inmon Model

Top-Down Approach

� Warehouse (3NF)

� Data Mart & OLAP (MD)

http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640&h=368

R.Kimball Model

Bottom-Up Approach

� Data Marts (3NF or MD)

� Warehouse & OLAP (MD)

http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640&h=369

Data Vault (by Dan Linstedt)

� Hubs

� List of unique business keys

� Links

� Unique relationships between keys

� Satellites

� Hub and Link details and history

It is irrelevant which camp you belong…

as far as you understand why!

Making Your Choice

• Data Vault

Backend Data Warehouse

+ Multiple sources; Full history; Incremental build

- Up-front work; Long-term payoff; Many joins

• Inmon (3NF)

+ Structured

+ Easy to maintain

+ Easier data mining

- Timely to build

• Kimball (MD)

+ Start small, scale big

+ Faster ROI

+ Analytical tools

- Low reusability

Dimensional modeling as de-facto standard

Dimensions

Def: The object of BI interest

� Keys

� Surrogate key

� Business key

� Hierarchical attributes

� Analysis and Drill Down

� Member properties

� Presentation labels

� Auditing information (not for end users)

Slowly Changing Dimensions

Def: Scheme for recording changes over time

� Type 1 - Overwrite

� Type 2 – Multiple Records

Facts

Def: Measurement of a business process

� Keys

� FK from all dimensional tables (in the star)

� PK - Composite (usually) or Surrogate

� Measures

� Numeric columns, that are of interest to the business

� Additive, Non-additive, Semi-additive

� Factless facts

� Auditing information (optional)

Practices and Design Patterns

Data Warehouse Pitfalls

� Admit it is not as it seems to be

� You need education

� Find what is of business value� Rather than focus on performance

� Spend a lot of time in Extract-Transform-Load

� Homogenize data from different sources

� Find (and resolve) problems in source systems

Prepare your Sources

� Data integrity

� Avoid redundancy

� Data quality

� Master data source

� Data validation

� Auditing

� CreatedDate / CreatedBy

� ChangedDate / ChangedBy

� Nightly jobs

Dimension Design

� Business key with non-clustered index� Include date (if dimension has history)

� Surrogate key� The smallest possible integer

� Clustered index

� FK constraints� Do not enforce (WITH NOCHECK)

� Document the relation

� Faster load

� Data validation� Task for the Source system

Conformed Dimensions

Def. Having the same meaning and content

when referred from multiple fact tables

� Date Dimension

� Partitioning best candidate

� Granularity

� Do not store every hour, when reporting daily

� Avoid surrogate keys

� Saves lookup and joins

� Integer representing date (yyyyMMdd, days after 1/1/1900)

Pre-join Hierarchies

� Recursive relationships

� Fast drill and report

� Pre-computed aggregations

Hierarchy Bridge

� For each dimension row

� 1 association with self

� 1 row for each subordinate

Determine the Facts

The center of a Star schema

� Identify subject areas

� Identify key business events

� Identify dimensions

� Start from OLTP logical model

� Identify historical requirements

� Identify attributes

The Grain

Def: The level of detail of a fact table

� What is the business objective?

� Fine grain - behaviour and frequency analysis

� Coarse grain - overall and trend analysis

� Aggregates

� DO NOT summarize prematurely

� DO NOT mix detail and summary

� DO use “summary tables”

C3-PO is fluent in 6M forms of communication.

What about your customers?

Multinational DW

� What parts need translation?

� Where to store various language versions?

� How to support future languages?

� Dimensions

� Add language attribute

� Include text data in the dimension

� Problem 1: The dimension key?

� Replicate PK for every language

Fact.DimId = Dim.Id AND Dim.Lang=[Lang]

� Problem 2: Storage = [Dim] x [Lang]

� Sub-dimension with language attributes

TxtId Attr1 Attr2 LangId

1 large Yes En

2 small No En

1 stor Ja No

2 liten Nei No

3 … … …

Data warehouse maintenance

How Large is “Large”

Is big really big?

Partitioning

� Why� Faster index maintenance

� Faster load

� Faster queries

� When� Tables 10GB+

� How� Do not partition dimension tables

� Partition by date (most analysis are time-based)

� Eliminate partitions (WHERE [PartitionKey]=…)

� Avoid split and merge of existing partitions� Can cause inefficient log generation

Columnstore Index

� Non-clustered in SQL 2012

� Clustered in SQL 2014

� Pros

� Better data compression

� High performance on table scan

� Clustered CSI Limitations

� No other indexes allowed

� Little advantage on seek operations

� No XML, computed column or replication

Extract-Transform-Load

� Extract data from OLTP

� Data transformations

� Data loads

� DW maintenance

Efficient Load Process

� Use simple recovery model during data load

� Staging

� Avoid indexing

� Populate in parallel

� Maintain DW

� Disable indexes on load

� Rebuild manually after load

� Automatic stats update slow down SQL Server

To SSIS, or not to SSIS ?

Pros� Minimum coding to none

� Extensive support of various data sources

� Parallel execution of migration tasks

� Better organization of the ETL process

Cons� Another way of thinking

� Hidden options

� T-SQL developer would do much faster

� Auto-generated flows need optimization

� Sometimes simply does not work (i.e. Sort by GUID)

Takeaways

� Books� The Data Warehouse Toolkit (3rd ed), Ralph Kimball

� Implementing DW with Microsoft SQL Server 2012

� Data Warehousing Fundamentals, Paulraj Ponniah

� Articles� Best Practices in Data Warehouse (Hanover Research Council)

� http://www.kimballgroup.com/category/design-tips/

� http://sqlmag.com/business-intelligence

� Resources� http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-

techniques/dimensional-modeling-techniques/

� http://www.databaseanswers.org/data_models/index.htm

Data Warehouse Design and Best Practices

Data & Analytics

use data

text data

data needs

key historical data

buildbackend data warehouse

data warehouse pitfalls

sources data integrity

approach data marts