Data Warehouse Design and Best Practices

Post on 26-Jun-2015

510 Views

Category:

Data & Analytics

7 Downloads

Preview:

Click to see full reader

DESCRIPTION

A data warehouse is a database designed for query and analysis rather than for transaction processing. An appropriate design leads to scalable, balanced and flexible architecture that is capable to meet both present and long-term future needs. This session covers a comparison of the main data warehouse architectures together with best practices for the logical and physical design that support staging, load and querying.

Transcript

Data Warehouse Design

Best Practices

About me

� Project Manager @

� 12 years professional experience

� .NET Web Development MCPD

� SQL Server 2012 (MCSA)

� Business Interests

� Web Development, SOA, Integration

� Security & Performance Optimization

� Horizon2020, Open BIM, GIS, Mapping

� Contact me

� ivelin.andreev@icb.bg

� www.linkedin.com/in/ivelin

� www.slideshare.net/ivoandreev

2 |

About me

� Senior Developer @

� .NET Web Development MCPD

� Business Interests

� Web Development, WCF, Integration

� SQL Server – Query Optimization and Tuning

� Data Warehousing

� Contact me� georgi.mishev@icb.bg

� www.linkedin.com/in/georgimishev

Sponsors

Agenda

� Why Data Warehouse

� Main DW Architectures

� Dimensional Modeling

� Patterns & Practices

� DW Maintenance

� ETL Process

� SSIS Demo

Lots of Data Everywhere

� Can’t find data?

� Data scattered over the network

� Can’t get data?

� Need an expert to get the data

� Can’t understand data?

� Data poorly documented

� Can’t use data found?

� Data needs to be transformed

Data Warehouse?

� Integrated

� Heterogeneous sources

� Data clean and conversion ($, €, 元)

� Focus on subject

� i.e. Customer, Sale, Product

� Time variant

� Timestamp every key

� Historical data (10+ years)

Def: Central repository where data are organized, cleansed and in standardized format.

Different Problems - Different Solutions

OLTP Database Data Warehouse

Users Customer Knowledge worker

Design Normalized, Data Integrity Denormalized

Function Daily operation Decision making

Data Current, Detailed Historical, Aggregated

Usage Real time Ad-hoc

Access Short R/W transactions Complex R/O queries

Data accessed Comparatively lower Large Amounts

# Records x100 x1’000’000

# Users x1’000 x10

DB Size x10 GB x100GB-TB

Different DW Architectures

B.Inmon Model

Top-Down Approach

� Warehouse (3NF)

� Data Mart & OLAP (MD)

http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640&h=368

R.Kimball Model

Bottom-Up Approach

� Data Marts (3NF or MD)

� Warehouse & OLAP (MD)

http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640&h=369

Data Vault (by Dan Linstedt)

� Hubs

� List of unique business keys

� Links

� Unique relationships between keys

� Satellites

� Hub and Link details and history

It is irrelevant which camp you belong…

as far as you understand why!

Making Your Choice

• Data Vault

Backend Data Warehouse

+ Multiple sources; Full history; Incremental build

- Up-front work; Long-term payoff; Many joins

• Inmon (3NF)

+ Structured

+ Easy to maintain

+ Easier data mining

- Timely to build

• Kimball (MD)

+ Start small, scale big

+ Faster ROI

+ Analytical tools

- Low reusability

Dimensional modeling as de-facto standard

Dimensions

Def: The object of BI interest

� Keys

� Surrogate key

� Business key

� Hierarchical attributes

� Analysis and Drill Down

� Member properties

� Presentation labels

� Auditing information (not for end users)

Slowly Changing Dimensions

Def: Scheme for recording changes over time

� Type 1 - Overwrite

� Type 2 – Multiple Records

Facts

Def: Measurement of a business process

� Keys

� FK from all dimensional tables (in the star)

� PK - Composite (usually) or Surrogate

� Measures

� Numeric columns, that are of interest to the business

� Additive, Non-additive, Semi-additive

� Factless facts

� Auditing information (optional)

Practices and Design Patterns

Data Warehouse Pitfalls

� Admit it is not as it seems to be

� You need education

� Find what is of business value� Rather than focus on performance

� Spend a lot of time in Extract-Transform-Load

� Homogenize data from different sources

� Find (and resolve) problems in source systems

Prepare your Sources

� Data integrity

� Avoid redundancy

� Data quality

� Master data source

� Data validation

� Auditing

� CreatedDate / CreatedBy

� ChangedDate / ChangedBy

� Nightly jobs

Dimension Design

� Business key with non-clustered index� Include date (if dimension has history)

� Surrogate key� The smallest possible integer

� Clustered index

� FK constraints� Do not enforce (WITH NOCHECK)

� Document the relation

� Faster load

� Data validation� Task for the Source system

Conformed Dimensions

Def. Having the same meaning and content

when referred from multiple fact tables

� Date Dimension

� Partitioning best candidate

� Granularity

� Do not store every hour, when reporting daily

� Avoid surrogate keys

� Saves lookup and joins

� Integer representing date (yyyyMMdd, days after 1/1/1900)

Pre-join Hierarchies

� Recursive relationships

� Fast drill and report

� Pre-computed aggregations

Hierarchy Bridge

� For each dimension row

� 1 association with self

� 1 row for each subordinate

Determine the Facts

The center of a Star schema

� Identify subject areas

� Identify key business events

� Identify dimensions

� Start from OLTP logical model

� Identify historical requirements

� Identify attributes

The Grain

Def: The level of detail of a fact table

� What is the business objective?

� Fine grain - behaviour and frequency analysis

� Coarse grain - overall and trend analysis

� Aggregates

� DO NOT summarize prematurely

� DO NOT mix detail and summary

� DO use “summary tables”

C3-PO is fluent in 6M forms of communication.

What about your customers?

Multinational DW

� What parts need translation?

� Where to store various language versions?

� How to support future languages?

� Dimensions

� Add language attribute

� Include text data in the dimension

� Problem 1: The dimension key?

� Replicate PK for every language

Fact.DimId = Dim.Id AND Dim.Lang=[Lang]

� Problem 2: Storage = [Dim] x [Lang]

� Sub-dimension with language attributes

TxtId Attr1 Attr2 LangId

1 large Yes En

2 small No En

1 stor Ja No

2 liten Nei No

3 … … …

Data warehouse maintenance

How Large is “Large”

Is big really big?

Partitioning

� Why� Faster index maintenance

� Faster load

� Faster queries

� When� Tables 10GB+

� How� Do not partition dimension tables

� Partition by date (most analysis are time-based)

� Eliminate partitions (WHERE [PartitionKey]=…)

� Avoid split and merge of existing partitions� Can cause inefficient log generation

Columnstore Index

� Non-clustered in SQL 2012

� Clustered in SQL 2014

� Pros

� Better data compression

� High performance on table scan

� Clustered CSI Limitations

� No other indexes allowed

� Little advantage on seek operations

� No XML, computed column or replication

Extract-Transform-Load

� Extract data from OLTP

� Data transformations

� Data loads

� DW maintenance

Efficient Load Process

� Use simple recovery model during data load

� Staging

� Avoid indexing

� Populate in parallel

� Maintain DW

� Disable indexes on load

� Rebuild manually after load

� Automatic stats update slow down SQL Server

To SSIS, or not to SSIS ?

Pros� Minimum coding to none

� Extensive support of various data sources

� Parallel execution of migration tasks

� Better organization of the ETL process

Cons� Another way of thinking

� Hidden options

� T-SQL developer would do much faster

� Auto-generated flows need optimization

� Sometimes simply does not work (i.e. Sort by GUID)

Takeaways

� Books� The Data Warehouse Toolkit (3rd ed), Ralph Kimball

� Implementing DW with Microsoft SQL Server 2012

� Data Warehousing Fundamentals, Paulraj Ponniah

� Articles� Best Practices in Data Warehouse (Hanover Research Council)

� http://www.kimballgroup.com/category/design-tips/

� http://sqlmag.com/business-intelligence

� Resources� http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-

techniques/dimensional-modeling-techniques/

� http://www.databaseanswers.org/data_models/index.htm

top related