Data Warehouse Design Best Practices
Jun 26, 2015
Data Warehouse Design
Best Practices
About me
� Project Manager @
� 12 years professional experience
� .NET Web Development MCPD
� SQL Server 2012 (MCSA)
� Business Interests
� Web Development, SOA, Integration
� Security & Performance Optimization
� Horizon2020, Open BIM, GIS, Mapping
� Contact me
� www.linkedin.com/in/ivelin
� www.slideshare.net/ivoandreev
2 |
About me
� Senior Developer @
� .NET Web Development MCPD
� Business Interests
� Web Development, WCF, Integration
� SQL Server – Query Optimization and Tuning
� Data Warehousing
� Contact me� [email protected]
� www.linkedin.com/in/georgimishev
Sponsors
Agenda
� Why Data Warehouse
� Main DW Architectures
� Dimensional Modeling
� Patterns & Practices
� DW Maintenance
� ETL Process
� SSIS Demo
Lots of Data Everywhere
� Can’t find data?
� Data scattered over the network
� Can’t get data?
� Need an expert to get the data
� Can’t understand data?
� Data poorly documented
� Can’t use data found?
� Data needs to be transformed
Data Warehouse?
� Integrated
� Heterogeneous sources
� Data clean and conversion ($, €, 元)
� Focus on subject
� i.e. Customer, Sale, Product
� Time variant
� Timestamp every key
� Historical data (10+ years)
Def: Central repository where data are organized, cleansed and in standardized format.
Different Problems - Different Solutions
OLTP Database Data Warehouse
Users Customer Knowledge worker
Design Normalized, Data Integrity Denormalized
Function Daily operation Decision making
Data Current, Detailed Historical, Aggregated
Usage Real time Ad-hoc
Access Short R/W transactions Complex R/O queries
Data accessed Comparatively lower Large Amounts
# Records x100 x1’000’000
# Users x1’000 x10
DB Size x10 GB x100GB-TB
Different DW Architectures
B.Inmon Model
Top-Down Approach
� Warehouse (3NF)
� Data Mart & OLAP (MD)
http://sqlschoolgr.files.wordpress.com/2012/03/clip_image003_thumb.png?w=640&h=368
R.Kimball Model
Bottom-Up Approach
� Data Marts (3NF or MD)
� Warehouse & OLAP (MD)
http://sqlschoolgr.files.wordpress.com/2012/03/clip_image005_thumb.png?w=640&h=369
Data Vault (by Dan Linstedt)
� Hubs
� List of unique business keys
� Links
� Unique relationships between keys
� Satellites
� Hub and Link details and history
It is irrelevant which camp you belong…
as far as you understand why!
Making Your Choice
• Data Vault
Backend Data Warehouse
+ Multiple sources; Full history; Incremental build
- Up-front work; Long-term payoff; Many joins
• Inmon (3NF)
+ Structured
+ Easy to maintain
+ Easier data mining
- Timely to build
• Kimball (MD)
+ Start small, scale big
+ Faster ROI
+ Analytical tools
- Low reusability
Dimensional modeling as de-facto standard
Dimensions
Def: The object of BI interest
� Keys
� Surrogate key
� Business key
� Hierarchical attributes
� Analysis and Drill Down
� Member properties
� Presentation labels
� Auditing information (not for end users)
Slowly Changing Dimensions
Def: Scheme for recording changes over time
� Type 1 - Overwrite
� Type 2 – Multiple Records
Facts
Def: Measurement of a business process
� Keys
� FK from all dimensional tables (in the star)
� PK - Composite (usually) or Surrogate
� Measures
� Numeric columns, that are of interest to the business
� Additive, Non-additive, Semi-additive
� Factless facts
� Auditing information (optional)
Practices and Design Patterns
Data Warehouse Pitfalls
� Admit it is not as it seems to be
� You need education
� Find what is of business value� Rather than focus on performance
� Spend a lot of time in Extract-Transform-Load
� Homogenize data from different sources
� Find (and resolve) problems in source systems
Prepare your Sources
� Data integrity
� Avoid redundancy
� Data quality
� Master data source
� Data validation
� Auditing
� CreatedDate / CreatedBy
� ChangedDate / ChangedBy
� Nightly jobs
Dimension Design
� Business key with non-clustered index� Include date (if dimension has history)
� Surrogate key� The smallest possible integer
� Clustered index
� FK constraints� Do not enforce (WITH NOCHECK)
� Document the relation
� Faster load
� Data validation� Task for the Source system
Conformed Dimensions
Def. Having the same meaning and content
when referred from multiple fact tables
� Date Dimension
� Partitioning best candidate
� Granularity
� Do not store every hour, when reporting daily
� Avoid surrogate keys
� Saves lookup and joins
� Integer representing date (yyyyMMdd, days after 1/1/1900)
Pre-join Hierarchies
� Recursive relationships
� Fast drill and report
� Pre-computed aggregations
Hierarchy Bridge
� For each dimension row
� 1 association with self
� 1 row for each subordinate
Determine the Facts
The center of a Star schema
� Identify subject areas
� Identify key business events
� Identify dimensions
� Start from OLTP logical model
� Identify historical requirements
� Identify attributes
The Grain
Def: The level of detail of a fact table
� What is the business objective?
� Fine grain - behaviour and frequency analysis
� Coarse grain - overall and trend analysis
� Aggregates
� DO NOT summarize prematurely
� DO NOT mix detail and summary
� DO use “summary tables”
C3-PO is fluent in 6M forms of communication.
What about your customers?
Multinational DW
� What parts need translation?
� Where to store various language versions?
� How to support future languages?
� Dimensions
� Add language attribute
� Include text data in the dimension
� Problem 1: The dimension key?
� Replicate PK for every language
Fact.DimId = Dim.Id AND Dim.Lang=[Lang]
� Problem 2: Storage = [Dim] x [Lang]
� Sub-dimension with language attributes
TxtId Attr1 Attr2 LangId
1 large Yes En
2 small No En
1 stor Ja No
2 liten Nei No
3 … … …
Data warehouse maintenance
How Large is “Large”
Is big really big?
Partitioning
� Why� Faster index maintenance
� Faster load
� Faster queries
� When� Tables 10GB+
� How� Do not partition dimension tables
� Partition by date (most analysis are time-based)
� Eliminate partitions (WHERE [PartitionKey]=…)
� Avoid split and merge of existing partitions� Can cause inefficient log generation
Columnstore Index
� Non-clustered in SQL 2012
� Clustered in SQL 2014
� Pros
� Better data compression
� High performance on table scan
� Clustered CSI Limitations
� No other indexes allowed
� Little advantage on seek operations
� No XML, computed column or replication
Extract-Transform-Load
� Extract data from OLTP
� Data transformations
� Data loads
� DW maintenance
Efficient Load Process
� Use simple recovery model during data load
� Staging
� Avoid indexing
� Populate in parallel
� Maintain DW
� Disable indexes on load
� Rebuild manually after load
� Automatic stats update slow down SQL Server
To SSIS, or not to SSIS ?
Pros� Minimum coding to none
� Extensive support of various data sources
� Parallel execution of migration tasks
� Better organization of the ETL process
Cons� Another way of thinking
� Hidden options
� T-SQL developer would do much faster
� Auto-generated flows need optimization
� Sometimes simply does not work (i.e. Sort by GUID)
Takeaways
� Books� The Data Warehouse Toolkit (3rd ed), Ralph Kimball
� Implementing DW with Microsoft SQL Server 2012
� Data Warehousing Fundamentals, Paulraj Ponniah
� Articles� Best Practices in Data Warehouse (Hanover Research Council)
� http://www.kimballgroup.com/category/design-tips/
� http://sqlmag.com/business-intelligence
� Resources� http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-
techniques/dimensional-modeling-techniques/
� http://www.databaseanswers.org/data_models/index.htm