cohesion institute December 5, 2005 Speaker: R. Michael Pickering Speaker: R. Michael Pickering President, Cohesion Systems Consulting President, Cohesion Systems Consulting Inc. Inc. Data Warehouse Data Warehouse Architecture Best Architecture Best Practices Practices
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
cohesion institute
December 5, 2005
Speaker: R. Michael PickeringSpeaker: R. Michael PickeringPresident, Cohesion Systems Consulting Inc.President, Cohesion Systems Consulting Inc.
Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices
April 9, 2023 DW Architecture Best Practices 2
cohesion institute
Agenda
Introductions Business Intelligence Background Architecture Best Practices Questions & Answers
cohesion institute
Introductions
Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices
April 9, 2023 DW Architecture Best Practices 4
cohesion institute
Presenter Biography R. Michael Pickering
President and Chief Architect, Cohesion Systems Consulting Inc.
Since inception in 2000, clients have included Enbridge, CIBC, The Bank of New York, Loyalty Management Group, Canada Post Borderfree, Katz Group
April 9, 2023 DW Architecture Best Practices 6
cohesion institute
Audience Survey By a show of hands, please indicate your
experience with: normalization dimensional modeling operational data store data consolidation Extract Transform Load (ETL) metadata architecture DW appliances
cohesion institute
Business Intelligence Background
Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices
April 9, 2023 DW Architecture Best Practices 8
cohesion institute
What is Business Intelligence?
A Data Warehouse is usually one component of an overall business intelligence solution
IT people may be tempted to think in terms of products and technologies
BUT...
April 9, 2023 DW Architecture Best Practices 9
cohesion institute
Overarching Goal
The overarching goal of business intelligence is to provide the information necessary to MANAGE a business
This means providing information in support of management decision making, which is why BI is also called “Decision Support”
April 9, 2023 DW Architecture Best Practices 10
cohesion institute
BI is about “Data Abstraction” wisdom knowledge information data
audience for a data warehouse typically considers higher slices of data abstraction pyramid
lowest level of pyramid is too detailed & unwieldy
Stages Stages (4)(4)
April 9, 2023 DW Architecture Best Practices 11
cohesion institute
It’s Not Technology
Business Intelligence is about delivering business value
provide tangible benefit by answering important questions that can help the business to achieve its strategic focus
Improving profitability Who are our five most profitable clients? What are our least profitable products?
Reducing cost Who are our lowest cost suppliers? Which materials incur highest spoilage costs?
Improving customer satisfaction What factors may lead to lost customers?
April 9, 2023 DW Architecture Best Practices 12
cohesion institute
Business of BI
In some cases, legislation such as Sarbanes-Oxley or Basel II makes some kind of BI fundamental to doing business
Many leading companies use BI to achieve competitive advantage
E.g. Walmart, Dell, Amazon.com, Kraft, American Express, etc…
April 9, 2023 DW Architecture Best Practices 13
cohesion institute
Data Warehouse Architecture
architecture is about delivering an elegant solution that meets the solution requirements this means really understanding the
problem DW architecture is part art, part
science
April 9, 2023 DW Architecture Best Practices 14
cohesion institute
Good Architecture
‘It’s not easy to describe a good design, but I’ll know it when I see it’
April 9, 2023 DW Architecture Best Practices 15
cohesion institute
BI Architecture Requirements must recognize change as a
constant take incremental development
approach existing applications must
continue to work need to allow more data and
new types of data to be added
April 9, 2023 DW Architecture Best Practices 16
cohesion institute
End User Acceptance
understandability understandability is in the eyes
of the beholder want to hide the complexity try to make it:
intuitive, obvious visible, memorable
April 9, 2023 DW Architecture Best Practices 17
cohesion institute
End User Acceptance
performance don’t want to interrupt
the thinking process provide one click,
instantaneous access warehouse must be
available, “production” system
cohesion institute
Architecture Best Practices
Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices
April 9, 2023 DW Architecture Best Practices 19
cohesion institute
High Level Architecture remember the different “worlds”
on-line transaction processing (OLTP) business intelligence systems (BIS)
users are different data content is different data structures are different architecture & methodology must be
different
April 9, 2023 DW Architecture Best Practices 20
cohesion institute
Two Different Worlds
On-Line Transaction Processing Entity Relational Data Model
created in 1960’s to address performance issues with relational database implementations
normalized to most efficiently get data in divides the data into many discrete entities many relationships between these entities this approach was documented by C.J. Date in
An Introduction to Database Systems
April 9, 2023 DW Architecture Best Practices 21
cohesion institute
Two Different Worlds
Business Intelligence Systems Dimensional Data Model
also called star schema designed to easily get information out fewer relationships than ERD, the only
table with multiple joins connecting to other tables is the central table
developed in 1960’s by data service providers, formalized by Ralph Kimball in The Data Warehouse Toolkit
April 9, 2023 DW Architecture Best Practices 22
cohesion institute
Entity Relation Disadvantages
all tables look the same people can’t visualize/remember
diagrams software can’t navigate as schema
becomes too complex business processes mixed together many artificial keys created
April 9, 2023 DW Architecture Best Practices 23
cohesion institute
Dimensional Model Advantages
simplicity humans can navigate and remember software can navigate
deterministically business process explicitly separated
(Data Mart) not so many keys (keys = # of
attendant tables)
April 9, 2023 DW Architecture Best Practices 24
cohesion institute
Best Practice #1
Use a data model that is optimized for information retrieval dimensional model denormalized hybrid approach
April 9, 2023 DW Architecture Best Practices 25
cohesion institute
Data Acquisition Processes Extract Transform Load (ETL)
the process of unloading or copying data from the source systems, transforming it into the format and data model required in the BI environment, and loading it to the DW
also, a software development tool for building ETL processes (an ETL tool)
many production DWs use COBOL or other general-purpose programming languages to implement ETL
April 9, 2023 DW Architecture Best Practices 26
cohesion institute
Data Quality Assurance
data cleansing the process of validating and enriching
the data as it is published to the DW also, a software development tool for
building data cleansing processes (a data cleansing tool)
many production DWs have only very rudimentary data quality assurance processes
April 9, 2023 DW Architecture Best Practices 27
cohesion institute
Data Acquisition & Cleansing
getting data loaded efficiently and correctly is critical to the success of your DW implementation of data acquisition &
cleansing processes represents from 50 to 80% of effort on typical DW projects
inaccurate data content can be ‘the kiss of death’ for user acceptance
April 9, 2023 DW Architecture Best Practices 28
cohesion institute
Best Practice #2
Carefully design the data acquisition and cleansing processes for your DW Ensure the data is processed
efficiently and accurately Consider acquiring ETL and Data
Cleansing tools Use them well!
April 9, 2023 DW Architecture Best Practices 29
cohesion institute
Data Model
Already discussed the benefits of a dimensional model
No matter whether dimensional modeling or any other design approach is used, the data model must be documented
April 9, 2023 DW Architecture Best Practices 30
cohesion institute
Documenting the Data Model
The best practice is to use some kind of data modeling tool
CA ERwin Sybase PowerDesigner Oracle Designer IBM Rational Rose Etc.
Different tools support different modeling notations, but they are more or less equivalent anyway
Most tools allow sharing of their metadata with an ETL tool
April 9, 2023 DW Architecture Best Practices 31
cohesion institute
Data Model Standards data model standards appropriate for the
environment and tools chosen in your data warehouse should be adopted
considerations should be given to data access tool(s) and integration with overall enterprise standards
standards must be documented and enforced within the DW team
someone must ‘own’ the data model to ensure a quality data model, all changes
should be reviewed thru some formal process
April 9, 2023 DW Architecture Best Practices 32
cohesion institute
Data Model Metadata
Business definitions should be recorded for every field (unless they are technical fields only)
Domain of data should be recorded Sample values should be included As more metadata is populated into the
modeling tool it becomes increasingly important to be able to share this data across ETL and Data Access tools
April 9, 2023 DW Architecture Best Practices 33
cohesion institute
Metadata Architecture
The strategy for sharing data model and other metadata should be formalized and documented
Metadata management tools should be considered & the overall metadata architecture should be carefully planned
April 9, 2023 DW Architecture Best Practices 34
cohesion institute
Best Practice #3
Design a metadata architecture that allows sharing of metadata between components of your DW consider metadata standards such as
OMG’s Common Warehouse Metamodel (CWM)
April 9, 2023 DW Architecture Best Practices 35
cohesion institute
Alternative Architecture Approaches
Bill Inmon: “Corporate Information Factory”
Hub and Spoke philosophy “JBOC” – just a bunch of cubes Let it evolve naturally
April 9, 2023 DW Architecture Best Practices 36
cohesion institute
What We Want(Architectural Principal) In most cases, business and IT
agree that the data warehouse should provide a ‘single version of the truth’
Any approach that can result in disparate data marts or cubes is undesireable
This is known as data silos or…
April 9, 2023 DW Architecture Best Practices 37
cohesion institute
Enterprise DW Architecture how to design an enterprise data
warehouse and ensure a ‘single version of the truth’?
according to Kimball: start with an overall data architecture
phase use “Data Warehouse Bus” design to
integrate multiple data marts use incremental approach by building one
data mart at a time
April 9, 2023 DW Architecture Best Practices 38
cohesion institute
Data Warehouse Bus Architecture
named for the bus in a computer standard interface that allows you to plug
in cdrom, disk drive, etc. these peripherals work together smoothly
provides framework for data marts to fit together
allows separate data marts to be implemented by different groups, even at different times
April 9, 2023 DW Architecture Best Practices 39
cohesion institute
Data Mart Definition
data mart is a complete subset of the overall data warehouse a single business process OR a group of related business processes
think of a data mart as a collection of related fact tables sharing conformed dimensions, aka a ‘fact constellation’
April 9, 2023 DW Architecture Best Practices 40
cohesion institute
Designing The DW Bus determine which dimensions will be
shared across multiple data marts conform the shared dimensions
produce a master suite of shared dimensions
determine which facts will be shared across data marts
conform the facts standardize the definitions of facts
April 9, 2023 DW Architecture Best Practices 41
cohesion institute
Dimension Granularity
conformed dimensions will usually be granular makes it easy to integrate with various
base level fact tables easy to extend fact table by adding
new facts no need to drop or reload fact tables,
and no keys have to be changed
April 9, 2023 DW Architecture Best Practices 42
cohesion institute
Conforming Dimensions
by adhering to standards, the separate data marts can be plugged together e.g. customer, product, time
they can even share data usefully, for example in a drill across report
ensures reports or queries from different data marts share the same context
April 9, 2023 DW Architecture Best Practices 43
cohesion institute
Conforming Dimensions (cont’d)
accomplish this by adding any dimension attribute(s) needed in any data mart(s) to the standard dimension definition
attributes not needed everywhere can always be ignored
typically harder to determine how to load conformed dimensions than to design them initially
need a single integrated ETL process what is the SOR for each attribute? how do we deal with attributes for which there is
more than one possible SOR?
April 9, 2023 DW Architecture Best Practices 44
cohesion institute
Conforming Facts
in an enterprise, some metrics may not have the same generally accepted definition across all business units
conforming facts is generally a bigger design challenge than conforming dimensions why?
April 9, 2023 DW Architecture Best Practices 45
cohesion institute
Conforming Facts - Benefits ensures the constituent data marts can as
clearly as possible represent fact data expressed on the same basis using consistent definitions
ensures reports or queries from different data marts share consistent content
success of an Enterprise DW hinges on successfully conformed facts
any perceived inconsistencies in fact definitions across data marts will generally be considered to be a DW bug or data problem by users
if users don’t have full confidence in data quality they may stop using the DW
April 9, 2023 DW Architecture Best Practices 46
cohesion institute
Data Consolidation
a current trend in BI/DW is ‘data consolidation’
from a software vendor perspective, it is tempting to simplify this: ‘we can keep all the tables for all your
disparate applications in one physical database’
April 9, 2023 DW Architecture Best Practices 47
cohesion institute
Data Integration
To truly achieve ‘a single version of the truth’, must do more than simply consolidating application databases
Must integrate data models and establish common terms of reference
April 9, 2023 DW Architecture Best Practices 48
cohesion institute
Best Practice #4
Take an approach that consolidates data into ‘a single version of the truth’ Data Warehouse Bus
conformed dimensions & facts OR?
April 9, 2023 DW Architecture Best Practices 49
cohesion institute
Operational Data Store (ODS) a single point of integration for disparate
operational systems contains integrated data at the most
detailed level (transactional) may be loaded in ‘near real time’ or
periodically can be used for centralized operational
reporting
April 9, 2023 DW Architecture Best Practices 50
cohesion institute
Role of an ODS in DW Architecture
In the case where an ODS is a necessary component of the overall DW, it should be carefully integrated into the overall architecture
Can also be used for: Staging area Master/reference data management Etc…
April 9, 2023 DW Architecture Best Practices 51
cohesion institute
ODS Data Model
Not clear if any design approach for an ODS data model has emerged as a best practice normalized dimensional denormalized/hybrid any suggestions?
April 9, 2023 DW Architecture Best Practices 52
cohesion institute
Best Practice #5 Consider implementing an ODS only
when information retrieval requirements are near the bottom of the data abstraction pyramid and/or when there are multiple operational sources that need to be accessed Must ensure that the data model is
integrated, not just consolidated May consider 3NF data model Avoid at all costs a ‘data dumping ground’
April 9, 2023 DW Architecture Best Practices 53
cohesion institute
Capacity Planning DW workloads are typically very
demanding, especially for I/O capacity Successful implementations tend to grow
very quickly, both in number of users and data volume
Rules of thumb do exist for sizing the hardware platform to provide adequate initial performance typically based on estimated ‘raw’ data size
of proposed database e.g. 100-150 Gb per modern CPU
April 9, 2023 DW Architecture Best Practices 54
cohesion institute
SMP Server Scale Up Scaling performance within a single SMP
server is referred to as ‘scale up’ Database benchmarks suggest Windows
scalability is near that of Linux IBM claims near-linear scalability for Linux
(on commodity hardware) up to about 4 processors Probably not cost effective to scale up Linux
much beyond 4 processors IBM claims near-linear scalability for AIX
on POWER5 up to about 8 processors
April 9, 2023 DW Architecture Best Practices 55
cohesion institute
Scale Out
There is an increasing trend in IT to ‘scale out’ processing capacity by deploying many small, commodity servers rather than a single large SMP system
This strategy tends to work well for relatively simple applications such as network or web servers
For very complex workloads such as a data warehouse, this strategy is much more difficult to effectively implement
Especially so for the database server itself
April 9, 2023 DW Architecture Best Practices 56
cohesion institute
Scale Up vs. Scale Out
To obtain the total number of processors required for the estimated DW workload, must plan either to scale up or scale out
Both options are viable but, all other things being equal, scaling up is less disruptive to end users and requires less work to implement scaling up can offer lower hardware
investment, if practical however, network bandwidth or latency
issues can limit effectiveness of parallelism
April 9, 2023 DW Architecture Best Practices 57
cohesion institute
Best Practice #6
Create a capacity plan for your BI application & monitor it carefully
Consider future additional performance demands Establish standard performance benchmark
queries and regularly run them Implement capacity monitoring tools Build scalability into your architecture May need to allow for scaling both up and
out!
April 9, 2023 DW Architecture Best Practices 58
cohesion institute
Open Source Affordability
Another emerging trend in IT generally is to utilize Open Source software running on commodity hardware
this is expected to offer lower total cost of ownership certainly, GNU/Linux and other Open Source initiatives
do provide very good functionality and quality for minimal cost
This trend also applies to BI & DW: most traditional rdbms’s are now supported on Linux however, open source rdbms’s lag behind on
providing good performance for DW queries
April 9, 2023 DW Architecture Best Practices 59
cohesion institute
DW Appliances
DW appliances, consisting of packaged solutions providing all required software and hardware, are beginning to offer very promising price/performance
production experience is limited so far, so this is not yet a ‘best practice’
cohesion institute
Q & A
Data Warehouse Data Warehouse Architecture Best PracticesArchitecture Best Practices