Top Banner
1 Data Warehousing Denise Jeffries [email protected] [email protected] 205.747.3301
96
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: DW 101

1

Data Warehousing

Denise Jeffries

[email protected]@hotmail.com

205.747.3301

Page 2: DW 101

2

Databases vs Data Warehousing

Often mistaken for each other – vastly different The database supports data storage & retrieval

for an application or specific purpose Don’t bog it down with informational

reporting, operational in nature (would your app work after your next acquisition,

when you have grown your customer base, have more users, produce even more reports)…

A data warehouse is used for informational purposes

To facilitate business reporting and analysis It is not operational

Page 3: DW 101

3

Definition of a Data Warehouse

Data warehouse —

+ subject oriented + integrated + time variant + nonvolatile collection of data for management’s decisions

…data warehouses are granular. They contain the bedrock data that forms the single source for all Decision Support System/Executive Information System processing. With a data warehouse there is reconcilability of information when there are differences of opinion. The atomic data found in the warehouse can be shaped in many ways, satisfying both known requirements and standing ready to satisfy unknown requirements.

http://www.itquestionbank.com/types-of-data-warehouse.html

Page 4: DW 101

4

DW Project Components

Business Requirements

Physical (hw/sw) environment setup

Data Modeling ETL OLAP or ROLAP

cube design Report

Development

Query Optimization

Data Quality Assurance

Promote to Production

Maintenance Enhancement

Page 5: DW 101

5

Key milestones in the early years of data warehousing:

1960: General Mills & Dartmouth College research project – coin terms DIMENSIONS & FACTS

1967: Edward Yourdan “Real-Time Systems Design” 1970: ACNielsen and IRI provide dimensional data marts for

retail sales 1979 – Tom DeMarco “Structured Analysis and Design” 1988 – Barry Devlin and Paul Murphy publish “An architecture

for business information systems” in IBM Systems Journal – coin term “business data warehouse”

1991 – Bill Inmon publishes book “Building the Data Warehouse”

1995 – The Data Warehouse Institute is founded (for profit) 1996 – Ralph Kimball publishes “The Data Warehouse Tookit” 2000 – Wayne Eckerson “Data Quality and the Bottom Line”

report from TDWI 2004 – IBM states their main competitors are Oracle and

Teradata

Page 6: DW 101

6

The beginnings

Commercial viability occurred with a drop in disk storage prices.

Then came the BI vendors The ETL vendors The Data modelers And the database vendors fought….

Page 7: DW 101

7

History of Data Warehousing Data Warehouses became a distinct type of computer database during the late 1980s

and early 1990s. They were developed to meet a growing demand for management information and analysis that could not be met by operational systems

the extra processing load of reporting reduced the response time of the operational systems

the development of reports in operational systems requires writing specific SQL queries which put a heavy load on the system

Separate computer databases began to be built that were specifically designed to support management information and analysis purposes.

Data warehouses were able to bring in data from a range of different data sources mainframe computers, minicomputers, personal computers and office

automation software such as spreadsheets,

Data warehouses integrate this information in a single place.

User-friendly reporting tools and freedom from operational impacts, has led to a growth of data warehousing systems

http://www.dedupe.com/history.php

Page 8: DW 101

8

History of Data Warehousing

As technology improved (lower cost for more performance) and user requirements increased (faster data load cycle times and more features), data warehouses have evolved through several fundamental stages:

Offline Operational Databases - Data warehouses in this initial stage are developed by simply copying the database of an operational system to an off-line server where the processing load of reporting does not impact on the operational system's performance.

Offline Data Warehouse - Data warehouses in this stage of evolution are updated on a regular time cycle (usually daily, weekly or monthly) from the operational systems and the data is stored in an integrated reporting-oriented data structure.

The real next generation warehousing – not really being done:

Real Time Data Warehouse - Data warehouses at this stage are updated on a transaction or event basis, every time an operational system performs a transaction (e.g. an order or a delivery or a booking etc.)

Integrated Data Warehouse - Data warehouses at this stage are used to generate activity or transactions that are passed back into the operational systems for use in the daily activity of the organization.

http://www.dedupe.com/history.php

Page 9: DW 101

9

Data Warehouse Architecture

The term data warehouse architecture describes the overall structure of the system.

historical terms include decision support systems (DSS), management information systems (MIS)

Newer terms include business intelligence competency center (BICC) The data warehouse architecture describes the overall system

components: infrastructure, data and processes. The infrastructure technology stack perspective determines the hardware

and software products needed to implement the components of the system. The data perspective typically diagrams the source and target data

structures and aid the user in understanding what data assets are available and how they are related.

The process perspective is primarily concerned with communicating the process and flow of data from the originating source system through the process of loading the data warehouse, and often the process that client products use to access and extract data from the warehouse.

Architecture facilitates the structure, function and interrelationships of each component.

Page 10: DW 101

10

Advantages to DW

Enables end-user access to a wide variety of data Increased data consistency Additional documentation of the data (published data models,

data dictionaries) Lower overall computing costs and increased productivity Area to combine related data from separate sources Flexible, easy to change computing infrastructure to support

data changes in applications systems and business structures/hierarchies

Empowering end-users to perform ad-hoc queries and reports without impacting the performance of the operational systems

An enabler of commercial business applications, most notably customer relationship management (CRM) i.e. through feed-back loops.

Page 11: DW 101

11

Data Integration

Data integration is the aspect of combining diverse sources and giving the user a unified view of their data.

This important problem emerges in a variety of situations both commercial (when two similar companies need to merge their databases) and scientific (combining research results from different bioinformatics repositories).

Data integration appears with increasing frequency as the volume and the need to share existing data explodes.

It has been the focus of extensive theoretical work and numerous open problems remain to be solved.

In practice, data integration is frequently called Enterprise Information Integration.

Page 12: DW 101

12

Data Warehousing Toolsets

Data modeling Diagrams: ERD, etc.

Data Dictionary ETL Tools Database of Choice

Oracle, SQLServer, DB2, Teradata, Netezza, …. SQL and it’s tools Data Validation Bug trackers/issue trackers: Testing

Page 13: DW 101

13

Types of Data Warehouses

Not Data marts Operational Data Store (ODS) Data warehouse (enterprise data

warehouse - EDW) Exploration data warehouse Decision Support System

(aka:Management Information System MIS)

Page 14: DW 101

14

Brief Description of Terms Operational Systems are the internal and external core systems that support the

day-to-day business operations. They are accessed through application program interfaces (APIs) and are the source of data for the data warehouse and operational data store. (Encompasses all operational systems including ERP, relational and legacy.)

Data Acquisition is the set of processes that capture, integrate, trans-form, cleanse, reengineer and load source data into the data warehouse and operational data store. Data reengineering is the process of investigating, standardizing and providing clean consolidated data.

The Data Warehouse is a subject-oriented, integrated, time-variant, non-volatile collection of data used to support the strategic decision-making process for the enterprise. It is the central point of data integration for business intelligence and is the source of data for the data marts, delivering a common view of enterprise data.

Primary Storage Management consists of the processes that manage data within and across the data warehouse and operational data store. It includes processes for backup and recovery, partitioning, summarization, aggregation, and archival and retrieval of data to and from alternative storage.

Alternative Storage is the set of devices used to cost-effectively store data warehouse and exploration warehouse data that is needed but not frequently accessed. These devices are less expensive than disks and still provide adequate performance when the data is needed.

Data Delivery is the set of processes that enable end users and their supporting IS group to build and manage views of the data warehouse within their data marts. It involves a three-step process consisting of filtering, formatting and delivering data from the data warehouse to the data marts.

The Data Mart is customized and/or summarized data derived from the data warehouse and tailored to support the specific analytical requirements of a business unit or function. It utilizes a common enterprise view of strategic data and provides business units more flexibility, control and responsibility. The data mart may or may not be on the same server or location as the data warehouse.

Page 15: DW 101

15

Desc terms cont’d The Operational Data Store (ODS) is a subject-oriented, integrated, current, volatile

collection of data used to support the tactical decision-making process for the enterprise. It is the central point of data integration for business management, delivering a common view of enterprise data.

Meta Data Management is the process for managing information needed to promote data legibility, use and administration. Contents are described in terms of data about data, activity and knowledge.

The Exploration Warehouse is a DSS architectural structure whose purpose is to provide a safe haven for exploratory and ad hoc processing. An exploration warehouse utilizes data compression to provide fast response times with the ability to access the entire database.

The Data Mining Warehouse is an environment created so analysts may test their hypotheses, assertions and assumptions developed in the exploration warehouse. Specialized data mining tools containing intelligent agents are used to perform these tasks.

Activities are the events captured by the enterprise legacy and/or ERP systems as well as external transactions such as Internet interactions.

Statistical Applications are set up to perform complex, difficult statistical analyses such as exception, means, average and pattern analyses. The data warehouse is the source of data for these analyses. These applications analyze massive amounts of detailed data and require a reasonably performing environment.

Analytic Applications are pre-designed, ready-to-install, decision sup-port applications. They generally require some customization to fit the specific requirements of the enterprise. The source of data is the data warehouse. Examples of these applications are risk analysis, database marketing (CRM) analyses, vertical industry "data marts in a box," etc.

External Data is any data outside the normal data collected through an enterprise's internal applications. There can be any number of sources of external data such as demographic, credit, competitor and financial information. Generally, external data is purchased by the enterprise from a vendor of such information.

Page 16: DW 101

16

Bill Inmon

Recognized as founder of the data warehouse (wrote the first book, offered first conference with Arnie Barnett, wrote the first column in a magazine {IBM Journal}, offered the first classes)

Created the accepted definition of what a DW is (subject orientated, nonvolatile, integrated, time variant collection of data in support of management’s decisions)

Approach is top-down 1991 Founded Prism Solutions, took public, 1995

founded PineCone Systems, renamed Ambeo. 1999 created Corporate Information Factory

website to educate professionals

Page 17: DW 101

17

http://www.inmoncif.com/library/cif/

Page 18: DW 101

18

Ralph Kimball

One of the original architects of data warehousing. DW must be understandable and FAST Developed Dimensional Modeling (Kimball

method) is the standard in decision support Bottom up approach

1986 founded Red Brick Systems (used indexes for performance gains), 1992 acquired by Informix, now owned by IBM

Coinventor of Xerox Star workstation (first commerical product to use mice, icons and windows)

Page 19: DW 101

19

Data Models

Provide definition and format of data represent information areas of interest

or Subject Areas Modeling methodologies:

Bottom-up model design: Start with existing structures

Top-down model design: Created fresh (by SME’s) as reference

point/template

Page 20: DW 101

20

Data Normalization – what is it Normalization is a relational database modeling process where the

relations or tables are progressively decomposed into smaller relations to a point where all attributes in a relation are very tightly coupled with the primary key of the relation. Most data modelers try to achieve the “Third Normal Form” with all of the relations before they de-normalize for performance, ease of query or other reasons.

First Normal Form: A relation is said to be in First Normal Form if it describes a single entity and it contains no arrays or repeating attributes. For example, an order table or relation with multiple line items would not be in First Normal Form because it would have repeating sets of attributes for each line item. The relational theory would call for separate tables for order and line items.

Second Normal Form: A relation is said to be in Second Normal Form if in addition to the First Normal Form properties, all attributes are fully dependent on the primary key for the relation.

Third Normal Form: A relation is in Third Normal Form if in addition to Second Normal Form, all non-key attributes are completely independent of each other. http://www.sserve.com/ftp/dwintro.doc

Page 21: DW 101

21

Entity Relationship Diagramsexample 3rd normal form

ARM_INDEX

PK INDEX_ID

INDEX_CD INDEX_DESC

DELQ_STAGE

PK DELQ_STAGE_ID

DELQ_STAGE_METHOD LOWER_BOUND UPPER_BOUND DELQ_STAGE_DESC

LOAN

PK,FK10 LOAN_ID

EFF_DT ORIG_SOP033_FLAG COVERED_FLAG ACCOUNTING_ASSET_TYPE LIEN MATURITY_DTFK4 LOAN_DESIGNATION_ID ORIG_FICO INTEREST_ONLY_FLAGFK8 STATE_ID ORIG_TERM BANKRUPTCY_FLAG BKDTFK3 DOC_TYPE_ID ORIG_DT ORIG_LTV COMB_LTV ORIG_BAL FC_FLAG INVESTOR CAT_CODE ZIP_CODE ORIG_APPR_VALUE ORIG_APPR_DT MARGIN ORIG_IR_CHG_DT ORIG_PI_CHG_DTFK9 SUPER_REPLINE_ID FOREIGN_ADDRESS_FLAG IR_CHG_PERIOD SECOND_NAME OCCUPANCY ACCRUAL_PERIOD PRODUCT_TYPE LIFETIME_FLOOR LIFETIME_CAP PERIODIC_FLOOR PERIODIC_CAP PREPAYMENT PREPAYMENT_PENALTY_DESC FIRST_DUE_DATE INCOME PAY_FREQ BALLOON_IND AVM_VALUE PRIOR_SALE PRIOR_DATE INDEXED_VALUE BASELINE_VALUE CHARGE_OFF_DESC CHARGE_OFF_BORROWER FICO_BASE CITY RT_CHG_FREQ TI_AMT ORIG_RATE PROP_COUNTY MAX_FIRST_ADJ OCCUPATION LATE_30 LATE_60 LATE_90 BALLOON_CDFK7 RATE_CODE_IDFK2 COLLATERAL_IDFK5 PRODUCT_ID ORIGINATION_INCOME MODIFICATION_INCOMEFK6 PURPOSE_ID IO_TERM LOAN_TYPE PROPERTY_TYPE PI_CHG_PERIOD BANKRUPTCY FORECLOSURE_STOP EFF_DT ACTIVE_FLAG

RATE_CHANGE_PERIOD

PK RATE_CHANGE_ID

LOWER_BOUND UPPER_BOUND RATE_CHANGE_DESC

LOAN_FACT

PK,FK5 LOAN_IDPK AS_OF_DT

LOAN_STATUS CURR_SOP033_FLAG CURR_TDR_FLAG PREV_UPB CURR_UPB PREV_PRICE CURR_PRICE PREV_FMV CURR_FMV CURR_PRIN_PAID PREV_INT_PAID_YTD CURR_INT_PAID_YTD CURR_INT_PAID CURR_INT_RATE NEXT_DUE_DT DELQ_FLAG DAYS_DELQ ACCRUAL_FLAG PROJECTED_INTEREST_ACCRUAL PERFORMING_FLAG REMAINING_TERM MONTHS_TO_RATE_CHGFK2 DELQ_STAGE_OCC_ID REPLINE_IDFK4 RATE_CHANGE_ID RATE_CHANGE_DTFK1 MATURITY_ID DI_RATIO CHARGE_OFF_AMT NEXT_PI_CHG_DT PCT_OWNED BALLOON_TERMFK6 DELQ_STAGE_PRR_ID CURR_PRIN_PAID_NON_LIQUID CURR_INT_PAID_NON_LIQUID UNFUNDED_AMT CURR_UPB_NON_LIQUID CREDIT_ACCRUAL_FLAG ACCRETING_FLAG CR_ADMIN_CURR_BOOK_VAL SFR_ALLOCATION_PCT FIRST_TD_LIEN_CURR_BAL NEG_AM_AMT NEG_AM_FLAG COMMIT_AMT DEFERRED_INTEREST FDIC_MOD_ADJ_AMT PI_CONSTANT PMI_FLAG CURR_GROSS_BAL NEG_AM_ADVANCEAMT SCHED_PRIN_PAYMENT SERVICING_FEE PITI INT_PAID_DATE DELQ_TABLE_DD CURR_FICO_SCORE FICO RATE_TYPE NEG_AMORT_MAX PREV_UPB_NON_LIQUID NEG_AM_MAX_REM_AMT DEFERRED_INT CR_ADMIN_CURR_NET_BOOK_VAL CR_ADMIN_CURR_PRO_RT_BOOK_VALFK3 INDEX_ID

MATURITY_PERIOD

PK MATURITY_ID

LOWER_BOUND UPPER_BOUND MATURITY_DESC

COLLATERAL

PK COLLATERAL_ID

COLLATERAL_CD COLLATERAL_DESC

DOC_TYPE

PK DOC_TYPE_ID

DOC_TYPE_CD DOC_TYPE_DESC

LOAN_CONVERSION

PK LOAN_ID

LEGACY_LOAN_NUM NOTE_NUM SOURCE_SYSTEM PERIOD_ADDED

LOAN_DESIGNATION

PK LOAN_DESIGNATION_ID

SYSTEM PROPERTY_TYPE LOAN_TYPE INVESTOR_NO CAT_CODE COLL_CODE GL_CODE GL_ACCOUNT LOAN_REPORT BBVA_LOAN_TYPE_MAPPING BBVA_GL_ACCOUNT BBVA_GL_ACCOUNT_NAME LEGACY_GL_ASSET_TYPE ASSET_TYPE ACCOUNTING_LOAN_TYPE BANK_NO DESCRIPTION INVESTOR_NUMBER CATEGORY PRINCIPAL_PAYEE PRINCIPAL_COST_CTR INTEREST_INCOME_PAYEE INTEREST_INCOME_ACCOUNT INTEREST_INCOME_COST_CTR AIR_PAYEE AIR_ACCOUNT AIR_COST_CTR LATE_CHARGE_PAYEE LATE_CHARGE_INCOME_ACCOUNT LATE_CHARGE_INCOME_COST_CTR REGULATORY_LOAN_TYPE INV_CLASS_CODE

MODIFICATION_LOANS

PK,FK1 LOAN_IDPK AS_OF_DT

MOD_DATE MOD_CODE OCCUPANCY RE_MOD PREV_LOAN_TYPE PREV_BALANCE PREV_INT_RATE PREV_MIN_PAYMENT PREV_LOAN_TERM POST_LOAN_TYPE ACCRUING_BALANCE FOREBEARANCE_BALANCE POST_INT_RATE POST_MIN_PAYMENT POST_LOAN_TERM FDIC_MOD_FLAG PRE_MOD_PROD_TYPE PRE_MOD_INDEXED_VALUE PRE_MOD_IR_CHG_PERIOD PRE_MOD_PI_CHG_PERIOD PRE_MOD_IO_TERM PRE_MOD_INDEX_DESC

PIF_LOANS

PK,FK1 LOAN_ID

MONTH_ADDED PIF_PROCEEDS PIF_INTEREST

PRODUCT

PK PRODUCT_ID

PRODUCT_CD PRODUCT_DESC

PURPOSE

PK PURPOSE_ID

PURPOSE_CD PURPOSE_DESC

RATE_CODE

PK RATE_CODE_ID

RATE_CODE RATE_CODE_DESC

REO_LOANS

PK,FK1 LOAN_IDPK AS_OF_DT

REO_VALUE MONTH_ADDED SPECIAL_ASSETS_COMMENTS NET_OUT_STANDING REO_DATE_ACQ REO_APPRAISAL_VAL REO_APPRAISAL_DT REO_WRITE_DOWN REO_SVA SHORT_NAME INT_PAID_DATE PCT_OWNED MAT_DATE HOA_DUES NET_BOOK_VALUE EST_RENTS VR TERMITE_FEES LIST_DATE SALES_PRICE_ACC EST_CLOSE_ESCROW COMMENTS PICO_AMT

RISK_RATINGS

PK,FK1 LOAN_IDPK AS_OF_DT

TYPE HOMOGENOUS_FLAG PASS WATCH SPECIAL_MENTION SUB_STANDARD DOUBTFUL LOSS REO

SHORT_SALE_LOANS

PK,FK1 LOAN_ID

MONTH_ADDED SHORT_SALE_AMT SHORT_SALE_FLAG THIRD_PARTY_FLAG

STATE

PK STATE_ID

STATE_NAME SEC_STATE_GROUP BANK_REGION_FLAG SEC_STATE_GROUP_DESC

SUPER_REPLINE

PK SUPER_REPLINE_ID

LEGACY_SUPER_REPLINE_ID SUPER_REPLINE_DESC

SUPER_REPLINE_FACT

PK AS_OF_DTPK,FK1 SUPER_REPLINE_ID

PREV_GROSS_BOOK_VALUE PIF_PROCEEDS SHORT_SALE_AMT REO_VALUE CURR_PRIN_PAID_NON_LIQUID NEG_AM_ADVANCE_AMT CURR_INT_PAID_NON_LIQUID PREV_ACCRETION_VALUE CURR_GROSS_BOOK_VALUE IMPAIRMENT CURR_NET_BOOK_VALUE DOLLAR_BASIS_DIFF PCT_BASIS_DIFF CURR_ACCRETION_VALUE PREV_ACCRETABLE_YIELD CURR_ACCRETABLE_YIELD CURR_MODELED_CASH_FLOWS CURR_ACTUAL_CASH_FLOWS CURR_DOLLAR_MODEL_MISS CURR_PCT_MODEL_MISS CURR_MODELED_BOOK_VALUE CURR_DOL_BOOK_VALUE_MODEL_MISS CURR_PCT_BOOK_VALUE_MODEL_MISS FDIC_MOD_ADJ_AMT PREV_UPB CURR_UPB UPB_CHANGE FDIC_COVERED_IMPAIRMENT ACCRETION_RATE PICO_AMT

Page 22: DW 101

22

Star Schema (facts and dimensions)

The facts that the data warehouse helps analyze are classified along different dimensions: The FACT table houses the main data

Includes a large amount of aggregated data (i.e. price, units sold)

DIMENSION tables off the FACT include attributes that describe the FACT

Star schemas provide simplicity for users

Page 23: DW 101

23

Star Schema example (Sales db)

Page 24: DW 101

24

SQL to select from Star Schema

SELECT Brand, Country, SUM (Units Sold) FROM Fact.Sales

JOIN Dim.Date ON Date_FK = Date_PK

JOIN Dim.Store ON Store_FK = Store_PK

JOIN Dim.Product ON Product_FK = Product_PK

WHERE [Year] = 2010 AND Product Category = ‘TV' GROUP BY Brand, Country

Page 25: DW 101

25

SnowFlake Schema

Central FACT Connected to multiple DIMENSIONS which

are NORMALIZED into related tables Snowflaking effects DIMS and never FACT Used in Data warehouses and data marts

when speed is more important than efficiency/ease of data selection

Needed for many BI OLAP tools Stores less data

Page 26: DW 101

26

Snowflake Schema example (Sales db)

Page 27: DW 101

27

SQL to select from SnowFlake

SELECT B.Brand, G.Country, SUM (F.Units_Sold)

FROM Fact_Sales F (NOLOCK) INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id = G.Id INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id INNER JOIN Dim_Product_Category C (NOLOCK) ON P.Product_Category_Id = C.ID

INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE D.Year = 2010 AND C.Product_Category = 'tv' GROUP BY B.Brand, G.Country

Page 28: DW 101

28

Comparison of SQL Star vs SnowFlake

SELECT Brand, Country, SUM (Units Sold) FROM Fact.Sales

JOIN Dim.Date ON Date_FK = Date_PK

JOIN Dim.Store ON Store_FK = Store_PK

JOIN Dim.Product ON Product_FK = Product_PK

WHERE [Year] = 2010 AND Product Category = ‘TV' GROUP BY

Brand, Country

SELECT B.Brand, G.Country, SUM (F.Units_Sold)

FROM Fact_Sales F (NOLOCK) INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id =

G.Id INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id INNER JOIN Dim_Product_Category C (NOLOCK) ON

P.Product_Category_Id = C.ID

INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE

D.Year = 2010 AND C.Product_Category = 'tv' GROUP BY

B.Brand, G.Country

Page 29: DW 101

29

Basic EDW Data Model Design

PartyAccount

Product & Service

Event

Each represents a subject area in the model, with third normal tables to accommodate the data and its relationships with hierarchy

Page 30: DW 101

30

Account, Customer & Address Relationships

Account Contact

Party Address link

Account Party link

Address

Account

Party

Account Information loaded from ALL Source Systems

ETL process builds the relationship between Accounts and Customers (Party) based on the relationship file from CUSTOMER CRM SYSTEM

Page 31: DW 101

31

Architecture for an EDW or other large Data Warehouse

How do get from where you are to implement an actual system?

Start with defining your requirements Then modeling Budget $$$ Hire staff Engage partners DO IT YOURSELF, DO NOT RELY ON THE

EXPERTS – staff augment and hire the talent internally

Page 32: DW 101

32

BREAK --- Modeling exercise

1. ACCOUNT TEAM2. CUSTOMER/PARTY TEAM3. PRODUCT TEAM4. EVENT (TRANSACTION) TEAM5. CRM TEAM (WHO WANT INFORMATION ABOUT CUSTOMER TO BE

ABLE TO MARKET TO THEM AND KNOW PRODUCE NEW PRODUCTS CUSTOMERS WANT)

6. EXECUTIVE TEAM WHO WANT INFORMATION ABOUT HOW THE BUSINESS IS DOING (WHAT IS SELLING, WHAT IS NOT, WHAT IS PROFITABLE…..)

DIVIDE INTO THE ABOVE 6 TEAMS ASK EACH TEAM TO BRAINSTORM WHAT INFORMATION THEY NEED

SEPARATELY ASK 1 LEADER FROM EACH TEAM TO PRESENT THEIR DATA NEEDS IN

DIAGRAM FOR 1 - 4 AND LIST/SPREADSHEET FOR 5 and 6 DISCUSS HOW THE NEEDS OF THE CRM AND EXECUTIVES CAN BE MET

FROM THE DIAGRAMED DATA

Page 33: DW 101

33

Some Interesting Info:

http://www.itquestionbank.com/an-introduction-to-data-warehousing.html

http://www.ralphkimball.com/ http://www.inmoncif.com/home/

Page 34: DW 101

34

SECTION 2

Data Warehouse Architecture

Page 35: DW 101

35

Where we are at / Next steps

Have identified data needs

Designed a model to fit those needs

Now we need to identify how we will set up the architecture Physical hardware Software

People Who do we need on

the project team Processes

Page 36: DW 101

36

State of many mature companies

Page 37: DW 101

37

Staging

Area

EDW

Metadata | Data Governance | Data Management

DM

CPS

MANTAS

CRDB

MKTG

FIN

SALES

EDW

Data cleansing

Data profiling

Sync &Sort

EDW Process State

BISource System

Cleanse / Pre-process

IMP

RMOECALS

AFSST

REDFPSBA

AFSV-PR

Page 38: DW 101

38

Information Factory Concept

Page 39: DW 101

39

Moore’s Law (yes, it applies here too)

Sharply increasing power of computer hardware

With increase in power decrease in price (capacity of microprocessor will double

every 18 months) also holds true for other computer components

Desktop power increasing as well as service power requirements (where GO GREEN comes from)

Page 40: DW 101

40

Explosion in innovation

BI software now able to be deployed on intranet vs hard to maintain thick client apps Thick client still used for developers

Web server, application server, database server Allows offloading of processing to

correct tier More power for everyone

Page 41: DW 101

41

Change in Business

Global economy changed needs of organizations worldwide

Global markets Mergers and Acquisitions All increase data needs More tech savvy end users (demand more

data, more tools… More information demanding executives

facilitates sponsorship of DW

Page 42: DW 101

42

DW Evolving

Care should be taken i.e. vendor claims Size is not a factor Operational vs informational

Operational pre-defined Informational more adhoc in nature Performance Volitile vs non volitile data DW saves data for longer periods than

transactional/operational systems (trending analysis, where I was vs where I am…..)

Real-time DW vs point in time

Page 43: DW 101

43

DW needs to be extendable, align with business structure

Orders

Product

Future

Figure 4. Extensible data warehouse

•Setup framework for Enterprise data warehouse

•Start with few a most valuable source applications

•Add additional applications as business case can be made

Ent

erpr

ise

Dat

a W

areh

ouse

Future

http://www.sserve.com/ftp/dwintro.doc

Page 44: DW 101

44

Data Marts and OLAP

EnterpriseData

Warehouse

SourceSystems

Reporting

Data Mining

OLAP Analysis

Dashboard

Scorecard

Master DataManagementApplication

Enterprise Data Solution

Master / Reference Data Store

Page 45: DW 101

45

EDW - Objective

Source Files

CDC/SyncSort Process

StagingSource Data

(Delta)

ETL Load1

EDW Target

ETL Load2

Datamarts

ET

L

Follow the process methodology to achieve these architectural aspects : Meta Data, Security, Scalability, Reliability and Supportability

Page 46: DW 101

46

EDW – Data Model Design

PartyAccount

Product & Service

Event

Each represents the subject area we have in the model, with third normal tables to accommodate the data and its relationships with hierarchy

Page 47: DW 101

47

Account, Customer & Address Relationships

Account Contact

Party Address link

Account Party link

Address

Account

Party

Account Information loaded from ALL Source Systems

Customer Information Loaded

EDW ETL process builds the relationship between Account and Customers based on the relationship file from RM

Page 48: DW 101

48

Single definition of a data element

DW brings in the data from multiple sources and conforms it so that it can be viewed together Multiple systems have individual

customers/addresses, but warehouse gives single view of the customer and all the systems they are in

Helping move from product centric systems to customer centric systems

Page 49: DW 101

49

Business view of data

DW is only successful is it provides the view the business needs of its data

A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business

terms, and summarized for quick analysis. Vivek R. Gupta, Senior Consultant [email protected] System

Services corporation, Chicago, Illinois http://www.system-services.com

Page 50: DW 101

50

Example of conforming data for business view:

Figure 8. Physical transformation of application data

•Uniform business terms

•Single physical definition of an attribute

•Consistent use of entity attributes

•Default and missing values

Data Warehouse

System

OperationalSystem B

OperationalSystem A

Detailed

Data

Summarized Data

Transformation

-----------------------

cust, cust_id, borrower>> customer ID

-----------------------

“1” >> “M”

“2” >> “F”

-----------------------

Missing >>> “……..”

http://www.sserve.com/ftp/dwintro.doc

Page 51: DW 101

51

Business use of DW

Business should use data mart created off data warehouse

Business uses want to use existing tools/methods (replicate queires, Excel, extract to Access) against DW and validate the data between existing and DW

Over time LoB gains confidence in DW and then begins to explore new possibilities of data use and tool use

Page 52: DW 101

52

EDW – Process Flow

SyncSort Server Informatica Server

Oracle DB Server (EDW)

Source Systems (M/F)

Source Files

CDC/SyncSort Process

Staging Schema

Source Data (Delta)

ETL Src -Stage

EDW ETL Stage-EDW

Oracle DB Server (DM)

Data Mart (FDM)

ETLEDW-DM

Source Data LayerLanding Zone

File ProcessingETL Layer Data Layer

ETL UTF

Page 53: DW 101

53

EDW ETL Design

Source to Stage Mapping (For AFS)

Stage to EDW Mapping (for AFS)

EDW to FDM Mapping (for FACT)

Page 54: DW 101

54

ETL Tools are prolific

Abinitio Syncsort DMExpress 6.5 Oracle Warehouse Builder

(OWB) 11gR1 Oracle  Data Integrator & Data Services 

XI 3.0 SAP Business Objects IBM Information Server

(Datastage) 8.1 IBM SAS Data Integration Studio 4.2

SAS Institute PowerCenter 9.0 Informatica  Elixir Repertoire 7.2.2 Elixir Data Migrator 7.6 Information

Builders 8. SQL Server Integration Services 10 Microsoft 

Talend Open Studio & Integration Suite4.0Talend

DataFlow Manager 6.5 Pitney Bowes Business Insight

Data Integrator 9.2 Pervasive Open Text Integration Center

7.1 Open Text  Transformation Manager 5.2.2

ETL Solutions Ltd. Data Manager/Decision Stream

8.2 IBM (Cognos) Clover ETL 2.9.2 Javlin  ETL4ALL 4.2 IKAN DB2 Warehouse Edition 9.1 IBM Pentaho Data Integration 3.0

Pentaho   Adeptia Integration

Suite5.1 Adeptia Expressor Sun – SeeBeyond ETL integrator

Page 55: DW 101

55

Commonly used toolsets:

Comercial ETL Tools: IBM Infosphere

DataStage Informatica PowerCenter Oracle Warehouse

Builder (OWB) Oracle Data Integrator

(ODI) SAS ETL Studio Business Objects Data

Integrator(BODI) Microsoft SQL Server

Integration Services(SSIS)

Ab Initio

Freeware, open source ETL tools:

Pentaho Data Integration (Kettle)

Talend Integrator Suite CloverETL Jasper ETL

Page 56: DW 101

56

ETL Extract, Transform, Load

Created to improve and facilitate data warehousing

EXTRACT Data brought in from

external sources TRANSFORM

Data fit to standards LOAD

Load converted data into target DW

Steps: Initiate Build reference data Extract from sources Validate Transform Load into staging

tables Audit reports Publish Archive cleanup

Page 57: DW 101

57

Reconciliation Overview (EDW-data mart)

Datamart Load ProcessEDW Load Process

StagingSource Data (Delta)

EDW

FDBR

FDM

Val

idat

e

Checkpoint 1Validate measures

between Balance files and Source Data Files

ETL ETLETLCDC

Checkpoint 2Validate measures

between Stage Tables and Source Data Files

Checkpoint 3Validate measures

between Balance files and EDW Tables

Checkpoint 4Validate measures

between EDW Tables and mart Tables

Source Files

Balance Files

TM1 (GL)

Views

Validate measures between EDW Tables and General Ledger

Extract

Ext

ract

Page 58: DW 101

58

EDW Data Flow

Sharedmounpoint

ETLFTPOther Sources

Source Files

Other Sources

Source Files

Mainframe(MVSPROD)

Source Files

File Transfer· All source data files should originate

from Server/Host· File transfer should be setup as job via

CA-Scheduler

File Maintenance· Both Input and Output mount points

are shared between SyncSort and Informatica Server

· The files should land on Input mount point and have a job to be Archived (if needed)

Security· FTP login accounts are/will be enabled

only from certain hosts/servers· Seprate folder under Input/Output/

Archive needs to be maintained for isolation

Capacity· Need capacity estimate for any new

feeds into Landing Zone

DB Schema· Various schema represent different

logical group of data/process in EDW· EDW_OWNER – All EDW objects· EDW_STG – Stagging/Temp objects

for ETL loads· ETL_CNTL – ETL process control

objects· FDM_OWNER – Datamart related

objects· INF – Informatica Repository· MM_OWNER – Modelmart repository

Oracle(edwora3/4)

INF

EDW

Oracle(edwora1/2)

FDM

SynSort(edwss1)

File Landing Zone/input

CDC

File Archive/ssiarc

CDC Output/output

Informatica(edwetl1)

ETL

Output/Reports/edw_output

ASM Deployed

Page 59: DW 101

59

EDW – Security Scheme

All EDW Tables/Objects

EDW_OWNER

EDW_JOBS

ETL login account with Read/Write access to both Staging and EDW tables

EDW_USER

End-user / reporting login account with read access to

all EDW tables

All Staging/Temp Tables and Objects, for ETL and

batch jobs

EDW_STG

Read only

Read/Write

Read/Write

Application-Server AdminInfo – informatica

Oracle – Oracle DBSsort - Syncsort

App Admin

App Execution

Login to execute the jobs (shell and other jobs)

Info_work/etl_jobs – InformticaSs_jobs - Syncsort

Developer

Individual Developer Access

Production support login for support group. Has read only

access to all application and log. Only can change the parameters

Info_sup – Informtica supportSort_sup – Syncsort support

App Support

Database User/Schema Unix User

Page 60: DW 101

60

EDW - InfrastructureDevelopment Environment Production Environment

Syncsort

Network IM2

VIO1

VIO2

ETL1

ORA1

ORA2

ORA3

ORA4

Network IMVIO2

VIO1ETL

SyncsortMaintenance

ORA1

ORA2

EDW-DEV DB Vlan

Corporate Network

DDC DDC1 ENT1 ENT2 BackupMainframe

AD MAIL

`

Unix/WS AdminWorkstations

`

DBA/DevWorkstations

NAS

TC

P –

TS

M

TC

P –

25

OEM

TC

P –

11

59

, 48

89

, 7

20

0,

38

72

, 18

30

TC

P –

53,

95

3

TC

P –

53,

95

3

TC

P –

12

3

TC

P –

12

3

TC

P –

38

9

UD

P/T

CP

– N

FS

TC

P –

20

, 2

1

NIM1

EDW-PRD DB Vlan

TC

P –

22

, 8

0, 4

43

, 8

44

3,

99

60

, 23

00

, 2

30

1

`

VM

TC

P –

22

TC

P –

22

, 1

52

1In

form

atic

a

EDW-PRD APP VlanEDW-DEV APP Vlan

RSK NetIQ

TC

P/U

DP

– N

ET

IQ

TC

P/U

DP

– N

ET

IQ

TCP-UDP NIM

TCP-UDPNIM

TCP-UDPNIM

TCP-UDPNIM

TCP-UDPSTND

TCP-UDP STND TCP-UDP

STND

TCP-UDPNFS, NIM

TCP-UDPNFS, NIM

TCP-UDPSTND

SOX

TCP-UDP20, 21,

NETIQ, TSMTCP-UDP

NETIQ, TSMTCP-UDP

NETIQ, TSM

TCP-UDP20, 21,

NETIQ, TSM

TCP-UDPNFS, NIM

TCP-UDPNFS, NIM

TC

P –

31

Page 61: DW 101

61

EDW – Development (SDLC)Intiate ImplementTestBuildDesignPlan

Initial Scope

Source Data Analysis

Initial Estimate

Final Scope

Work Plan

Architecture & Design

Source/Target Mapping

Report/Data Requirements

Data Modeling

Test, Integration and Deployment

Plan

Transition and Support Plan

Finalize Source/Target Mapping

Build ETL and other processes

Unit test cases and results

Defect fixes and support

Functional test cases and

results

Integration test cases and

results

Data validation

Load / Performance

Testing

Auto processing(CA-Scheduler)

UAT and Sign-off

Data creation/Setup

Production Migration

User Training

Support Documentation

and training

Page 62: DW 101

62

EDW Development Project Cycle (New Source to EDW)

ImplementTestBuildSystem Design

Initial Scope

Source Data Analysis

Initial Estimate

Final Scope

Work Plan

Architecture & Design

Source/Target Mapping

Report/Data Requirements

Data Modeling

Test, Integration and Deployment Plan

Transition and Support Plan

Finalize Source/Target Mapping

Build ETL and other processes

Unit test cases and results

Defect fixes and support

Functional test cases and results

Operations Procedure testing and review

Data validation

Load / Performance Testing

Integration Testing(CA-Scheduler)

UAT and Sign-off

Initial Data creation/Setup

Production Migration

User Training

Support Documentation and training

Requirement SpecsInitiation

Project Sponsor Approval

Management Approval

Peer review

IT Approval

IT / Business Distribution

Peer review

IT Approval

Peer / Lead reviewPerformance, Capacity and

Guidelines for project

Business Users IT Planning & Systems EDW Development Team

Data Analyst Team Operations & SupportProject Management Operations & Support

Operations & Support

Business Users

Business Users

Business Users

Business Users

Project Management

Project Management Project Management Project Management

IT Planning & Systems

IT Planning & Systems

IT Planning & Systems IT Planning & Systems

Data Analyst Team

Data Analyst Team

Data Analyst Team Data Analyst Team

EDW Development Team

EDW Development Team EDW Development TeamGroups Involved in various Phases of the project

Major Tasks and Deliverables for the project

Page 63: DW 101

63

EDW – Support – Escalation Procedure

CA - Scheduler

Mail Notification

User Call Help Desk Alert

Ops(Priliminary Checks ?)

Unix Admin

Oracle Admin

Informatica Admin

ETL Related Issue ?

Database Related Issue ? Storage Admin

Hardware Admin

IBM Support

EMC Support

Oracle Support

Informatica Support

EDW SupportIssue

NetIQAlerts

IBM Support

Page 64: DW 101

64

EDW – Support Process

Issue(Operational Error)

SLA

Process Management

Guidelines

Known DB Issues

Known Application Server Issues

Known OS issues

Input Issue Management Process Output

Record & Classify Issue

Open Remedy Ticket

RFC Required ?

Change Management

Investigation And Diagonsis

Resolution Found ?

RFC Required ?

Change Management

Provision for Temporary Solution

to continue the Process

Record the nature of the issue and

Update the User group (& BA)

Update/Close Remedy Ticket

No

Yes Yes

No

Yes

No

Provide Solution and continue the Process

Page 65: DW 101

65

EDW - Roadmap

EDW(Accounts and Customers)

Multiple Source System Financial Mart

Master Data Management

mart1

Risk Mart

mart2

Transaction

Customer Analytics

Source System

mart3

Management Architecture (Metadata, Data Security, Systems Management)

Page 66: DW 101

66

Architecture Exercise (1 of 2)

Identify needs in the following categories Physical hardware

CPU, Memory, Disk For disk – how much?

Use your model and calculate size For database & tools Will data be behind a firewall?

Software Database ETL BI tools

Application and web server

Page 67: DW 101

67

Architecture Exercise (2 of 2)

Break into former 6 teams Ask each team to consider what they will need to build

a DW Hardware Software People Processes Support and Operations

Allow 30 minutes to brainstorm then discuss as a class Volunteer team presents what they came up with

Lists needs on board each progressive team adds what they have to the list

Group discussion on what they have uncovered

Page 68: DW 101

68

SECTION 3

What is Data Quality I can’t tell you what’s

important, but your users can.

Look for the fields that can identify potential problems with the data

What is Master Data Management (MDM)

Page 69: DW 101

69

Data Quality

Data doesn’t stay the same Sometimes it does

Considerations: What happens to the warehouse when

the data changes When needs change

Page 70: DW 101

70

Roadmap to DQ

Data profiling Establishing metrics/measures Design and implement the rules Deploy the plan Review errors/exceptions Monitor the results

Page 71: DW 101

71

Data Profiling

What’s in the data Analyze the columns in the tables

Provides metadata Allows for good specifications for

programmers Reduces project risk (as data is now

known) How many rows, number of distinct values in

a column, how many null, data type identification

Shows the data pattern

Page 72: DW 101

72

Data Profiling Example

Page 73: DW 101

73

Data Quality is measured as the degree of superiority, or excellence, of the various data that we use to create information products.

“Reason #1 for the failure of CRM projects: Data is ignored. Enterprise must have a detailed understanding of the quality of their data. How to clean it up, how to keep it clean, where to source it, and what 3rd-party data is required. Action item: Have a data quality strategy. Devote ½ of the total timeline of the CRM project to data elements.” - Gartner

Page 74: DW 101

74

Data Quality Tools (Gartner Magic Quadrant)

Page 75: DW 101

75

Dimensions of Quality

Informatica.com

Page 76: DW 101

76

Data Quality Measures

Definition Accuracy Completeness Coverage Timeliness Validity

Page 77: DW 101

77

Definition

Conformance: The degree to which data values are consistent with their agreed upon definitions.

A detailed definition must first exist before this can be measured.

Information quality begins with a comprehensive understanding of the data inventory. The information about the data is as important as the data itself.

A Data Dictionary must exist! An organized, authoritive collection of attributes is equivalent to the old “Card Catalog” in a library, or the “Parts and List Description” section of an inventory system. It must contain all the know usage rules and an acceptable list of values. All known caveats and anomalies must be descried.

Page 78: DW 101

78

Accuracy

The degree to which a piece of data is correct and believable. The value can be compared to the original source for correctness, but it can still be unbelievable. Conformed values can be compared to lists of reference values. Zip code 35244 is correct and believable. Zip code 3524B is incorrect and unbelievable. Zip code 35290 is incorrect but believable (it looks right,

but does not exist). AL is a correct and believable state code (compared to

the list of valid state codes) A1 is an incorrect and unbelievable state code

(compared to the list of valid state codes) AA is an incorrect but believable state code (compared

to the list of valid state codes)

Page 79: DW 101

79

Completeness

The Degree to which all information expected is received. This is measured in two ways: Do we have all the records that were sent to us?

Counts from the provider can be compared against counts of data received.

Did the provider send us all the records that they have or just some of them?

This is difficult to measure without auditing and trending the source.

How would we know that the provider had a ‘glitch’ in their system and records were missing from our feed?

Page 80: DW 101

80

Measures of Completeness

The following questions can be answered for counts: How many records per batch by

provider? How is this batch’s counts compared to

the previous month’s average. How is the batch’s counts compared to

the same time period last year? How does this batch’s counts compare to

a 12 month average?

Page 81: DW 101

81

Coverage

The degree to which all fields are populated with data. Columns of data can be measured for % of missing values and compared to expected % missing. i.e. Sale Type Code is expected to be

populated 100% by all sources for Sales documents.

Page 82: DW 101

82

Timeliness

The degree to which provider files are received, processed and made available to for assembly to data marts. Expected receipt times are compared to actual receipt times. Late or missing files are flagged and reported

on. Proactive alerts trigger communication with the

provider contact. Proactive communication can alert to assembly

processes. Excessive lag times can be reported to providers

in order to request delivery sooner.

Page 83: DW 101

83

Validity

The degree to which the relationships between different data are valid. Zip code 48108 is accurate. State code

AL is accurate. Zip code 48108 is invalid for the state of AL.

Page 84: DW 101

84

Data Quality Measures

How do you know if your data is of high quality? Agree upon the measure that are

important to the organization and consistently report them out.

Use the data measures to communicate and inform.

Page 85: DW 101

85

Measurement

Informatica.com

Page 86: DW 101

86

Exercise: Changing the Data Warehouse (1 of 2)

So, you need to add a new source Or, you need to receive additional

data from an existing source Could be the data quality is an issue Could be that the business rules

weren’t defined adequately

Page 87: DW 101

87

Brainstorming Group Exercise

(2 of 2)

The data changed due to DQ measures – what do we have to do in the DW? What has to change Estimate the change Implement the change How do we make sure it doesn’t

happen again? What DQ measure can help?

Page 88: DW 101

88

MDMMaster Data Management

The newest ‘buzz word’

Page 89: DW 101

89

Exercise:

What processes need to be put in place for MDM Who needs to be involved Who owns it

Page 90: DW 101

90

SECTION 4

BI Tools BICC Jobs Certifications

Page 91: DW 101

91

SECTION 4

What is business intelligence What are BI tools What is a business intelligence

competency center (BICC) What jobs are available

certifications

Page 92: DW 101

92

BI Tools

Page 93: DW 101

93

BICC

Page 94: DW 101

94

Jobs in Data Warehousing

Page 95: DW 101

95

Certifications in DW

Page 96: DW 101

96

References

Data Management and Integration Topic, Gartner, http://www.gartner.com/it/products/research/asset_137953_2395.jsp

Articles: Key Issues for Implementing an Enterprise wide Data Quality Improvement Project, 2008, Key Issues for Enterprise Information Management Initiatives, 2008, Key Issues for Establishing Information Governance Policies, Processes and Organization, 2008

Data Quality Management, The Most Critical Initiative You Can Implement, J. G. Geiger, http://www2.sas.com/proceedings/sugi29/098-29.pdf

Information Management, How to Measure and Monitor the Quality of Master Data, http://www.information-management.com/issues/2007_58/master_data_management_mdm_quality-10015358-1.html?ET=informationmgmt:e963:2046487a:&st=email

Data Management Assn of Michigan Bits & Bytes, Critical Data Quality Controls, D Jeffries, Fall 2006 http://dama-michigan.org/2%20Newsletter.pdf