Introduction to Data Warehousing Pasquale LOPS Gestione della Conoscenza d’Impresa A.A. 2003-2004 Introduction to Data Warehousing Pasquale LOPS Gestione della Conoscenza d’Impresa A.A. 2003-2004 Introduction Introduction Data warehousing and decision support have given rise to a new class of databases. Design strategies for OLAP databases differ significantly from OLTP systems. Today’s decision support systems must deliver multidimensional analysis capabilities.
32
Embed
Introduction to Data Warehousing - Dipartimento di …semeraro/GCI/Seminario_datawarehouse.pdf · Introduction to Data Warehousing Pasquale LOPS ... Design Considerations 9Scheduling
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
1
Introduction to Data Warehousing
Pasquale LOPSGestione della Conoscenza d’Impresa
A.A. 2003-2004
Introduction to Data Warehousing
Pasquale LOPSGestione della Conoscenza d’Impresa
A.A. 2003-2004
IntroductionIntroduction
Data warehousing and decision support have given rise to a new class of databases.
Design strategies for OLAP databases differ significantly from OLTP systems.
Today’s decision support systems must deliver multidimensional analysis capabilities.
2
Terminology – What is a Data Warehouse?Terminology – What is a Data Warehouse?
A database- typically read-only- Data stored in relational or multidimensional format- Multidimensional db often populated from relational db
Populated from existing source systems- Secondary sources of data- Populated from existing internal or external data sources- It is possible to build DSS on top of operational systems
Used for reporting purposes- not transaction-based- Used primarily for reporting purposes- Must be designed for analysis purposes- OLAP; not OLTP
Terminology – Decision Support and Multidimensional AnalysisTerminology – Decision Support and Multidimensional Analysis
Decision Support Systems (DSS)Facilitate business analysisSupport business decision makers by providing various types of analysis: trend, comparison and ad hoc reporting
Multidimensional AnalysisAllows analysis by business dimensionEquates to ‘Flexible Reporting’Allows for drill down, drill up and iterative data analysis
3
Terminology – OLTP and OLAPTerminology – OLTP and OLAP
OLTP On-Line Transaction Processing
- support specific application- Maintain integrity of data
OLAPOn-Line Analytical Processing
- support business analysisPoints of Difference
Orientation or alignment of dataIntegrationHistory—time horizon of dataData access and manipulationUsage patterns
OLTP vs. OLAP – Orientation or Alignment of DataOLTP vs. OLAP – Orientation or Alignment of Data
Organized around Applications
Different systems hold differenttypes of data.
Data is inherently organized by application.
Different information in a different system.
Organized for Business Dimension
All types of data are integrated into one system.
Data is organized by defined dimensions of the business.
Information from different systems stored in a single database.
OLTP OLAP
4
OLTP vs. OLAP – IntegrationOLTP vs. OLAP – Integration
Typically Not Integrated
Different key structuresDifferent naming conventionsDifferent file formatsDifferent hardware platforms
Must Be Integrated
Standard key structuresStandard naming conventionsStandard file formatOne warehouse server– Logical server
OLTP OLAP
OLTP vs. OLAP – HistoryOLTP vs. OLAP – History
Recent or Current Data
60-90 daysCurrent values onlyNo time key No time series analysisPrimary source
Historical Data
2 or more years Historical snapshots of OLTP dataTime key Time series analysisSecondary source– Until data is purged or lost from
OLTP (after 60-90 days)
OLTP OLAP
5
OLTP vs. OLAP – Data Access and ManipulationOLTP vs. OLAP – Data Access and Manipulation
Transactions
Inserts, Updates, Deletes, SelectsSmall amount of data involved in each transactionHighly ‘indexable’RDBMS focus– Locking– Concurrency– Logical Unit of Work
Bulk Processes
Selects onlyLarge amount of data involved in each processNot always ‘indexable’RDBMS focus– Parallel Loader, Query– Star Join– Bit mapped Indexes
OLTP OLAP
OLTP vs. OLAP – Usage PatternsOLTP vs. OLAP – Usage Patterns
Fairly ConsistentMaintain a constant system
utilization pattern
OLTP OLAP
Spiked or UnevenLarge period of light use and
spiked usage pattern
System Resource Utilization Graphs
6
OLTP vs. OLAP – SummaryOLTP vs. OLAP – Summary
Aligned by Application
Typically Not Integrated
Recent or Current Data
Transactions
Fairly Consistent
Aligned by Dimension
Must Be Integrated
Historical Data
Bulk Processes
Spiked or Uneven
OLTP OLAP
Alignment:
Integration:
History:
Data Access:
Usage:
WarehouseHeadaches
Batch
Maintenance
Tuning
Intro to ERM and ERDIntro to ERM and ERD
Terms and Concepts
7
ENTITIESENTITIES
ERM TerminologyERM Terminology
ERM - Entity Relationship Model (design)
ERD - Entity Relationship Diagram (graphical)
Entity - things of interest to the business, represented by boxes and implemented as tables
Attributes - things to know about an entity, implemented as columns in tables
Relationships - how entities relate, represented by lines on ERD and implemented as foreign keys
8
Entity ParadigmsEntity Paradigms
Rounded corners - ERD,
Square corners - Relational
NamingShould be singular in natureConsistency, communication, compatibility
RELATIONSHIPSRELATIONSHIPS
9
Relationships and Business RulesRelationships and Business Rules
RX TransactionRX
Relationship - Line and “crow’s foot” represent a foreign key relationship from RX Transaction and RX
Cardinality - crow’s foot means “one or more”, absence means “one”
Relationships and Business RulesRelationships and Business Rules
RX TransactionRXallowsIs allowed by
Optionality
Solid bar means that the relationship MUST existCircle means that relationship MAY existUse words near entities with optionality symbols to complete sentences for definitionRX may allow RX TransactionsRX Transactions must be allowed by an RX
10
RelationshipsRelationships
Vice PresidentDepartmentBe managed by
manage
One-to-One relationship:Each department must be managed by one VP.Each VP may manage one department.
RelationshipsRelationships
Vice PresidentManagement Team
contains
Is contained in
One-to-Many relationship:Each management team must contain many VP’s.Each VP may be contained in one management team.
11
RelationshipsRelationships
DegreeEmployeeIs held by
hold
Many-to-Many relationship:Each Employee must hold one or more degrees.Each Degree may be held by one or more employees.
ATTRIBUTESATTRIBUTES
12
Attributes - TerminologyAttributes - Terminology
Attributes are the information we wish to keep about a particular entity
Designers must consider and understand unique characteristics and requirements of all three previous components
Ideally, a project team should pick the best-of-class tools for storing and accessing data.In reality, all three pieces should be selected with regard to the others, to ensure that each component will complement the others.
Source SystemsSource SystemsOne or more operational systems will be the source(s) of the data stored in the data warehouse.
Source System
A
Source System
B
Source System
C
UserGroup A
UserGroup B
UserGroup C
15
Source Systems (cont’d)Source Systems (cont’d)
Source systems are typically not integrated.Have unique key structures and unique naming conventions
Possess overlapping data
Source systems hold current value data.
Source systems will indirectly define the scope of a warehouse.
Only data found in source systems can be included in data warehouse; no “new” data can be created.
Each operational system will have unique characteristics (levels of detail or granularity of data, types of data or metrics available)
The Warehouse Trade-Off TriangleThe Warehouse Trade-Off Triangle
Query Performance
User Requirements
Data Warehouse Maintenance
Schema
The ETL Process
ETL = Extraction, Transformation
and Loading
The ETL Process
ETL = Extraction, Transformation
and Loading
18
Batch Process – OverviewBatch Process – Overview
Source System
ExtractProgram
ExtractFile
Source System Server Warehouse Server
LoadFile
DWHRDBMS
File Transfer
LandingSpace
Batch Process – ExtractsBatch Process – Extracts
Source System
ExtractProgram
ExtractFile
Source System Server
Extracts are programs that generate data files.
Perform data transformations, data cleaning.
Perform key conversions.
Reformat data to the standards of the warehouse.
Must produce data in a file format suitable for loading into the data warehouse (delimiters, capitalization, etc.).
May build aggregation tables.
19
Batch Process – ExtractsBatch Process – Extracts
Source System
ExtractProgram
ExtractFile
Source System Server
Basic Types of Extracts1) Facts tables
Must provide load files for the following tables:
–Base tables–Historical tables–Aggregate tables
2) Lookup tablesMust provide data to populate the following tables:
–Lookup tables–Relationship tables
Batch Process – Extracts (cont’d)Batch Process – Extracts (cont’d)
Static Extraction for the first loading of the DWH
Incremental Extraction for the update of the DWH
20
Batch Process – File TransfersBatch Process – File Transfers
Source System
ExtractProgram
ExtractFile
Source System Server Warehouse Server
LoadFile
File Transfer
LandingSpace
Batch Process – File Transfers (cont’d)Batch Process – File Transfers (cont’d)
File Transfer: Process of moving data files to data warehouse server.
After Extracts, generated files must be moved from source systems to data warehouse server.
Design Considerations
Transfer method and network impact Usually transferred via FTP
Data volumes are usually large, therefore, the impact on the network and the transfer rate should be tested and understood
Landing space must have enough disk space on the data warehouse to temporarily store extract files before they are loaded into the warehouse
21
Batch Process – File Transfers (cont’d)Batch Process – File Transfers (cont’d)
Design Considerations
Scheduling routinesIf there is not enough landing space, scheduling routines must
be designed.
Routines must coordinate file transfers and database loads, transferring a new data file only after an existing file has been loaded and is no longer needed.
Batch Process – Data LoadsBatch Process – Data Loads
Source System
ExtractProgram
ExtractFile
Source System Server Warehouse Server
LoadFile
DWHRDBMS
File Transfer
LandingSpace
22
Batch Process – Data Loads (cont’d)Batch Process – Data Loads (cont’d)
Data Load: data loaded from extract file into database.
Post-load processes
Aggregation routines
Basic Types of Load Procedures
Append new records to existing table.
Drop table and reload updated data file.
Update existing records.
Batch Process – Data Loads (cont’d)Batch Process – Data Loads (cont’d)
Post Load Processes
Must update table indexes after data loads.
For statistics-based optimizers, must update table and index statistics after loads.
Aggregation Routines
Must run aggregation routines if aggregate data preparation is performed in the data warehouse database.
23
Batch Jobs – OverviewBatch Jobs – Overview
Basic Refresh Jobs: necessary to update tables in warehouse with current information
Lookup TableFact TableAggregate Table
Maintenance Jobs: necessary to maintain tables in warehouse
Updating fact dataRe-organizing data
Basic Refresh Jobs – Lookup TablesBasic Refresh Jobs – Lookup Tables
Purpose
Apply changes in existing “organizational” systems to lookup data in data warehouse.
Changes include addition of new items or changes to descriptive information.
No changes to attribute keys or attribute relationships.
24
Basic Refresh Jobs – Lookup Tables (cont’d)Basic Refresh Jobs – Lookup Tables (cont’d)Basic Methods
No refresh Extract is run once to populate DWHOften used in pilot or prototype systems
Drop and reloadExisting table is dropped or emptiedExtract is re-run to capture current informationTable is loaded with new extract
Append to existing tableExtract is re-run to capture current informationNew extract and “old” or “master” lookup file are compared.New “Delta” file is generated.Delta is applied to master lookup file and lookup table in warehouse.Delta file may be loaded into warehouse directly, ORDelta may be applied to master lookup file and then use Drop and Reload method, loading the master lookup file.Sophisticated batch routines, normally used in production
Basic Refresh Jobs – Lookup Tables (cont’d)Basic Refresh Jobs – Lookup Tables (cont’d)
Org
Source System
ExtractProgram
ExtractFile 1/96
DWHRDBMS
ExtractFile 2/96
LookupTable
25
Basic Refresh Jobs – Lookup Tables (cont’d)Basic Refresh Jobs – Lookup Tables (cont’d)
Org
Source System
ExtractProgram
MasterLookup
DWHRDBMS
ExtractFile 2/96
LookupTable
CompareProgram
DeltaFile
Basic Refresh Jobs – Fact TablesBasic Refresh Jobs – Fact Tables
Purpose
Refresh or update fact data in DWH with the new data from source systems.
Basic methods
Bulk or historical insertExtract is run to capture all data existing in source systemsData is bulk-loaded into data warehouse fact tablesSimple batch routineUsed to “start” warehouse or provide initial data setsOften doesn’t perform any cleansing or integration
Drop and reload
Append to existing table
26
Basic Refresh Jobs – Fact Tables (cont’d)Basic Refresh Jobs – Fact Tables (cont’d)
Fact
Source System
ExtractProgram
DWHRDBMS
ExtractFile for
9/95 thru1/96
FactTable
Basic Refresh Jobs – Fact Tables (cont’d)Basic Refresh Jobs – Fact Tables (cont’d)
Basic methods
Drop and reloadHistorical or Bulk extract is re-run to capture all available dataExisting warehouse table is emptied or truncatedFile is inserted into empty fact tableSimple batch routineUsed in prototypes or pilotNot feasible for large data sets
Append to existing tableExtract is re-run to capture current informationNew extract is added to “end” of existing fact tableMost common method used in production systems
27
Basic Refresh Jobs – Fact Tables (cont’d)Basic Refresh Jobs – Fact Tables (cont’d)
Fact
Source System
ExtractProgram
DWHRDBMS
FactTable
9/95 - 1/96
ExtractFile 2/96
Append2/96
Basic Refresh Jobs – Aggregate tablesBasic Refresh Jobs – Aggregate tables
Purpose
Refresh or aggregate tables
Basic methods
Aggregate in warehouse RDBMSProduce atomic extractTransfer and load atomic extract into atomic fact tableProduce aggregate values using SQL accessing atomic fact tableInsert aggregate values into aggregate fact table
Aggregate in batch (on source systems or warehouse server)Produce atomic extractTransfer and load atomic extract into atomic fact tableProduce aggregate extract from atomic extractTransfer and load aggregate extract into aggregate fact table
28
Basic Refresh Jobs – Aggregate Tables (cont’d)Basic Refresh Jobs – Aggregate Tables (cont’d)
Source System
ExtractProgram
AtomicExtract
Source System Server Warehouse RDBMS
AggregateExtract
AggregateProgram
AtomicFact
Table
AggregateFact
Table
Aggregate SQL
Routines
Maintenance Jobs – Updating Fact DataMaintenance Jobs – Updating Fact Data
Purpose
As changes are made to source system data, they should be reflected in the data warehouse.
Basic Methods
Ignore changes.
Wait until audited data is available.
Drop and reload day’s extract.
Capture and apply changes.
Transfer changes.
29
Maintenance Jobs – Updating Fact Data (cont’d)Maintenance Jobs – Updating Fact Data (cont’d)
Sun Mon Tue Wed Thu Fri Sat
SunPre
AuditData
WedPre
AuditData
SunPostAuditData
Scenario
Audit Process produces clean data set 3 days after initial set is posted.
Maintenance Jobs – Updating Fact Data (cont’d)Maintenance Jobs – Updating Fact Data (cont’d)
Fact
Source System
ExtractProgram
Sunday
DWHRDBMS
FactTableCompare
ProgramDeltaFile
SunPre
Data
WedPre
Data
SunPostData
30
Maintenance Jobs – Re-Organizing DataMaintenance Jobs – Re-Organizing Data
Store
Region Lookup Store
Store_idStore_descRegion_id
Lookup Region
Region_idRegion_desc
Relationship between Region
and Store changes
Must update foreign key in Store Lookup
Region_id
212 1112
Store_id
13 2124273557
Store_desc
San FranBostonDallas PhillyDCLas Vegas
Dallas is moved to the East
Region
Maintenance Jobs – Re-Organizing Data (cont’d)Maintenance Jobs – Re-Organizing Data (cont’d)