Introduction to Data Warehousing and Business Intelligence Slides adapted from Torben Bach Pedersen
Introduction to Data Warehousing and Business Intelligence
Slides adapted from Torben Bach Pedersen
Aalborg University 2008 - DWML course 2
Course Structure
• Business intelligence� Extract knowledge from large amounts of data collected in a
modern enterprise� Data warehousing, machine learning
• Purpose� Acquire theoretical background in lectures and literature studies� Obtain practical experience on (industrial) tools in a mini-project
Data warehousing: construction of a database with only data analysis purpose
Machine learning: find patterns automatically in databases
Business Intelligence (BI)
Aalborg University 2008 - DWML course 3
Contact Information
• Data warehousing� Teacher: Ken (Man Lung YIU)� Office: 3.2.48 Email: [email protected]
• Machine learning � Teacher: Thomas D. Nielsen� Office: 2.2.03 Email: [email protected]
• Course homepage: http://www.cs.aau.dk/~tdn/Teaching/DWML08/
� Lecture slides, mini-project, ……
Aalborg University 2008 - DWML course 4
Literature for Data Warehousing
• No textbook• Books (selected pages available in the class)
� The Data Warehouse Lifecycle Toolkit, Kimball et. al., Wiley 1998
� Fundamentals of Data Warehousing, Jarke et. al., Springer Verlag 2003
• Additional references/articles:� To be posted at course homepage
Aalborg University 2008 - DWML course 5
Mini-Project and Exam
• Mini-project� Performed in groups of ~3 persons� Documented in report of 20 pages� Firm Deadline: April 20
u The homepage also shows the soft deadline of each task
• Exam (information from last year)� Individual oral exam, for 20 minutes
u 8 minutes of DW questions u 8 minutes of ML questions
� Mini-project report as the basis for discussion� Exam also covers theoretical background in lectures and
literature� More details at the end of the course
Aalborg University 2008 - DWML course 6
Overview
• Why Business Intelligence?• Data analysis problems• Data Warehouse (DW) introduction• DW Topics
� Multidimensional modeling� ETL� Performance optimization
Aalborg University 2008 - DWML course 7
What is Business Intelligence (BI)?
• BI is different from Artificial Intelligence (AI) � AI systems make decisions for the users� BI systems help the users make the right decisions,
based on available data
• Combination of technologies� Data Warehousing (DW)� On-Line Analytical Processing (OLAP)� Data Mining (DM)� ……
Aalborg University 2008 - DWML course 8
Why BI Important?
• Worldwide BI revenue in 2005 = US$ 5.7 billion� 10% growth each year
• The Web makes BI more necessary� Customers do not appear “physically” in the store� Customers can change to other stores more easily
• Thus:� Know your customers using data and BI!� Utilize Web logs, analyze customer behavior in more detail
than before (e.g., what was not bought?)� Combine web data with traditional customer data
Aalborg University 2008 - DWML course 9
Data Analysis Problems
• The same data found in many different systems� Example: customer data across different departments� The same concept is defined differently
• Heterogeneous sources� Relational DBMS, On-Line Transaction Processing (OLTP)� Unstructured data in files (e.g., MS Excel) and documents
(e.g., MS Word)
Aalborg University 2008 - DWML course 10
Data Analysis Problems (cont’)
• Data is suited for operational systems� Accounting, billing, etc.� Do not support analysis across business functions
• Data quality is bad� Missing data, imprecise data, different use of systems
• Data are “volatile”� Data deleted in operational systems (6 months)� Data change over time – no historical information
Aalborg University 2008 - DWML course 11
Data Warehousing
• Solution: new analysis environment (DW) where data are� Subject oriented (versus function oriented)� Integrated (logically and physically)� Time variant (data can always be related to time) � Stable (data not deleted, several versions)� Supporting management decisions (different organization)
• A good DW is a prerequisite for successful BI
Aalborg University 2008 - DWML course 12
DW: Purpose and Definition
• DW is a store of information organized in a unified data model
• Data collected from a number of different sources� Finance, billing, web logs, personnel, …
• Purpose of a data warehouse (DW): support decision making
• Easy to perform advanced analysis� Ad-hoc analysis and reports
u We will cover this soon ……
� Data mining: discovery of hidden patterns and trends
Aalborg University 2008 - DWML course 13
DW Architecture – Data as Materialized Views
DB
DB
DB
DB
DB
Appl.
Appl.
Appl.
Trans. DW
DM
DM
DM
OLAP
Visua-lization
Appl.
Appl.
Data mining
(Local) Data Marts
(Global) DataWarehouse
Existing databasesand systems (OLTP) New databases
and systems (OLAP)
Analogy: (data) suppliers ↔ warehouse ↔ (data) consumers
Aalborg University 2008 - DWML course 14
Function- vs. Subject Orientation
DB
DB
DB
DB
DB
Appl.
Appl.
Appl.
Trans. DW
DM
DM
DM
D-Appl.
D-Appl.
Appl.
Appl.
D-Appl.
Function-orientedsystems
Selected subjects
All subjects,integrated
Subject-orientedsystems
Aalborg University 2008 - DWML course 15
Central DW Architecture
• All data in one, central DW• All client queries directly on the
central DW• Pros
� Simplicity� Easy to manage
• Cons� Bad performance due to no
redundancy/ workload distribution
Central
DW
SourceSource
Clients
Aalborg University 2008 - DWML course 16
Federated DW Architecture
• Data stored in separate data marts, aimed at special departments
• Logical DW (i.e., virtual)• Data marts contain detail data• Pros
� Performance due to distribution
• Cons� More complex
Logical
DW
SourceSource
Clients
Finance
martMrktng
mart
Distr.
mart
Aalborg University 2008 - DWML course 17
Tiered Architecture
• Central DW is materialized• Data is distributed to data marts in
one or more tiers
• Only aggregated data in cube tiers• Data is aggregated/reduced as it
moves through tiers• Pros
� Best performance due to redundancy and distribution
• Cons� Most complex� Hard to manage
2000 2001
Aalborg
Copenhagen
Milk
Bread
123 127
57 45
56 67
211
2000 2001
Aalborg
Copenhagen
Milk
Bread
123 127
57 45
56 67
211
2000 2001
Aalborg
Copenhagen
Milk
Bread
123 127
57 45
56 67
211 Central
DW
2000 2001
Aalborg
Copenhagen
Milk
Bread
123 127
57 45
56 67
211
2000 2001
Aalborg
Copenhagen
Milk
Bread
123 127
57 45
56 67
211
Aalborg University 2008 - DWML course 18
Queries Hard/Infeasible for OLTP
• Business analysis� In the past five years, which product is the most profitable?� Which public holiday we have the largest sales? � Which week we have the largest sales?� Does the sales of dairy products increase over time?
• Difficult to represent these queries by using SQL � 3rd query: extract the “week” value using a function
u But the user has to learn many transformation functions …� 4th query: use a “special” table to store IDs of all dairy products,
in advanceu We have many other product types as well …
• The need of multidimensional modeling
Aalborg University 2008 - DWML course 19
Multidimensional Modeling
• Example: sales of supermarkets• Facts and measures
� Each sales record is a fact, and its sales value is a measure
• Dimensions� Each sales record is associated with its values of
Product, Store, Time
� Correlated attributes grouped into the same dimension � easier for analysis tasks
5.751997Maj25ÅrhusÅrhusTrøjborgBeverageBeerTop
SalesYearMonthDayCountyCityStoreCategoryTypeProduct
Product Store Time
Aalborg University 2008 - DWML course 20
Multidimensional Modeling• How do we model the Time dimension?
� A tree structure, with multiple levels� Attributes, e.g., holiday, event
• Advantage of this model?� Easy for query (more about this later)
• Disadvantage?� Data redundancy (controlled redundancy is acceptable)
Day
Week
Month
Year
T
…Yes20081122
…………………
…No20081111
…workday
yearmonthweekdaytid
Aalborg University 2008 - DWML course 21
OLTP vs. OLAP
not necessarynecessaryTransactional recovery
not unifiedSQLQuery language
query operationsupdate operationsOptimized for
business analysisoperational needsTarget
large, historical datasmall, operational dataData
denormalized/multidimensional
normalizedModel
largesmallQueries
infrequent and batchfrequent and smallUpdates
OLAPOLTP
Aalborg University 2008 - DWML course 22
Quick Review: Normalized Database
05-02-20086.00BeverageBeer3302
02-02-20084.00CerealRice3301
02-02-20086.00BeverageBeer3301
07-02-20085.00CerealWheat3303
DatePriceCategoryProductCustomer ID
• Normalized database avoids� Redundant data� Modification anomalies
• How to get the original table? (join them)• No redundancy in OLTP, controlled redundancy in OLAP
05-02-20080133302
02-02-20080523301
02-02-20080133301
07-02-20080673303
DateProductIDCustomer ID
4.00CerealRice052
6.00BeverageBeer013
5.00CerealWheat067
PriceCategoryProductProductID
Aalborg University 2008 - DWML course 23
OLAP Data Cube
• Data cube� Useful data analysis tool in DW� Generalized GROUP BY queries� Aggregate facts based on chosen
dimensionsu Product, store, time dimensionsu Sales measure of sale facts
• Why data cube?� Good for visualization (i.e., text
results hard to understand)� Multidimensional, intuitive� Support interactive OLAP
operations
• How is it different from a spreadsheet?
Aalborg University 2008 - DWML course 24
On-Line Analytical Processing (OLAP)
• On-Line Analytical Processing� Interactive analysis� Explorative discovery� Fast response times required
• OLAP operations/queries� Aggregation, e.g., SUM� Starting level, (Year, City)� Roll Up: Less detail� Drill Down: More detail� Slice/Dice: Selection, Year=2000
102
250
All Time
9 1011 15
Aalborg University 2008 - DWML course 25
Advanced Multidimensional Modeling
• Changing dimensions� Some dimensions are not static. They can change over
time. E.g., u Time dimensionu A new store opens, or an existing store closesu The price of a product changes
� How do we handle these changes?
• Large-scale dimensional modeling� How do we coordinate the dimensions in different data
cubes and data marts?
++++Profit
++Costs
+++Sales
SupplierProductCustomerTime
Dimensions
Data marts
Bus architecture
Aalborg University 2008 - DWML course 26
Top-down vs. Bottom-up
DB
DB
DB
DB
DB
Appl.
Appl.
Appl.
Trans. DW
DM
DM
DM
D-Appl.
D-Appl.
Appl.
Appl.
D-Appl.
Top-down:1. Design of DW2. Design of DMs
Bottom-up:
1. Design of DMs2. Maybe integration
of DMs in DW3. Maybe no DW
In-between:1. Design of DW for
DM12. Design of DM2 and
integration with DW3. Design of DM3 and
integration with DW4. ...
Aalborg University 2008 - DWML course 27
Extract, Transform, Load (ETL)
• “Getting multidimensional data into the DW”• Problems
� Data from different sources� Data with different formats� Handling of missing data and erroneous data
• ETL� Extract� Transformations / cleansing� Load
• The most time-consuming process in DW development� 80% of development time spent on ETL
Aalborg University 2008 - DWML course 28
Data’s Way To DW• Extraction
� Extract from many heterogeneous systems
• Staging area� Large, sequential bulk operations � flat files best?
• Cleansing� Data checked for missing parts and erroneous values� Default values provided and out-of-range values marked
• Transformation� Data transformed to decision-oriented format� Data from several sources merged, optimize for querying
• Aggregation?� Are individual business transactions needed in the DW?
• Loading into DW� Large bulk loads rather than SQL INSERTs (Why?)� Fast indexing (and pre-aggregation) required
Aalborg University 2008 - DWML course 29
Performance Optimization
• Performance optimization� Fine tune performance for important queries� Aggregates, indexing, other optimizations (environment,
partitioning)
• Using aggregates� How can aggregates improve performance?
• Choosing aggregates� Which aggregates should we materialize?
Aalborg University 2008 - DWML course 30
Materialization Example
• Imagine 1 billion sales rows, 1000 products, 100 locations
• CREATE VIEW TotalSales (pid,locid,total) ASSELECT s.pid,s.locid,SUM(s.sales) FROM Sales s GROUP BY s.pid,s.locid
• The materialized view has 100,000 rows• Rewrite the query to use the view
� SELECT p.category,SUM(s.sales) FROM Products p, Sales s WHERE p.pid=s.pid GROUP BY p.category
u can be rewritten to� SELECT p.category,SUM(t.total) FROM Products p,
TotalSales t WHERE p.pid=t.pid GROUP BY p.category
u Query becomes 10,000 times faster!
40323
…………
1
1
locid
2
1011
201
salespidtid
Sales
4032
………
1
locid
301
salespid
VIEW TotalSales
1 billion rows
100,000 rows
Aalborg University 2008 - DWML course 31
Common DW Issues• Metadata management
� Need to understand data = metadata needed� Greater need in OLAP than in OLTP as “raw” data is used� Need to know about:
u Data definitions, dataflow, transformations, versions, usage, security
• DW project management� DW projects are large and different from ordinary SW projects
u 12-36 months and US$ 1+ million per projectu Data marts are smaller and “safer” (bottom up approach)
� Reasons for failureu Lack of proper design methodologiesu High HW+SW costu Deployment problems (lack of training)u Organizational change is hard… (new processes, data ownership,..)u Ethical issues (security, privacy,…)
Aalborg University 2008 - DWML course 32
Topics not to be Covered
• Privacy/security of data during ETL� Encryption may not work� During extraction/transformation, may need to know original
values in order to check whether ETL performs correctly
• Data Visualization (VIS)
• Decision Analysis (What-if)
• Customer Relationship Management (CRM)
Aalborg University 2008 - DWML course 33
Summary
• Why Business Intelligence?• Data analysis problems• Data Warehouse (DW) introduction• DW Topics
� Multidimensional modeling � ETL� Performance optimization
• BI provide many advantages to your organization� A good DW is a prerequisite for BI
Aalborg University 2008 - DWML course 34
DW Software
• DW part of the mini-project• DW software
� Obtain from MSDNAA, and installu MS SQL Server 2005 RDBMSu MS Analysis Services, Integration Services,
Reporting Services
� Checking after installationu Open “Component Services” and check whether all
four services above have been startedu Open “SQL Server Management Studio” and see
whether you can connect to “Database Engine”
� Read the mini-project webpage for installation details
Aalborg University 2008 - DWML course 35
Demonstration Session
• To get you familiar with DW software, we have a demonstration session after the next few lectures
• Details of demonstration� Location: 3.2.48� Time: just after each lecture (not this one)� Three demonstration slots (30 minutes each):
u 14.30-15.00u 15.00-15.30u 15.30-16.00
� At most 5 students can fit in the same slot (not necessarily from the same group)
� Signup for demonstration slots today!
Aalborg University 2008 - DWML course 36
Mini-Project Group Formation
• Form groups ~3 persons• Discuss with your classmates NOW• Groups to be formed today, before
we leave this room!