Top Banner
Introduction to Data Warehousing and Business Intelligence Slides adapted from Torben Bach Pedersen
36

Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

May 29, 2020

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Introduction to Data Warehousing and Business Intelligence

Slides adapted from Torben Bach Pedersen

Page 2: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 2

Course Structure

• Business intelligence� Extract knowledge from large amounts of data collected in a

modern enterprise� Data warehousing, machine learning

• Purpose� Acquire theoretical background in lectures and literature studies� Obtain practical experience on (industrial) tools in a mini-project

Data warehousing: construction of a database with only data analysis purpose

Machine learning: find patterns automatically in databases

Business Intelligence (BI)

Page 3: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 3

Contact Information

• Data warehousing� Teacher: Ken (Man Lung YIU)� Office: 3.2.48 Email: [email protected]

• Machine learning � Teacher: Thomas D. Nielsen� Office: 2.2.03 Email: [email protected]

• Course homepage: http://www.cs.aau.dk/~tdn/Teaching/DWML08/

� Lecture slides, mini-project, ……

Page 4: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 4

Literature for Data Warehousing

• No textbook• Books (selected pages available in the class)

� The Data Warehouse Lifecycle Toolkit, Kimball et. al., Wiley 1998

� Fundamentals of Data Warehousing, Jarke et. al., Springer Verlag 2003

• Additional references/articles:� To be posted at course homepage

Page 5: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 5

Mini-Project and Exam

• Mini-project� Performed in groups of ~3 persons� Documented in report of 20 pages� Firm Deadline: April 20

u The homepage also shows the soft deadline of each task

• Exam (information from last year)� Individual oral exam, for 20 minutes

u 8 minutes of DW questions u 8 minutes of ML questions

� Mini-project report as the basis for discussion� Exam also covers theoretical background in lectures and

literature� More details at the end of the course

Page 6: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 6

Overview

• Why Business Intelligence?• Data analysis problems• Data Warehouse (DW) introduction• DW Topics

� Multidimensional modeling� ETL� Performance optimization

Page 7: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 7

What is Business Intelligence (BI)?

• BI is different from Artificial Intelligence (AI) � AI systems make decisions for the users� BI systems help the users make the right decisions,

based on available data

• Combination of technologies� Data Warehousing (DW)� On-Line Analytical Processing (OLAP)� Data Mining (DM)� ……

Page 8: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 8

Why BI Important?

• Worldwide BI revenue in 2005 = US$ 5.7 billion� 10% growth each year

• The Web makes BI more necessary� Customers do not appear “physically” in the store� Customers can change to other stores more easily

• Thus:� Know your customers using data and BI!� Utilize Web logs, analyze customer behavior in more detail

than before (e.g., what was not bought?)� Combine web data with traditional customer data

Page 9: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 9

Data Analysis Problems

• The same data found in many different systems� Example: customer data across different departments� The same concept is defined differently

• Heterogeneous sources� Relational DBMS, On-Line Transaction Processing (OLTP)� Unstructured data in files (e.g., MS Excel) and documents

(e.g., MS Word)

Page 10: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 10

Data Analysis Problems (cont’)

• Data is suited for operational systems� Accounting, billing, etc.� Do not support analysis across business functions

• Data quality is bad� Missing data, imprecise data, different use of systems

• Data are “volatile”� Data deleted in operational systems (6 months)� Data change over time – no historical information

Page 11: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 11

Data Warehousing

• Solution: new analysis environment (DW) where data are� Subject oriented (versus function oriented)� Integrated (logically and physically)� Time variant (data can always be related to time) � Stable (data not deleted, several versions)� Supporting management decisions (different organization)

• A good DW is a prerequisite for successful BI

Page 12: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 12

DW: Purpose and Definition

• DW is a store of information organized in a unified data model

• Data collected from a number of different sources� Finance, billing, web logs, personnel, …

• Purpose of a data warehouse (DW): support decision making

• Easy to perform advanced analysis� Ad-hoc analysis and reports

u We will cover this soon ……

� Data mining: discovery of hidden patterns and trends

Page 13: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 13

DW Architecture – Data as Materialized Views

DB

DB

DB

DB

DB

Appl.

Appl.

Appl.

Trans. DW

DM

DM

DM

OLAP

Visua-lization

Appl.

Appl.

Data mining

(Local) Data Marts

(Global) DataWarehouse

Existing databasesand systems (OLTP) New databases

and systems (OLAP)

Analogy: (data) suppliers ↔ warehouse ↔ (data) consumers

Page 14: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 14

Function- vs. Subject Orientation

DB

DB

DB

DB

DB

Appl.

Appl.

Appl.

Trans. DW

DM

DM

DM

D-Appl.

D-Appl.

Appl.

Appl.

D-Appl.

Function-orientedsystems

Selected subjects

All subjects,integrated

Subject-orientedsystems

Page 15: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 15

Central DW Architecture

• All data in one, central DW• All client queries directly on the

central DW• Pros

� Simplicity� Easy to manage

• Cons� Bad performance due to no

redundancy/ workload distribution

Central

DW

SourceSource

Clients

Page 16: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 16

Federated DW Architecture

• Data stored in separate data marts, aimed at special departments

• Logical DW (i.e., virtual)• Data marts contain detail data• Pros

� Performance due to distribution

• Cons� More complex

Logical

DW

SourceSource

Clients

Finance

martMrktng

mart

Distr.

mart

Page 17: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 17

Tiered Architecture

• Central DW is materialized• Data is distributed to data marts in

one or more tiers

• Only aggregated data in cube tiers• Data is aggregated/reduced as it

moves through tiers• Pros

� Best performance due to redundancy and distribution

• Cons� Most complex� Hard to manage

2000 2001

Aalborg

Copenhagen

Milk

Bread

123 127

57 45

56 67

211

2000 2001

Aalborg

Copenhagen

Milk

Bread

123 127

57 45

56 67

211

2000 2001

Aalborg

Copenhagen

Milk

Bread

123 127

57 45

56 67

211 Central

DW

2000 2001

Aalborg

Copenhagen

Milk

Bread

123 127

57 45

56 67

211

2000 2001

Aalborg

Copenhagen

Milk

Bread

123 127

57 45

56 67

211

Page 18: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 18

Queries Hard/Infeasible for OLTP

• Business analysis� In the past five years, which product is the most profitable?� Which public holiday we have the largest sales? � Which week we have the largest sales?� Does the sales of dairy products increase over time?

• Difficult to represent these queries by using SQL � 3rd query: extract the “week” value using a function

u But the user has to learn many transformation functions …� 4th query: use a “special” table to store IDs of all dairy products,

in advanceu We have many other product types as well …

• The need of multidimensional modeling

Page 19: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 19

Multidimensional Modeling

• Example: sales of supermarkets• Facts and measures

� Each sales record is a fact, and its sales value is a measure

• Dimensions� Each sales record is associated with its values of

Product, Store, Time

� Correlated attributes grouped into the same dimension � easier for analysis tasks

5.751997Maj25ÅrhusÅrhusTrøjborgBeverageBeerTop

SalesYearMonthDayCountyCityStoreCategoryTypeProduct

Product Store Time

Page 20: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 20

Multidimensional Modeling• How do we model the Time dimension?

� A tree structure, with multiple levels� Attributes, e.g., holiday, event

• Advantage of this model?� Easy for query (more about this later)

• Disadvantage?� Data redundancy (controlled redundancy is acceptable)

Day

Week

Month

Year

T

…Yes20081122

…………………

…No20081111

…workday

yearmonthweekdaytid

Page 21: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 21

OLTP vs. OLAP

not necessarynecessaryTransactional recovery

not unifiedSQLQuery language

query operationsupdate operationsOptimized for

business analysisoperational needsTarget

large, historical datasmall, operational dataData

denormalized/multidimensional

normalizedModel

largesmallQueries

infrequent and batchfrequent and smallUpdates

OLAPOLTP

Page 22: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 22

Quick Review: Normalized Database

05-02-20086.00BeverageBeer3302

02-02-20084.00CerealRice3301

02-02-20086.00BeverageBeer3301

07-02-20085.00CerealWheat3303

DatePriceCategoryProductCustomer ID

• Normalized database avoids� Redundant data� Modification anomalies

• How to get the original table? (join them)• No redundancy in OLTP, controlled redundancy in OLAP

05-02-20080133302

02-02-20080523301

02-02-20080133301

07-02-20080673303

DateProductIDCustomer ID

4.00CerealRice052

6.00BeverageBeer013

5.00CerealWheat067

PriceCategoryProductProductID

Page 23: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 23

OLAP Data Cube

• Data cube� Useful data analysis tool in DW� Generalized GROUP BY queries� Aggregate facts based on chosen

dimensionsu Product, store, time dimensionsu Sales measure of sale facts

• Why data cube?� Good for visualization (i.e., text

results hard to understand)� Multidimensional, intuitive� Support interactive OLAP

operations

• How is it different from a spreadsheet?

Page 24: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 24

On-Line Analytical Processing (OLAP)

• On-Line Analytical Processing� Interactive analysis� Explorative discovery� Fast response times required

• OLAP operations/queries� Aggregation, e.g., SUM� Starting level, (Year, City)� Roll Up: Less detail� Drill Down: More detail� Slice/Dice: Selection, Year=2000

102

250

All Time

9 1011 15

Page 25: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 25

Advanced Multidimensional Modeling

• Changing dimensions� Some dimensions are not static. They can change over

time. E.g., u Time dimensionu A new store opens, or an existing store closesu The price of a product changes

� How do we handle these changes?

• Large-scale dimensional modeling� How do we coordinate the dimensions in different data

cubes and data marts?

++++Profit

++Costs

+++Sales

SupplierProductCustomerTime

Dimensions

Data marts

Bus architecture

Page 26: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 26

Top-down vs. Bottom-up

DB

DB

DB

DB

DB

Appl.

Appl.

Appl.

Trans. DW

DM

DM

DM

D-Appl.

D-Appl.

Appl.

Appl.

D-Appl.

Top-down:1. Design of DW2. Design of DMs

Bottom-up:

1. Design of DMs2. Maybe integration

of DMs in DW3. Maybe no DW

In-between:1. Design of DW for

DM12. Design of DM2 and

integration with DW3. Design of DM3 and

integration with DW4. ...

Page 27: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 27

Extract, Transform, Load (ETL)

• “Getting multidimensional data into the DW”• Problems

� Data from different sources� Data with different formats� Handling of missing data and erroneous data

• ETL� Extract� Transformations / cleansing� Load

• The most time-consuming process in DW development� 80% of development time spent on ETL

Page 28: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 28

Data’s Way To DW• Extraction

� Extract from many heterogeneous systems

• Staging area� Large, sequential bulk operations � flat files best?

• Cleansing� Data checked for missing parts and erroneous values� Default values provided and out-of-range values marked

• Transformation� Data transformed to decision-oriented format� Data from several sources merged, optimize for querying

• Aggregation?� Are individual business transactions needed in the DW?

• Loading into DW� Large bulk loads rather than SQL INSERTs (Why?)� Fast indexing (and pre-aggregation) required

Page 29: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 29

Performance Optimization

• Performance optimization� Fine tune performance for important queries� Aggregates, indexing, other optimizations (environment,

partitioning)

• Using aggregates� How can aggregates improve performance?

• Choosing aggregates� Which aggregates should we materialize?

Page 30: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 30

Materialization Example

• Imagine 1 billion sales rows, 1000 products, 100 locations

• CREATE VIEW TotalSales (pid,locid,total) ASSELECT s.pid,s.locid,SUM(s.sales) FROM Sales s GROUP BY s.pid,s.locid

• The materialized view has 100,000 rows• Rewrite the query to use the view

� SELECT p.category,SUM(s.sales) FROM Products p, Sales s WHERE p.pid=s.pid GROUP BY p.category

u can be rewritten to� SELECT p.category,SUM(t.total) FROM Products p,

TotalSales t WHERE p.pid=t.pid GROUP BY p.category

u Query becomes 10,000 times faster!

40323

…………

1

1

locid

2

1011

201

salespidtid

Sales

4032

………

1

locid

301

salespid

VIEW TotalSales

1 billion rows

100,000 rows

Page 31: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 31

Common DW Issues• Metadata management

� Need to understand data = metadata needed� Greater need in OLAP than in OLTP as “raw” data is used� Need to know about:

u Data definitions, dataflow, transformations, versions, usage, security

• DW project management� DW projects are large and different from ordinary SW projects

u 12-36 months and US$ 1+ million per projectu Data marts are smaller and “safer” (bottom up approach)

� Reasons for failureu Lack of proper design methodologiesu High HW+SW costu Deployment problems (lack of training)u Organizational change is hard… (new processes, data ownership,..)u Ethical issues (security, privacy,…)

Page 32: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 32

Topics not to be Covered

• Privacy/security of data during ETL� Encryption may not work� During extraction/transformation, may need to know original

values in order to check whether ETL performs correctly

• Data Visualization (VIS)

• Decision Analysis (What-if)

• Customer Relationship Management (CRM)

Page 33: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 33

Summary

• Why Business Intelligence?• Data analysis problems• Data Warehouse (DW) introduction• DW Topics

� Multidimensional modeling � ETL� Performance optimization

• BI provide many advantages to your organization� A good DW is a prerequisite for BI

Page 34: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 34

DW Software

• DW part of the mini-project• DW software

� Obtain from MSDNAA, and installu MS SQL Server 2005 RDBMSu MS Analysis Services, Integration Services,

Reporting Services

� Checking after installationu Open “Component Services” and check whether all

four services above have been startedu Open “SQL Server Management Studio” and see

whether you can connect to “Database Engine”

� Read the mini-project webpage for installation details

Page 35: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 35

Demonstration Session

• To get you familiar with DW software, we have a demonstration session after the next few lectures

• Details of demonstration� Location: 3.2.48� Time: just after each lecture (not this one)� Three demonstration slots (30 minutes each):

u 14.30-15.00u 15.00-15.30u 15.30-16.00

� At most 5 students can fit in the same slot (not necessarily from the same group)

� Signup for demonstration slots today!

Page 36: Introduction to Data Warehousing and Business Intelligencepeople.cs.aau.dk/~tdn/Teaching/DWML08/Slides/DW1_BIintro.pdf · 2008-02-02 · Introduction to Data Warehousing and Business

Aalborg University 2008 - DWML course 36

Mini-Project Group Formation

• Form groups ~3 persons• Discuss with your classmates NOW• Groups to be formed today, before

we leave this room!