Top Banner
DATA WAREHOUSING AND DATA MINING M.Mageshwari,Lecturer M.S.P.V.L Polytechnic College
96
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Datawarehousing

DATA WAREHOUSING ANDDATA MINING

M.Mageshwari,Lecturer

M.S.P.V.L Polytechnic College

Page 2: Datawarehousing

2

Course OverviewThe course: what and

how

0. Introduction I. Data Warehousing II. Decision Support and

OLAP III. Data Mining IV. Looking Ahead

Demos and Labs

Page 3: Datawarehousing

3

0. Introduction

Data Warehousing, OLAP and data mining: what and why (now)?

Relation to OLTPA case study

demos, labs

Page 4: Datawarehousing

4

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

A producer wants to know….

Page 5: Datawarehousing

5

Data, Data everywhereyet ... I can’t find the data I need

data is scattered over the network many versions, subtle differences

I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly documented

I can’t use the data I found results are unexpected data needs to be transformed

from one form to other

Page 6: Datawarehousing

6

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

Page 7: Datawarehousing

7

What are the users saying...

Data should be integrated across the enterprise

Summary data has a real value to the organization

Historical data holds the key to understanding data over time

What-if capabilities are required

Page 8: Datawarehousing

8

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

Data

Information

Page 9: Datawarehousing

9

Evolution

60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new

request

70’s: Terminal-based DSS(Decision Support System and EIS (executive information systems) still inflexible, not integrated with desktop tools

Page 10: Datawarehousing

10

Data Warehouse Structure

base customer (1985-87)custid, from date, to date, name, phone, dob

base customer (1988-90)custid, from date, to date, name, credit rating,

employer

customer activity (1986-89) -- monthly summary

customer activity detail (1987-89)custid, activity date, amount, clerk id, order no

customer activity detail (1990-91)custid, activity date, amount, line item no, order no

Time is Time is part of part of key of key of each tableeach table

Page 11: Datawarehousing

Definition of DSS

Decision support system is defined as a system that helps the decision makers in various levels to take decisions

This system uses data, analytical models and user friendly software for taking decision

11

Page 12: Datawarehousing

Definition of EIS

Executive information system(EIS) is defined as a system that helps the high level executives to take policy decisions.

This system user higher level data, analytical models and user friendly software for taking decisions.

12

Page 13: Datawarehousing

Evolution

80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational

databases

90’s: Data warehousing with integrated OLAP(online analytical processing)engines and tools

13

Page 14: Datawarehousing

14

Data Warehousing -- It is a process

Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

A decision support database maintained separately from the organization’s operational database

Page 15: Datawarehousing

15

Characteristics of Data Warehouse

A data warehouse is a subject-oriented

integrated

time-varying

non-volatile

collection of data that is used primarily in organizational decision making.

Page 16: Datawarehousing

subject-oriented A data warehouse is organized around the

major subjects of the organization such as customer, supplier, product, sales, etc..,

Data warehouse provides a simple and concise view around a particular subject by excluding data that are not useful to the decision support process.

16

Page 17: Datawarehousing

Integrated:

A data warehouse is constructed by integrating multiple sources of data such as relational database, flat files and on-line transaction records.

Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attributes etc..,

17

Page 18: Datawarehousing

Time Variant

Data warehouse maintains records of both historical and current data.

So it can provide information in a historical perspective

18

Page 19: Datawarehousing

Non Volatile

Once data warehouse is loaded with data, it is not possible to perform any modifications in the stored data.

19

Page 20: Datawarehousing

20

Explorers, Farmers and Tourists

Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data

Farmers: Harvest informationfrom known access paths

Tourists: Browse information about Tourists

Page 21: Datawarehousing

21

Application-Orientation vs. Subject-Orientation

Application-Orientation

Operational Database

LoansCredit Card

Trust

Savings

Subject-Orientation

DataWarehouse

Customer

VendorProduct

Activity

Page 22: Datawarehousing

Functioning of Data warehousing

22

Data Source

cleaningTransformation

Data Warehouse

New Update

Page 23: Datawarehousing

Collection data

Data warehousing collect data from various data sources such as relational data base, flat files and on-line records

The collection of data are stored in database inside the warehouse.

The type of data collection used depends on the architecture of the ware house.

23

Page 24: Datawarehousing

Integration

Each and every data source uses from different schema.

Data warehouse get data from different source with different schema and convert the data from various sources into a common integrated schema.

24

Page 25: Datawarehousing

25

Star Schema

A single fact table and for each dimension one dimension table

Does not capture hierarchies directly

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

Page 26: Datawarehousing

26

Snowflake schema

Represent dimensional hierarchy directly by normalizing tables.

Easy to maintain and saves storage

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

Page 27: Datawarehousing

Data transformation and cleaning

The task of correcting and preparing the data is called data cleaning.

Data source delivers data into the database of data warehouse it should be corrected.

27

Page 28: Datawarehousing

Update of data

Update on tables at the data sources must be sent to the data warehouse.

If the tables in data warehouse are same as sources, the updation is easy.

28

Page 29: Datawarehousing

Summarizing data

The raw data generated by a transaction may be too large to store online.

Therefore, we can use summary of transactions for easy querying.

29

Page 30: Datawarehousing

30

Data Warehouse for Decision Support & OLAP

Putting Information technology to help the knowledge worker make faster and better decisions Which of my customers are most likely to go to

the competition? What product promotions have the biggest

impact on revenue? How did the share price of software

companies correlate with profits over last 10 years?

Page 31: Datawarehousing

31

Decision Support

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-hoc

Used by managers and end-users to understand the business and make judgments

Page 32: Datawarehousing

OLAP(Online analytical processing)

A data warehouse stores data , but OLAP transform the data warehouse data into specific meaningful information.

Therefore OLAP provides a user friendly environment for interactive data analysis.

32

Page 33: Datawarehousing

OLAP

33

DATA WAREHOUSE

OLAP SERVER

FRONT END TOOL

User

Result

Result set

Request

SQL

Page 34: Datawarehousing

OLAP OPERATION on the multidimensional data

Roll-up(GROUP)Drill down(Less)Slice and Dice(Pice)Pivot(rotate)

34

Page 35: Datawarehousing

TYPES OF OLAP

MOLAP(MULTIDIMENSIONAL OLAP)

ROLAP(RELATIONAL ROLAP)

35

Page 36: Datawarehousing

36MonthMonth

1 1 22 3 3 4 4 776 6 5 5

Pro

du

ctP

rod

uct

Toothpaste Toothpaste

JuiceJuiceColaColaMilk Milk

CreamCream

Soap Soap

Regio

n

Regio

n

WWS S

N N

Dimensions: Dimensions: Product, Region, TimeProduct, Region, TimeHierarchical summarization pathsHierarchical summarization paths

Product Product Region Region TimeTimeIndustry Country YearIndustry Country Year

Category Region Quarter Category Region Quarter

Product City Month WeekProduct City Month Week

Office DayOffice Day

Multi-dimensional Data

“Hey…I sold $100M worth of goods”

Page 37: Datawarehousing

37

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 38: Datawarehousing

Architecture of data warehousing

38

External data

Data Acquisition

Data Manager

Warehouse data

External data

Data Dictionary

Information Directiory

Warehouse data

Middleware

Design

Management

Data Access

Page 39: Datawarehousing

Architecture of

39

Page 40: Datawarehousing

40

Design Component

The data warehouse designer design the database of the data warehouse and the warehouse administrator manages the data warehouse.

The designer and administrator use the design component to design and store data

Page 41: Datawarehousing

Types of design

Bottom-up designBusiness value can be returned as quickly as

the first data marts can be created Top-down designAtomic data, that is, data at the lowest level

of detail, are stored in the data warehouse.

Hybrid design

41

Page 42: Datawarehousing

Hybrid design. Hybrid methodologies have evolved

to take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data consistency of top-down design.

42

Page 43: Datawarehousing

Data Manager Component

The database in the data warehouse uses the data manager component for managing and accessing the data stored in the data warehouse.

RdbmsMdbms

43

Page 44: Datawarehousing

Management Component

Administering data acquisition operation

Managing backup copies of the dataRecovering the lost data Providing security to the data stored

in the data warehouse.Authorizing access to the data stored

in the data warehouse.

44

Page 45: Datawarehousing

Data Acquisition Component

This component acquires data from various sources by using the data acquisition applications

The data acquisition applications are based on rules that are defined by the data warehouse developers.

45

Page 46: Datawarehousing

The operation performed during data clean up

Restructuring the records and fields of the database tables.

Removing the irrelevant and redundant data

obtaining and adding missing data.Verifying integrity and consistency of

the data

46

Page 47: Datawarehousing

The operation performed on the data for enhancement are

Decoding and translating the values in fields.

Summarizing dataCalculating the derived values.

47

Page 48: Datawarehousing

Information directory Component

This component helps the end users to know the details of the data stored in the data warehouse.

This is done with the help of the data about the data named meta data.

Technical dataBusiness data

48

Page 49: Datawarehousing

Middleware Component

This components connect to the local databases.

Analytical server used to analyze multidimensional data.

Intelligent data warehousing middleware to control the access to the warehouse database.

49

Page 50: Datawarehousing

Data mart

Data mart is a database that contains data needed for a small group of users for their own department needs.

–Dependent data mart–Independent data mart

50

Page 51: Datawarehousing

Different between data warehouse and data martData warehouse Data Mart

Data mart is therefore useful for small organizations with very few departments

data warehousing is suitable to support an entire corporate environment.

If you listen to some vendors, you may be left thinking that building data warehouses is a waste of time.

data mart vendor that tells you this are looking out for their own best interests.

This supports the entire information requirement of an organization.

This support the information requirement of a department in an organization

This has large model, wider implementation, large data and more number of users.

This has small data model, shorter implementation, less data and some users.

51

Page 52: Datawarehousing

Advantages of data martSince each department has its own data

mart, the departments can summarize, sort , select structure etc their own department’s data. This will not confused with any other department.

The department can do whatever DSS processing they want.

The processing cost and storage are less that the data warehouse.

The department can select a software for their data mart. it is powerful to fit their needs.

52

Page 53: Datawarehousing

Data warehousing life cycle

53

Design

Enhance prototype

Operate

deploy

Page 54: Datawarehousing

54MonthMonth

1 1 22 3 3 4 4 776 6 5 5

Pro

du

ctP

rod

uct

Toothpaste Toothpaste

JuiceJuiceColaColaMilk Milk

CreamCream

Soap Soap

Regio

n

Regio

n

WWS S

N N

Dimensions: Dimensions: Product, Region, Product, Region, periodsperiodsHierarchical summarization pathsHierarchical summarization paths

Product Product Region Region PeriodPeriodIndustry Country YearIndustry Country Year

Category Region Quarter Category Region Quarter

Product City Month WeekProduct City Month Week

Office DayOffice Day

Data Modeling(Multi-dimensional Database)

“Hey…I sold $100M worth of goods”

Page 55: Datawarehousing

Building of data warehouse The builder must forecast the usage of the warehouse

by the users. The design should support accessing data with any

meaningful values of the attributes. To build a good data warehouse data acquisition

process must follow the steps given flowextract the data from multiple heterogeneous

sourcesFormat the data for consistency within the

warehouse.The data must be cleaned to ensure validityThe data must be converted from relational ,object

oriented ,hierarchy model to a multidimensional model.

The data are loaded into the warehouse. Good monitoring tools are necessary to recover from incorrect load. 55

Page 56: Datawarehousing

Data warehouse and views

Data warehouse is a permanent storage of data in multidimensional tables.

View are temporarily created when needed using data warehouse.

This is used for decision support system.

56

Page 57: Datawarehousing

Different between data warehouse and views

Data warehouse Views

Data warehouse is a permanent storage data.

Views are created from warehouse data when needed and it is not permanent

Data warehouse are multidimensional

Views are relational

Data warehouse can be indexed to maximize performance.

Views cannot be indexed.

Data warehouse provides specific support to a functionality

Views cannot give specific support to a functionality.

Data warehouse provide large amount of data.

Views are created by extracting minimum data from data warehouse.

57

Page 58: Datawarehousing

Data warehouse FutureNew techniques must be introduced in

data cleaning ,indexing and partitioning.The manual operation involved in data

acquisition ,management data quality and performance maximization must be automated.

Proper business rules must be developed and incorporated in warehouse creation and maintenance process.

58

Page 59: Datawarehousing

Data Mining

Data mining is sorting through data to identify patterns and establish relationships.

59

Page 60: Datawarehousing

60

Data Mining (cont.)

Page 61: Datawarehousing

61

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

Page 62: Datawarehousing

62

“The key in business is to know something that nobody else knows.”

— Aristotle Onassis

“To understand is to perceive patterns.” — Sir Isaiah Berlin

PH

OT

O: L

UC

IND

A D

OU

GL

AS

-ME

NZ

IES

PHOTO: HULTON-DEUTSCH COLL

Data Mining Motivation

Page 63: Datawarehousing

63

Application Areas

Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud Analysis

Telecommunication Call record analysis

Consumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis

Page 64: Datawarehousing

64

Data Mining in Use

The US Government uses Data Mining to track fraud

A Supermarket becomes an information broker

Basketball teams use it to track game strategy

Cross SellingWarranty claims RoutingHolding on to Good CustomersWeeding out Bad Customers

Page 65: Datawarehousing

65

What is data mining technology

The process of extracting or finding hidden knowledge from large database is called data mining.

Ex: Age 21------ we can understand he is major

data information

Page 66: Datawarehousing

Data Mining Technology

66

Cleaning and Integration Databases

Data Warehouse

Flat Files

Patterns Knowledge

Selection and transformation

Data Mining

Page 67: Datawarehousing

The various step

Data cleaning To remove noise and inconsistent data

Data integration Data from multiple sources are combined

Data selection relevant data are retrieved from the database for analysis

67

Page 68: Datawarehousing

Data transformation The selected data are made for mining by performing aggregation operations

Data mining Intelligent methods are applied to extract data patterns

Pattern evaluation Identify the needed patterns

Knowledge presentation present the mined knowledge to the user

68

Page 69: Datawarehousing

Loading the Warehouse

Cleaning the data before it is loaded

Page 70: Datawarehousing

70

Data Integration Across Sources

Trust Credit cardSavings Loans

Same data different name

Different data Same name

Data found here nowhere else

Different keyssame data

Page 71: Datawarehousing

71

Data Transformation Exampleen

cod

ing

unit

field

appl A - balanceappl B - balappl C - currbalappl D - balcurr

appl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - yds

appl A - m,fappl B - 1,0appl C - x,yappl D - male, female

Data Warehouse

Page 72: Datawarehousing

Structuring/Modeling Issues

Page 73: Datawarehousing

Data Warehouse vs. Data Marts

Page 74: Datawarehousing

74

From the Data Warehouse to Data Marts

DepartmentallyStructured

IndividuallyStructured

Data WarehouseOrganizationallyStructured

Less

More

HistoryNormalizedDetailed

Data

Information

Page 75: Datawarehousing

75

Data Warehouse and Data Marts

OLAPData MartLightly summarizedDepartmentally structured

Organizationally structuredAtomicDetailed Data Warehouse Data

Page 76: Datawarehousing

76

Characteristics of the Departmental Data Mart

OLAPSmallFlexibleCustomized by

DepartmentSource is

departmentally structured data warehouse

Page 77: Datawarehousing

77

Techniques for Creating Departmental Data Mart

OLAP

Subset

Summarized

Superset

Indexed

Arrayed

Sales Mktg.Finance

Page 78: Datawarehousing

78

Data Mart Centric

Data Marts

Data Sources

Data Warehouse

Page 79: Datawarehousing

79

True Warehouse

Data Marts

Data Sources

Data Warehouse

Page 80: Datawarehousing

II. On-Line Analytical Processing (OLAP)

Making Decision Support Possible

Page 81: Datawarehousing

81

What Is OLAP?

Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software

Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System

OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase,

Oracle Express) ROLAP: Relational OLAP (Informix MetaCube,

Microstrategy DSS Agent)

Page 82: Datawarehousing

82

The OLAP Market

Rapid growth in the enterprise market 1995: $700 Million 1997: $2.1 Billion

Significant consolidation activity among major DBMS vendors 10/94: Sybase acquires ExpressWay 7/95: Oracle acquires Express 11/95: Informix acquires Metacube 1/97: Arbor partners up with IBM 10/96: Microsoft acquires Panorama

Result: OLAP shifted from small vertical niche to mainstream DBMS category

Page 83: Datawarehousing

83

Strengths of OLAP

It is a powerful visualization paradigm

It provides fast, interactive response times

It is good for analyzing time series

It can be useful to find some clusters and

outliers

Many vendors offer OLAP tools

Page 84: Datawarehousing

84

OLAP Is FASMI

FastAnalysisSharedMultidimensionalInformation

Page 85: Datawarehousing

85

Data Cube Lattice

Cube lattice ABC

AB AC BC A B C none

Can materialize some groupbys, compute others on demand

Question: which groupbys to materialze? Question: what indices to create Question: how to organize data (chunks, etc)

Page 86: Datawarehousing

86

Visualizing Neighbors is simpler

1 2 3 4 5 6 7 8AprMayJunJulAugSepOctNovDecJanFebMar

Month Store SalesApr 1Apr 2Apr 3Apr 4Apr 5Apr 6Apr 7Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2

Page 87: Datawarehousing

87

A Visual Operation: Pivot (Rotate)

1010

4747

3030

1212

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYNY

LALA

SFSF

3/1 3/2 3/3 3/1 3/2 3/3 3/43/4

DateDate

Month

Month

Reg

ion

Reg

ion

ProductProduct

Page 88: Datawarehousing

88

“Slicing and Dicing”

Product

Sales Channel

Regio

ns

Retail Direct Special

Household

Telecomm

Video

Audio IndiaFar East

Europe

The Telecomm Slice

Page 89: Datawarehousing

89

Roll-up and Drill Down

Sales ChannelRegionCountryState Location AddressSales

Representative

Roll

Up

Higher Level ofAggregation

Low-levelDetails

Drill-D

ow

n

Page 90: Datawarehousing

90

Nature of OLAP AnalysisAggregation -- (total sales,

percent-to-total)Comparison -- Budget vs.

ExpensesRanking -- Top 10, quartile

analysisAccess to detailed and

aggregate dataComplex criteria

specificationVisualization

Page 91: Datawarehousing

91

Organizationally Structured Data

Different Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data

marketing

manufacturing

sales

finance

Page 92: Datawarehousing

92

Multidimensional SpreadsheetsAnalysts need

spreadsheets that support pivot tables (cross-tabs) drill-down and roll-up slice and dice sort selections derived attributes

Popular in retail domain

Page 93: Datawarehousing

© Prentice Hall 93

OLAP Operations

Single Cell Multiple Cells Slice Dice

Roll Up

Drill Down

Page 94: Datawarehousing

94

Relational OLAP: 3 Tier DSS

Data Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in industry standard RDBMS.

Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.

Obtain multi-dimensional reports from the DSS Client.

Page 95: Datawarehousing

95

MD-OLAP: 2 Tier DSS

MDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.

Obtain multi-dimensional reports from the DSS Client.

Page 96: Datawarehousing

MSPVL Polytechnic CollegePavoorchatram

96