Top Banner
1 Data warehousing Data warehousing & & Data mining Data mining Presented by Siva Krishna
31

1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

Dec 31, 2015

Download

Documents

Samantha Daniel
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

11

Data warehousing Data warehousing & &

Data miningData mining

Presented by

Siva Krishna

Nilesh Kumar

Page 2: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

22

Evolution of Data Warehousing

60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request

70’s: Terminal-based DSS and EIS (executive information systems still inflexible, not integrated with desktop tool

80’s: Desktop data access and analysis tools query tools, spreadsheets, GUI’s easier to use, but only access operational databases

90’s: Data warehousing with integrated OLAP engines and tools

Page 3: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

33

What is a Data Warehouse?What is a Data Warehouse?

A single, complete and A single, complete and consistent store of data consistent store of data obtained from a variety of obtained from a variety of different sources made different sources made available to end users in a available to end users in a what they can understand and what they can understand and use in a business context.use in a business context.

Page 4: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

44

Data Warehouse ArchitectureData Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 5: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

55

Components of the WarehouseComponents of the Warehouse

Data Extraction and LoadingData Extraction and Loading The Warehouse The Warehouse Analyze and Query -- OLAP ToolsAnalyze and Query -- OLAP Tools Metadata Metadata Data Mining toolsData Mining tools

Page 6: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

66

Loading the WarehouseLoading the Warehouse

Cleaning the data before Cleaning the data before it is loadedit is loaded

Page 7: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

77

Data Integrity ProblemsData Integrity Problems Same person, different spellingsSame person, different spellings

Agarwal, Agrawal, Aggarwal etc...Agarwal, Agrawal, Aggarwal etc... Multiple ways to denote company nameMultiple ways to denote company name

Persistent Systems, PSPL, Persistent Pvt. LTD.Persistent Systems, PSPL, Persistent Pvt. LTD. Use of different namesUse of different names

mumbai, bombaymumbai, bombay Different account numbers generated by different Different account numbers generated by different

applications for the same customerapplications for the same customer

Page 8: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

88

Data Transformation TermsData Transformation Terms

ExtractingExtracting ConditioningConditioning HouseholdingHouseholding EnrichmentEnrichment ScoringScoring

Page 9: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

99

ExtractingExtracting Capture of data from operational source in “as is” Capture of data from operational source in “as is”

statusstatus Sources for data generally in legacy mainframes in Sources for data generally in legacy mainframes in

VSAM, IMS, IDMS, DB2; more data today in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unixrelational databases on Unix

ConditioningConditioning The conversion of data types from the source to the The conversion of data types from the source to the

target data store (warehouse) -- always a relational target data store (warehouse) -- always a relational databasedatabase

Page 10: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1010

HouseholdingHouseholding Identifying all members of a household (living at Identifying all members of a household (living at

the same address)the same address) Ensures only one mail is sent to a householdEnsures only one mail is sent to a household Can result in substantial savings: 1 lakh catalogues Can result in substantial savings: 1 lakh catalogues

at Rs. 50 each costs Rs. 50 lakhs. A 2% savings at Rs. 50 each costs Rs. 50 lakhs. A 2% savings would save Rs. 1 lakh.would save Rs. 1 lakh.

EnrichmentEnrichment Bring data from external sources to Bring data from external sources to

augment/enrich operational data. Data augment/enrich operational data. Data sources include Dunn and Bradstreet, A. sources include Dunn and Bradstreet, A. C. Nielsen, CMIE, IMRA etc...C. Nielsen, CMIE, IMRA etc...

Page 11: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1111

Scoring computation of a probability of an event. e.g...., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product

Page 12: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1212

Data Warehouse StructureData Warehouse Structure

Subject Orientation -- customer, product, Subject Orientation -- customer, product, policy, account etc... A subject may be policy, account etc... A subject may be implemented as a set of related tables. E.g., implemented as a set of related tables. E.g., customer may be five tablescustomer may be five tables

Page 13: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1313

Data Warehouse StructureData Warehouse Structure base customer (1985-87)base customer (1985-87)

custid, from date, to date, name, phone, dobcustid, from date, to date, name, phone, dob base customer (1988-90)base customer (1988-90)

custid, from date, to date, name, credit rating, employercustid, from date, to date, name, credit rating, employer customer activity (1986-89) -- monthly summarycustomer activity (1986-89) -- monthly summary customer activity detail (1987-89)customer activity detail (1987-89)

custid, activity date, amount, clerk id, order nocustid, activity date, amount, clerk id, order no customer activity detail (1990-91)customer activity detail (1990-91)

custid, activity date, amount, line item no, order nocustid, activity date, amount, line item no, order no

Page 14: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1414

Schema DesignSchema Design Database organizationDatabase organization

must look like businessmust look like business must be recognizable by business usermust be recognizable by business user approachable by business userapproachable by business user Must be Must be simplesimple

Schema TypesSchema Types Star SchemaStar Schema Fact Constellation SchemaFact Constellation Schema Snowflake schemaSnowflake schema

Page 15: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1515

Dimension TablesDimension Tables

Dimension tablesDimension tables Define business in terms already familiar to usersDefine business in terms already familiar to users Wide rows with lots of descriptive textWide rows with lots of descriptive text Small tables (about a million rows) Small tables (about a million rows) Joined to fact table by a foreign keyJoined to fact table by a foreign key heavily indexedheavily indexed typical dimensionstypical dimensions

time periods, geographic region (markets, cities), time periods, geographic region (markets, cities), products, customers, salesperson, etc.products, customers, salesperson, etc.

Page 16: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1616

Fact TableFact Table Central tableCentral table

mostly raw numeric itemsmostly raw numeric items narrow rows, a few columns at mostnarrow rows, a few columns at most large number of rows (millions to a billion)large number of rows (millions to a billion) Access via dimensionsAccess via dimensions

Page 17: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1717

Star SchemaStar Schema A single fact table and for each dimension A single fact table and for each dimension

one dimension tableone dimension table Does not capture hierarchies directlyDoes not capture hierarchies directly

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

Page 18: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1818

Snowflake schemaSnowflake schema Represent dimensional hierarchy directly by Represent dimensional hierarchy directly by

normalizing tables. normalizing tables. Easy to maintain and saves storageEasy to maintain and saves storage

T ime

prod

cust

city

fact

date, custno, prodno, cityname, ...

region

Page 19: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

1919

Fact ConstellationFact Constellation Fact ConstellationFact Constellation

Multiple fact tables that share many dimension Multiple fact tables that share many dimension tablestables

Booking and Checkout may share many Booking and Checkout may share many dimension tables in the hotel industrydimension tables in the hotel industry

Hotels

Travel Agents

Promotion

Room Type

Customer

Booking

Checkout

Page 20: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2020

On-Line Analytical On-Line Analytical Processing (OLAP)Processing (OLAP)

Making Decision Making Decision Support PossibleSupport Possible

Page 21: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2121

What Is OLAP?What Is OLAP? Online Analytical Processing - coined by Online Analytical Processing - coined by

EF Codd in 1994 paper contracted by EF Codd in 1994 paper contracted by Arbor Software*Arbor Software*

Generally synonymous with earlier terms such as Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Decisions Support, Business Intelligence, Executive Information SystemInformation System

OLAP = Multidimensional DatabaseOLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, MOLAP: Multidimensional OLAP (Arbor Essbase,

Oracle Express)Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, ROLAP: Relational OLAP (Informix MetaCube,

Microstrategy DSS Agent)Microstrategy DSS Agent)

Page 22: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2222

Data Warehouse for Decision Data Warehouse for Decision Support & OLAPSupport & OLAP

Putting Information technology to help the Putting Information technology to help the knowledge worker make faster and better knowledge worker make faster and better decisionsdecisions Which of my customers are most likely to go to Which of my customers are most likely to go to

the competition?the competition? What product promotions have the biggest impact What product promotions have the biggest impact

on revenue?on revenue? How did the share price of software companies How did the share price of software companies

correlate with profits over last 10 years?correlate with profits over last 10 years?

Page 23: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2323

Decision SupportDecision Support

Used to manage and control businessUsed to manage and control business Data is historical or point-in-timeData is historical or point-in-time Optimized for inquiry rather than updateOptimized for inquiry rather than update Use of the system is loosely defined and Use of the system is loosely defined and

can be ad-hoccan be ad-hoc Used by managers and end-users to Used by managers and end-users to

understand the business and make understand the business and make judgementsjudgements

Page 24: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2424

Data MiningData Mining

The non trivial extraction of implicit, The non trivial extraction of implicit, previously unknown, and potentially useful previously unknown, and potentially useful information from data information from data

In simple wordsIn simple words Searching for new knowledgeSearching for new knowledge

Page 25: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2525

Data Mining GoalsData Mining Goals

PredictionPrediction IdentificationIdentification ClassificationClassification OptimizationOptimization

Page 26: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2626

Data mining techniquesData mining techniques clustering clustering data summarization data summarization learning classification rules learning classification rules finding dependency net works finding dependency net works analyzing changesanalyzing changes detecting anomalies detecting anomalies

Page 27: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2727

Data Mining in UseData Mining in Use The US Government uses Data Mining to track The US Government uses Data Mining to track

fraudfraud A Supermarket becomes an information brokerA Supermarket becomes an information broker Basketball teams use it to track game strategyBasketball teams use it to track game strategy Cross SellingCross Selling Warranty Claims RoutingWarranty Claims Routing Holding on to Good CustomersHolding on to Good Customers Weeding out Bad CustomersWeeding out Bad Customers

Page 28: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2828

Data Warehouse PitfallsData Warehouse Pitfalls You are going to spend much time extracting, cleaning, You are going to spend much time extracting, cleaning,

and loading dataand loading data Despite best efforts at project management, data Despite best efforts at project management, data

warehousing project scope will increasewarehousing project scope will increase You are going to find problems with systems feeding You are going to find problems with systems feeding

the data warehousethe data warehouse You will find the need to store data not being captured You will find the need to store data not being captured

by any existing systemby any existing system You will need to validate data not being validated by You will need to validate data not being validated by

transaction processing systemstransaction processing systems

Page 29: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

2929

Data Warehouse PitfallsData Warehouse Pitfalls Some transaction processing systems feeding the Some transaction processing systems feeding the

warehousing system will not contain detailwarehousing system will not contain detail Many warehouse end users will be trained and never or Many warehouse end users will be trained and never or

seldom apply their trainingseldom apply their training After end users receive query and report tools, requests After end users receive query and report tools, requests

for IS written reports may increasefor IS written reports may increase Your warehouse users will develop conflicting business Your warehouse users will develop conflicting business

rulesrules Large scale data warehousing can become an exercise in Large scale data warehousing can become an exercise in

data homogenizingdata homogenizing

Page 30: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

3030

ApplicationsApplications

Medicine - drug side effects, hospital cost analysis, genetic sequence analysis, prediction etc.

Finance - stock market prediction, credit assessment, fraud detection etc.Marketing/sales - product analysis, buying patterns, sales prediction, target mailing, identifying `unusual behavior' etc.

Knowledge Acquisition

Scientific Discovery - superconductivity research, etc.

Engineering - automotive diagnostic expert systems, fault detection etc.

Page 31: 1 Data warehousing & Data mining Presented by Siva Krishna Nilesh Kumar.

3131

Thanking you

Nilesh and Siva Krishna…