Top Banner
March 21, 2022 ICS426: Introduction 1 DATA WAREHOUSING AND DATA MINING
27

July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

Dec 22, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 1

DATA WAREHOUSING AND

DATA MINING

Page 2: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 2

Course Overview

Introduction

Data Preporcessing

DW and OLAP

Data Mining

Page 3: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 3

Motivation

Data flood

Automated data collection tools and mature database technology lead to tremendous amounts of data stored in databases, data warehouses and other information repositories

There is a tremendous increase in the amount of data recorded and stored on digital media

We are producing over two exabites (10^18) of data per year

Storage capacity, for a fixed price, appears to be doubling approximately every 9 months

Data stored in world’s databases doubles every 20 months Other growth rate estimates even higher

Page 4: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 4

Data, Data everywhere - yet ...

I can’t find the data I need data is scattered over the network many versions, subtle differences

I can’t get the data I need need an expert to get the data

I can’t understand the data I found available data poorly documented

I can’t use the data I found results are unexpected data needs to be transformed from

one form to other

Page 5: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 5

Motivation

Very little data will ever be looked at by a human. We are drowning in data, but starving for knowledge! “The greatest problem of today is how to teach people to ignore the irrelevant, how to refuse to know things, before they are suffocated. For too many facts are as bad as none at all.

Knowledge Discovery is NEEDED to make sense and use of data.

Solution: Data warehousing and data mining

Data warehousing and On-Line Analytical Processing (OLAP)

Extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases

Page 6: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 6

Knowledge Discovery (KDD) Process

Page 7: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 7

KDD Process: Several Key Steps

Learning the application domain

relevant prior knowledge and goals of application

Creating a target data set: data selection

Data cleaning and preprocessing: (may take 60% of effort!)

Data reduction and transformation

Data mining

summarization, classification, regression, association, clustering

Pattern evaluation and knowledge presentation

Use of discovered knowledge

Page 8: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 8

What is a Data Warehouse?

A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

[Barry Devlin]

Page 9: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 9

What are the users saying...

Data should be integrated across the enterprise

Summary data has a real value to the organization

Historical data holds the key to understanding data over time

What-if capabilities are required

Page 10: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 10

What is Data Warehousing?

A process of transforming data into information and making it available to users in a timely enough manner to make a difference

[Forrester Research, April 1996]

Data

Information

Page 11: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 11

Evolution

60’s: Batch reports hard to find and analyze information inflexible and expensive, reprogram every new request

70’s: Terminal-based DSS and EIS (executive information systems) still inflexible, not integrated with desktop tools

80’s: Desktop data access and analysis tools query tools, spreadsheets, GUIs easier to use, but only access operational databases

90’s: Data warehousing with integrated OLAP engines and tools 2000’s:

Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information

systems

Page 12: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 12

Very Large Data Bases

Terabytes -- 10^12 bytes:

Petabytes -- 10^15 bytes:

Exabytes -- 10^18 bytes:

Zettabytes -- 10^21 bytes:

Zottabytes -- 10^24 bytes:

Walmart -- 24 Terabytes

Geographic Information Systems

National Medical Records

Weather images

Intelligence Agency Videos

Page 13: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 13

Data Warehousing -- It is a process

Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible

A decision support database maintained separately from the organization’s operational database

Page 14: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 14

Data Warehouse

A data warehouse is a subject-oriented integrated time-varying non-volatile

collection of data that is used primarily in organizational decision making.

-- Bill Inmon, Building the Data Warehouse 1996

Page 15: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 15

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 16: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 16

Data Warehouse for Decision Support & OLAP

Putting Information technology to help the knowledge worker make faster and better decisions

Which of my customers are most likely to go to the competition?

What product promotions have the biggest impact on revenue?

How did the share price of software companies correlate with profits over last 10 years?

Page 17: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 17

Decision Support

Used to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-hoc

Used by managers and end-users to understand the business and make judgements

Page 18: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 18

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

Page 19: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 19

Why Data Mining

Credit ratings/targeted marketing: Given a database of 100,000 names, which persons are

the least likely to default on their credit cards? Identify likely responders to sales promotions

Fraud detection Which types of transactions are likely to be fraudulent,

given the demographics and transactional history of a particular customer?

Customer relationship management: Which of my customers are likely to be the most loyal,

and which are most likely to leave for a competitor? :

Data Mining helps extract such information

Page 20: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 20

Which are our lowest/highest margin

customers ?

Which are our lowest/highest margin

customers ?

Who are my customers and what products are they buying?

Who are my customers and what products are they buying?

Which customers are most likely to go to the competition ?

Which customers are most likely to go to the competition ?

What impact will new products/services

have on revenue and margins?

What impact will new products/services

have on revenue and margins?

What product prom--otions have the biggest

impact on revenue?

What product prom--otions have the biggest

impact on revenue?

What is the most effective distribution

channel?

What is the most effective distribution

channel?

Why DM: A producer wants to know….

Page 21: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 21

What is Data Mining?

Data mining: a misnomer?

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc

Many Definitions

Non-trivial extraction of implicit, previously unknown and potentially useful information from huge amount of data

Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 22: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 22

Data Mining: Confluence of Multiple Disciplines

?

20x20 ~ 2^400 10^120 patterns

Page 23: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 23

Some basic operations

Predictive: Regression Classification Collaborative Filtering

Descriptive: Clustering / similarity matching Association rules and variants Deviation detection

Page 24: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 24

Applications …

Banking: loan/credit card approval

predict good customers based on old customers

Customer relationship management:

identify those who are likely to leave for a competitor.

Targeted marketing:

identify likely responders to promotions

Fraud detection: telecommunications, financial transactions

from an online stream of event identify fraudulent events

Manufacturing and production:

automatically adjust knobs when process parameter changes

Page 25: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 25

… Applications

Medicine: disease outcome, effectiveness of treatments

analyze patient disease history: find relationship between diseases

Molecular/Pharmaceutical: identify new drugs

Scientific data analysis:

identify new galaxies by searching for sub clusters

Web site/store design and promotion:

find affinity of visitor to pages and modify layout

Page 26: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 26

The course

DS

DS

DS

DW

OLAP

DM

(2) (3)

(4)

Association

Classification

Clustering

(5)

(6)

(7)DS = Data sourceDW = Data warehouseDM = Data MiningDP = Data processing

DP

Page 27: July 13, 2015ICS426: Introduction1 DATA WAREHOUSING AND DATA MINING.

April 19, 2023 ICS426: Introduction 27

END