Experian Lecture17_257

8/3/2019 Experian Lecture17_257

1/66

2010.10.28- SLIDE 1

IS 257 Fall 2010

Data Warehouses, Decision Support

and Data Mining

University of California, Berkeley

School of InformationIS 257: Database Management


2/66

2010.10.28- SLIDE 2

IS 257 Fall 2010

Lecture Outline

Review Data Warehouses

(Based on lecture notes from Joachim Hammer,University of Florida, and Joe Hellerstein and Mike

Stonebraker of UCB) Applications for Data Warehouses

Decision Support Systems (DSS)

OLAP (ROLAP, MOLAP)

Data Mining Thanks again to lecture notes from Joachim

Hammer of the University of Florida


3/66

2010.10.28- SLIDE 3

IS 257 Fall 2010

Problem: Heterogeneous Information Sources

Heterogeneities are

everywhere

Different interfaces

Different data representations

Duplicate and inconsistent information

PersonalDatabases

Digital Libraries

Scientific DatabasesWorldWide

Web

Slide credit: J. Hammer


4/66

2010.10.28- SLIDE 4

IS 257 Fall 2010

Problem: Data Management in Large Enterprises

Vertical fragmentation of informationalsystems (vertical stove pipes)

Result of application (user)-driven

development of operational systems

Sales Administration Finance Manufacturing ...

Sales PlanningStock Mngmt

...

Suppliers

...

Debt MngmtNum. Control

...

Inventory



5/66

2010.10.28- SLIDE 5

IS 257 Fall 2010

Goal: Unified Access to Data

Integration System

Collects and combines information

Provides integrated view, uniform user interface

Supports sharing

World

Wide

WebDigital Libraries Scientific Databases

PersonalDatabases



6/66

2010.10.28- SLIDE 6

IS 257 Fall 2010

The Traditional Research Approach

Source SourceSource

. . .

Integration System

. . .

Metadata

Clients

Wrapper WrapperWrapper

Query-driven (lazy, on-demand)



7/66

2010.10.28- SLIDE 7

IS 257 Fall 2010

The Warehousing Approach

DataWarehouse

Clients

Source SourceSource

. . .

Extractor/Monitor

Integration System

. . .

Metadata

Extractor/Monitor Extractor/Monitor

Information

integrated inadvance

Stored in WH

for directquerying andanalysis



8/66

2010.10.28- SLIDE 8

IS 257 Fall 2010

What is a Data Warehouse?

A Data Warehouse is asubject-oriented,

integrated,

time-variant,

non-volatile

collection of data used in support ofmanagement decision makingprocesses.

-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11


9/66

2010.10.28- SLIDE 9

IS 257 Fall 2010

A Data Warehouse is...

Stored collection of diverse data A solution to data integration problem Single repository of information

Subject-oriented Organized by subject, not by application

Used for analysis, data mining, etc.

Optimized differently from transaction-

oriented db User interface aimed at executive decision

makers and analysts


10/66

2010.10.28- SLIDE 10

IS 257 Fall 2010

Contd

Large volume of data (Gb, Tb) Non-volatile

Historical

Time attributes are important

Updates infrequent

May be append-only

Examples

All transactions ever at WalMart Complete client histories at insurance firm

Stockbroker financial information and portfolios



11/66

2010.10.28- SLIDE 11

IS 257 Fall 2010

Data Warehousing Architecture


12/66

2010.10.28- SLIDE 12

IS 257 Fall 2010

Ingest

DataWarehouse

Clients

Source/ File Source / ExternalSource / DB

. . .

Extractor/Monitor

Integration System

. . .

Metadata

Extractor/Monitor Extractor/Monitor


13/66

2010.10.28- SLIDE 13

IS 257 Fall 2010

Today

Applications for Data Warehouses Decision Support Systems (DSS)

OLAP (ROLAP, MOLAP)

Data Mining Thanks again to slides and lecture notes

from Joachim Hammer of the University of

Florida, and also to Laura Squier of SPSS,Gregory Piatetsky-Shapiro of KDNuggetsand to the CRISP web site

Source: Gregory Piatetsky-Shapiro


14/66

2010.10.28- SLIDE 14

IS 257 Fall 2010

Trends leading to Data Flood

More data is generated: Bank, telecom, other

business transactions ...

Scientific Data: astronomy,biology, etc

Web, text, and e-commerce

More data is captured: Storage technology faster

and cheaper

DBMS capable of handlingbigger DB



15/66

2010.10.28- SLIDE 15

IS 257 Fall 2010

Examples

Europe's Very Long BaselineInterferometry (VLBI) has 16 telescopes,each of which produces 1 Gigabit/secondof astronomical data over a 25-dayobservation session

storage and analysis a big problem

Walmart reported to have 500 Terabyte DB

AT&T handles billions of calls per day

data cannot be stored -- analysis is done onthe fly



16/66

2010.10.28- SLIDE 16

IS 257 Fall 2010

Growth Trends

Moores law Computer Speed doublesevery 18 months

Storage law

total storage doubles every 9months

Consequence very little data will ever be

looked at by a human

Knowledge Discovery isNEEDED to make senseand use of data.



17/66

2010.10.28- SLIDE 17

IS 257 Fall 2010

Knowledge Discovery in Data (KDD)

Knowledge Discovery in Data is the non-trivial process of identifying valid

novel

potentially useful and ultimately understandable patterns in

data. from Advances in Knowledge Discovery and Data

Mining, Fayyad, Piatetsky-Shapiro, Smyth, andUthurusamy, (Chapter 1), AAAI/MIT Press 1996



18/66

2010.10.28- SLIDE 18

IS 257 Fall 2010

Related Fields

Statistics

MachineLearning

Databases

Visualization

Data Mining andKnowledge Discovery



19/66

2010.10.28- SLIDE 19

IS 257 Fall 2010

__

__

__

__

__

__

__

__

__

TransformedData

PatternsandRules

Target

Data

RawData

KnowledgeInterpretation& Evaluation

Integration

Understanding

Knowledge Discovery Process

DATA

Warehouse

Knowledge



20/66

2010.10.28- SLIDE 20

IS 257 Fall 2010

What is Decision Support?

Technology that will help managers andplanners make decisions regarding theorganization and its operations based ondata in the Data Warehouse.

What was the last two years of sales volumefor each product by state and city?

What effects will a 5% price discount have onour future income for product X?

Increasing common term is KDD Knowledge Discovery in Databases


21/66

2010.10.28- SLIDE 21

IS 257 Fall 2010

Conventional Query Tools

Ad-hoc queries and reports usingconventional database tools

E.g. Access queries.

Typical database designs include fixedsets of reports and queries to supportthem

The end-user is often not given the ability todo ad-hoc queries


22/66

2010.10.28- SLIDE 22

IS 257 Fall 2010

OLAP

Online Line Analytical Processing Intended to provide multidimensional views of

the data

I.e., the Data Cube

The PivotTables in MS Excel are examples ofOLAP tools


23/66

2010.10.28- SLIDE 23

IS 257 Fall 2010

Data Cube


24/66

2010.10.28- SLIDE 24

IS 257 Fall 2010

Operations on Data Cubes

Slicing the cube Extracts a 2d table from the multidimensional

data cube

Example

Drill-Down

Analyzing a given set of data at a finer level ofdetail


25/66

2010.10.28- SLIDE 25

IS 257 Fall 2010

Star Schema

Typical design for the derived layer of aData Warehouse or Mart for DecisionSupport Particularly suited to ad-hoc queries

Dimensional data separate from fact or eventdata

Fact tables contain factual or quantitativedata about the business

Dimension tables hold data about thesubjects of the business Typically there is one Fact table with

multiple dimension tables


26/66

2010.10.28- SLIDE 26

IS 257 Fall 2010

Star Schema for multidimensional data

Order

OrderNo

OrderDate

Salesperson

SalespersonID

SalespersonName

City

Quota

Fact Table

OrderNo

Salespersonid

Customerno

ProdNo

DatekeyCityname

Quantity

TotalPriceCity

CityName

State

Country

DateDateKey

Day

Month

Year

ProductProdNo

ProdName

Category

Description

Customer

CustomerNameCustomerAddress

City


27/66

2010.10.28- SLIDE 27

IS 257 Fall 2010

Data Mining

Data mining is knowledge discovery ratherthan question answering

May have no pre-formulated questions

Derived from Traditional Statistics Artificial intelligence

Computer graphics (visualization)


28/66

2010.10.28- SLIDE 28

IS 257 Fall 2010

Goals of Data Mining

Explanatory Explain some observed event or situation Why have the sales of SUVs increased in California but not

in Oregon?

Confirmatory To confirm a hypothesis

Whether 2-income families are more likely to buy familymedical coverage

Exploratory To analyze data for new or unexpected relationships

What spending patterns seem to indicate credit card fraud?


29/66

2010.10.28- SLIDE 29

IS 257 Fall 2010

Data Mining Applications

Profiling Populations

Analysis of business trends

Target marketing

Usage Analysis

Campaign effectiveness Product affinity

Customer Retention and Churn

Profitability Analysis Customer Value Analysis

Up-Selling


30/66

2010.10.28- SLIDE 30

IS 257 Fall 2010

Data + Text Mining Process

Source: Languistics

via Google Images


31/66

2010.10.28- SLIDE 31

IS 257 Fall 2010

How Can We Do Data Mining?

By Utilizing the CRISP-DM Methodology

a standard process

existing data

software technologies

situational expertise

Source: Laura Squier


32/66

2010.10.28- SLIDE 32

IS 257 Fall 2010

Why Should There be a Standard Process?

Framework for recordingexperience Allows projects to be

replicated

Aid to project planningand management

Comfort factor for new

adopters Demonstrates maturity of

Data Mining

Reduces dependency on

stars

The data mining process must

be reliable and repeatable by

people with little data mining

background.



33/66

2010.10.28- SLIDE 33

IS 257 Fall 2010

Process Standardization

CRISP-DM:

CRoss Industry Standard Process for Data Mining

Initiative launched Sept.1996

SPSS/ISL, NCR, Daimler-Benz, OHRA

Funding from European commission

Over 200 members of the CRISP-DM SIG worldwide DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries,

Syllogic, Magnify, ..

System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte& Touche,

End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...


C S


34/66

2010.10.28- SLIDE 34

IS 257 Fall 2010

CRISP-DM

Non-proprietary Application/Industry neutral

Tool neutral

Focus on business issues

As well as technical analysis

Framework for guidance

Experience base

Templates for Analysis


Th CRISP DM P M d l


35/66

2010.10.28- SLIDE 35

IS 257 Fall 2010

The CRISP-DM Process Model


Wh CRISP DM?


36/66

2010.10.28- SLIDE 36

IS 257 Fall 2010

Why CRISP-DM?

The data mining process must be reliable andrepeatable by people with little data mining skills

CRISP-DM provides a uniform framework for

guidelines

experience documentation

CRISP-DM is flexible to account for differences Different business/agency problems

Different data


Phases and Tasks


37/66

2010.10.28- SLIDE 37

IS 257 Fall 2010

BusinessUnderstanding

DataUnderstanding Evaluation

DataPreparation Modeling

DetermineBusiness Objectives

Background

Business Objectives

Business Success

CriteriaSituation AssessmentInventory of Resources

Requirements,

Assumptions, and

Constraints

Risks and Contingencies

Terminology

Costs and Benefits

DetermineData Mining Goal

Data Mining Goals

Data Mining Success

Criteria

Produce Project PlanProject Plan

Initial Asessment of

Tools and Techniques

Collect Initial DataInitial Data Collection

Report

Describe DataData Description Report

Explore DataData Exploration Report

Verify Data QualityData Quality Report

Data Set

Data Set DescriptionSelect DataRationale for Inclusion /

Exclusion

Clean DataData Cleaning Report

Construct DataDerived Attributes

Generated Records

Integrate DataMerged Data

Format DataReformatted Data

Select ModelingTechnique

Modeling Technique

Modeling Assumptions

Generate Test DesignTest Design

Build ModelParameter Settings

Models

Model Description

Assess ModelModel Assessment

Revised Parameter

Settings

Evaluate ResultsAssessment of Data

Mining Results w.r.t.

Business Success

Criteria

Approved Models

Review ProcessReview of Process

Determine Next StepsList of Possible Actions

Decision

Plan DeploymentDeployment Plan

Plan Monitoring andMaintenance

Monitoring and

Maintenance Plan

Produce Final ReportFinal Report

Final PresentationReview ProjectExperience

Documentation

Deployment

Phases and Tasks


Ph i CRISP


38/66

2010.10.28- SLIDE 38

IS 257 Fall 2010

Phases in CRISP

Business Understanding

This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then

converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.

Data Understanding

The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data,to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses forhidden information.

Data Preparation

The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) fromthe initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks includetable, record, and attribute selection as well as transformation and cleaning of data for modeling tools.

Modeling In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values.

Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements onthe form of data. Therefore, stepping back to the data preparation phase is often needed.

Evaluation

At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective.Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the stepsexecuted to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if thereis some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the

data mining results should be reached. Deployment

Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data,the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on therequirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable datamining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However,even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions willneed to be carried out in order to actually make use of the created models.

Ph i h DM P CRISP DM


39/66

2010.10.28- SLIDE 39

IS 257 Fall 2010

Phases in the DM Process: CRISP-DM


Ph i th DM P (1 & 2)


40/66

2010.10.28- SLIDE 40

IS 257 Fall 2010

Phases in the DM Process (1 & 2)

BusinessUnderstanding:

Statement of BusinessObjective

Statement of DataMining objective

Statement of SuccessCriteria

Data Understanding Explore the data andverify the quality

Find outliers


Ph i th DM P (3)


41/66

2010.10.28- SLIDE 41

IS 257 Fall 2010

Phases in the DM Process (3)

Data preparation: Takes usually over 90% of our time

Collection

Assessment

Consolidation and Cleaning table links, aggregation level, missing values, etc

Data selection active role in ignoring non-contributory data?

outliers?

Use of samples

visualization tools

Transformations - create new variables


Ph i th DM P (4)


42/66

2010.10.28- SLIDE 42

IS 257 Fall 2010


Model building Selection of the modeling techniques is based

upon the data mining objective

Modeling is an iterative process - different for

supervised and unsupervised learning May model for either description or prediction


T f M d l


43/66

2010.10.28- SLIDE 43

IS 257 Fall 2010

Types of Models

Prediction Models forPredicting and Classifying Regression algorithms

(predict numeric outcome):neural networks, ruleinduction, CART (OLSregression, GLM)

Classification algorithmpredict symbolic outcome):CHAID, C5.0 (discriminantanalysis, logisticregression)

Descriptive Models forGrouping and FindingAssociations Clustering/Grouping

algorithms: K-means,

Kohonen Association algorithms:

apriori, GRI


D t Mi i Al ith


44/66

2010.10.28- SLIDE 44

IS 257 Fall 2010

Data Mining Algorithms

Market Basket Analysis Memory-based reasoning

Cluster detection

Link analysis Decision trees and rule induction

algorithms

Neural Networks Genetic algorithms

M k t B k t A l i


45/66

2010.10.28- SLIDE 45

IS 257 Fall 2010

Market Basket Analysis

A type of clustering used to predictpurchase patterns.

Identify the products likely to be purchased

in conjunction with other products E.g., the famous (and apocryphal) story thatmen who buy diapers on Friday nights alsobuy beer.

Memor based reasoning


46/66

2010.10.28- SLIDE 46

IS 257 Fall 2010

Memory-based reasoning

Use known instances of a model to makepredictions about unknown instances.

Could be used for sales forecasting or

fraud detection by working from knowncases to predict new cases

Cluster detection


47/66

2010.10.28- SLIDE 47

IS 257 Fall 2010

Cluster detection

Finds data records that are similar to eachother.

K-nearest neighbors (where K represents

the mathematical distance to the nearestsimilar record) is an example of oneclustering algorithm

Kohonen Network


48/66

2010.10.28- SLIDE 48

IS 257 Fall 2010

Kohonen Network

Description unsupervised

seeks to

describe datasetin terms ofnatural clustersof cases


Link analysis


49/66

2010.10.28- SLIDE 49

IS 257 Fall 2010

Link analysis

Follows relationships between records todiscover patterns

Link analysis can provide the basis for

various affinity marketing programs Similar to Markov transition analysismethods where probabilities are calculatedfor each observed transition.

Decision trees and r le ind ction algorithms


50/66

2010.10.28- SLIDE 50

IS 257 Fall 2010

Decision trees and rule induction algorithms

Pulls rules out of a mass of data usingclassification and regression trees (CART)or Chi-Square automatic interactiondetectors (CHAID)

These algorithms produce explicit rules,which make understanding the resultssimpler

Rule Induction


51/66

2010.10.28- SLIDE 51

IS 257 Fall 2010

Rule Induction

Description Produces decision trees:

income < $40K job > 5 yrs thengood risk

job < 5 yrs then bad risk

income > $40K high debt then bad risk

low debt thengood risk

Or Rule Sets: Rule #1 for good risk:

if income > $40K

if low debt

Rule #2 for good risk:

if income < $40K

if job > 5 years

Cat. % n

Bad 52.01 168

Good 47.99 155Total (100.00) 323

Credit ranking (1=default)

Cat. % n

Bad 86.67 143

Good 13.33 22

Total (51.08) 165

Paid Weekly/Monthly

P-value=0.0000, Chi-square=179.6665, df=1

Weekly pay

Cat. % n

B ad 1 5. 82 2 5

Good 84.18 133

Total (48.92) 158

Monthly salary

Cat. % n

Bad 90.51 143

G oo d 9 . 49 1 5

Total (48.92) 158

Age Categorical


Young (< 25);Middle (25-35)

Cat. % n

Ba d 0 .0 0 0

Good 100.00 7

Total (2.17) 7

Old ( > 35)

Cat. % n

Bad 48.98 24

Good 51.02 25

Total (15.17) 49

Age Categorical


Young (< 25)

Cat. % n

Ba d 0 .9 2 1

Good 99.08 108

Total (33.75) 109

Middle (25-35);Old ( > 35)

Cat. % n

Ba d 0 .0 0 0

Good 100.00 8

Total (2.48) 8

Social Class


Management;Clerical

Cat. % n

B ad 5 8. 54 2 4

Good 41.46 17

Total (12.69) 41

Professional


Rule Induction


52/66

2010.10.28- SLIDE 52

IS 257 Fall 2010

Rule Induction

Description Intuitive output

Handles all forms of numeric data, as well

as non-numeric (symbolic) data

C5 Algorithm a special case of rule

induction Target variable must be symbolic


Apriori


53/66

2010.10.28- SLIDE 53

IS 257 Fall 2010

Apriori

Description Seeks association rules in dataset

Market basket analysis

Sequence discovery


Neural Networks


54/66

2010.10.28- SLIDE 54

IS 257 Fall 2010

Neural Networks

Attempt to model neurons in the brain Learn from a training set and then can be

used to detect patterns inherent in thattraining set

Neural nets are effective when the data isshapeless and lacking any apparentpatterns

May be hard to understand results

Neural Network


55/66

2010.10.28- SLIDE 55

IS 257 Fall 2010

Neural Network

Output

Hidden layer

Input layer


Neural Networks


56/66

2010.10.28- SLIDE 56

IS 257 Fall 2010

Neural Networks

Description Difficult interpretation

Tends to overfit the data

Extensive amount of training time

A lot of data preparation

Works with all data types


Genetic algorithms


57/66

2010.10.28- SLIDE 57

IS 257 Fall 2010

Genetic algorithms

Imitate natural selection processes toevolve models using

Selection

Crossover

Mutation

Each new generation inherits traits fromthe previous ones until only the mostpredictive survive.



58/66

2010.10.28- SLIDE 58

IS 257 Fall 2010


Model Evaluation Evaluation of model: how well itperformed on test data

Methods and criteria depend on

model type: e.g., coincidence matrix with

classification models, mean errorrate with regression models

Interpretation of model:important or not, easy or harddepends on algorithm




59/66

2010.10.28- SLIDE 59

IS 257 Fall 2010


Deployment Determine how the results need to be utilized

Who needs to use them?

How often do they need to be used

Deploy Data Mining results by:

Scoring a database

Utilizing results as business rules

interactive scoring on-line


Specific Data Mining Applications:


60/66

2010.10.28- SLIDE 60

IS 257 Fall 2010

Specific Data Mining Applications:


What data mining has done for


61/66

2010.10.28- SLIDE 61

IS 257 Fall 2010

What data mining has done for...

Scheduled its workforce

to provide faster, more accurateanswers to questions.

The US Internal Revenue Serviceneeded to improve customer

service and...




62/66

2010.10.28- SLIDE 62

IS 257 Fall 2010


analyzed suspects cell phone

usage to focus investigations.

The US Drug EnforcementAgency needed to be moreeffective in their drug busts

and




63/66

2010.10.28- SLIDE 63

IS 257 Fall 2010


Reduced direct mail costs by 30%while garnering 95% of thecampaigns revenue.

HSBC need to cross-sell moreeffectively by identifying profilesthat would be interested in higheryielding investments and...


Analytic technology can be effective


64/66

2010.10.28- SLIDE 64

IS 257 Fall 2010

Analytic technology can be effective

Combining multiple models and linkanalysis can reduce false positives

Today there are millions of false positiveswith manual analysis

Data Mining is just one additional tool tohelp analysts

Analytic Technology has the potential toreduce the current high rate of falsepositives


Data Mining with Privacy


65/66

2010.10.28- SLIDE 65

IS 257 Fall 2010

Data Mining with Privacy

Data Mining looks for patterns, not people! Technical solutions can limit privacy

invasion

Replacing sensitive personal data with anon.ID

Give randomized outputs

Multi-party computation distributed data

Bayardo & Srikant, Technological Solutions forProtecting Privacy, IEEE Computer, Sep 2003


The Hype Curve forD Mi i d K l d Di


66/66

19901998 2000 2002

Expectations

Performance

Data Mining and Knowledge Discovery

Over-inflatedexpectations

Disappointment

Growing acceptance

and mainstreamingrisingexpectations

S G Pi t t k Sh i

Experian Lecture17_257

Documents