8/3/2019 Experian Lecture17_257
1/66
2010.10.28- SLIDE 1
IS 257 Fall 2010
Data Warehouses, Decision Support
and Data Mining
University of California, Berkeley
School of InformationIS 257: Database Management
8/3/2019 Experian Lecture17_257
2/66
2010.10.28- SLIDE 2
IS 257 Fall 2010
Lecture Outline
Review Data Warehouses
(Based on lecture notes from Joachim Hammer,University of Florida, and Joe Hellerstein and Mike
Stonebraker of UCB) Applications for Data Warehouses
Decision Support Systems (DSS)
OLAP (ROLAP, MOLAP)
Data Mining Thanks again to lecture notes from Joachim
Hammer of the University of Florida
8/3/2019 Experian Lecture17_257
3/66
2010.10.28- SLIDE 3
IS 257 Fall 2010
Problem: Heterogeneous Information Sources
Heterogeneities are
everywhere
Different interfaces
Different data representations
Duplicate and inconsistent information
PersonalDatabases
Digital Libraries
Scientific DatabasesWorldWide
Web
Slide credit: J. Hammer
8/3/2019 Experian Lecture17_257
4/66
2010.10.28- SLIDE 4
IS 257 Fall 2010
Problem: Data Management in Large Enterprises
Vertical fragmentation of informationalsystems (vertical stove pipes)
Result of application (user)-driven
development of operational systems
Sales Administration Finance Manufacturing ...
Sales PlanningStock Mngmt
...
Suppliers
...
Debt MngmtNum. Control
...
Inventory
Slide credit: J. Hammer
8/3/2019 Experian Lecture17_257
5/66
2010.10.28- SLIDE 5
IS 257 Fall 2010
Goal: Unified Access to Data
Integration System
Collects and combines information
Provides integrated view, uniform user interface
Supports sharing
World
Wide
WebDigital Libraries Scientific Databases
PersonalDatabases
Slide credit: J. Hammer
8/3/2019 Experian Lecture17_257
6/66
2010.10.28- SLIDE 6
IS 257 Fall 2010
The Traditional Research Approach
Source SourceSource
. . .
Integration System
. . .
Metadata
Clients
Wrapper WrapperWrapper
Query-driven (lazy, on-demand)
Slide credit: J. Hammer
8/3/2019 Experian Lecture17_257
7/66
2010.10.28- SLIDE 7
IS 257 Fall 2010
The Warehousing Approach
DataWarehouse
Clients
Source SourceSource
. . .
Extractor/Monitor
Integration System
. . .
Metadata
Extractor/Monitor Extractor/Monitor
Information
integrated inadvance
Stored in WH
for directquerying andanalysis
Slide credit: J. Hammer
8/3/2019 Experian Lecture17_257
8/66
2010.10.28- SLIDE 8
IS 257 Fall 2010
What is a Data Warehouse?
A Data Warehouse is asubject-oriented,
integrated,
time-variant,
non-volatile
collection of data used in support ofmanagement decision makingprocesses.
-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11
8/3/2019 Experian Lecture17_257
9/66
2010.10.28- SLIDE 9
IS 257 Fall 2010
A Data Warehouse is...
Stored collection of diverse data A solution to data integration problem Single repository of information
Subject-oriented Organized by subject, not by application
Used for analysis, data mining, etc.
Optimized differently from transaction-
oriented db User interface aimed at executive decision
makers and analysts
8/3/2019 Experian Lecture17_257
10/66
2010.10.28- SLIDE 10
IS 257 Fall 2010
Contd
Large volume of data (Gb, Tb) Non-volatile
Historical
Time attributes are important
Updates infrequent
May be append-only
Examples
All transactions ever at WalMart Complete client histories at insurance firm
Stockbroker financial information and portfolios
Slide credit: J. Hammer
8/3/2019 Experian Lecture17_257
11/66
2010.10.28- SLIDE 11
IS 257 Fall 2010
Data Warehousing Architecture
8/3/2019 Experian Lecture17_257
12/66
2010.10.28- SLIDE 12
IS 257 Fall 2010
Ingest
DataWarehouse
Clients
Source/ File Source / ExternalSource / DB
. . .
Extractor/Monitor
Integration System
. . .
Metadata
Extractor/Monitor Extractor/Monitor
8/3/2019 Experian Lecture17_257
13/66
2010.10.28- SLIDE 13
IS 257 Fall 2010
Today
Applications for Data Warehouses Decision Support Systems (DSS)
OLAP (ROLAP, MOLAP)
Data Mining Thanks again to slides and lecture notes
from Joachim Hammer of the University of
Florida, and also to Laura Squier of SPSS,Gregory Piatetsky-Shapiro of KDNuggetsand to the CRISP web site
Source: Gregory Piatetsky-Shapiro
8/3/2019 Experian Lecture17_257
14/66
2010.10.28- SLIDE 14
IS 257 Fall 2010
Trends leading to Data Flood
More data is generated: Bank, telecom, other
business transactions ...
Scientific Data: astronomy,biology, etc
Web, text, and e-commerce
More data is captured: Storage technology faster
and cheaper
DBMS capable of handlingbigger DB
Source: Gregory Piatetsky-Shapiro
8/3/2019 Experian Lecture17_257
15/66
2010.10.28- SLIDE 15
IS 257 Fall 2010
Examples
Europe's Very Long BaselineInterferometry (VLBI) has 16 telescopes,each of which produces 1 Gigabit/secondof astronomical data over a 25-dayobservation session
storage and analysis a big problem
Walmart reported to have 500 Terabyte DB
AT&T handles billions of calls per day
data cannot be stored -- analysis is done onthe fly
Source: Gregory Piatetsky-Shapiro
8/3/2019 Experian Lecture17_257
16/66
2010.10.28- SLIDE 16
IS 257 Fall 2010
Growth Trends
Moores law Computer Speed doublesevery 18 months
Storage law
total storage doubles every 9months
Consequence very little data will ever be
looked at by a human
Knowledge Discovery isNEEDED to make senseand use of data.
Source: Gregory Piatetsky-Shapiro
8/3/2019 Experian Lecture17_257
17/66
2010.10.28- SLIDE 17
IS 257 Fall 2010
Knowledge Discovery in Data (KDD)
Knowledge Discovery in Data is the non-trivial process of identifying valid
novel
potentially useful and ultimately understandable patterns in
data. from Advances in Knowledge Discovery and Data
Mining, Fayyad, Piatetsky-Shapiro, Smyth, andUthurusamy, (Chapter 1), AAAI/MIT Press 1996
Source: Gregory Piatetsky-Shapiro
8/3/2019 Experian Lecture17_257
18/66
2010.10.28- SLIDE 18
IS 257 Fall 2010
Related Fields
Statistics
MachineLearning
Databases
Visualization
Data Mining andKnowledge Discovery
Source: Gregory Piatetsky-Shapiro
8/3/2019 Experian Lecture17_257
19/66
2010.10.28- SLIDE 19
IS 257 Fall 2010
__
__
__
__
__
__
__
__
__
TransformedData
PatternsandRules
Target
Data
RawData
KnowledgeInterpretation& Evaluation
Integration
Understanding
Knowledge Discovery Process
DATA
Warehouse
Knowledge
Source: Gregory Piatetsky-Shapiro
8/3/2019 Experian Lecture17_257
20/66
2010.10.28- SLIDE 20
IS 257 Fall 2010
What is Decision Support?
Technology that will help managers andplanners make decisions regarding theorganization and its operations based ondata in the Data Warehouse.
What was the last two years of sales volumefor each product by state and city?
What effects will a 5% price discount have onour future income for product X?
Increasing common term is KDD Knowledge Discovery in Databases
8/3/2019 Experian Lecture17_257
21/66
2010.10.28- SLIDE 21
IS 257 Fall 2010
Conventional Query Tools
Ad-hoc queries and reports usingconventional database tools
E.g. Access queries.
Typical database designs include fixedsets of reports and queries to supportthem
The end-user is often not given the ability todo ad-hoc queries
8/3/2019 Experian Lecture17_257
22/66
2010.10.28- SLIDE 22
IS 257 Fall 2010
OLAP
Online Line Analytical Processing Intended to provide multidimensional views of
the data
I.e., the Data Cube
The PivotTables in MS Excel are examples ofOLAP tools
8/3/2019 Experian Lecture17_257
23/66
2010.10.28- SLIDE 23
IS 257 Fall 2010
Data Cube
8/3/2019 Experian Lecture17_257
24/66
2010.10.28- SLIDE 24
IS 257 Fall 2010
Operations on Data Cubes
Slicing the cube Extracts a 2d table from the multidimensional
data cube
Example
Drill-Down
Analyzing a given set of data at a finer level ofdetail
8/3/2019 Experian Lecture17_257
25/66
2010.10.28- SLIDE 25
IS 257 Fall 2010
Star Schema
Typical design for the derived layer of aData Warehouse or Mart for DecisionSupport Particularly suited to ad-hoc queries
Dimensional data separate from fact or eventdata
Fact tables contain factual or quantitativedata about the business
Dimension tables hold data about thesubjects of the business Typically there is one Fact table with
multiple dimension tables
8/3/2019 Experian Lecture17_257
26/66
2010.10.28- SLIDE 26
IS 257 Fall 2010
Star Schema for multidimensional data
Order
OrderNo
OrderDate
Salesperson
SalespersonID
SalespersonName
City
Quota
Fact Table
OrderNo
Salespersonid
Customerno
ProdNo
DatekeyCityname
Quantity
TotalPriceCity
CityName
State
Country
DateDateKey
Day
Month
Year
ProductProdNo
ProdName
Category
Description
Customer
CustomerNameCustomerAddress
City
8/3/2019 Experian Lecture17_257
27/66
2010.10.28- SLIDE 27
IS 257 Fall 2010
Data Mining
Data mining is knowledge discovery ratherthan question answering
May have no pre-formulated questions
Derived from Traditional Statistics Artificial intelligence
Computer graphics (visualization)
8/3/2019 Experian Lecture17_257
28/66
2010.10.28- SLIDE 28
IS 257 Fall 2010
Goals of Data Mining
Explanatory Explain some observed event or situation Why have the sales of SUVs increased in California but not
in Oregon?
Confirmatory To confirm a hypothesis
Whether 2-income families are more likely to buy familymedical coverage
Exploratory To analyze data for new or unexpected relationships
What spending patterns seem to indicate credit card fraud?
8/3/2019 Experian Lecture17_257
29/66
2010.10.28- SLIDE 29
IS 257 Fall 2010
Data Mining Applications
Profiling Populations
Analysis of business trends
Target marketing
Usage Analysis
Campaign effectiveness Product affinity
Customer Retention and Churn
Profitability Analysis Customer Value Analysis
Up-Selling
8/3/2019 Experian Lecture17_257
30/66
2010.10.28- SLIDE 30
IS 257 Fall 2010
Data + Text Mining Process
Source: Languistics
via Google Images
8/3/2019 Experian Lecture17_257
31/66
2010.10.28- SLIDE 31
IS 257 Fall 2010
How Can We Do Data Mining?
By Utilizing the CRISP-DM Methodology
a standard process
existing data
software technologies
situational expertise
Source: Laura Squier
8/3/2019 Experian Lecture17_257
32/66
2010.10.28- SLIDE 32
IS 257 Fall 2010
Why Should There be a Standard Process?
Framework for recordingexperience Allows projects to be
replicated
Aid to project planningand management
Comfort factor for new
adopters Demonstrates maturity of
Data Mining
Reduces dependency on
stars
The data mining process must
be reliable and repeatable by
people with little data mining
background.
Source: Laura Squier
8/3/2019 Experian Lecture17_257
33/66
2010.10.28- SLIDE 33
IS 257 Fall 2010
Process Standardization
CRISP-DM:
CRoss Industry Standard Process for Data Mining
Initiative launched Sept.1996
SPSS/ISL, NCR, Daimler-Benz, OHRA
Funding from European commission
Over 200 members of the CRISP-DM SIG worldwide DM Vendors - SPSS, NCR, IBM, SAS, SGI, Data Distilleries,
Syllogic, Magnify, ..
System Suppliers / consultants - Cap Gemini, ICL Retail, Deloitte& Touche,
End Users - BT, ABB, Lloyds Bank, AirTouch, Experian, ...
Source: Laura Squier
C S
8/3/2019 Experian Lecture17_257
34/66
2010.10.28- SLIDE 34
IS 257 Fall 2010
CRISP-DM
Non-proprietary Application/Industry neutral
Tool neutral
Focus on business issues
As well as technical analysis
Framework for guidance
Experience base
Templates for Analysis
Source: Laura Squier
Th CRISP DM P M d l
8/3/2019 Experian Lecture17_257
35/66
2010.10.28- SLIDE 35
IS 257 Fall 2010
The CRISP-DM Process Model
Source: Laura Squier
Wh CRISP DM?
8/3/2019 Experian Lecture17_257
36/66
2010.10.28- SLIDE 36
IS 257 Fall 2010
Why CRISP-DM?
The data mining process must be reliable andrepeatable by people with little data mining skills
CRISP-DM provides a uniform framework for
guidelines
experience documentation
CRISP-DM is flexible to account for differences Different business/agency problems
Different data
Source: Laura Squier
Phases and Tasks
8/3/2019 Experian Lecture17_257
37/66
2010.10.28- SLIDE 37
IS 257 Fall 2010
BusinessUnderstanding
DataUnderstanding Evaluation
DataPreparation Modeling
DetermineBusiness Objectives
Background
Business Objectives
Business Success
CriteriaSituation AssessmentInventory of Resources
Requirements,
Assumptions, and
Constraints
Risks and Contingencies
Terminology
Costs and Benefits
DetermineData Mining Goal
Data Mining Goals
Data Mining Success
Criteria
Produce Project PlanProject Plan
Initial Asessment of
Tools and Techniques
Collect Initial DataInitial Data Collection
Report
Describe DataData Description Report
Explore DataData Exploration Report
Verify Data QualityData Quality Report
Data Set
Data Set DescriptionSelect DataRationale for Inclusion /
Exclusion
Clean DataData Cleaning Report
Construct DataDerived Attributes
Generated Records
Integrate DataMerged Data
Format DataReformatted Data
Select ModelingTechnique
Modeling Technique
Modeling Assumptions
Generate Test DesignTest Design
Build ModelParameter Settings
Models
Model Description
Assess ModelModel Assessment
Revised Parameter
Settings
Evaluate ResultsAssessment of Data
Mining Results w.r.t.
Business Success
Criteria
Approved Models
Review ProcessReview of Process
Determine Next StepsList of Possible Actions
Decision
Plan DeploymentDeployment Plan
Plan Monitoring andMaintenance
Monitoring and
Maintenance Plan
Produce Final ReportFinal Report
Final PresentationReview ProjectExperience
Documentation
Deployment
Phases and Tasks
Source: Laura Squier
Ph i CRISP
8/3/2019 Experian Lecture17_257
38/66
2010.10.28- SLIDE 38
IS 257 Fall 2010
Phases in CRISP
Business Understanding
This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then
converting this knowledge into a data mining problem definition, and a preliminary plan designed to achieve the objectives.
Data Understanding
The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data,to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses forhidden information.
Data Preparation
The data preparation phase covers all activities to construct the final dataset (data that will be fed into the modeling tool(s)) fromthe initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks includetable, record, and attribute selection as well as transformation and cleaning of data for modeling tools.
Modeling In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values.
Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements onthe form of data. Therefore, stepping back to the data preparation phase is often needed.
Evaluation
At this stage in the project you have built a model (or models) that appears to have high quality, from a data analysis perspective.Before proceeding to final deployment of the model, it is important to more thoroughly evaluate the model, and review the stepsexecuted to construct the model, to be certain it properly achieves the business objectives. A key objective is to determine if thereis some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the
data mining results should be reached. Deployment
Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data,the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on therequirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable datamining process. In many cases it will be the customer, not the data analyst, who will carry out the deployment steps. However,even if the analyst will not carry out the deployment effort it is important for the customer to understand up front what actions willneed to be carried out in order to actually make use of the created models.
Ph i h DM P CRISP DM
8/3/2019 Experian Lecture17_257
39/66
2010.10.28- SLIDE 39
IS 257 Fall 2010
Phases in the DM Process: CRISP-DM
Source: Laura Squier
Ph i th DM P (1 & 2)
8/3/2019 Experian Lecture17_257
40/66
2010.10.28- SLIDE 40
IS 257 Fall 2010
Phases in the DM Process (1 & 2)
BusinessUnderstanding:
Statement of BusinessObjective
Statement of DataMining objective
Statement of SuccessCriteria
Data Understanding Explore the data andverify the quality
Find outliers
Source: Laura Squier
Ph i th DM P (3)
8/3/2019 Experian Lecture17_257
41/66
2010.10.28- SLIDE 41
IS 257 Fall 2010
Phases in the DM Process (3)
Data preparation: Takes usually over 90% of our time
Collection
Assessment
Consolidation and Cleaning table links, aggregation level, missing values, etc
Data selection active role in ignoring non-contributory data?
outliers?
Use of samples
visualization tools
Transformations - create new variables
Source: Laura Squier
Ph i th DM P (4)
8/3/2019 Experian Lecture17_257
42/66
2010.10.28- SLIDE 42
IS 257 Fall 2010
Phases in the DM Process (4)
Model building Selection of the modeling techniques is based
upon the data mining objective
Modeling is an iterative process - different for
supervised and unsupervised learning May model for either description or prediction
Source: Laura Squier
T f M d l
8/3/2019 Experian Lecture17_257
43/66
2010.10.28- SLIDE 43
IS 257 Fall 2010
Types of Models
Prediction Models forPredicting and Classifying Regression algorithms
(predict numeric outcome):neural networks, ruleinduction, CART (OLSregression, GLM)
Classification algorithmpredict symbolic outcome):CHAID, C5.0 (discriminantanalysis, logisticregression)
Descriptive Models forGrouping and FindingAssociations Clustering/Grouping
algorithms: K-means,
Kohonen Association algorithms:
apriori, GRI
Source: Laura Squier
D t Mi i Al ith
8/3/2019 Experian Lecture17_257
44/66
2010.10.28- SLIDE 44
IS 257 Fall 2010
Data Mining Algorithms
Market Basket Analysis Memory-based reasoning
Cluster detection
Link analysis Decision trees and rule induction
algorithms
Neural Networks Genetic algorithms
M k t B k t A l i
8/3/2019 Experian Lecture17_257
45/66
2010.10.28- SLIDE 45
IS 257 Fall 2010
Market Basket Analysis
A type of clustering used to predictpurchase patterns.
Identify the products likely to be purchased
in conjunction with other products E.g., the famous (and apocryphal) story thatmen who buy diapers on Friday nights alsobuy beer.
Memor based reasoning
8/3/2019 Experian Lecture17_257
46/66
2010.10.28- SLIDE 46
IS 257 Fall 2010
Memory-based reasoning
Use known instances of a model to makepredictions about unknown instances.
Could be used for sales forecasting or
fraud detection by working from knowncases to predict new cases
Cluster detection
8/3/2019 Experian Lecture17_257
47/66
2010.10.28- SLIDE 47
IS 257 Fall 2010
Cluster detection
Finds data records that are similar to eachother.
K-nearest neighbors (where K represents
the mathematical distance to the nearestsimilar record) is an example of oneclustering algorithm
Kohonen Network
8/3/2019 Experian Lecture17_257
48/66
2010.10.28- SLIDE 48
IS 257 Fall 2010
Kohonen Network
Description unsupervised
seeks to
describe datasetin terms ofnatural clustersof cases
Source: Laura Squier
Link analysis
8/3/2019 Experian Lecture17_257
49/66
2010.10.28- SLIDE 49
IS 257 Fall 2010
Link analysis
Follows relationships between records todiscover patterns
Link analysis can provide the basis for
various affinity marketing programs Similar to Markov transition analysismethods where probabilities are calculatedfor each observed transition.
Decision trees and r le ind ction algorithms
8/3/2019 Experian Lecture17_257
50/66
2010.10.28- SLIDE 50
IS 257 Fall 2010
Decision trees and rule induction algorithms
Pulls rules out of a mass of data usingclassification and regression trees (CART)or Chi-Square automatic interactiondetectors (CHAID)
These algorithms produce explicit rules,which make understanding the resultssimpler
Rule Induction
8/3/2019 Experian Lecture17_257
51/66
2010.10.28- SLIDE 51
IS 257 Fall 2010
Rule Induction
Description Produces decision trees:
income < $40K job > 5 yrs thengood risk
job < 5 yrs then bad risk
income > $40K high debt then bad risk
low debt thengood risk
Or Rule Sets: Rule #1 for good risk:
if income > $40K
if low debt
Rule #2 for good risk:
if income < $40K
if job > 5 years
Cat. % n
Bad 52.01 168
Good 47.99 155Total (100.00) 323
Credit ranking (1=default)
Cat. % n
Bad 86.67 143
Good 13.33 22
Total (51.08) 165
Paid Weekly/Monthly
P-value=0.0000, Chi-square=179.6665, df=1
Weekly pay
Cat. % n
B ad 1 5. 82 2 5
Good 84.18 133
Total (48.92) 158
Monthly salary
Cat. % n
Bad 90.51 143
G oo d 9 . 49 1 5
Total (48.92) 158
Age Categorical
P-value=0.0000, Chi-square=30.1113, df=1
Young (< 25);Middle (25-35)
Cat. % n
Ba d 0 .0 0 0
Good 100.00 7
Total (2.17) 7
Old ( > 35)
Cat. % n
Bad 48.98 24
Good 51.02 25
Total (15.17) 49
Age Categorical
P-value=0.0000, Chi-square=58.7255, df=1
Young (< 25)
Cat. % n
Ba d 0 .9 2 1
Good 99.08 108
Total (33.75) 109
Middle (25-35);Old ( > 35)
Cat. % n
Ba d 0 .0 0 0
Good 100.00 8
Total (2.48) 8
Social Class
P-value=0.0016, Chi-square=12.0388, df=1
Management;Clerical
Cat. % n
B ad 5 8. 54 2 4
Good 41.46 17
Total (12.69) 41
Professional
Source: Laura Squier
Rule Induction
8/3/2019 Experian Lecture17_257
52/66
2010.10.28- SLIDE 52
IS 257 Fall 2010
Rule Induction
Description Intuitive output
Handles all forms of numeric data, as well
as non-numeric (symbolic) data
C5 Algorithm a special case of rule
induction Target variable must be symbolic
Source: Laura Squier
Apriori
8/3/2019 Experian Lecture17_257
53/66
2010.10.28- SLIDE 53
IS 257 Fall 2010
Apriori
Description Seeks association rules in dataset
Market basket analysis
Sequence discovery
Source: Laura Squier
Neural Networks
8/3/2019 Experian Lecture17_257
54/66
2010.10.28- SLIDE 54
IS 257 Fall 2010
Neural Networks
Attempt to model neurons in the brain Learn from a training set and then can be
used to detect patterns inherent in thattraining set
Neural nets are effective when the data isshapeless and lacking any apparentpatterns
May be hard to understand results
Neural Network
8/3/2019 Experian Lecture17_257
55/66
2010.10.28- SLIDE 55
IS 257 Fall 2010
Neural Network
Output
Hidden layer
Input layer
Source: Laura Squier
Neural Networks
8/3/2019 Experian Lecture17_257
56/66
2010.10.28- SLIDE 56
IS 257 Fall 2010
Neural Networks
Description Difficult interpretation
Tends to overfit the data
Extensive amount of training time
A lot of data preparation
Works with all data types
Source: Laura Squier
Genetic algorithms
8/3/2019 Experian Lecture17_257
57/66
2010.10.28- SLIDE 57
IS 257 Fall 2010
Genetic algorithms
Imitate natural selection processes toevolve models using
Selection
Crossover
Mutation
Each new generation inherits traits fromthe previous ones until only the mostpredictive survive.
Phases in the DM Process (5)
8/3/2019 Experian Lecture17_257
58/66
2010.10.28- SLIDE 58
IS 257 Fall 2010
Phases in the DM Process (5)
Model Evaluation Evaluation of model: how well itperformed on test data
Methods and criteria depend on
model type: e.g., coincidence matrix with
classification models, mean errorrate with regression models
Interpretation of model:important or not, easy or harddepends on algorithm
Source: Laura Squier
Phases in the DM Process (6)
8/3/2019 Experian Lecture17_257
59/66
2010.10.28- SLIDE 59
IS 257 Fall 2010
Phases in the DM Process (6)
Deployment Determine how the results need to be utilized
Who needs to use them?
How often do they need to be used
Deploy Data Mining results by:
Scoring a database
Utilizing results as business rules
interactive scoring on-line
Source: Laura Squier
Specific Data Mining Applications:
8/3/2019 Experian Lecture17_257
60/66
2010.10.28- SLIDE 60
IS 257 Fall 2010
Specific Data Mining Applications:
Source: Laura Squier
What data mining has done for
8/3/2019 Experian Lecture17_257
61/66
2010.10.28- SLIDE 61
IS 257 Fall 2010
What data mining has done for...
Scheduled its workforce
to provide faster, more accurateanswers to questions.
The US Internal Revenue Serviceneeded to improve customer
service and...
Source: Laura Squier
What data mining has done for
8/3/2019 Experian Lecture17_257
62/66
2010.10.28- SLIDE 62
IS 257 Fall 2010
What data mining has done for...
analyzed suspects cell phone
usage to focus investigations.
The US Drug EnforcementAgency needed to be moreeffective in their drug busts
and
Source: Laura Squier
What data mining has done for
8/3/2019 Experian Lecture17_257
63/66
2010.10.28- SLIDE 63
IS 257 Fall 2010
What data mining has done for...
Reduced direct mail costs by 30%while garnering 95% of thecampaigns revenue.
HSBC need to cross-sell moreeffectively by identifying profilesthat would be interested in higheryielding investments and...
Source: Laura Squier
Analytic technology can be effective
8/3/2019 Experian Lecture17_257
64/66
2010.10.28- SLIDE 64
IS 257 Fall 2010
Analytic technology can be effective
Combining multiple models and linkanalysis can reduce false positives
Today there are millions of false positiveswith manual analysis
Data Mining is just one additional tool tohelp analysts
Analytic Technology has the potential toreduce the current high rate of falsepositives
Source: Gregory Piatetsky-Shapiro
Data Mining with Privacy
8/3/2019 Experian Lecture17_257
65/66
2010.10.28- SLIDE 65
IS 257 Fall 2010
Data Mining with Privacy
Data Mining looks for patterns, not people! Technical solutions can limit privacy
invasion
Replacing sensitive personal data with anon.ID
Give randomized outputs
Multi-party computation distributed data
Bayardo & Srikant, Technological Solutions forProtecting Privacy, IEEE Computer, Sep 2003
Source: Gregory Piatetsky-Shapiro
The Hype Curve forD Mi i d K l d Di
8/3/2019 Experian Lecture17_257
66/66
19901998 2000 2002
Expectations
Performance
Data Mining and Knowledge Discovery
Over-inflatedexpectations
Disappointment
Growing acceptance
and mainstreamingrisingexpectations
S G Pi t t k Sh i