Top Banner
Successful Data Mining in Practice Richard D. De Veaux Williams College May 19, 2009 [email protected]
111
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Successful Data Mining

Successful Data Mining in Practice

Richard D. De VeauxWilliams CollegeMay 19, [email protected]

Page 2: Successful Data Mining

2JMP Visual Data MiningJMP Visual Data Mining

Where is Massachusetts?Where is Massachusetts?

Page 3: Successful Data Mining

3JMP Visual Data MiningJMP Visual Data Mining

Where in Massachusetts?Where in Massachusetts?

Page 4: Successful Data Mining

4JMP Visual Data MiningJMP Visual Data Mining

Williams CollegeWilliams College

Page 5: Successful Data Mining

5JMP Visual Data MiningJMP Visual Data Mining

Williams College Williams College

Page 6: Successful Data Mining

6JMP Visual Data MiningJMP Visual Data Mining

Reason for Data MiningReason for Data Mining

Data = $$Data = $$

Page 7: Successful Data Mining

7JMP Visual Data MiningJMP Visual Data Mining

Data Mining IsData Mining Is……“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad

“finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley

“a knowledge discovery process of extracting previously unknown, actionable information from very large data bases”--- Zornes

“ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”---Edelstein

Page 8: Successful Data Mining

8JMP Visual Data MiningJMP Visual Data Mining

What is Data Mining?What is Data Mining?

Page 9: Successful Data Mining

9JMP Visual Data MiningJMP Visual Data Mining

Data Mining Models Data Mining Models –– a partial lista partial listTraditional statistical models

Linear regression, logistic regression, splines, smoothers etc.Vendors are adding these to DM software

Clustering and Visualization

Neural networks

Decision trees

Naïve Bayes

K Nearest Neighbor Methods

K-means

Combining Models – Bagging and Boosting

Page 10: Successful Data Mining

10JMP Visual Data MiningJMP Visual Data Mining

What makes Data Mining Different?What makes Data Mining Different?

Massive amounts of dataNumber of rows (cases)Number of columns (variables)

Low signal to noiseMany irrelevant variablesSubtle relationshipsVariation

UPS16TB – U.S. library of congressMostly tracking

Facebook200-300 TB of photos every month

Google1 PB every 72 minutes

Page 11: Successful Data Mining

11JMP Visual Data MiningJMP Visual Data Mining

Why Is Data Mining Taking Off Now?Why Is Data Mining Taking Off Now?

Because we canComputer powerThe price of digital storage is near zero

Data warehouses already built

Companies want return on data investment

Page 12: Successful Data Mining

12JMP Visual Data MiningJMP Visual Data Mining

Users are Also DifferentUsers are Also Different

UsersDomain experts, not statisticians Have too much dataWant automatic methodsWant useful information without spending all their time doing statistical analysis

Page 13: Successful Data Mining

13JMP Visual Data MiningJMP Visual Data Mining

Data Mining MythsData Mining Myths

Find answers to unasked questions

Continuously monitor your data base for interesting patterns

Eliminate the need to understand your business

Eliminate the need to collect good data

Eliminate the need to have good data analysis skills

Page 14: Successful Data Mining

14JMP Visual Data MiningJMP Visual Data Mining

Examples of Data Mining Applications Examples of Data Mining Applications --Customer Relationship ManagementCustomer Relationship Management

Transactional DataCustomer retentionUpselling opportunitiesCustomer optimization across different areas

Marketing ExperimentsOften, new hypotheses are generated by data mining a planned experiment.Segmentation

Page 15: Successful Data Mining

15JMP Visual Data MiningJMP Visual Data Mining

Manufacturing ApplicationsManufacturing Applications

Product reliability and quality control

Process controlWhat can I do to improve batch yields?

Warranty analysisProduct problemsService assessment Adverse experiences – link to production

Page 16: Successful Data Mining

16JMP Visual Data MiningJMP Visual Data Mining

Medical ApplicationsMedical Applications

Medical procedure effectivenessWho are good candidates for surgery?

Physician effectivenessWhich tests are ineffective?

Which physicians are likely to over-prescribe treatments?

What combinations of tests are most effective?

Page 17: Successful Data Mining

17JMP Visual Data MiningJMP Visual Data Mining

EE--commercecommerce

Automatic web page design

Recommendations for new purchases

Cross selling

Social Network Marketing

Page 18: Successful Data Mining

18JMP Visual Data MiningJMP Visual Data Mining

Pharmaceutical ApplicationsPharmaceutical Applications

High throughput screeningPredict actions in assaysPredict results in animals or humans

Rational drug designRelating chemical structure with chemical propertiesInverse regression to predict chemical properties from desired structure

DNA snips

GenomicsAssociate genes with diseasesFind relationships between genotype and drug response (e.g., dosage requirements, adverse effects) Find individuals most susceptible to placebo effect

Page 19: Successful Data Mining

19JMP Visual Data MiningJMP Visual Data Mining

Pharmaceutical ApplicationsPharmaceutical ApplicationsCombine clinical trial results with extensive medical/demographic information

Non traditional uses of clinical trial data warehouse to explore:

Prediction of adverse experiences – combining more than one trialWho is likely to be non-compliant or drop out?What are alternative (I.E., Non-approved) uses supported by the data?

Page 20: Successful Data Mining

20JMP Visual Data MiningJMP Visual Data Mining

Fraud and Terrorist DetectionFraud and Terrorist DetectionIdentify false:

Medical insurance claimsAccident insurance claims

Which stock trades are based on insider information?

Whose cell phone numbers have been stolen?

Which credit card transactions are from stolen cards?

Which documents are “interesting”

When are changes in networks signs of potential illegal activity?

Page 21: Successful Data Mining

21JMP Visual Data MiningJMP Visual Data Mining

Lesson 1: Learn to Make FriendsLesson 1: Learn to Make FriendsPVA is a philanthropic organization,

Sanctioned by the US Govt to represent the disabled veterans

They send out 4 million “free gifts” , every 6 weeksAnd hope for donations

Data were used for the KDD 1998 cup 200,000 donors

(100,000 training, 100,000 test)481 demographic variables

Past giving, income, age etc etc etcRecent campaign (only for training set)

Did they give? (Target B)How much did they give (Target D)

To optimize profit, who should receive the current solicitation?

What is the most cost effective strategy?

Page 22: Successful Data Mining

22JMP Visual Data MiningJMP Visual Data Mining

WhatWhat’’s s ““HardHard””? ? ----ExampleExample

Page 23: Successful Data Mining

23JMP Visual Data MiningJMP Visual Data Mining

TT--CodeCode

Page 24: Successful Data Mining

24JMP Visual Data MiningJMP Visual Data Mining

More More TcodeTcode

Page 25: Successful Data Mining

25JMP Visual Data MiningJMP Visual Data Mining

Transformation?Transformation?

Page 26: Successful Data Mining

26JMP Visual Data MiningJMP Visual Data Mining

Categories?Categories?

Page 27: Successful Data Mining

27JMP Visual Data MiningJMP Visual Data Mining

What does it mean?What does it mean?T -C ode

0 _ 1 6 DEAN 4 8 CO RP O RAL 1 0 9 LIC. 1 M R. 1 7 J UDGE 5 0 ELDER 1 1 1 S A.

1 0 0 1 M ES S RS . 1 7 0 0 2 J UDGE & M RS . 5 6 M AYO R 1 1 4 DA. 1 0 0 2 M R. & M RS . 1 8 M AJ O R 5 9 0 0 2 LIEUTENANT & M RS . 1 1 6 S R.

2 M RS . 1 8 0 0 2 M AJ O R & M RS . 6 2 LO RD 1 1 7 S RA. 2 0 0 2 M ES DAM ES 1 9 S ENATO R 6 3 CARDINAL 1 1 8 S RTA.

3 M IS S 2 0 GO V ERNO R 6 4 FRIEND 1 2 0 YO UR M AJ ES TY 3 0 0 3 M IS S ES 2 1 0 0 2 S ERGEANT & M RS . 6 5 FRIENDS 1 2 2 HIS HIGHNES S

4 DR. 2 2 0 0 2 CO LNEL & M RS . 6 8 ARCHDEACO N 1 2 3 HER HIGHNES S 4 0 0 2 DR. & M RS . 2 4 LIEUTENANT 6 9 CANO N 1 2 4 CO UNT 4 0 0 4 DO CTO RS 2 6 M O NS IGNO R 7 0 BIS HO P 1 2 5 LADY

5 M ADAM E 2 7 REV EREND 7 2 0 0 2 REV EREND & M RS . 1 2 6 P RINCE 6 S ERGEANT 2 8 M S . 7 3 P AS TO R 1 2 7 P RINCES S 9 RABBI 2 8 0 2 8 M S S . 7 5 ARCHBIS HO P 1 2 8 CHIEF

1 0 P RO FES S O R 2 9 BIS HO P 8 5 S P ECIALIS T 1 2 9 BARO N 1 0 0 0 2 P RO FES S O R & M RS . 3 1 AM BAS S ADO R 8 7 P RIV ATE 1 3 0 S HEIK 1 0 0 1 0 P RO FES S O RS 3 1 0 0 2 AM BAS S ADO R & M RS 8 9 S EAM AN 1 3 1 P RINCE AND P RINCES S

1 1 ADM IRAL 3 3 CANTO R 9 0 AIRM AN 1 3 2 YO UR IM P ERIAL M AJ ES T 1 1 0 0 2 ADM IRAL & M RS . 3 6 BRO THER 9 1 J US TICE 1 3 5 M . ET M M E.

1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE 2 1 0 P RO F.1 2 0 0 2 GENERAL & M RS . 3 8 CO M M O DO RE 1 0 0 M .

1 3 CO LO NEL 4 0 FATHER 1 0 3 M LLE. 1 3 0 0 2 CO LO NEL & M RS . 4 2 S IS TER 1 0 4 CHANCELLO R

1 4 CAP TAIN 4 3 P RES IDENT 1 0 6 REP RES ENTATIV E 1 4 0 0 2 CAP TAIN & M RS . 4 4 M AS TER 1 0 7 S ECRETARY

1 5 CO M M ANDER 4 6 M O THER 1 0 8 LT. GO V ERNO R 1 5 0 0 2 CO M M ANDER & M RS . 4 7 CHAP LAIN

T itle

Page 28: Successful Data Mining

28JMP Visual Data MiningJMP Visual Data Mining

Relational Data BasesRelational Data BasesData are stored in tables

Items

ItemID ItemName price

C56621 top hat 34.95

T35691 cane 4.99

RS5292 red shoes 22.95

Shoppers

Person ID person name ZIPCODE item bought

135366 Lyle 19103 T35691

135366 Lyle 19103 C56621

259835 Dick 01267 RS5292

Page 29: Successful Data Mining

29JMP Visual Data MiningJMP Visual Data Mining

MetadataMetadata

The data survey describes the data set contents and characteristics

Table nameDescriptionPrimary key/foreign key relationshipsCollection information: how, where, conditionsTimeframe: daily, weekly, monthlyCosynchronus: every Monday or Tuesday

Page 30: Successful Data Mining

30JMP Visual Data MiningJMP Visual Data Mining

Data PreparationData Preparation

Build data mining database

Explore data

Prepare data for modeling

60% to 95% of the time is spent preparing the data

60% to 95% of the time is spent 60% to 95% of the time is spent preparing the datapreparing the data

Page 31: Successful Data Mining

31JMP Visual Data MiningJMP Visual Data Mining

Data ChallengesData Challenges

Data definitionsTypes of variables

Data consolidationCombine data from different sourcesNASA mars lander

Data heterogeneityHomonymsSynonyms

Data quality

Page 32: Successful Data Mining

32JMP Visual Data MiningJMP Visual Data Mining

Data QualityData Quality

Page 33: Successful Data Mining

33JMP Visual Data MiningJMP Visual Data Mining

Missing ValuesMissing Values

Random missing valuesDelete row?

Paralyzed Veterans

Substitute valueImputationMultiple ImputationJMP 8 (!)

Systematic missing dataNow what?

Page 34: Successful Data Mining

34JMP Visual Data MiningJMP Visual Data Mining

Missing Values Missing Values ---- SystematicSystematic

Credit Card Bank finds that “Income” field is missing

Wharton Ph.D. Student questionnaire on survey attitudes

Bowdoin college applicants have mean SAT verbal score above 750

Clinical Trial of Depression Medication –what does missing mean?

Page 35: Successful Data Mining

35JMP Visual Data MiningJMP Visual Data Mining

Results for PVA Data SetResults for PVA Data SetIf entire list (100,000 donors) are mailed, net donation is $10,500

Using data mining techniques, this was increased 41.37%

Page 36: Successful Data Mining

36JMP Visual Data MiningJMP Visual Data Mining

KDD CUP 98 ResultsKDD CUP 98 Results

Page 37: Successful Data Mining

37JMP Visual Data MiningJMP Visual Data Mining

KDD CUP 98 Results 2KDD CUP 98 Results 2

Page 38: Successful Data Mining

38JMP Visual Data MiningJMP Visual Data Mining

Students in Data Mining ClassStudents in Data Mining Class

Student #1 $15,024Student #2 $14,695Student #3 $14,345

Page 39: Successful Data Mining

39JMP Visual Data MiningJMP Visual Data Mining

Data Mining and OLAPData Mining and OLAP

On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis

Descriptive, not predictive

Data mining: software uses data to inductively find patterns – models!

Predictive or descriptive

Associations?Most associated variables in the censusMost associated variables in a supermarketAssocation Rules

Page 40: Successful Data Mining

40JMP Visual Data MiningJMP Visual Data Mining

Why Models?Why Models?Beer and Diapers

“In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated”Conclusions?Actions?

Page 41: Successful Data Mining

41JMP Visual Data MiningJMP Visual Data Mining

Beer and DiapersBeer and Diapers

Picture from TandemTM ad

Page 42: Successful Data Mining

42JMP Visual Data MiningJMP Visual Data Mining

ModelsModelsModels are:

Powerful summaries for understandingUsed for exploration and prediction

Of course, models are not reality

George Box“All models are wrong, but some are useful”“Statisticians, like artists, have the bad habit of falling in love with their models”.

Page 43: Successful Data Mining

43JMP Visual Data MiningJMP Visual Data Mining

TwymanTwyman’’s Law and Corollariess Law and Corollaries

“If it looks interesting, it must be wrong”

De Veaux’s Corollary 1 to Twyman’s Law“If it’s perfect, it’s wrong”

De Veaux’s Corollary 2 to Twyman’s Law“If it isn’t wrong, you probably knew it already

Page 44: Successful Data Mining

44JMP Visual Data MiningJMP Visual Data Mining

Lesson 2 Lesson 2 –– An Example of TwymanAn Example of Twyman’’s Laws Law

Ingot cracking3935 30,000 lb. IngotsUp to 25% cracking rate$30,000 per recast90 potential explanatory variables

Water composition (reduced)Metal compositionProcess variablesOther environmental variables

Page 45: Successful Data Mining

45JMP Visual Data MiningJMP Visual Data Mining

Data ProcessingData Processing

Five months to consolidate process data

Three months to analyze and reduce dimension of water data

Eight months after starting projects, statisticians received flat file:

960 ingots (rows)149 variables

Page 46: Successful Data Mining

46JMP Visual Data MiningJMP Visual Data Mining

Household Income > $40000

Debt > $10000

No

Yes

On Job > 5 Yr

No

0.050.01

Yes

No Yes

0.060.11

Decision Trees – Mortgage Defaults

Page 47: Successful Data Mining

47JMP Visual Data MiningJMP Visual Data Mining

Decision Tree Decision Tree ---- TitanicTitanic

|M

3

46% 93%

3 1,2,CChildAdult

1 or 2

F

27% 100%

33%23%

1stCrew

1 or Crew2 or 3

14%

Page 48: Successful Data Mining

48JMP Visual Data MiningJMP Visual Data Mining

Cook County Hospital Cook County Hospital ---- ““ERER””

The 3 “Urgent”Risk Factors:

1. Is the reported Pain unstable angina?

2. Is there fluid in patient’s lungs?

3. Is the patient’s systolic BP < 100?

The ECG Tests:•MI: myocardial infarction (heart attack)

•Ischemia – Heart muscle not getting enough blood

Page 49: Successful Data Mining

49JMP Visual Data MiningJMP Visual Data Mining

Confusion MatrixConfusion Matrix

No Heart Attack

Actual Heart Attack

Doctors in ER

Predict No Heart Attack

Predict Heart Attack

0.250.11

0.750.89

No Heart Attack

Actual Heart Attack

Tree Algorithm (Goldman)

Predict No Heart Attack

Predict Heart Attack

0.920.04

0.080.96

Page 50: Successful Data Mining

50JMP Visual Data MiningJMP Visual Data Mining

Regression Tree Regression Tree

|Price<9446.5

Weight<2280 Disp.<134

Weight<3637.5

Price<11522

Reliability:abde

HP<154

34.00 30.17 26.22

24.17

21.67 20.40

22.57

18.60

Page 51: Successful Data Mining

51JMP Visual Data MiningJMP Visual Data Mining

Ingots Ingots –– First TreeFirst Tree

CountMeanStd Dev

39350.23560.4244

All Rows

CountMeanStd Dev

30050.1590.3661

Alloy (6045,7348,8234,2345,3234)CountMeanStd Dev

9300.48170.4999

Alloy (5434,5894,2439)

We know that – some alloys are hard to make. That’s why we gave you the data in the first place.

Page 52: Successful Data Mining

52JMP Visual Data MiningJMP Visual Data Mining

Second TreeSecond Tree

CountMeanStd Dev

3935

All Rows

CountMeanStd Dev

30550.16730.3723

MG<3.9CountMeanStd Dev

8800.47270.4999

MG>=3.9

What do you think is in those alloys?

0.42440.2356

Page 53: Successful Data Mining

53JMP Visual Data MiningJMP Visual Data Mining

One More TimeOne More Time

Looks like ChromeOH!Did that solve it? No, but

Experimental designEnabled us to focus on important variables

Oh, that’s funny!

-Issac Asimov

Page 54: Successful Data Mining

54JMP Visual Data MiningJMP Visual Data Mining

What did we learn?What did we learn?

Data mining gave clues for generating hypotheses

Followed up with DOE

DOE led to substantial process improvement

Page 55: Successful Data Mining

55JMP Visual Data MiningJMP Visual Data Mining

HerbHerb’’s Tree s Tree –– TwymanTwyman’’s Law agains Law again

94649Count

37928.436G^2

01

Level0.94940.0506

Prob

All Rows

4792Count

0G^2

01

Level0.00001.0000

Prob

TARGET_D>=1

89857Count

0G^2

01

Level1.00000.0000

Prob

TARGET_D<1

Page 56: Successful Data Mining

56JMP Visual Data MiningJMP Visual Data Mining

Doing it Right Doing it Right –– Knowledge DiscoveryKnowledge Discovery

Define business problem

Build data mining database

Explore data

Prepare data for modeling

Build model

Evaluate model

Deploy model and results

Note: This process model borrows from Note: This process model borrows from CRISPCRISP--DM: DM: CRossCRoss Industry Standard Process for Data Industry Standard Process for Data MiningMining

Page 57: Successful Data Mining

57JMP Visual Data MiningJMP Visual Data Mining

Successful Data MiningSuccessful Data MiningThe keys to success:

Formulating the problemUsing the right dataFlexibility in modelingActing on results

Success depends more on the way you mine the data rather than the specific tool

Page 58: Successful Data Mining

58JMP Visual Data MiningJMP Visual Data Mining

Types of ModelsTypes of Models

Descriptions

Classification (categorical or discrete values)

Regression (continuous values)Time series (continuous values)

Clustering

Association

Page 59: Successful Data Mining

59JMP Visual Data MiningJMP Visual Data Mining

Model BuildingModel Building

Model buildingTrainTest

Evaluate

Page 60: Successful Data Mining

60JMP Visual Data MiningJMP Visual Data Mining

OverfittingOverfitting in Regressionin RegressionClassical overfitting:

Fit 6th order polynomial to 6 data points

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

-1 0 1 2 3 4 5 6

Page 61: Successful Data Mining

61JMP Visual Data MiningJMP Visual Data Mining

OverfittingOverfitting

Fitting non-explanatory variables to data

Overfitting is the result ofIncluding too many predictor variablesLack of regularizing the model

Neural net run too longDecision tree too deep

Page 62: Successful Data Mining

62JMP Visual Data MiningJMP Visual Data Mining

Avoiding OverfittingAvoiding OverfittingAvoiding overfitting is a balancing act –Occam’s Razor

Fit fewer variables rather than moreHave a reason for including a variable (other than it is in the database)Regularize (don’t overtrain)Know your field.

All models should be as simple as possiblebut no simpler than necessary

Albert Einstein

All models should be as simple as possiblebut no simpler than necessary

Albert Einstein

Page 63: Successful Data Mining

63JMP Visual Data MiningJMP Visual Data Mining

““ToyToy”” Problem Problem

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]tra

in2$

y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

Page 64: Successful Data Mining

64JMP Visual Data MiningJMP Visual Data Mining

Tree ModelTree Model|x4<0.512146

x1<0.359395

x4<0.140557

x3<0.215425

x3<0.490631

x5<0.232708

x2<0.54068

x4<0.283724

x2<0.299431

x2<0.129879

x4<0.336583x4<0.148234

x5<0.412206

x4<0.223909

x3<0.177433

x3<0.784959

x3<0.114976

x4<0.264999

x3<0.0789249

x3<0.728124

x3<0.0777104

x1<0.209569

x5<0.260297

x3<0.885533

x4<0.768584

x2<0.27271

x5<0.621811

x2<0.20279x3<0.248065

x5<0.588094

x4<0.916189

x1<0.328133

x8<0.933915

x4<0.700738

x2<0.414822

x3<0.821878

x4<0.941058

9.785

4.400 7.074

7.602

7.994 9.956

12.040 6.688 9.865 8.88712.060

15.020

10.78014.060

19.06014.34017.560

14.03016.860

21.26017.770

10.640

12.70014.380

16.830

12.25015.83018.16015.150

15.190

17.69019.470

14.200

21.24018.760

21.10024.280

25.320

R –squared 82.3% Train 67.2% Test

Page 65: Successful Data Mining

65JMP Visual Data MiningJMP Visual Data Mining

Predictions for ExamplePredictions for Example

R –squared 82.3% Train 67.2% Test

0

10

20

y

10 20

y Predictor

Page 66: Successful Data Mining

66JMP Visual Data MiningJMP Visual Data Mining

Tree AdvantagesTree Advantages

Model explains its reasoning -- builds rules

Build model quickly

Handles non-numeric data

No problems with missing dataMissing data as a new valueSurrogate splits

Works fine with many dimensions

Page 67: Successful Data Mining

67JMP Visual Data MiningJMP Visual Data Mining

WhatWhat’’s Wrong With Trees?s Wrong With Trees?

Output are step functions – big errors near boundaries

Greedy algorithms for splitting – small changes change model

Uses less data after every split

Model has high order interactions -- all splits are dependent on previous splits

Often non-interpretable

Page 68: Successful Data Mining

68JMP Visual Data MiningJMP Visual Data Mining

Linear Regression Linear Regression Term Estimate Std Error t Ratio Prob>|t|

Intercept -0.900 0.482 -1.860 0.063x1 4.658 0.292 15.950 <.0001x2 4.685 0.294 15.920 <.0001x3 -0.040 0.291 -0.140 0.892x4 9.806 0.298 32.940 <.0001x5 5.361 0.281 19.090 <.0001x6 0.369 0.284 1.300 0.194x7 0.001 0.291 0.000 0.998x8 -0.110 0.295 -0.370 0.714x9 0.467 0.301 1.550 0.122x10 -0.200 0.289 -0.710 0.479

R-squared: 73.5% Train 69.4% Test

Page 69: Successful Data Mining

69JMP Visual Data MiningJMP Visual Data Mining

Stepwise RegressionStepwise Regression

Term Estimate Std Error t Ratio Prob>|t|Intercept -0.625 0.309 -2.019 0.0439x1 4.619 0.289 15.998 <.0001x2 4.665 0.292 15.984 <.0001x4 9.824 0.296 33.176 <.0001x5 5.366 0.28 19.145 <.0001

R-squared 73.4% on Train 69.8% Test

Page 70: Successful Data Mining

70JMP Visual Data MiningJMP Visual Data Mining

Stepwise 2Stepwise 2NDND Order ModelOrder ModelTerm Estimate Std Error t Ratio Prob>|t|

Intercept -2.026 0.264 -7.68 <.0001x1 4.311 0.184 23.47 <.0001x2 4.808 0.185 26.04 <.0001x3 -0.506 0.181 -2.79 0.0054x4 10 0.186 53.79 <.0001x5 5.212 0.176 29.67 <.0001x8 -0.181 0.186 -0.97 0.3301x9 0.427 0.188 2.28 0.0232(x1-0.51811)*(x1-0.51811) -0.932 0.711 -1.31 0.1905(x2-0.48354)*(x1-0.51811) 8.972 0.634 14.14 <.0001(x3-0.48517)*(x1-0.51811) -1.367 0.65 -2.1 0.0358(x3-0.48517)*(x2-0.48354) -0.8 0.639 -1.25 0.2111(x3-0.48517)*(x3-0.48517) 20.515 0.69 29.71 <.0001(x4-0.49647)*(x1-0.51811) 1.014 0.651 1.56 0.1197(x4-0.49647)*(x2-0.48354) -1.159 0.65 -1.78 0.075(x5-0.50509)*(x2-0.48354) -0.794 0.62 -1.28 0.2008(x5-0.50509)*(x3-0.48517) 1.105 0.619 1.78 0.0748(x5-0.50509)*(x4-0.49647) 0.127 0.635 0.2 0.8414(x8-0.52029)*(x5-0.50509) 1.065 0.63 1.69 0.0914

R-squared 89.9% Train 88.8% Test

Page 71: Successful Data Mining

71JMP Visual Data MiningJMP Visual Data Mining

Next Steps Next Steps

Higher order terms?

When to stop?

Transformations?

Too simple: underfitting – bias

Too complex: inconsistent predictions, overfitting – high variance

Selecting models is Occam’s razorKeep goals of interpretation vs. prediction in mind

Page 72: Successful Data Mining

72JMP Visual Data MiningJMP Visual Data Mining

Logistic RegressionLogistic RegressionWhat happens if we use linear regression on1-0 (yes/no) data?

Income

20000 40000 60000 80000

0.0

0.2

0.4

0.6

0.8

1.0

Page 73: Successful Data Mining

73JMP Visual Data MiningJMP Visual Data Mining

Logistic Regression IILogistic Regression II

Points on the line can be interpreted as probability, but don’t stay within [0,1]Use a sigmoidal function instead of linear function to fit the data

IeIf −+=

11)(

00.20.40.60.8

11.2

-10 -6 -2 2 6 10

Page 74: Successful Data Mining

74JMP Visual Data MiningJMP Visual Data Mining

Logistic Regression IIILogistic Regression III

Income

Acc

ept

20000 40000 60000 80000

0.0

0.2

0.4

0.6

0.8

1.0

Page 75: Successful Data Mining

75JMP Visual Data MiningJMP Visual Data Mining

Neural NetsNeural NetsDon’t resemble the brain

Are a statistical modelClosest relative is projection pursuit regression

Page 76: Successful Data Mining

76JMP Visual Data MiningJMP Visual Data Mining

Input (z1)Output

x1

x2

x3

x4

x5

x0

0.3

0.7

-0.2

0.4-0.5

z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5

h(z1)

A Single NeuronA Single Neuron

0.8

Page 77: Successful Data Mining

77JMP Visual Data MiningJMP Visual Data Mining

Single NodeSingle Node

lj

jjkl xwzI θ+== ∑ 1

)(ˆ klk zhy =

Output:

Input to outer layer from “hidden node”:

Page 78: Successful Data Mining

78JMP Visual Data MiningJMP Visual Data Mining

Layered ArchitectureLayered Architecture

Input layer

Output layer

Hidden layer

z1

z2

z3

x1

x2y

Page 79: Successful Data Mining

79JMP Visual Data MiningJMP Visual Data Mining

Neural NetworksNeural Networks

lj

jjkl xwz θ+=∑ 1

jk zhwzhwy θ+++= …)()(ˆ 222121

Create lots of features – hidden nodes

Use them in an additive model:

Page 80: Successful Data Mining

80JMP Visual Data MiningJMP Visual Data Mining

Put It TogetherPut It Together

))((~ˆ 12 jlj

jjkl

klk xwhwhy θθ ++= ∑∑

The resulting model is just a flexible non-linear regression of the response on a set of predictor variables.

Page 81: Successful Data Mining

81JMP Visual Data MiningJMP Visual Data Mining

Predictions for ExamplePredictions for Example

R2 89.5% Train 87.7% Test

Page 82: Successful Data Mining

82JMP Visual Data MiningJMP Visual Data Mining

What Does This Get Us?What Does This Get Us?

Enormous flexibility

Ability to fit anythingIncluding noise

Interpretation?

Page 83: Successful Data Mining

83JMP Visual Data MiningJMP Visual Data Mining

Neural Net ProNeural Net Pro

AdvantagesHandles continuous or discrete valuesComplex interactionsIn general, highly accurate for fitting due to flexibility of modelCan incorporate known relationships

So called grey box modelsSee De Veaux et al, Environmetrics 1999

Page 84: Successful Data Mining

84JMP Visual Data MiningJMP Visual Data Mining

Neural Net ConNeural Net Con

DisadvantagesModel is not descriptive (black box)Difficult, complex architecturesSlow model buildingCategorical data explosionSensitive to input variable selection

Page 85: Successful Data Mining

85JMP Visual Data MiningJMP Visual Data Mining

KK--Nearest Neighbors(KNN) Nearest Neighbors(KNN)

To predict y for an x: Find the k most similar x'sAverage their y's

Find k by cross validation

No training (estimation) required

Works embarrassingly wellFriedman, KDDM 1996

Page 86: Successful Data Mining

86JMP Visual Data MiningJMP Visual Data Mining

Collaborative FilteringCollaborative Filtering

Goal: predict what movies people will like

Data: list of movies each person has watchedLyle André, StarwarsEllen André, Starwars, Destin Fred Starwars, BatmanDean Starwars, Batman, RamboJason Destin d’Amélie Poulin, Caché

Page 87: Successful Data Mining

87JMP Visual Data MiningJMP Visual Data Mining

Data BaseData Base

Data can be represented as a sparse matrix

Karen likes André. What else might she like?

CDNow doubled e-mail responses

Starwars Rambo Batman My Dinner w/André Destin D'Amilie Caché

Lyle y yEllen y y yFred y yDean y y yJason y y y

Karen ? ? ? ? ? ?

Page 88: Successful Data Mining

88JMP Visual Data MiningJMP Visual Data Mining

Lesson 3: Know When to Hold Lesson 3: Know When to Hold ‘‘emem

Breast cancer data from mammogramsError rates by trained radiologists are near 25% for both false positives and false negatives

Newer equipment is prohibitively expensive for the developing world

Early detection of breast cancer is crucial

Cumulative type I error over a decade is near 100% leading to needless biopsies

Page 89: Successful Data Mining

89JMP Visual Data MiningJMP Visual Data Mining

The DataThe Data

1618 mammograms showing clustered microcalcifications

Biostatistics Dept Institut Curie

VariablesResponse: Malignant or notPredictors: Age, Tissue Type (light/dense) Size (mm), Number of microcalc, Number of suspicious clusters, Shape of microcalc (1-5), Polyshape?(y/n), Shape of cluster (1,2,3), Retro (cluster near nipple?), Deep? (y/n)

Page 90: Successful Data Mining

90JMP Visual Data MiningJMP Visual Data Mining

Tree modelTree model

Page 91: Successful Data Mining

91JMP Visual Data MiningJMP Visual Data Mining

Combining ModelsCombining Models

In 1950’s forecasters found that combining forecasting models worked better on average than any single forecast model

Reduces variance by averagingCan reduce bias if collection

is broader than single model

Page 92: Successful Data Mining

92JMP Visual Data MiningJMP Visual Data Mining

Bagging and BoostingBagging and BoostingBagging (Bootstrap Aggregation)

Bootstrap a data set repeatedlyTake many versions of same model (e.g. tree)

Random Forest VariationForm a committee of modelsTake majority rule of predictions

BoostingCreate repeated samples of weighted dataWeights based on misclassificationCombine by majority rule, or linear combination of predictions

Page 93: Successful Data Mining

93JMP Visual Data MiningJMP Visual Data Mining

ResultsResults

False Positives False NegativesSimple Tree 32.20% 33.70%Neural Network 25.50% 31.70%Boosted Trees 24.90% 32.50%Bagged Trees 19.30% 28.80%Radiologists 22.40% 35.80%

• Split data into train and test (62.5% -37.5%)

• Repeat random splits 1000 timesFor each iteration, count false positives and false negatives on the 600 test set cases

Page 94: Successful Data Mining

94JMP Visual Data MiningJMP Visual Data Mining

How Do We Really Start?How Do We Really Start?

Life is not so kindCategorical variablesMissing data500 variables, not 10

481 variables – where to start?

Page 95: Successful Data Mining

95JMP Visual Data MiningJMP Visual Data Mining

Where to StartWhere to Start

Three rules of data analysisDraw a pictureDraw a pictureDraw a picture

Ok, but how? There are 90 histogram/bar charts and 4005 scatterplots to look at (or at least 90 if you look only at y vs. X)

Page 96: Successful Data Mining

96JMP Visual Data MiningJMP Visual Data Mining

Exploratory Data ModelsExploratory Data Models

Use a tree to find a smaller subset of variables to investigate

Explore this set graphicallyStart the modeling process over

Build model Compare model on small subset with full predictive model

Page 97: Successful Data Mining

97JMP Visual Data MiningJMP Visual Data Mining

More RealisticMore Realistic

250 predictors200 Continuous 50 Categorical

10,000 rows

Why is this still easy?No missing valuesRelatively high signal/noise

Page 98: Successful Data Mining

98JMP Visual Data MiningJMP Visual Data Mining

Start With a Simple ModelStart With a Simple Model

Tree? |x4<0.477873

x2<0.288579

x5<0.465905 x1<0.333728

x1<0.152683 x5<0.466843

x4<0.208211

x1<0.297806

x5<0.529173 x2<0.343653

x2<0.125849 x4<0.752766

x5<0.644585 x5<0.49235

-2.560 -0.265

-1.890 1.150

2.000 4.570

5.820

2.540 5.120

2.910 6.050

7.500 10.100 9.880 12.200

Page 99: Successful Data Mining

99JMP Visual Data MiningJMP Visual Data Mining

BrushingBrushing

Page 100: Successful Data Mining

100JMP Visual Data MiningJMP Visual Data Mining

Lesson 4: Know when to Fold Lesson 4: Know when to Fold ‘‘ememLiability for churches

Some PredictorsNet Premium ValueProperty ValueCoastal (yes/no)Inner100 (a.k.a., highly-urban) (yes/no)High property value Neighborhood (yes/no)Indicator Class

1 (Church/House of worship)2 (Sexual Misconduct – Church)3 (Add’l Sex. Misc. Covg Purchased)4 (Not-for-profit daycare centers)5 (Dwellings – One family (Lessor’s risk))6 (Bldg or Premises – Office – Not for profit)7 (Corporal Punishment – each faculty member)8 (Vacant land- not for profit)9 (Private, not for profit, elementary, Kindergarten and Jr. High Schools)10 (Stores – no food or drink – not for profit)11 (Bldg or Premises – Maintained by insured (lessor’s risk) – not for profit)12 (Sexual misconduct – diocese)

Page 101: Successful Data Mining

101JMP Visual Data MiningJMP Visual Data Mining

Fast FailFast Fail

Not every modeling effort is a successA model search can save lots of queries

Data took 8 months to get ready

Analyst spent 2 months exploring it

Tree models, stepwise regression (and a neural network running for several hours) found no out of sample predictive ability

Page 102: Successful Data Mining

102JMP Visual Data MiningJMP Visual Data Mining

Lesson 5: Machines are Smart Lesson 5: Machines are Smart ––You are SmarterYou are Smarter

Why do statisticians like interpretability?

Black boxes are not interpretable, but there may be important information

Page 103: Successful Data Mining

103JMP Visual Data MiningJMP Visual Data Mining

Case Study Case Study –– Warranty DataWarranty Data

A new backpack inkjet printer is showing higher than expected warranty claims

What are the important variables?What’s going on?

A neural networks shows that Zip code is the most important predictor

Page 104: Successful Data Mining

104JMP Visual Data MiningJMP Visual Data Mining

Zip Code?Zip Code?

Page 105: Successful Data Mining

105JMP Visual Data MiningJMP Visual Data Mining

Data Mining Data Mining –– DOE SynergyDOE Synergy

Data Mining is exploratory

Efforts can go on simultaneously

Learning cycle oscillates naturally between the two

Page 106: Successful Data Mining

106JMP Visual Data MiningJMP Visual Data Mining

What Did We Learn?What Did We Learn?Toy problem

Functional form of model

PVA dataUseful predictor – increased sales 40%

Depression StudyIdentified critical intervention point at 2 weeks

IngotsGave clues as to where to lookExperimental design followed

ChurchesWhen to quit

PrintersWhen to experiment – what factors

Page 107: Successful Data Mining

107JMP Visual Data MiningJMP Visual Data Mining

Challenges for data miningChallenges for data mining

Not algorithms

Overfitting

Finding an interpretable model that fits reasonably well

Page 108: Successful Data Mining

108JMP Visual Data MiningJMP Visual Data Mining

Recap Recap –– Success in Data MiningSuccess in Data MiningProblem formulation

Data preparationData definitionsData cleaningFeature creation, transformations

EDM – exploratory modelingReduce dimensions

Page 109: Successful Data Mining

109JMP Visual Data MiningJMP Visual Data Mining

Success in Data Mining IISuccess in Data Mining II

Don’t forget Graphics

Second phase modeling

Testing, validation, implementation

Constant re-evaluation of models

Page 110: Successful Data Mining

110JMP Visual Data MiningJMP Visual Data Mining

Which Method(s) to Use?Which Method(s) to Use?

No method is best

Which methods work best when?

Which method to use?YES!

Page 111: Successful Data Mining

111JMP Visual Data MiningJMP Visual Data Mining

For More InformationFor More InformationTwo Crows

http//www.twocrows.com

KDNuggetshttp://www.kdnuggets.com

[email protected]

M. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997

Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999

Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press 2001

Tan, P.N, Steinbach, and Kumar: Introduction to Data Mining, Addison-Wesley, 2006

Hastie, Tibshirani and Friedman, Statistical Learning 2nd edition, Springer