Transcript

Successful Data Mining in Practice

Richard D. De VeauxWilliams CollegeMay 19, 2009deveaux@williams.edu

2JMP Visual Data MiningJMP Visual Data Mining

Where is Massachusetts?Where is Massachusetts?

3JMP Visual Data MiningJMP Visual Data Mining

Where in Massachusetts?Where in Massachusetts?

4JMP Visual Data MiningJMP Visual Data Mining

Williams CollegeWilliams College

5JMP Visual Data MiningJMP Visual Data Mining

Williams College Williams College

6JMP Visual Data MiningJMP Visual Data Mining

Reason for Data MiningReason for Data Mining

Data = $$Data = $$

7JMP Visual Data MiningJMP Visual Data Mining

Data Mining IsData Mining Is……“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad

“finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley

“a knowledge discovery process of extracting previously unknown, actionable information from very large data bases”--- Zornes

“ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”---Edelstein

8JMP Visual Data MiningJMP Visual Data Mining

What is Data Mining?What is Data Mining?

9JMP Visual Data MiningJMP Visual Data Mining

Data Mining Models Data Mining Models –– a partial lista partial listTraditional statistical models

Linear regression, logistic regression, splines, smoothers etc.Vendors are adding these to DM software

Clustering and Visualization

Neural networks

Decision trees

Naïve Bayes

K Nearest Neighbor Methods

K-means

Combining Models – Bagging and Boosting

10JMP Visual Data MiningJMP Visual Data Mining

What makes Data Mining Different?What makes Data Mining Different?

Massive amounts of dataNumber of rows (cases)Number of columns (variables)

Low signal to noiseMany irrelevant variablesSubtle relationshipsVariation

UPS16TB – U.S. library of congressMostly tracking

Facebook200-300 TB of photos every month

Google1 PB every 72 minutes

11JMP Visual Data MiningJMP Visual Data Mining

Why Is Data Mining Taking Off Now?Why Is Data Mining Taking Off Now?

Because we canComputer powerThe price of digital storage is near zero

Data warehouses already built

Companies want return on data investment

12JMP Visual Data MiningJMP Visual Data Mining

Users are Also DifferentUsers are Also Different

UsersDomain experts, not statisticians Have too much dataWant automatic methodsWant useful information without spending all their time doing statistical analysis

13JMP Visual Data MiningJMP Visual Data Mining

Data Mining MythsData Mining Myths

Find answers to unasked questions

Continuously monitor your data base for interesting patterns

Eliminate the need to understand your business

Eliminate the need to collect good data

Eliminate the need to have good data analysis skills

14JMP Visual Data MiningJMP Visual Data Mining

Examples of Data Mining Applications Examples of Data Mining Applications --Customer Relationship ManagementCustomer Relationship Management

Transactional DataCustomer retentionUpselling opportunitiesCustomer optimization across different areas

Marketing ExperimentsOften, new hypotheses are generated by data mining a planned experiment.Segmentation

15JMP Visual Data MiningJMP Visual Data Mining

Manufacturing ApplicationsManufacturing Applications

Product reliability and quality control

Process controlWhat can I do to improve batch yields?

Warranty analysisProduct problemsService assessment Adverse experiences – link to production

16JMP Visual Data MiningJMP Visual Data Mining

Medical ApplicationsMedical Applications

Medical procedure effectivenessWho are good candidates for surgery?

Physician effectivenessWhich tests are ineffective?

Which physicians are likely to over-prescribe treatments?

What combinations of tests are most effective?

17JMP Visual Data MiningJMP Visual Data Mining

EE--commercecommerce

Automatic web page design

Recommendations for new purchases

Cross selling

Social Network Marketing

18JMP Visual Data MiningJMP Visual Data Mining

Pharmaceutical ApplicationsPharmaceutical Applications

High throughput screeningPredict actions in assaysPredict results in animals or humans

Rational drug designRelating chemical structure with chemical propertiesInverse regression to predict chemical properties from desired structure

DNA snips

GenomicsAssociate genes with diseasesFind relationships between genotype and drug response (e.g., dosage requirements, adverse effects) Find individuals most susceptible to placebo effect

19JMP Visual Data MiningJMP Visual Data Mining

Pharmaceutical ApplicationsPharmaceutical ApplicationsCombine clinical trial results with extensive medical/demographic information

Non traditional uses of clinical trial data warehouse to explore:

Prediction of adverse experiences – combining more than one trialWho is likely to be non-compliant or drop out?What are alternative (I.E., Non-approved) uses supported by the data?

20JMP Visual Data MiningJMP Visual Data Mining

Fraud and Terrorist DetectionFraud and Terrorist DetectionIdentify false:

Medical insurance claimsAccident insurance claims

Which stock trades are based on insider information?

Whose cell phone numbers have been stolen?

Which credit card transactions are from stolen cards?

Which documents are “interesting”

When are changes in networks signs of potential illegal activity?

21JMP Visual Data MiningJMP Visual Data Mining

Lesson 1: Learn to Make FriendsLesson 1: Learn to Make FriendsPVA is a philanthropic organization,

Sanctioned by the US Govt to represent the disabled veterans

They send out 4 million “free gifts” , every 6 weeksAnd hope for donations

Data were used for the KDD 1998 cup 200,000 donors

(100,000 training, 100,000 test)481 demographic variables

Past giving, income, age etc etc etcRecent campaign (only for training set)

Did they give? (Target B)How much did they give (Target D)

To optimize profit, who should receive the current solicitation?

What is the most cost effective strategy?

22JMP Visual Data MiningJMP Visual Data Mining

WhatWhat’’s s ““HardHard””? ? ----ExampleExample

23JMP Visual Data MiningJMP Visual Data Mining

TT--CodeCode

24JMP Visual Data MiningJMP Visual Data Mining

More More TcodeTcode

25JMP Visual Data MiningJMP Visual Data Mining

Transformation?Transformation?

26JMP Visual Data MiningJMP Visual Data Mining

Categories?Categories?

27JMP Visual Data MiningJMP Visual Data Mining

What does it mean?What does it mean?T -C ode

0 _ 1 6 DEAN 4 8 CO RP O RAL 1 0 9 LIC. 1 M R. 1 7 J UDGE 5 0 ELDER 1 1 1 S A.

1 0 0 1 M ES S RS . 1 7 0 0 2 J UDGE & M RS . 5 6 M AYO R 1 1 4 DA. 1 0 0 2 M R. & M RS . 1 8 M AJ O R 5 9 0 0 2 LIEUTENANT & M RS . 1 1 6 S R.

2 M RS . 1 8 0 0 2 M AJ O R & M RS . 6 2 LO RD 1 1 7 S RA. 2 0 0 2 M ES DAM ES 1 9 S ENATO R 6 3 CARDINAL 1 1 8 S RTA.

3 M IS S 2 0 GO V ERNO R 6 4 FRIEND 1 2 0 YO UR M AJ ES TY 3 0 0 3 M IS S ES 2 1 0 0 2 S ERGEANT & M RS . 6 5 FRIENDS 1 2 2 HIS HIGHNES S

4 DR. 2 2 0 0 2 CO LNEL & M RS . 6 8 ARCHDEACO N 1 2 3 HER HIGHNES S 4 0 0 2 DR. & M RS . 2 4 LIEUTENANT 6 9 CANO N 1 2 4 CO UNT 4 0 0 4 DO CTO RS 2 6 M O NS IGNO R 7 0 BIS HO P 1 2 5 LADY

5 M ADAM E 2 7 REV EREND 7 2 0 0 2 REV EREND & M RS . 1 2 6 P RINCE 6 S ERGEANT 2 8 M S . 7 3 P AS TO R 1 2 7 P RINCES S 9 RABBI 2 8 0 2 8 M S S . 7 5 ARCHBIS HO P 1 2 8 CHIEF

1 0 P RO FES S O R 2 9 BIS HO P 8 5 S P ECIALIS T 1 2 9 BARO N 1 0 0 0 2 P RO FES S O R & M RS . 3 1 AM BAS S ADO R 8 7 P RIV ATE 1 3 0 S HEIK 1 0 0 1 0 P RO FES S O RS 3 1 0 0 2 AM BAS S ADO R & M RS 8 9 S EAM AN 1 3 1 P RINCE AND P RINCES S

1 1 ADM IRAL 3 3 CANTO R 9 0 AIRM AN 1 3 2 YO UR IM P ERIAL M AJ ES T 1 1 0 0 2 ADM IRAL & M RS . 3 6 BRO THER 9 1 J US TICE 1 3 5 M . ET M M E.

1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE 2 1 0 P RO F.1 2 0 0 2 GENERAL & M RS . 3 8 CO M M O DO RE 1 0 0 M .

1 3 CO LO NEL 4 0 FATHER 1 0 3 M LLE. 1 3 0 0 2 CO LO NEL & M RS . 4 2 S IS TER 1 0 4 CHANCELLO R

1 4 CAP TAIN 4 3 P RES IDENT 1 0 6 REP RES ENTATIV E 1 4 0 0 2 CAP TAIN & M RS . 4 4 M AS TER 1 0 7 S ECRETARY

1 5 CO M M ANDER 4 6 M O THER 1 0 8 LT. GO V ERNO R 1 5 0 0 2 CO M M ANDER & M RS . 4 7 CHAP LAIN

T itle

28JMP Visual Data MiningJMP Visual Data Mining

Relational Data BasesRelational Data BasesData are stored in tables

Items

ItemID ItemName price

C56621 top hat 34.95

T35691 cane 4.99

RS5292 red shoes 22.95

Shoppers

Person ID person name ZIPCODE item bought

135366 Lyle 19103 T35691

135366 Lyle 19103 C56621

259835 Dick 01267 RS5292

29JMP Visual Data MiningJMP Visual Data Mining

MetadataMetadata

The data survey describes the data set contents and characteristics

Table nameDescriptionPrimary key/foreign key relationshipsCollection information: how, where, conditionsTimeframe: daily, weekly, monthlyCosynchronus: every Monday or Tuesday

30JMP Visual Data MiningJMP Visual Data Mining

Data PreparationData Preparation

Build data mining database

Explore data

Prepare data for modeling

60% to 95% of the time is spent preparing the data

60% to 95% of the time is spent 60% to 95% of the time is spent preparing the datapreparing the data

31JMP Visual Data MiningJMP Visual Data Mining

Data ChallengesData Challenges

Data definitionsTypes of variables

Data consolidationCombine data from different sourcesNASA mars lander

Data heterogeneityHomonymsSynonyms

Data quality

32JMP Visual Data MiningJMP Visual Data Mining

Data QualityData Quality

33JMP Visual Data MiningJMP Visual Data Mining

Missing ValuesMissing Values

Random missing valuesDelete row?

Paralyzed Veterans

Substitute valueImputationMultiple ImputationJMP 8 (!)

Systematic missing dataNow what?

34JMP Visual Data MiningJMP Visual Data Mining

Missing Values Missing Values ---- SystematicSystematic

Credit Card Bank finds that “Income” field is missing

Wharton Ph.D. Student questionnaire on survey attitudes

Bowdoin college applicants have mean SAT verbal score above 750

Clinical Trial of Depression Medication –what does missing mean?

35JMP Visual Data MiningJMP Visual Data Mining

Results for PVA Data SetResults for PVA Data SetIf entire list (100,000 donors) are mailed, net donation is $10,500

Using data mining techniques, this was increased 41.37%

36JMP Visual Data MiningJMP Visual Data Mining

KDD CUP 98 ResultsKDD CUP 98 Results

37JMP Visual Data MiningJMP Visual Data Mining

KDD CUP 98 Results 2KDD CUP 98 Results 2

38JMP Visual Data MiningJMP Visual Data Mining

Students in Data Mining ClassStudents in Data Mining Class

Student #1 $15,024Student #2 $14,695Student #3 $14,345

39JMP Visual Data MiningJMP Visual Data Mining

Data Mining and OLAPData Mining and OLAP

On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis

Descriptive, not predictive

Data mining: software uses data to inductively find patterns – models!

Predictive or descriptive

Associations?Most associated variables in the censusMost associated variables in a supermarketAssocation Rules

40JMP Visual Data MiningJMP Visual Data Mining

Why Models?Why Models?Beer and Diapers

“In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated”Conclusions?Actions?

41JMP Visual Data MiningJMP Visual Data Mining

Beer and DiapersBeer and Diapers

Picture from TandemTM ad

42JMP Visual Data MiningJMP Visual Data Mining

ModelsModelsModels are:

Powerful summaries for understandingUsed for exploration and prediction

Of course, models are not reality

George Box“All models are wrong, but some are useful”“Statisticians, like artists, have the bad habit of falling in love with their models”.

43JMP Visual Data MiningJMP Visual Data Mining

TwymanTwyman’’s Law and Corollariess Law and Corollaries

“If it looks interesting, it must be wrong”

De Veaux’s Corollary 1 to Twyman’s Law“If it’s perfect, it’s wrong”

De Veaux’s Corollary 2 to Twyman’s Law“If it isn’t wrong, you probably knew it already

44JMP Visual Data MiningJMP Visual Data Mining

Lesson 2 Lesson 2 –– An Example of TwymanAn Example of Twyman’’s Laws Law

Ingot cracking3935 30,000 lb. IngotsUp to 25% cracking rate$30,000 per recast90 potential explanatory variables

Water composition (reduced)Metal compositionProcess variablesOther environmental variables

45JMP Visual Data MiningJMP Visual Data Mining

Data ProcessingData Processing

Five months to consolidate process data

Three months to analyze and reduce dimension of water data

Eight months after starting projects, statisticians received flat file:

960 ingots (rows)149 variables

46JMP Visual Data MiningJMP Visual Data Mining

Household Income > $40000

Debt > $10000

No

Yes

On Job > 5 Yr

No

0.050.01

Yes

No Yes

0.060.11

Decision Trees – Mortgage Defaults

47JMP Visual Data MiningJMP Visual Data Mining

Decision Tree Decision Tree ---- TitanicTitanic

|M

3

46% 93%

3 1,2,CChildAdult

1 or 2

F

27% 100%

33%23%

1stCrew

1 or Crew2 or 3

14%

48JMP Visual Data MiningJMP Visual Data Mining

Cook County Hospital Cook County Hospital ---- ““ERER””

The 3 “Urgent”Risk Factors:

1. Is the reported Pain unstable angina?

2. Is there fluid in patient’s lungs?

3. Is the patient’s systolic BP < 100?

The ECG Tests:•MI: myocardial infarction (heart attack)

•Ischemia – Heart muscle not getting enough blood

49JMP Visual Data MiningJMP Visual Data Mining

Confusion MatrixConfusion Matrix

No Heart Attack

Actual Heart Attack

Doctors in ER

Predict No Heart Attack

Predict Heart Attack

0.250.11

0.750.89

No Heart Attack

Actual Heart Attack

Tree Algorithm (Goldman)

Predict No Heart Attack

Predict Heart Attack

0.920.04

0.080.96

50JMP Visual Data MiningJMP Visual Data Mining

Regression Tree Regression Tree

|Price<9446.5

Weight<2280 Disp.<134

Weight<3637.5

Price<11522

Reliability:abde

HP<154

34.00 30.17 26.22

24.17

21.67 20.40

22.57

18.60

51JMP Visual Data MiningJMP Visual Data Mining

Ingots Ingots –– First TreeFirst Tree

CountMeanStd Dev

39350.23560.4244

All Rows

CountMeanStd Dev

30050.1590.3661

Alloy (6045,7348,8234,2345,3234)CountMeanStd Dev

9300.48170.4999

Alloy (5434,5894,2439)

We know that – some alloys are hard to make. That’s why we gave you the data in the first place.

52JMP Visual Data MiningJMP Visual Data Mining

Second TreeSecond Tree

CountMeanStd Dev

3935

All Rows

CountMeanStd Dev

30550.16730.3723

MG<3.9CountMeanStd Dev

8800.47270.4999

MG>=3.9

What do you think is in those alloys?

0.42440.2356

53JMP Visual Data MiningJMP Visual Data Mining

One More TimeOne More Time

Looks like ChromeOH!Did that solve it? No, but

Experimental designEnabled us to focus on important variables

Oh, that’s funny!

-Issac Asimov

54JMP Visual Data MiningJMP Visual Data Mining

What did we learn?What did we learn?

Data mining gave clues for generating hypotheses

Followed up with DOE

DOE led to substantial process improvement

55JMP Visual Data MiningJMP Visual Data Mining

HerbHerb’’s Tree s Tree –– TwymanTwyman’’s Law agains Law again

94649Count

37928.436G^2

01

Level0.94940.0506

Prob

All Rows

4792Count

0G^2

01

Level0.00001.0000

Prob

TARGET_D>=1

89857Count

0G^2

01

Level1.00000.0000

Prob

TARGET_D<1

56JMP Visual Data MiningJMP Visual Data Mining

Doing it Right Doing it Right –– Knowledge DiscoveryKnowledge Discovery

Define business problem

Build data mining database

Explore data

Prepare data for modeling

Build model

Evaluate model

Deploy model and results

Note: This process model borrows from Note: This process model borrows from CRISPCRISP--DM: DM: CRossCRoss Industry Standard Process for Data Industry Standard Process for Data MiningMining

57JMP Visual Data MiningJMP Visual Data Mining

Successful Data MiningSuccessful Data MiningThe keys to success:

Formulating the problemUsing the right dataFlexibility in modelingActing on results

Success depends more on the way you mine the data rather than the specific tool

58JMP Visual Data MiningJMP Visual Data Mining

Types of ModelsTypes of Models

Descriptions

Classification (categorical or discrete values)

Regression (continuous values)Time series (continuous values)

Clustering

Association

59JMP Visual Data MiningJMP Visual Data Mining

Model BuildingModel Building

Model buildingTrainTest

Evaluate

60JMP Visual Data MiningJMP Visual Data Mining

OverfittingOverfitting in Regressionin RegressionClassical overfitting:

Fit 6th order polynomial to 6 data points

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

-1 0 1 2 3 4 5 6

61JMP Visual Data MiningJMP Visual Data Mining

OverfittingOverfitting

Fitting non-explanatory variables to data

Overfitting is the result ofIncluding too many predictor variablesLack of regularizing the model

Neural net run too longDecision tree too deep

62JMP Visual Data MiningJMP Visual Data Mining

Avoiding OverfittingAvoiding OverfittingAvoiding overfitting is a balancing act –Occam’s Razor

Fit fewer variables rather than moreHave a reason for including a variable (other than it is in the database)Regularize (don’t overtrain)Know your field.

All models should be as simple as possiblebut no simpler than necessary

Albert Einstein

All models should be as simple as possiblebut no simpler than necessary

Albert Einstein

63JMP Visual Data MiningJMP Visual Data Mining

““ToyToy”” Problem Problem

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]tra

in2$

y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

train2[, i]

train

2$y

0.0 0.2 0.4 0.6 0.8 1.0

510

1520

25

64JMP Visual Data MiningJMP Visual Data Mining

Tree ModelTree Model|x4<0.512146

x1<0.359395

x4<0.140557

x3<0.215425

x3<0.490631

x5<0.232708

x2<0.54068

x4<0.283724

x2<0.299431

x2<0.129879

x4<0.336583x4<0.148234

x5<0.412206

x4<0.223909

x3<0.177433

x3<0.784959

x3<0.114976

x4<0.264999

x3<0.0789249

x3<0.728124

x3<0.0777104

x1<0.209569

x5<0.260297

x3<0.885533

x4<0.768584

x2<0.27271

x5<0.621811

x2<0.20279x3<0.248065

x5<0.588094

x4<0.916189

x1<0.328133

x8<0.933915

x4<0.700738

x2<0.414822

x3<0.821878

x4<0.941058

9.785

4.400 7.074

7.602

7.994 9.956

12.040 6.688 9.865 8.88712.060

15.020

10.78014.060

19.06014.34017.560

14.03016.860

21.26017.770

10.640

12.70014.380

16.830

12.25015.83018.16015.150

15.190

17.69019.470

14.200

21.24018.760

21.10024.280

25.320

R –squared 82.3% Train 67.2% Test

65JMP Visual Data MiningJMP Visual Data Mining

Predictions for ExamplePredictions for Example

R –squared 82.3% Train 67.2% Test

0

10

20

y

10 20

y Predictor

66JMP Visual Data MiningJMP Visual Data Mining

Tree AdvantagesTree Advantages

Model explains its reasoning -- builds rules

Build model quickly

Handles non-numeric data

No problems with missing dataMissing data as a new valueSurrogate splits

Works fine with many dimensions

67JMP Visual Data MiningJMP Visual Data Mining

WhatWhat’’s Wrong With Trees?s Wrong With Trees?

Output are step functions – big errors near boundaries

Greedy algorithms for splitting – small changes change model

Uses less data after every split

Model has high order interactions -- all splits are dependent on previous splits

Often non-interpretable

68JMP Visual Data MiningJMP Visual Data Mining

Linear Regression Linear Regression Term Estimate Std Error t Ratio Prob>|t|

Intercept -0.900 0.482 -1.860 0.063x1 4.658 0.292 15.950 <.0001x2 4.685 0.294 15.920 <.0001x3 -0.040 0.291 -0.140 0.892x4 9.806 0.298 32.940 <.0001x5 5.361 0.281 19.090 <.0001x6 0.369 0.284 1.300 0.194x7 0.001 0.291 0.000 0.998x8 -0.110 0.295 -0.370 0.714x9 0.467 0.301 1.550 0.122x10 -0.200 0.289 -0.710 0.479

R-squared: 73.5% Train 69.4% Test

69JMP Visual Data MiningJMP Visual Data Mining

Stepwise RegressionStepwise Regression

Term Estimate Std Error t Ratio Prob>|t|Intercept -0.625 0.309 -2.019 0.0439x1 4.619 0.289 15.998 <.0001x2 4.665 0.292 15.984 <.0001x4 9.824 0.296 33.176 <.0001x5 5.366 0.28 19.145 <.0001

R-squared 73.4% on Train 69.8% Test

70JMP Visual Data MiningJMP Visual Data Mining

Stepwise 2Stepwise 2NDND Order ModelOrder ModelTerm Estimate Std Error t Ratio Prob>|t|

Intercept -2.026 0.264 -7.68 <.0001x1 4.311 0.184 23.47 <.0001x2 4.808 0.185 26.04 <.0001x3 -0.506 0.181 -2.79 0.0054x4 10 0.186 53.79 <.0001x5 5.212 0.176 29.67 <.0001x8 -0.181 0.186 -0.97 0.3301x9 0.427 0.188 2.28 0.0232(x1-0.51811)*(x1-0.51811) -0.932 0.711 -1.31 0.1905(x2-0.48354)*(x1-0.51811) 8.972 0.634 14.14 <.0001(x3-0.48517)*(x1-0.51811) -1.367 0.65 -2.1 0.0358(x3-0.48517)*(x2-0.48354) -0.8 0.639 -1.25 0.2111(x3-0.48517)*(x3-0.48517) 20.515 0.69 29.71 <.0001(x4-0.49647)*(x1-0.51811) 1.014 0.651 1.56 0.1197(x4-0.49647)*(x2-0.48354) -1.159 0.65 -1.78 0.075(x5-0.50509)*(x2-0.48354) -0.794 0.62 -1.28 0.2008(x5-0.50509)*(x3-0.48517) 1.105 0.619 1.78 0.0748(x5-0.50509)*(x4-0.49647) 0.127 0.635 0.2 0.8414(x8-0.52029)*(x5-0.50509) 1.065 0.63 1.69 0.0914

R-squared 89.9% Train 88.8% Test

71JMP Visual Data MiningJMP Visual Data Mining

Next Steps Next Steps

Higher order terms?

When to stop?

Transformations?

Too simple: underfitting – bias

Too complex: inconsistent predictions, overfitting – high variance

Selecting models is Occam’s razorKeep goals of interpretation vs. prediction in mind

72JMP Visual Data MiningJMP Visual Data Mining

Logistic RegressionLogistic RegressionWhat happens if we use linear regression on1-0 (yes/no) data?

Income

20000 40000 60000 80000

0.0

0.2

0.4

0.6

0.8

1.0

73JMP Visual Data MiningJMP Visual Data Mining

Logistic Regression IILogistic Regression II

Points on the line can be interpreted as probability, but don’t stay within [0,1]Use a sigmoidal function instead of linear function to fit the data

IeIf −+=

11)(

00.20.40.60.8

11.2

-10 -6 -2 2 6 10

74JMP Visual Data MiningJMP Visual Data Mining

Logistic Regression IIILogistic Regression III

Income

Acc

ept

20000 40000 60000 80000

0.0

0.2

0.4

0.6

0.8

1.0

75JMP Visual Data MiningJMP Visual Data Mining

Neural NetsNeural NetsDon’t resemble the brain

Are a statistical modelClosest relative is projection pursuit regression

76JMP Visual Data MiningJMP Visual Data Mining

Input (z1)Output

x1

x2

x3

x4

x5

x0

0.3

0.7

-0.2

0.4-0.5

z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5

h(z1)

A Single NeuronA Single Neuron

0.8

77JMP Visual Data MiningJMP Visual Data Mining

Single NodeSingle Node

lj

jjkl xwzI θ+== ∑ 1

)(ˆ klk zhy =

Output:

Input to outer layer from “hidden node”:

78JMP Visual Data MiningJMP Visual Data Mining

Layered ArchitectureLayered Architecture

Input layer

Output layer

Hidden layer

z1

z2

z3

x1

x2y

79JMP Visual Data MiningJMP Visual Data Mining

Neural NetworksNeural Networks

lj

jjkl xwz θ+=∑ 1

jk zhwzhwy θ+++= …)()(ˆ 222121

Create lots of features – hidden nodes

Use them in an additive model:

80JMP Visual Data MiningJMP Visual Data Mining

Put It TogetherPut It Together

))((~ˆ 12 jlj

jjkl

klk xwhwhy θθ ++= ∑∑

The resulting model is just a flexible non-linear regression of the response on a set of predictor variables.

81JMP Visual Data MiningJMP Visual Data Mining

Predictions for ExamplePredictions for Example

R2 89.5% Train 87.7% Test

82JMP Visual Data MiningJMP Visual Data Mining

What Does This Get Us?What Does This Get Us?

Enormous flexibility

Ability to fit anythingIncluding noise

Interpretation?

83JMP Visual Data MiningJMP Visual Data Mining

Neural Net ProNeural Net Pro

AdvantagesHandles continuous or discrete valuesComplex interactionsIn general, highly accurate for fitting due to flexibility of modelCan incorporate known relationships

So called grey box modelsSee De Veaux et al, Environmetrics 1999

84JMP Visual Data MiningJMP Visual Data Mining

Neural Net ConNeural Net Con

DisadvantagesModel is not descriptive (black box)Difficult, complex architecturesSlow model buildingCategorical data explosionSensitive to input variable selection

85JMP Visual Data MiningJMP Visual Data Mining

KK--Nearest Neighbors(KNN) Nearest Neighbors(KNN)

To predict y for an x: Find the k most similar x'sAverage their y's

Find k by cross validation

No training (estimation) required

Works embarrassingly wellFriedman, KDDM 1996

86JMP Visual Data MiningJMP Visual Data Mining

Collaborative FilteringCollaborative Filtering

Goal: predict what movies people will like

Data: list of movies each person has watchedLyle André, StarwarsEllen André, Starwars, Destin Fred Starwars, BatmanDean Starwars, Batman, RamboJason Destin d’Amélie Poulin, Caché

87JMP Visual Data MiningJMP Visual Data Mining

Data BaseData Base

Data can be represented as a sparse matrix

Karen likes André. What else might she like?

CDNow doubled e-mail responses

Starwars Rambo Batman My Dinner w/André Destin D'Amilie Caché

Lyle y yEllen y y yFred y yDean y y yJason y y y

Karen ? ? ? ? ? ?

88JMP Visual Data MiningJMP Visual Data Mining

Lesson 3: Know When to Hold Lesson 3: Know When to Hold ‘‘emem

Breast cancer data from mammogramsError rates by trained radiologists are near 25% for both false positives and false negatives

Newer equipment is prohibitively expensive for the developing world

Early detection of breast cancer is crucial

Cumulative type I error over a decade is near 100% leading to needless biopsies

89JMP Visual Data MiningJMP Visual Data Mining

The DataThe Data

1618 mammograms showing clustered microcalcifications

Biostatistics Dept Institut Curie

VariablesResponse: Malignant or notPredictors: Age, Tissue Type (light/dense) Size (mm), Number of microcalc, Number of suspicious clusters, Shape of microcalc (1-5), Polyshape?(y/n), Shape of cluster (1,2,3), Retro (cluster near nipple?), Deep? (y/n)

90JMP Visual Data MiningJMP Visual Data Mining

Tree modelTree model

91JMP Visual Data MiningJMP Visual Data Mining

Combining ModelsCombining Models

In 1950’s forecasters found that combining forecasting models worked better on average than any single forecast model

Reduces variance by averagingCan reduce bias if collection

is broader than single model

92JMP Visual Data MiningJMP Visual Data Mining

Bagging and BoostingBagging and BoostingBagging (Bootstrap Aggregation)

Bootstrap a data set repeatedlyTake many versions of same model (e.g. tree)

Random Forest VariationForm a committee of modelsTake majority rule of predictions

BoostingCreate repeated samples of weighted dataWeights based on misclassificationCombine by majority rule, or linear combination of predictions

93JMP Visual Data MiningJMP Visual Data Mining

ResultsResults

False Positives False NegativesSimple Tree 32.20% 33.70%Neural Network 25.50% 31.70%Boosted Trees 24.90% 32.50%Bagged Trees 19.30% 28.80%Radiologists 22.40% 35.80%

• Split data into train and test (62.5% -37.5%)

• Repeat random splits 1000 timesFor each iteration, count false positives and false negatives on the 600 test set cases

94JMP Visual Data MiningJMP Visual Data Mining

How Do We Really Start?How Do We Really Start?

Life is not so kindCategorical variablesMissing data500 variables, not 10

481 variables – where to start?

95JMP Visual Data MiningJMP Visual Data Mining

Where to StartWhere to Start

Three rules of data analysisDraw a pictureDraw a pictureDraw a picture

Ok, but how? There are 90 histogram/bar charts and 4005 scatterplots to look at (or at least 90 if you look only at y vs. X)

96JMP Visual Data MiningJMP Visual Data Mining

Exploratory Data ModelsExploratory Data Models

Use a tree to find a smaller subset of variables to investigate

Explore this set graphicallyStart the modeling process over

Build model Compare model on small subset with full predictive model

97JMP Visual Data MiningJMP Visual Data Mining

More RealisticMore Realistic

250 predictors200 Continuous 50 Categorical

10,000 rows

Why is this still easy?No missing valuesRelatively high signal/noise

98JMP Visual Data MiningJMP Visual Data Mining

Start With a Simple ModelStart With a Simple Model

Tree? |x4<0.477873

x2<0.288579

x5<0.465905 x1<0.333728

x1<0.152683 x5<0.466843

x4<0.208211

x1<0.297806

x5<0.529173 x2<0.343653

x2<0.125849 x4<0.752766

x5<0.644585 x5<0.49235

-2.560 -0.265

-1.890 1.150

2.000 4.570

5.820

2.540 5.120

2.910 6.050

7.500 10.100 9.880 12.200

99JMP Visual Data MiningJMP Visual Data Mining

BrushingBrushing

100JMP Visual Data MiningJMP Visual Data Mining

Lesson 4: Know when to Fold Lesson 4: Know when to Fold ‘‘ememLiability for churches

Some PredictorsNet Premium ValueProperty ValueCoastal (yes/no)Inner100 (a.k.a., highly-urban) (yes/no)High property value Neighborhood (yes/no)Indicator Class

1 (Church/House of worship)2 (Sexual Misconduct – Church)3 (Add’l Sex. Misc. Covg Purchased)4 (Not-for-profit daycare centers)5 (Dwellings – One family (Lessor’s risk))6 (Bldg or Premises – Office – Not for profit)7 (Corporal Punishment – each faculty member)8 (Vacant land- not for profit)9 (Private, not for profit, elementary, Kindergarten and Jr. High Schools)10 (Stores – no food or drink – not for profit)11 (Bldg or Premises – Maintained by insured (lessor’s risk) – not for profit)12 (Sexual misconduct – diocese)

101JMP Visual Data MiningJMP Visual Data Mining

Fast FailFast Fail

Not every modeling effort is a successA model search can save lots of queries

Data took 8 months to get ready

Analyst spent 2 months exploring it

Tree models, stepwise regression (and a neural network running for several hours) found no out of sample predictive ability

102JMP Visual Data MiningJMP Visual Data Mining

Lesson 5: Machines are Smart Lesson 5: Machines are Smart ––You are SmarterYou are Smarter

Why do statisticians like interpretability?

Black boxes are not interpretable, but there may be important information

103JMP Visual Data MiningJMP Visual Data Mining

Case Study Case Study –– Warranty DataWarranty Data

A new backpack inkjet printer is showing higher than expected warranty claims

What are the important variables?What’s going on?

A neural networks shows that Zip code is the most important predictor

104JMP Visual Data MiningJMP Visual Data Mining

Zip Code?Zip Code?

105JMP Visual Data MiningJMP Visual Data Mining

Data Mining Data Mining –– DOE SynergyDOE Synergy

Data Mining is exploratory

Efforts can go on simultaneously

Learning cycle oscillates naturally between the two

106JMP Visual Data MiningJMP Visual Data Mining

What Did We Learn?What Did We Learn?Toy problem

Functional form of model

PVA dataUseful predictor – increased sales 40%

Depression StudyIdentified critical intervention point at 2 weeks

IngotsGave clues as to where to lookExperimental design followed

ChurchesWhen to quit

PrintersWhen to experiment – what factors

107JMP Visual Data MiningJMP Visual Data Mining

Challenges for data miningChallenges for data mining

Not algorithms

Overfitting

Finding an interpretable model that fits reasonably well

108JMP Visual Data MiningJMP Visual Data Mining

Recap Recap –– Success in Data MiningSuccess in Data MiningProblem formulation

Data preparationData definitionsData cleaningFeature creation, transformations

EDM – exploratory modelingReduce dimensions

109JMP Visual Data MiningJMP Visual Data Mining

Success in Data Mining IISuccess in Data Mining II

Don’t forget Graphics

Second phase modeling

Testing, validation, implementation

Constant re-evaluation of models

110JMP Visual Data MiningJMP Visual Data Mining

Which Method(s) to Use?Which Method(s) to Use?

No method is best

Which methods work best when?

Which method to use?YES!

111JMP Visual Data MiningJMP Visual Data Mining

For More InformationFor More InformationTwo Crows

http//www.twocrows.com

KDNuggetshttp://www.kdnuggets.com

deveaux@williams.edu

M. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997

Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999

Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press 2001

Tan, P.N, Steinbach, and Kumar: Introduction to Data Mining, Addison-Wesley, 2006

Hastie, Tibshirani and Friedman, Statistical Learning 2nd edition, Springer

top related