Successful Data Mining

Successful Data Mining in Practice

Richard D. De VeauxWilliams CollegeMay 19, 2009deveaux@williams.edu

2JMP Visual Data MiningJMP Visual Data Mining

Where is Massachusetts?Where is Massachusetts?

Where in Massachusetts?Where in Massachusetts?

Williams CollegeWilliams College

Williams College Williams College

Reason for Data MiningReason for Data Mining

Data = $$Data = $$

Data Mining IsData Mining Is……“the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” --- Fayyad

“finding interesting structure (patterns, statistical models, relationships) in data bases”.--- Fayyad, Chaduri and Bradley

“a knowledge discovery process of extracting previously unknown, actionable information from very large data bases”--- Zornes

“ a process that uses a variety of data analysis tools to discover patterns and relationships in data that may be used to make valid predictions.”---Edelstein

What is Data Mining?What is Data Mining?

Data Mining Models Data Mining Models –– a partial lista partial listTraditional statistical models

Linear regression, logistic regression, splines, smoothers etc.Vendors are adding these to DM software

Clustering and Visualization

Neural networks

Decision trees

Naïve Bayes

K Nearest Neighbor Methods

K-means

Combining Models – Bagging and Boosting

What makes Data Mining Different?What makes Data Mining Different?

Massive amounts of dataNumber of rows (cases)Number of columns (variables)

Low signal to noiseMany irrelevant variablesSubtle relationshipsVariation

UPS16TB – U.S. library of congressMostly tracking

Facebook200-300 TB of photos every month

Google1 PB every 72 minutes

Why Is Data Mining Taking Off Now?Why Is Data Mining Taking Off Now?

Because we canComputer powerThe price of digital storage is near zero

Data warehouses already built

Companies want return on data investment

Users are Also DifferentUsers are Also Different

UsersDomain experts, not statisticians Have too much dataWant automatic methodsWant useful information without spending all their time doing statistical analysis

Data Mining MythsData Mining Myths

Find answers to unasked questions

Continuously monitor your data base for interesting patterns

Eliminate the need to understand your business

Eliminate the need to collect good data

Eliminate the need to have good data analysis skills

Examples of Data Mining Applications Examples of Data Mining Applications --Customer Relationship ManagementCustomer Relationship Management

Transactional DataCustomer retentionUpselling opportunitiesCustomer optimization across different areas

Marketing ExperimentsOften, new hypotheses are generated by data mining a planned experiment.Segmentation

Manufacturing ApplicationsManufacturing Applications

Product reliability and quality control

Process controlWhat can I do to improve batch yields?

Warranty analysisProduct problemsService assessment Adverse experiences – link to production

Medical ApplicationsMedical Applications

Medical procedure effectivenessWho are good candidates for surgery?

Physician effectivenessWhich tests are ineffective?

Which physicians are likely to over-prescribe treatments?

What combinations of tests are most effective?

EE--commercecommerce

Automatic web page design

Recommendations for new purchases

Cross selling

Social Network Marketing

Pharmaceutical ApplicationsPharmaceutical Applications

High throughput screeningPredict actions in assaysPredict results in animals or humans

Rational drug designRelating chemical structure with chemical propertiesInverse regression to predict chemical properties from desired structure

DNA snips

GenomicsAssociate genes with diseasesFind relationships between genotype and drug response (e.g., dosage requirements, adverse effects) Find individuals most susceptible to placebo effect

Pharmaceutical ApplicationsPharmaceutical ApplicationsCombine clinical trial results with extensive medical/demographic information

Non traditional uses of clinical trial data warehouse to explore:

Prediction of adverse experiences – combining more than one trialWho is likely to be non-compliant or drop out?What are alternative (I.E., Non-approved) uses supported by the data?

Fraud and Terrorist DetectionFraud and Terrorist DetectionIdentify false:

Medical insurance claimsAccident insurance claims

Which stock trades are based on insider information?

Whose cell phone numbers have been stolen?

Which credit card transactions are from stolen cards?

Which documents are “interesting”

When are changes in networks signs of potential illegal activity?

Lesson 1: Learn to Make FriendsLesson 1: Learn to Make FriendsPVA is a philanthropic organization,

Sanctioned by the US Govt to represent the disabled veterans

They send out 4 million “free gifts” , every 6 weeksAnd hope for donations

Data were used for the KDD 1998 cup 200,000 donors

(100,000 training, 100,000 test)481 demographic variables

Past giving, income, age etc etc etcRecent campaign (only for training set)

Did they give? (Target B)How much did they give (Target D)

To optimize profit, who should receive the current solicitation?

What is the most cost effective strategy?

WhatWhat’’s s ““HardHard””? ? ----ExampleExample

TT--CodeCode

More More TcodeTcode

Transformation?Transformation?

Categories?Categories?

What does it mean?What does it mean?T -C ode

0 _ 1 6 DEAN 4 8 CO RP O RAL 1 0 9 LIC. 1 M R. 1 7 J UDGE 5 0 ELDER 1 1 1 S A.

1 0 0 1 M ES S RS . 1 7 0 0 2 J UDGE & M RS . 5 6 M AYO R 1 1 4 DA. 1 0 0 2 M R. & M RS . 1 8 M AJ O R 5 9 0 0 2 LIEUTENANT & M RS . 1 1 6 S R.

2 M RS . 1 8 0 0 2 M AJ O R & M RS . 6 2 LO RD 1 1 7 S RA. 2 0 0 2 M ES DAM ES 1 9 S ENATO R 6 3 CARDINAL 1 1 8 S RTA.

3 M IS S 2 0 GO V ERNO R 6 4 FRIEND 1 2 0 YO UR M AJ ES TY 3 0 0 3 M IS S ES 2 1 0 0 2 S ERGEANT & M RS . 6 5 FRIENDS 1 2 2 HIS HIGHNES S

4 DR. 2 2 0 0 2 CO LNEL & M RS . 6 8 ARCHDEACO N 1 2 3 HER HIGHNES S 4 0 0 2 DR. & M RS . 2 4 LIEUTENANT 6 9 CANO N 1 2 4 CO UNT 4 0 0 4 DO CTO RS 2 6 M O NS IGNO R 7 0 BIS HO P 1 2 5 LADY

5 M ADAM E 2 7 REV EREND 7 2 0 0 2 REV EREND & M RS . 1 2 6 P RINCE 6 S ERGEANT 2 8 M S . 7 3 P AS TO R 1 2 7 P RINCES S 9 RABBI 2 8 0 2 8 M S S . 7 5 ARCHBIS HO P 1 2 8 CHIEF

1 0 P RO FES S O R 2 9 BIS HO P 8 5 S P ECIALIS T 1 2 9 BARO N 1 0 0 0 2 P RO FES S O R & M RS . 3 1 AM BAS S ADO R 8 7 P RIV ATE 1 3 0 S HEIK 1 0 0 1 0 P RO FES S O RS 3 1 0 0 2 AM BAS S ADO R & M RS 8 9 S EAM AN 1 3 1 P RINCE AND P RINCES S

1 1 ADM IRAL 3 3 CANTO R 9 0 AIRM AN 1 3 2 YO UR IM P ERIAL M AJ ES T 1 1 0 0 2 ADM IRAL & M RS . 3 6 BRO THER 9 1 J US TICE 1 3 5 M . ET M M E.

1 2 GENERAL 3 7 S IR 9 2 M R. J US TICE 2 1 0 P RO F.1 2 0 0 2 GENERAL & M RS . 3 8 CO M M O DO RE 1 0 0 M .

1 3 CO LO NEL 4 0 FATHER 1 0 3 M LLE. 1 3 0 0 2 CO LO NEL & M RS . 4 2 S IS TER 1 0 4 CHANCELLO R

1 4 CAP TAIN 4 3 P RES IDENT 1 0 6 REP RES ENTATIV E 1 4 0 0 2 CAP TAIN & M RS . 4 4 M AS TER 1 0 7 S ECRETARY

1 5 CO M M ANDER 4 6 M O THER 1 0 8 LT. GO V ERNO R 1 5 0 0 2 CO M M ANDER & M RS . 4 7 CHAP LAIN

T itle

Relational Data BasesRelational Data BasesData are stored in tables

ItemID ItemName price

C56621 top hat 34.95

T35691 cane 4.99

RS5292 red shoes 22.95

Shoppers

Person ID person name ZIPCODE item bought

135366 Lyle 19103 T35691

135366 Lyle 19103 C56621

259835 Dick 01267 RS5292

MetadataMetadata

The data survey describes the data set contents and characteristics

Table nameDescriptionPrimary key/foreign key relationshipsCollection information: how, where, conditionsTimeframe: daily, weekly, monthlyCosynchronus: every Monday or Tuesday

Data PreparationData Preparation

Build data mining database

Explore data

Prepare data for modeling

60% to 95% of the time is spent preparing the data

60% to 95% of the time is spent 60% to 95% of the time is spent preparing the datapreparing the data

Data ChallengesData Challenges

Data definitionsTypes of variables

Data consolidationCombine data from different sourcesNASA mars lander

Data heterogeneityHomonymsSynonyms

Data quality

Data QualityData Quality

Missing ValuesMissing Values

Random missing valuesDelete row?

Paralyzed Veterans

Substitute valueImputationMultiple ImputationJMP 8 (!)

Systematic missing dataNow what?

Missing Values Missing Values ---- SystematicSystematic

Credit Card Bank finds that “Income” field is missing

Wharton Ph.D. Student questionnaire on survey attitudes

Bowdoin college applicants have mean SAT verbal score above 750

Clinical Trial of Depression Medication –what does missing mean?

Results for PVA Data SetResults for PVA Data SetIf entire list (100,000 donors) are mailed, net donation is $10,500

Using data mining techniques, this was increased 41.37%

KDD CUP 98 ResultsKDD CUP 98 Results

KDD CUP 98 Results 2KDD CUP 98 Results 2

Students in Data Mining ClassStudents in Data Mining Class

Student #1 $15,024Student #2 $14,695Student #3 $14,345

Data Mining and OLAPData Mining and OLAP

On-line analytical processing (OLAP): users deductively analyze data to verify hypothesis

Descriptive, not predictive

Data mining: software uses data to inductively find patterns – models!

Predictive or descriptive

Associations?Most associated variables in the censusMost associated variables in a supermarketAssocation Rules

Why Models?Why Models?Beer and Diapers

“In the convenience stores we looked at, on Friday nights, purchases of beer and purchases of diapers are highly associated”Conclusions?Actions?

Beer and DiapersBeer and Diapers

Picture from TandemTM ad

ModelsModelsModels are:

Powerful summaries for understandingUsed for exploration and prediction

Of course, models are not reality

George Box“All models are wrong, but some are useful”“Statisticians, like artists, have the bad habit of falling in love with their models”.

TwymanTwyman’’s Law and Corollariess Law and Corollaries

“If it looks interesting, it must be wrong”

De Veaux’s Corollary 1 to Twyman’s Law“If it’s perfect, it’s wrong”

De Veaux’s Corollary 2 to Twyman’s Law“If it isn’t wrong, you probably knew it already

Lesson 2 Lesson 2 –– An Example of TwymanAn Example of Twyman’’s Laws Law

Ingot cracking3935 30,000 lb. IngotsUp to 25% cracking rate$30,000 per recast90 potential explanatory variables

Water composition (reduced)Metal compositionProcess variablesOther environmental variables

Data ProcessingData Processing

Five months to consolidate process data

Three months to analyze and reduce dimension of water data

Eight months after starting projects, statisticians received flat file:

960 ingots (rows)149 variables

Household Income > $40000

Debt > $10000

On Job > 5 Yr

0.050.01

No Yes

0.060.11

Decision Trees – Mortgage Defaults

Decision Tree Decision Tree ---- TitanicTitanic

46% 93%

3 1,2,CChildAdult

1 or 2

27% 100%

33%23%

1stCrew

1 or Crew2 or 3

Cook County Hospital Cook County Hospital ---- ““ERER””

The 3 “Urgent”Risk Factors:

1. Is the reported Pain unstable angina?

2. Is there fluid in patient’s lungs?

3. Is the patient’s systolic BP < 100?

The ECG Tests:•MI: myocardial infarction (heart attack)

•Ischemia – Heart muscle not getting enough blood

Confusion MatrixConfusion Matrix

No Heart Attack

Actual Heart Attack

Doctors in ER

Predict No Heart Attack

Predict Heart Attack

0.250.11

0.750.89

No Heart Attack

Actual Heart Attack

Tree Algorithm (Goldman)

Predict No Heart Attack

Predict Heart Attack

0.920.04

0.080.96

Regression Tree Regression Tree

|Price<9446.5

Weight<2280 Disp.<134

Weight<3637.5

Price<11522

Reliability:abde

HP<154

34.00 30.17 26.22

21.67 20.40

Ingots Ingots –– First TreeFirst Tree

CountMeanStd Dev

39350.23560.4244

All Rows

CountMeanStd Dev

30050.1590.3661

Alloy (6045,7348,8234,2345,3234)CountMeanStd Dev

9300.48170.4999

Alloy (5434,5894,2439)

We know that – some alloys are hard to make. That’s why we gave you the data in the first place.

Second TreeSecond Tree

CountMeanStd Dev

All Rows

CountMeanStd Dev

30550.16730.3723

MG<3.9CountMeanStd Dev

8800.47270.4999

MG>=3.9

What do you think is in those alloys?

0.42440.2356

One More TimeOne More Time

Looks like ChromeOH!Did that solve it? No, but

Experimental designEnabled us to focus on important variables

Oh, that’s funny!

-Issac Asimov

What did we learn?What did we learn?

Data mining gave clues for generating hypotheses

Followed up with DOE

DOE led to substantial process improvement

HerbHerb’’s Tree s Tree –– TwymanTwyman’’s Law agains Law again

94649Count

37928.436G^2

Level0.94940.0506

All Rows

4792Count

Level0.00001.0000

TARGET_D>=1

89857Count

Level1.00000.0000

TARGET_D<1

Doing it Right Doing it Right –– Knowledge DiscoveryKnowledge Discovery

Define business problem

Build data mining database

Explore data

Prepare data for modeling

Build model

Evaluate model

Deploy model and results

Note: This process model borrows from Note: This process model borrows from CRISPCRISP--DM: DM: CRossCRoss Industry Standard Process for Data Industry Standard Process for Data MiningMining

Successful Data MiningSuccessful Data MiningThe keys to success:

Formulating the problemUsing the right dataFlexibility in modelingActing on results

Success depends more on the way you mine the data rather than the specific tool

Types of ModelsTypes of Models

Descriptions

Classification (categorical or discrete values)

Regression (continuous values)Time series (continuous values)

Clustering

Association

Model BuildingModel Building

Model buildingTrainTest

Evaluate

OverfittingOverfitting in Regressionin RegressionClassical overfitting:

Fit 6th order polynomial to 6 data points

-1 0 1 2 3 4 5 6

OverfittingOverfitting

Fitting non-explanatory variables to data

Overfitting is the result ofIncluding too many predictor variablesLack of regularizing the model

Neural net run too longDecision tree too deep

Avoiding OverfittingAvoiding OverfittingAvoiding overfitting is a balancing act –Occam’s Razor

Fit fewer variables rather than moreHave a reason for including a variable (other than it is in the database)Regularize (don’t overtrain)Know your field.

All models should be as simple as possiblebut no simpler than necessary

Albert Einstein

All models should be as simple as possiblebut no simpler than necessary

Albert Einstein

““ToyToy”” Problem Problem

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]tra

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

train2[, i]

0.0 0.2 0.4 0.6 0.8 1.0

Tree ModelTree Model|x4<0.512146

x1<0.359395

x4<0.140557

x3<0.215425

x3<0.490631

x5<0.232708

x2<0.54068

x4<0.283724

x2<0.299431

x2<0.129879

x4<0.336583x4<0.148234

x5<0.412206

x4<0.223909

x3<0.177433

x3<0.784959

x3<0.114976

x4<0.264999

x3<0.0789249

x3<0.728124

x3<0.0777104

x1<0.209569

x5<0.260297

x3<0.885533

x4<0.768584

x2<0.27271

x5<0.621811

x2<0.20279x3<0.248065

x5<0.588094

x4<0.916189

x1<0.328133

x8<0.933915

x4<0.700738

x2<0.414822

x3<0.821878

x4<0.941058

4.400 7.074

7.994 9.956

12.040 6.688 9.865 8.88712.060

15.020

10.78014.060

19.06014.34017.560

14.03016.860

21.26017.770

10.640

12.70014.380

16.830

12.25015.83018.16015.150

15.190

17.69019.470

14.200

21.24018.760

21.10024.280

25.320

R –squared 82.3% Train 67.2% Test

Predictions for ExamplePredictions for Example

R –squared 82.3% Train 67.2% Test

y Predictor

Tree AdvantagesTree Advantages

Model explains its reasoning -- builds rules

Build model quickly

Handles non-numeric data

No problems with missing dataMissing data as a new valueSurrogate splits

Works fine with many dimensions

WhatWhat’’s Wrong With Trees?s Wrong With Trees?

Output are step functions – big errors near boundaries

Greedy algorithms for splitting – small changes change model

Uses less data after every split

Model has high order interactions -- all splits are dependent on previous splits

Often non-interpretable

Linear Regression Linear Regression Term Estimate Std Error t Ratio Prob>|t|

Intercept -0.900 0.482 -1.860 0.063x1 4.658 0.292 15.950 <.0001x2 4.685 0.294 15.920 <.0001x3 -0.040 0.291 -0.140 0.892x4 9.806 0.298 32.940 <.0001x5 5.361 0.281 19.090 <.0001x6 0.369 0.284 1.300 0.194x7 0.001 0.291 0.000 0.998x8 -0.110 0.295 -0.370 0.714x9 0.467 0.301 1.550 0.122x10 -0.200 0.289 -0.710 0.479

R-squared: 73.5% Train 69.4% Test

Stepwise RegressionStepwise Regression

Term Estimate Std Error t Ratio Prob>|t|Intercept -0.625 0.309 -2.019 0.0439x1 4.619 0.289 15.998 <.0001x2 4.665 0.292 15.984 <.0001x4 9.824 0.296 33.176 <.0001x5 5.366 0.28 19.145 <.0001

R-squared 73.4% on Train 69.8% Test

Stepwise 2Stepwise 2NDND Order ModelOrder ModelTerm Estimate Std Error t Ratio Prob>|t|

Intercept -2.026 0.264 -7.68 <.0001x1 4.311 0.184 23.47 <.0001x2 4.808 0.185 26.04 <.0001x3 -0.506 0.181 -2.79 0.0054x4 10 0.186 53.79 <.0001x5 5.212 0.176 29.67 <.0001x8 -0.181 0.186 -0.97 0.3301x9 0.427 0.188 2.28 0.0232(x1-0.51811)*(x1-0.51811) -0.932 0.711 -1.31 0.1905(x2-0.48354)*(x1-0.51811) 8.972 0.634 14.14 <.0001(x3-0.48517)*(x1-0.51811) -1.367 0.65 -2.1 0.0358(x3-0.48517)*(x2-0.48354) -0.8 0.639 -1.25 0.2111(x3-0.48517)*(x3-0.48517) 20.515 0.69 29.71 <.0001(x4-0.49647)*(x1-0.51811) 1.014 0.651 1.56 0.1197(x4-0.49647)*(x2-0.48354) -1.159 0.65 -1.78 0.075(x5-0.50509)*(x2-0.48354) -0.794 0.62 -1.28 0.2008(x5-0.50509)*(x3-0.48517) 1.105 0.619 1.78 0.0748(x5-0.50509)*(x4-0.49647) 0.127 0.635 0.2 0.8414(x8-0.52029)*(x5-0.50509) 1.065 0.63 1.69 0.0914

R-squared 89.9% Train 88.8% Test

Next Steps Next Steps

Higher order terms?

When to stop?

Transformations?

Too simple: underfitting – bias

Too complex: inconsistent predictions, overfitting – high variance

Selecting models is Occam’s razorKeep goals of interpretation vs. prediction in mind

Logistic RegressionLogistic RegressionWhat happens if we use linear regression on1-0 (yes/no) data?

Income

20000 40000 60000 80000

Logistic Regression IILogistic Regression II

Points on the line can be interpreted as probability, but don’t stay within [0,1]Use a sigmoidal function instead of linear function to fit the data

IeIf −+=

00.20.40.60.8

-10 -6 -2 2 6 10

Logistic Regression IIILogistic Regression III

Income

20000 40000 60000 80000

Neural NetsNeural NetsDon’t resemble the brain

Are a statistical modelClosest relative is projection pursuit regression

Input (z1)Output

0.4-0.5

z1 = 0.8 + .3x1 + .7x2 - .2x3 + .4x4 - .5x5

A Single NeuronA Single Neuron

Single NodeSingle Node

jjkl xwzI θ+== ∑ 1

)(ˆ klk zhy =

Output:

Input to outer layer from “hidden node”:

Layered ArchitectureLayered Architecture

Input layer

Output layer

Hidden layer

Neural NetworksNeural Networks

jjkl xwz θ+=∑ 1

jk zhwzhwy θ+++= …)()(ˆ 222121

Create lots of features – hidden nodes

Use them in an additive model:

Put It TogetherPut It Together

))((~ˆ 12 jlj

klk xwhwhy θθ ++= ∑∑

The resulting model is just a flexible non-linear regression of the response on a set of predictor variables.

Predictions for ExamplePredictions for Example

R2 89.5% Train 87.7% Test

What Does This Get Us?What Does This Get Us?

Enormous flexibility

Ability to fit anythingIncluding noise

Interpretation?

Neural Net ProNeural Net Pro

AdvantagesHandles continuous or discrete valuesComplex interactionsIn general, highly accurate for fitting due to flexibility of modelCan incorporate known relationships

So called grey box modelsSee De Veaux et al, Environmetrics 1999

Neural Net ConNeural Net Con

DisadvantagesModel is not descriptive (black box)Difficult, complex architecturesSlow model buildingCategorical data explosionSensitive to input variable selection

KK--Nearest Neighbors(KNN) Nearest Neighbors(KNN)

To predict y for an x: Find the k most similar x'sAverage their y's

Find k by cross validation

No training (estimation) required

Works embarrassingly wellFriedman, KDDM 1996

Collaborative FilteringCollaborative Filtering

Goal: predict what movies people will like

Data: list of movies each person has watchedLyle André, StarwarsEllen André, Starwars, Destin Fred Starwars, BatmanDean Starwars, Batman, RamboJason Destin d’Amélie Poulin, Caché

Data BaseData Base

Data can be represented as a sparse matrix

Karen likes André. What else might she like?

CDNow doubled e-mail responses

Starwars Rambo Batman My Dinner w/André Destin D'Amilie Caché

Lyle y yEllen y y yFred y yDean y y yJason y y y

Karen ? ? ? ? ? ?

Lesson 3: Know When to Hold Lesson 3: Know When to Hold ‘‘emem

Breast cancer data from mammogramsError rates by trained radiologists are near 25% for both false positives and false negatives

Newer equipment is prohibitively expensive for the developing world

Early detection of breast cancer is crucial

Cumulative type I error over a decade is near 100% leading to needless biopsies

The DataThe Data

1618 mammograms showing clustered microcalcifications

Biostatistics Dept Institut Curie

VariablesResponse: Malignant or notPredictors: Age, Tissue Type (light/dense) Size (mm), Number of microcalc, Number of suspicious clusters, Shape of microcalc (1-5), Polyshape?(y/n), Shape of cluster (1,2,3), Retro (cluster near nipple?), Deep? (y/n)

Tree modelTree model

Combining ModelsCombining Models

In 1950’s forecasters found that combining forecasting models worked better on average than any single forecast model

Reduces variance by averagingCan reduce bias if collection

is broader than single model

Bagging and BoostingBagging and BoostingBagging (Bootstrap Aggregation)

Bootstrap a data set repeatedlyTake many versions of same model (e.g. tree)

Random Forest VariationForm a committee of modelsTake majority rule of predictions

BoostingCreate repeated samples of weighted dataWeights based on misclassificationCombine by majority rule, or linear combination of predictions

ResultsResults

False Positives False NegativesSimple Tree 32.20% 33.70%Neural Network 25.50% 31.70%Boosted Trees 24.90% 32.50%Bagged Trees 19.30% 28.80%Radiologists 22.40% 35.80%

• Split data into train and test (62.5% -37.5%)

• Repeat random splits 1000 timesFor each iteration, count false positives and false negatives on the 600 test set cases

How Do We Really Start?How Do We Really Start?

Life is not so kindCategorical variablesMissing data500 variables, not 10

481 variables – where to start?

Where to StartWhere to Start

Three rules of data analysisDraw a pictureDraw a pictureDraw a picture

Ok, but how? There are 90 histogram/bar charts and 4005 scatterplots to look at (or at least 90 if you look only at y vs. X)

Exploratory Data ModelsExploratory Data Models

Use a tree to find a smaller subset of variables to investigate

Explore this set graphicallyStart the modeling process over

Build model Compare model on small subset with full predictive model

More RealisticMore Realistic

250 predictors200 Continuous 50 Categorical

10,000 rows

Why is this still easy?No missing valuesRelatively high signal/noise

Start With a Simple ModelStart With a Simple Model

Tree? |x4<0.477873

x2<0.288579

x5<0.465905 x1<0.333728

x1<0.152683 x5<0.466843

x4<0.208211

x1<0.297806

x5<0.529173 x2<0.343653

x2<0.125849 x4<0.752766

x5<0.644585 x5<0.49235

-2.560 -0.265

-1.890 1.150

2.000 4.570

2.540 5.120

2.910 6.050

7.500 10.100 9.880 12.200

BrushingBrushing

Lesson 4: Know when to Fold Lesson 4: Know when to Fold ‘‘ememLiability for churches

Some PredictorsNet Premium ValueProperty ValueCoastal (yes/no)Inner100 (a.k.a., highly-urban) (yes/no)High property value Neighborhood (yes/no)Indicator Class

1 (Church/House of worship)2 (Sexual Misconduct – Church)3 (Add’l Sex. Misc. Covg Purchased)4 (Not-for-profit daycare centers)5 (Dwellings – One family (Lessor’s risk))6 (Bldg or Premises – Office – Not for profit)7 (Corporal Punishment – each faculty member)8 (Vacant land- not for profit)9 (Private, not for profit, elementary, Kindergarten and Jr. High Schools)10 (Stores – no food or drink – not for profit)11 (Bldg or Premises – Maintained by insured (lessor’s risk) – not for profit)12 (Sexual misconduct – diocese)

Fast FailFast Fail

Not every modeling effort is a successA model search can save lots of queries

Data took 8 months to get ready

Analyst spent 2 months exploring it

Tree models, stepwise regression (and a neural network running for several hours) found no out of sample predictive ability

Lesson 5: Machines are Smart Lesson 5: Machines are Smart ––You are SmarterYou are Smarter

Why do statisticians like interpretability?

Black boxes are not interpretable, but there may be important information

Case Study Case Study –– Warranty DataWarranty Data

A new backpack inkjet printer is showing higher than expected warranty claims

What are the important variables?What’s going on?

A neural networks shows that Zip code is the most important predictor

Zip Code?Zip Code?

Data Mining Data Mining –– DOE SynergyDOE Synergy

Data Mining is exploratory

Efforts can go on simultaneously

Learning cycle oscillates naturally between the two

What Did We Learn?What Did We Learn?Toy problem

Functional form of model

PVA dataUseful predictor – increased sales 40%

Depression StudyIdentified critical intervention point at 2 weeks

IngotsGave clues as to where to lookExperimental design followed

ChurchesWhen to quit

PrintersWhen to experiment – what factors

Challenges for data miningChallenges for data mining

Not algorithms

Overfitting

Finding an interpretable model that fits reasonably well

Recap Recap –– Success in Data MiningSuccess in Data MiningProblem formulation

Data preparationData definitionsData cleaningFeature creation, transformations

EDM – exploratory modelingReduce dimensions

Success in Data Mining IISuccess in Data Mining II

Don’t forget Graphics

Second phase modeling

Testing, validation, implementation

Constant re-evaluation of models

Which Method(s) to Use?Which Method(s) to Use?

No method is best

Which methods work best when?

Which method to use?YES!

For More InformationFor More InformationTwo Crows

http//www.twocrows.com

KDNuggetshttp://www.kdnuggets.com

deveaux@williams.edu

M. Berry and G. Linoff, Data Mining Techniques, Wiley, 1997

Dorian Pyle, Data Preparation for Data Mining, Morgan Kaufmann, 1999

Hand, D.J., Mannila, H. and Smyth, P., Principles of Data Mining, MIT Press 2001

Tan, P.N, Steinbach, and Kumar: Introduction to Data Mining, Addison-Wesley, 2006

Hastie, Tibshirani and Friedman, Statistical Learning 2nd edition, Springer

Successful Data Mining

data mining different

successful data mining

data investment

data warehouses

jmp visual data mining

jmp visual data mining

large data bases

variety of data analysis

Documents

EE3J2 Data Mining EE3J2 Data Mining

SUSTAINABLE MINING – SUCCESSFUL EXAMPLES ARTHUR ...

Data Mining Taylor Statistics 202: Data Mining

Data mining and privacy preserving in data mining

DATA MINING WITH - Lagout Mining/Data Mining with...

Data Mining vs. Statistics Pavel Brusilovsky. 2 Objectives 2...

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is...

Applied Data Mining - Lagout Mining/Applied Data Mining...

A Linguistic and Sociological Approach to Using Big Data...

September 4, 20151 Chapter 1. Introduction Motivation: Why.....

Data Mining and Applications - antoniomucherino.it · Data....

FROM DATA MINING TO KNOWLEDGE MINING: SYMBOLIC DATA ...

Datenbanksysteme 3 Sommer 2003 Data Mining - 1 Worzyk FH...

Data mining week 1 - pengantar data mining

Data Mining für Business Intelligence Data Mining for ...

Introduction to Introduction to Data Mining Data Mining