Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen.

• Analyze/StripMiner ™ Overview• To obtain an idiot’s guide type “analyze > readme.txt”• Standard Analyze Scripts• Predicting on Blind Data• PLS (Please Listen to Svante Wold)• LOO, BOO and n-Fold Cross-Validation Error Measures• Albumin Data Set and Feature Selection• Bio-Informatics

Analyze/StripMiner™

Analyze/StripMiner™

• Feature Selection– Sensitivity Analysis– Genetic Algorithms– Correlation GA (GAFEAT)– Method specific

• Learning Modes– Bootstrapping– Bagging– Boosting– Leave-one-out cross-validation

• Data Processing– Interface with RECON– Different Scaling Modes– Outlier detection/data cleansing

• Visualization– Correlation Plots– 2-D Sensitivity Plots– Outlier Visualization Plots– Different Scaling Options– Cluster Ranking Plots– Standard ROC curves– Continuous ROC curves

• Modeling– ANN (Neural Networks)– SVM (Support Vector Machines)– PLS (Partial-Least Squares)– GA-based regression clustering– PCA regression– Local Learning– Outlier Detection (GAMOL)

• Code Specifics– Tight Classic C-code (< 15000 lines)– Script-Based Shell Program– Runs on all Platforms– Ultra Fast– Use: TransScan – GE - KODAK Doppler broadening Macro-Economics Analysis

Analyze/StripMiner ™ Coding Philosophy

• Standard C code that compiles on all platforms• WINDOWS™ and Linux platforms• Supporting visualizations use Java and/or gnuplot• Flexible GUI with sample problems and demos

• Fastest code possible with efficient memory requirements• Long history of code use with variety of users for troubleshooting• Flexible code based on scripts and operators• Operates on a numeric standard data mining format file

Practical Tips for PCA

• NIPALS algorithm assumes the features are zero centered• It is standard practice to do a Mahalanobis scaling of the data

• PCA regression does not consider the response data• The t’s are called the scores• It is common practice to drop 4 sigma outlier features (if there are many features)

xxx

x iscaledi

StripMiner Script Examples

• PCA visualization (pca.bat)• Pharma-plot (pharma.bat)• Prediction for iris with PCA (iris.bat)• Bootstrap prediction for iris (iris_boo.bat)• Predicting with an external test set example (iris_ext.bat))• PLS and ROC curve for iris problem (roc.bat)• Leave-One-Out PLS for HIV (loo_hiv.bat)• Feature selection for HIV (prune.bat)• Starplots (star.bat)

File Flow for PCA.bat Script

• num_eg.txt contains the number of PCAs (2-10)• usually data are first Mahalanobis scaled (option #-3: “PLS scaling”, data only)

num_eg.txtstats.txtla_sscala.txtiris.txt.txt.txt.txt

REM PCA.BAT FOR IRIS DATA (5)analyze iris.txt 3301

REM ELIMIBATE COMMASanalyze pca.txt.txt 100

REM MAHALANOBIS SCALE ATTRIBUTES ONLYanalyze iris.txt.txt -3

REM DEFINE # PRINCIPAL COMPONENTS (4)analyze num_eg.txt 105

REM CREATE PCA COMPONENTSanalyze iris.txt.txt.txt 36pause

REM VISUALIZE PCAs (3)analyze iris.txt.txt.txt.txt 3350pause

REM PHARMAPLOTanalyze iris.txt.txt.txt.txt 3308pause

• num_eg.txt has to contain a 4 for a pharmaplot• use pharmaplot.m for visualization in MATLAB• adjust color setting threshold in pharmaplot.m

File Flow for pharma.bat script

num_eg.txtstats.txtla_sscala.txtdmatrix.txta.txt

pharmaplot

REM PHARMA.BAT FOR IRIS DATA (5)analyze iris.txt 3301

REM ELIMINATE COMMASanalyze pca.txt.txt 100

REM MAHALANOBIS SCALE ATTRIBUTES AND RESPONSEanalyze iris.txt.txt 3

REM DEFINE # PRINCIPAL COMPONENTS (4)analyze num_eg.txt 105

REM SPLIT DATA IN TRAINING AND TEST SET (100 2)analyze iris.txt.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tespause

REM DO PHARMAPLOT (assumes dmatrix.txt for test data)analyze a.pat 1028pause

• For the random seed in splitting routine don’t use 0 (preserves order)• The test set is really only for validation purposes (answer is known)• Note: descaling from PLS uses la_sscala.txt file• Notice q2, Q2, and RSME error measures

File Flow For iris.bat Script: Predicting Class

num_eg.txt

stats.txtla_sscala.txta.txtcmatrix.txtdmatrix.txtresultss.xxxresultss.tttresults.xxxresults.ttt

REM IRIS.BAT (PCA REGRESSION)REM GENERATE IRIS DATA (5)

analyze iris.txt 3301REM ELIMINATE COMMAS

analyze iris.txt 100REM MAHALANOBIS SCALE

analyze iris.txt.txt 3REM GENERATE # PCAs (4)

analyze num_eg.txt 105REM SPLIT DATA (100 2)

analyze iris.txt.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tes

REM MAKE PCA REGRESSION MODELanalyze a.pat 17analyze a.tes 18pause

REM VISUALIZE RESULTSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4analyze results.ttt 3313pause

• We use bootstrap cross-validation (e.g., leave 7 out 100 times)• Use MATLAB script dos_mbotw results.ttt to display results for test set• Use MATLAB script dos_mbotw resultss.xxx to display results training set• Notice q2, Q2, and RSME error measures

File Flow for iris_boo.bat Script: Bootstrap Validation for Estimating Prediction Confidence

num_eg.txtstats.txtla_sscala.txta.txtresultss.xxxresultss.tttresults.ttt

REM IRIS_BOO.BATREM GENERATE IRIS DATA (5)


analyze iris.txt 100REM MAHALANOBIS SCALE

analyze iris.txt.txt 3REM GENERATE # PCAs (4)

analyze num_eg.txt 105REM SPLIT DATA (100 2)

analyze iris.txt.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tes

REM MAKE PCA REGRESSION MODEL (7 100 2)analyze a.pat -33analyze a.tes 18analyze a.pat 18pause

REM VISUALIZE RESULTSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4analyze results.ttt 3313pause

Error Measure Criteria

For training set we use: - RMSE: root mean square error for training set

- r2 : correlation coefficient for training set

- R2: PRESS R2

For validation/test set we use: - RMSE: reast mean square error for validation set - q2 : 1 – rtest

2

- Q2: PRESS/SD

n

ii

n

iii

yy

yyQ

1

2

1

2

2

ˆ

2ˆ

1 i

ii yyn

RMSE

traintrain

train

1

2

1

2

12

ˆ

ˆ

n

ii

n

ii

n

iii

yyyy

yyyyr

train

train

n

ii

n

iii

yy

yyR

1

2

1

2

2

ˆ

1

Script for Scaling with an External Test Set

REM IRIS_EXT.BATREM GENERATE IRIS DATA (5)


analyze iris.txt 100REM SPLIT DATA IN TRAINING AND TEST SET (100 2)

analyze iris.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tesREM SCALE TRAINING AND TEST DATA CONSISTENTLYanalyze a.pat 3314159analyze a.tes 3314159

REM GENERATE # PCAs (4)analyze num_eg.txt 105

REM MAKE PCA REGRESSION MODELanalyze a.pat 17analyze a.tes 18pause

REM VISUALIZE RESULTSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4analyze results.ttt 3313pauseanalyze resultss.ttt 3305pause

• 3305 scatterplot (Java)• -3305 scatterplot gnuplot• 3313 errorplot (Java)• -3313 errorplot (gnuplot)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1ROC curve ( median )

False Positive

Tru

e Po

sitiv

e

goodnes = 0.9897

Docking Ligands is a Nonlinear Problem

PLS, K-PLS, SVM, ANN

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 100 200 300 400 500 600

1 -

Q2

# Features

1 - Q2 versus # Features on Validation Set

Thu Mar 13 15:59:57 2003

'evolve.txt' using 1:2

Feature Selection (data strip mining)

Script for ALBUMIN_LOO.BAT: Pls-loo Validation For Albumin Data

cmatrix.oridmatrix.orinum_eg.txt

stats.txtla_sscala.txta.txtresults.xxxresults.tttsel_lbls.txtbbmatrixx.txtbbmatrixxx.txt

REM ALBUMIN_LOO.BATREM RECOVER DATA FROM ORIGINAL FILES

copy a.pat + a.tes a.txtREM GENERATE GENERIC LABELS IN SEL_LBLS.TXT

analyze a.txt 116REM PLS SCALE DATA

analyze a.txt 3copy a.txt.txt a.txt

REM DROP NON-CHANGING FEATURESanalyze a.txt 5copy a.txt.txt a.txtcopy sel_lblss.txt sel_lbls.txt

REM DROP 4-sigma OUTLIERSanalyze a.txt 9copy a.txt.txt a.txtcopy sel_lblss.txt sel_lbls.txt

REM RESPLIT DATA (answer 84 and 0)analyze a.txt 20copy cmatrix.txt a.txtcopy dmatrix.txt b.txt

REM LEAVE-ONE OUT PLS VALIDATIONanalyze a.pat 25copy resultss.ttt resultss.xxxanalyze a.tes 18

REM RESCALE AND DISPLAY METRICSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4

REM SDISPLAY SCATTERPLOTanalyze results.xxx 3305pauseanalyze results.ttt 3305pause

• PLS-LOO stands for leave-one-out PLS cross-validation• Training set is in cmatrix.ori and external validation set in dmatrix.ori • External validation set has –999 or 0 in the activity field• Note that we create generic labels and and that there is a test set• Notice the dropping of non-changing features and 4-sigma ouliers• Notice the acrobatics for displaying metrics (visualize with dos_mbotw)

PLS Feature Selection Script For Albumin Data

REM PREPRUNE.BAT:STORE FILES IN CASE…copy a.txt aa.txtcopy b.txt bb.txtcopy sel_lbls.txt label.txt

REM RUN PLS BOOTSTRAPS BEFORE PRUNING (7, 100, 2)REM BBMATRIXX.TXT WILL BE USED FOR PRUNING

REM ONLY PRUNE FROM OPTRION 33 (FOR THE TIME BEING)analyze aa.txt 33

• Do several iterative prunings, typically leave 7 out 100 x• Use different seeds• Number of selected feature example: 400, 300, 200, 150, 120, 100, 80, 60, 50, 45, …

aa.patbbmatrixx.txtsel_lbls.txt

select.txt

sel_lbls.txtaa.pataa.tesbbmatrixx.txtbbmatrixxx.txt

REM PRUNE_ALBUMIN.BATanalyze aa.pat 3318

REM USE SELECT.TXT TO PRUNE OUT FEATURESanalyze aa.pat 10analyze aa.tes 10copy aa.pat.txt aa.patcopy aa.tes.txt aa.tescopy sel_lblss.txt sel_lbls.txt

REM DO ANOTHER BOOTSTRAP(7 100 2)analyze aa.pat 33pause

0 20 40 60 80 100 120 140 160 180 200

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72

0.74

0.76Correlation vs # descriptors for HIV

Number of Features

Cor

rela

tion

STARPLOT.BAT: Starplot for Selected Features for Albumin

REM GENERATE STARPLOTREM DO FIRST A DUMMY PRUNING

REM i.e., RUN PRUNE.BAT BUT SELECT ALLREM FEATURES TO REORDER BY SENSITIVITY

REM DO 30 BOOTSTRAPS (7; 30; 2)analyze a.pat 33

REM GENERATE STARPLOT (9)analyze a.pat 3320analyze starpl_reg.txt 3350pause

sel_lbls.txtaa.pat

bbmatrixxx.txt

sel_lbls.txtstarplot.txt

starplot

• First generate bbmatrixxx.txt which contains all sensitivities for (e.g.) 30 boostraps using PLS bootstrap option 33• Generate starplot.txt from bbmatrixxx.txt using option 3320• Use the MATLAB routine starplot.m (operates on starplot.txt and sel_lbls.txt)

Analyze/StripMiner ™ Overview To obtain an idiot’s guide type “analyze > readme.txt” Standard Analyze Scripts Predicting on Blind Data PLS (Please Listen.

Documents

pharmaplot slide

data pca regression

pharmaplot use pharmaplot

pls scaling

blind data pls

test set use

response data

fast use