Dec 21, 2015
• Analyze/StripMiner ™ Overview• To obtain an idiot’s guide type “analyze > readme.txt”• Standard Analyze Scripts• Predicting on Blind Data• PLS (Please Listen to Svante Wold)• LOO, BOO and n-Fold Cross-Validation Error Measures• Albumin Data Set and Feature Selection• Bio-Informatics
Analyze/StripMiner™
Analyze/StripMiner™
• Feature Selection– Sensitivity Analysis– Genetic Algorithms– Correlation GA (GAFEAT)– Method specific
• Learning Modes– Bootstrapping– Bagging– Boosting– Leave-one-out cross-validation
• Data Processing– Interface with RECON– Different Scaling Modes– Outlier detection/data cleansing
• Visualization– Correlation Plots– 2-D Sensitivity Plots– Outlier Visualization Plots– Different Scaling Options– Cluster Ranking Plots– Standard ROC curves– Continuous ROC curves
• Modeling– ANN (Neural Networks)– SVM (Support Vector Machines)– PLS (Partial-Least Squares)– GA-based regression clustering– PCA regression– Local Learning– Outlier Detection (GAMOL)
• Code Specifics– Tight Classic C-code (< 15000 lines)– Script-Based Shell Program– Runs on all Platforms– Ultra Fast– Use: TransScan – GE - KODAK Doppler broadening Macro-Economics Analysis
Analyze/StripMiner ™ Coding Philosophy
• Standard C code that compiles on all platforms• WINDOWS™ and Linux platforms• Supporting visualizations use Java and/or gnuplot• Flexible GUI with sample problems and demos
• Fastest code possible with efficient memory requirements• Long history of code use with variety of users for troubleshooting• Flexible code based on scripts and operators• Operates on a numeric standard data mining format file
Practical Tips for PCA
• NIPALS algorithm assumes the features are zero centered• It is standard practice to do a Mahalanobis scaling of the data
• PCA regression does not consider the response data• The t’s are called the scores• It is common practice to drop 4 sigma outlier features (if there are many features)
xxx
x iscaledi
StripMiner Script Examples
• PCA visualization (pca.bat)• Pharma-plot (pharma.bat)• Prediction for iris with PCA (iris.bat)• Bootstrap prediction for iris (iris_boo.bat)• Predicting with an external test set example (iris_ext.bat))• PLS and ROC curve for iris problem (roc.bat)• Leave-One-Out PLS for HIV (loo_hiv.bat)• Feature selection for HIV (prune.bat)• Starplots (star.bat)
File Flow for PCA.bat Script
• num_eg.txt contains the number of PCAs (2-10)• usually data are first Mahalanobis scaled (option #-3: “PLS scaling”, data only)
num_eg.txtstats.txtla_sscala.txtiris.txt.txt.txt.txt
REM PCA.BAT FOR IRIS DATA (5)analyze iris.txt 3301
REM ELIMIBATE COMMASanalyze pca.txt.txt 100
REM MAHALANOBIS SCALE ATTRIBUTES ONLYanalyze iris.txt.txt -3
REM DEFINE # PRINCIPAL COMPONENTS (4)analyze num_eg.txt 105
REM CREATE PCA COMPONENTSanalyze iris.txt.txt.txt 36pause
REM VISUALIZE PCAs (3)analyze iris.txt.txt.txt.txt 3350pause
REM PHARMAPLOTanalyze iris.txt.txt.txt.txt 3308pause
• num_eg.txt has to contain a 4 for a pharmaplot• use pharmaplot.m for visualization in MATLAB• adjust color setting threshold in pharmaplot.m
File Flow for pharma.bat script
num_eg.txtstats.txtla_sscala.txtdmatrix.txta.txt
pharmaplot
REM PHARMA.BAT FOR IRIS DATA (5)analyze iris.txt 3301
REM ELIMINATE COMMASanalyze pca.txt.txt 100
REM MAHALANOBIS SCALE ATTRIBUTES AND RESPONSEanalyze iris.txt.txt 3
REM DEFINE # PRINCIPAL COMPONENTS (4)analyze num_eg.txt 105
REM SPLIT DATA IN TRAINING AND TEST SET (100 2)analyze iris.txt.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tespause
REM DO PHARMAPLOT (assumes dmatrix.txt for test data)analyze a.pat 1028pause
• For the random seed in splitting routine don’t use 0 (preserves order)• The test set is really only for validation purposes (answer is known)• Note: descaling from PLS uses la_sscala.txt file• Notice q2, Q2, and RSME error measures
File Flow For iris.bat Script: Predicting Class
num_eg.txt
stats.txtla_sscala.txta.txtcmatrix.txtdmatrix.txtresultss.xxxresultss.tttresults.xxxresults.ttt
REM IRIS.BAT (PCA REGRESSION)REM GENERATE IRIS DATA (5)
analyze iris.txt 3301REM ELIMINATE COMMAS
analyze iris.txt 100REM MAHALANOBIS SCALE
analyze iris.txt.txt 3REM GENERATE # PCAs (4)
analyze num_eg.txt 105REM SPLIT DATA (100 2)
analyze iris.txt.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tes
REM MAKE PCA REGRESSION MODELanalyze a.pat 17analyze a.tes 18pause
REM VISUALIZE RESULTSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4analyze results.ttt 3313pause
• We use bootstrap cross-validation (e.g., leave 7 out 100 times)• Use MATLAB script dos_mbotw results.ttt to display results for test set• Use MATLAB script dos_mbotw resultss.xxx to display results training set• Notice q2, Q2, and RSME error measures
File Flow for iris_boo.bat Script: Bootstrap Validation for Estimating Prediction Confidence
num_eg.txtstats.txtla_sscala.txta.txtresultss.xxxresultss.tttresults.ttt
REM IRIS_BOO.BATREM GENERATE IRIS DATA (5)
analyze iris.txt 3301REM ELIMINATE COMMAS
analyze iris.txt 100REM MAHALANOBIS SCALE
analyze iris.txt.txt 3REM GENERATE # PCAs (4)
analyze num_eg.txt 105REM SPLIT DATA (100 2)
analyze iris.txt.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tes
REM MAKE PCA REGRESSION MODEL (7 100 2)analyze a.pat -33analyze a.tes 18analyze a.pat 18pause
REM VISUALIZE RESULTSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4analyze results.ttt 3313pause
Error Measure Criteria
For training set we use: - RMSE: root mean square error for training set
- r2 : correlation coefficient for training set
- R2: PRESS R2
For validation/test set we use: - RMSE: reast mean square error for validation set - q2 : 1 – rtest
2
- Q2: PRESS/SD
n
ii
n
iii
yy
yyQ
1
2
1
2
2
ˆ
2ˆ
1 i
ii yyn
RMSE
traintrain
train
1
2
1
2
12
ˆ
ˆ
n
ii
n
ii
n
iii
yyyy
yyyyr
train
train
n
ii
n
iii
yy
yyR
1
2
1
2
2
ˆ
1
Script for Scaling with an External Test Set
REM IRIS_EXT.BATREM GENERATE IRIS DATA (5)
analyze iris.txt 3301REM ELIMINATE COMMAS
analyze iris.txt 100REM SPLIT DATA IN TRAINING AND TEST SET (100 2)
analyze iris.txt.txt 20copy cmatrix.txt a.patcopy dmatrix.txt a.tesREM SCALE TRAINING AND TEST DATA CONSISTENTLYanalyze a.pat 3314159analyze a.tes 3314159
REM GENERATE # PCAs (4)analyze num_eg.txt 105
REM MAKE PCA REGRESSION MODELanalyze a.pat 17analyze a.tes 18pause
REM VISUALIZE RESULTSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4analyze results.ttt 3313pauseanalyze resultss.ttt 3305pause
• 3305 scatterplot (Java)• -3305 scatterplot gnuplot• 3313 errorplot (Java)• -3313 errorplot (gnuplot)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1ROC curve ( median )
False Positive
Tru
e Po
sitiv
e
goodnes = 0.9897
Docking Ligands is a Nonlinear Problem
PLS, K-PLS, SVM, ANN
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 100 200 300 400 500 600
1 -
Q2
# Features
1 - Q2 versus # Features on Validation Set
Thu Mar 13 15:59:57 2003
'evolve.txt' using 1:2
Feature Selection (data strip mining)
Script for ALBUMIN_LOO.BAT: Pls-loo Validation For Albumin Data
cmatrix.oridmatrix.orinum_eg.txt
stats.txtla_sscala.txta.txtresults.xxxresults.tttsel_lbls.txtbbmatrixx.txtbbmatrixxx.txt
REM ALBUMIN_LOO.BATREM RECOVER DATA FROM ORIGINAL FILES
copy a.pat + a.tes a.txtREM GENERATE GENERIC LABELS IN SEL_LBLS.TXT
analyze a.txt 116REM PLS SCALE DATA
analyze a.txt 3copy a.txt.txt a.txt
REM DROP NON-CHANGING FEATURESanalyze a.txt 5copy a.txt.txt a.txtcopy sel_lblss.txt sel_lbls.txt
REM DROP 4-sigma OUTLIERSanalyze a.txt 9copy a.txt.txt a.txtcopy sel_lblss.txt sel_lbls.txt
REM RESPLIT DATA (answer 84 and 0)analyze a.txt 20copy cmatrix.txt a.txtcopy dmatrix.txt b.txt
REM LEAVE-ONE OUT PLS VALIDATIONanalyze a.pat 25copy resultss.ttt resultss.xxxanalyze a.tes 18
REM RESCALE AND DISPLAY METRICSanalyze resultss.xxx 4copy results.ttt results.xxxanalyze resultss.ttt 4
REM SDISPLAY SCATTERPLOTanalyze results.xxx 3305pauseanalyze results.ttt 3305pause
• PLS-LOO stands for leave-one-out PLS cross-validation• Training set is in cmatrix.ori and external validation set in dmatrix.ori • External validation set has –999 or 0 in the activity field• Note that we create generic labels and and that there is a test set• Notice the dropping of non-changing features and 4-sigma ouliers• Notice the acrobatics for displaying metrics (visualize with dos_mbotw)
PLS Feature Selection Script For Albumin Data
REM PREPRUNE.BAT:STORE FILES IN CASE…copy a.txt aa.txtcopy b.txt bb.txtcopy sel_lbls.txt label.txt
REM RUN PLS BOOTSTRAPS BEFORE PRUNING (7, 100, 2)REM BBMATRIXX.TXT WILL BE USED FOR PRUNING
REM ONLY PRUNE FROM OPTRION 33 (FOR THE TIME BEING)analyze aa.txt 33
• Do several iterative prunings, typically leave 7 out 100 x• Use different seeds• Number of selected feature example: 400, 300, 200, 150, 120, 100, 80, 60, 50, 45, …
aa.patbbmatrixx.txtsel_lbls.txt
select.txt
sel_lbls.txtaa.pataa.tesbbmatrixx.txtbbmatrixxx.txt
REM PRUNE_ALBUMIN.BATanalyze aa.pat 3318
REM USE SELECT.TXT TO PRUNE OUT FEATURESanalyze aa.pat 10analyze aa.tes 10copy aa.pat.txt aa.patcopy aa.tes.txt aa.tescopy sel_lblss.txt sel_lbls.txt
REM DO ANOTHER BOOTSTRAP(7 100 2)analyze aa.pat 33pause
0 20 40 60 80 100 120 140 160 180 200
0.56
0.58
0.6
0.62
0.64
0.66
0.68
0.7
0.72
0.74
0.76Correlation vs # descriptors for HIV
Number of Features
Cor
rela
tion
STARPLOT.BAT: Starplot for Selected Features for Albumin
REM GENERATE STARPLOTREM DO FIRST A DUMMY PRUNING
REM i.e., RUN PRUNE.BAT BUT SELECT ALLREM FEATURES TO REORDER BY SENSITIVITY
REM DO 30 BOOTSTRAPS (7; 30; 2)analyze a.pat 33
REM GENERATE STARPLOT (9)analyze a.pat 3320analyze starpl_reg.txt 3350pause
sel_lbls.txtaa.pat
bbmatrixxx.txt
sel_lbls.txtstarplot.txt
starplot
• First generate bbmatrixxx.txt which contains all sensitivities for (e.g.) 30 boostraps using PLS bootstrap option 33• Generate starplot.txt from bbmatrixxx.txt using option 3320• Use the MATLAB routine starplot.m (operates on starplot.txt and sel_lbls.txt)