Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC

Presented at the Alabany Chapter of the ASAFebruary 25, 2004Washinghton DC

Direct Kernel Methods for the Detection of Ischemia from Magnetocardiograms: Support Vector Machines for the Rest of Us

Mark J. Embrechts ([email protected])

Department of decision Sciences and Engineering Systems Rensselaer Polytechnic Institute, Troy, NY 12180

Supported by NSF Grant SBIR Phase I # 0232215 and KDI # IIS-9979860

Magnetocardiography at CardioMag Imaging inc.

With Bolek Szymanski and Karsten Sternickel

Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid).Right Upper: Spatial map of the cardiac magnetic field, generated at an instant within the ST interval. Right Lower: T3-T4 sub-cycle in one MCG signal trace

1

1

1

1

1

1

1

11

11

nTmnnm

Tmnm

nTmnnm

Tmnmnm

Tmnnm

Tmn

nTmnmnm

Tmn

nTmnmnm

Tmn

yXXXw

yXXXwXXXX

yXwXX

yXwXX

Pseudo inverse

Classical (Linear) Regression Analysis: Predict y from X

NAME PIE PIF DGR SAC MR Lam Vol DDGTS IDAla 0.23 0.31 -0.55 254.2 2.126 -0.02 82.2 8.5 0Asn -0.48 -0.6 0.51 303.6 2.994 -1.24 112.3 8.2 1Asp -0.61 -0.77 1.2 287.9 2.994 -1.08 103.7 8.5 2Cys 0.45 1.54 -1.4 282.9 2.933 -0.11 9.1 11 3Gln -0.11 -0.22 0.29 335 3.458 -1.19 127.5 6.3 4Glu -0.51 -0.64 0.76 311.6 3.243 -1.43 120.5 8.8 5Gly 0 0 0 224.9 1.662 0.03 65 7.1 6His 0.15 0.13 -0.25 337.2 3.856 -1.06 140.6 10.1 7Ile 1.2 1.8 -2.1 322.6 3.35 0.04 131.7 16.8 8

Leu 1.28 1.7 -2 324 3.518 0.12 131.5 15 9Lys -0.77 -0.99 0.78 336.6 2.933 -2.26 144.3 7.9 10Met 0.9 1.23 -1.6 336.3 3.86 -0.33 132.3 13.3 11Phe 1.56 1.79 -2.6 366.1 4.638 -0.05 155.8 11.2 12Pro 0.38 0.49 -1.5 288.5 2.876 -0.31 106.7 8.2 13Ser 0 -0.04 0.09 266.7 2.279 -0.4 88.5 7.4 14Thr 0.17 0.26 -0.58 283.9 2.743 -0.53 105.3 8.8 15Trp 1.85 2.25 -2.7 401.8 5.755 -0.31 185.9 9.9 16Tyr 0.89 0.96 -1.7 377.8 4.791 -0.84 162.7 8.8 17Val 0.71 1.22 -1.6 295.1 3.054 -0.13 115.6 12 18

Xnm

y 1,1,ˆ mmtesttest wXy

Prediction model

11 nmnm ywX

Can we apply wisdom to data and forecast them right?

(n = 19 & m = 7)19 data and 7 attributes

(1 response)

Fundamental Machine Learning Paradox

1

1

1

1

1

1

11

nTmnmmFm

nTmnnm

Tmnm

nmnm

yXKw

yXXXw

ywX

• How to resolve Machine Learning Paradox?

testmtest ywx ˆ1

• Learning occurs because of redundancy (patterns) in the data

• Machine Learning Paradox: If data contain redundancies (i) we can learn from data (ii) the “feature kernel matrix” KF is ill-conditioned

(i) fix rank deficiency of KF with principal components (PCA) (ii) regularization: use KF+I instead of KF (ridge regression) (iii) local learning

1,1,

1

1

1

ˆ mmtesttest

nTmnnm

Tmnm

wXy

yXXXw

Tmmnmnm

mmnmnm

BXT

BTX

Tmhnmnh

hmnhnm

BXT

BTX

Principal Component Regression (PCR): Replace Xnm by Tnh

1,1,

1

1

1

ˆ hhtesttest

nThnnh

Thnh

wTy

yTTTw

Tnh principal components projection of the (n) data records on the (h) “most important” eigenvectors of the feature kernel KF

Ridge Regression in Data Space

nTmnnm

Tmnm

nmnm

yXXXw

ywX

1

• “Wisdom” is now obtained from the right-hand inverse or Penrose inverse

nnn

trainDn

traintestDtest

nTmnnm

Tmnmtesttest

nTmnnm

Tmnmtesttest

mmtesttest

yIKKy

yXXXXy

yXXXXy

wXy

1

,1,

1

,

1

,,

,

ˆ

ˆ

ˆ

ˆ

Ridge term is added to resolvelearning paradox

nnnDTmnm yIKXw

1

Data Kernel KD

Needs kernels only

nTmnnm

Tmnm yXXXw

1

Implementing Direct Kernel Methods

Linear Model:- PCA model- PLS model- Ridge Regression- Self-Organizing Map . . .

What have we learned so far?

• There is a “learning paradox” because of redundancies in the data

• We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel

• So far prediction models involved only linear algebra stricly linear

• What is in a kernel?

jiij xxk

nnnn

inijii

n

n

TmnnmD

kkk

kkkk

kkk

kkk

XXK

...

...

...

21

21

22221

11211

�

The data kernel contains linear similarity measures (correlations) of data records

xi

xj

Kernels

• What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel

nnnn

inijii

n

n

D

kkk

kkkk

kkk

kkk

K

...

...

...

21

21

22221

11211

�

xi

xj

• We actually can make up nonlinear similarity measures as well

22

2ji xx

ij ek

jiij xxk

TmnnmD XXK

�

Radial Basis Function Kernel

Nonlinear

Distance or difference

Review: What is in a Kernel?

• A kernel can be considered as a (nonlinear) data transformation - Many different choices for the kernel are possible - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel

• The RBF or Gaussian kernel is a symmetric matrix - Entries reflect nonlinear similarities amongst data descriptions

- As defined by:

22

2lj xx

ij ek

nnnn

inijii

n

n

nn

kkk

kkkk

kkk

kkk

K

...

...

...

21

21

22221

11211

�

Direct Kernel Methods for Nonlinear Regression/Classification

• Consider the Kernel as a (nonlinear) data transformation - This is the so-called “kernel trick” (Hilbert, early 1900’s) - The Radial Basis Function (RBF) or Gaussian kernel is an efficient nonlinear kernel

• Linear regression models can be “tricked” into nonlinear models by applying such regression models on kernel transformed data - PCA DK-PCA - PLS DK-PLS (Partial Least Squares Support Vector Machines) - (Direct) Kernel Ridge Regression Least Squares Support Vector Machines - Direct Kernel Self-Organizing maps (DK-SOM)

• These methods work in the same space as SVMs - DK models can usually be derived also from an optimization formulation (similar to SVMs) - Unlike the original SVMs DK methods are not sparse (i.,e., all data are support vectors) - Unlike SVMs there is no patent on direct kernel methods - Performance on hunderds of benchmark problems compare favorably with SVMs

• Classification can be considered as a special cae of regression

• Data Pre-processing: Data are usually Mahalanobis scaled first

Nonlinear PCA in Kernel Space

• Like PCA• Consider a nonlinear data kernel transformation up front: Data Kernel• Derive principal components for that kernel (e.g. with NIPALS)• Examples: - Haykin’s Spiral - Cherkassky’s nonlinear function model

22

2lj xx

ij ek

nnnn

inijii

n

n

nn

kkk

kkkk

kkk

kkk

K

...

...

...

21

21

22221

11211

� n

Tnnnn

Tnnn

nnnn

yKKKw

ywK��

�

1

PCA Example: Haykin’s Spiral

REM HAYKINS SPIRALREM GENERATE DATA (6, 500, 2)

analyze num_eg.txt 3301REM GENERATE LABELS

analyze spiral.txt 116REM SPLIT DATA (400, 2)

analyze spiral.txt 20REM FOR SAFEKEEPING

copy cmatrix.txt spiral.pat > embrex.logcopy dmatrix.txt spiral.tes >> embrex.log

REM SCALE DATAanalyze spiral.pat 3314159copy spiral.pat.txt a.pat >> embrex.log

REM SCALE TEST SET CONSISTENTLY (n)analyze spiral.tes 314159copy spiral.tes.txt a.tes >> embrex.logpause

REM DO PCA (2)analyze num_eg.txt 105analyze a.pat 36analyze a.tes 3636copy a.tes.txt dmatrix.txt

REM JAVA PLOTanalyze a.pat.txt 3308pause

(demo: haykin1)

PCA

Linear PCR Example: Haykin’s SpiralREM HAYKINS SPIRAL

REM GENERATE DATA (6, 500, 2)analyze num_eg.txt 3301

REM GENERATE LABELSanalyze spiral.txt 116

REM SPLIT DATA (400, 2)analyze spiral.txt 20

REM FOR SAFEKEEPINGecho offcopy cmatrix.txt spiral.pat > embrex.logcopy dmatrix.txt spiral.tes >> embrex.logecho on

REM SCALE DATAanalyze spiral.pat 3314159copy spiral.pat.txt a.pat >> embrex.logREM SCALE TEST SET CONSISTENTLY (n)

analyze spiral.tes 314159echo offcopy spiral.tes.txt a.tes >> embrex.logecho onpause

REM RUN PCR (2)analyze num_eg.txt 105analyze a.pat 17analyze a.tes 18pause

REM DESCALEanalyze resultss.xxx 4echo offcopy results.ttt results.xxx >> embrex.logecho onanalyze resultss.ttt 4

REM JAVA PLOTanalyze num_eg.txt 3354pause

REM ROC CURVEanalyze results.ttt -42pauseanalyze results.ttt 40analyze results.ttt 3310pause

(demo: haykin2)

K-PCR Example: Haykin’s SpiralREM HAYKINS SPIRAL

REM GENERATE DATA (6, 500, 2)analyze num_eg.txt 3301

REM GENERATE LABELSanalyze spiral.txt 116

REM SPLIT DATA (400, 2)analyze spiral.txt 20

REM FOR SAFEKEEPINGecho offcopy cmatrix.txt spiral.pat > embrex.logcopy dmatrix.txt spiral.tes >> embrex.logecho on

REM SCALE DATAanalyze spiral.pat 3314159copy spiral.pat.txt a.pat >> embrex.log

REM SCALE TEST SET CONSISTENTLY (n)analyze spiral.tes 314159echo offcopy spiral.tes.txt a.tes >> embrex.logecho onpause

REM RUN K-PCA (12 1)analyze num_eg.txt 105analyze a.pat 4531analyze a.tes 4518pause

REM DESCALEanalyze resultss.xxx 4copy results.ttt results.xxx >> embrex.loganalyze resultss.ttt 4

REM JAVA PLOTanalyze num_eg.txt 3354pause

REM ROC CURVEanalyze results.ttt -42pauseanalyze results.ttt 40analyze results.ttt 3310pause

3 PCAs 12 PCAs

(demo: haykin3)

Training Data

Test Data

Mahalanobis-scaledTraining Data

Kernel TransformedTraining Data

Centered Direct Kernel

(Training Data)

Mahalanobis-scaledTest Data

MahalanobisScaling Factors

Vertical KernelCentering Factors

Kernel TransformedTest Data

Centered Direct Kernel

(Test Data)

Scaling, centering & making the test kernel centering consistent

36 MCG T3-T4 Traces

Preprocessing:- horizontal Mahalanobis scaling- D4 wavlet transform- vertical Mahalanobis scaling (features and response)

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 5 10 15 20 25

Target

an

d P

redic

ted V

alu

es

Sorted Sequence Number

Errorplot for Test Data

Thu May 08 21:04:45 2003

targetpredicted

SVMLib

Linear PCA

Direct Kernel PLS

SVMLib

7 (TN) 0 (FN)3 (FN) 15 (TP)

PharmaPlot

Wed Mar 19 15:23:32 2003

'negative''positive'

-0.08-0.06-0.04-0.02 0 0.02 0.04First PLS Component -0.06-0.04

-0.02 0

0.02 0.04

0.06 0.08

Second PLS Component

-0.08-0.06-0.04-0.02

0 0.02 0.04 0.06 0.08 0.1

Third PLS Component

Direct Kernel PLS with 3 Latent Variables

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e P

osi

tive

s

False Positives

AZ_area = 0.879

Predictions on Test Cases with K-PLS

-1.5

-1

-0.5

0

0.5

1

1.5

0 5 10 15 20 25 30 35 40 45 50 55

Targ

et a

nd

Pre

dic

ted

Va

lue

s


q2 = 0.568

Q2 = 0.575

RMSE = 0.751

target

predicted

-1.5

-1

-0.5

0

0.5

1

1.5

0 5 10 15 20 25 30 35 40 45 50 55

Targ

et a

nd

Pre

dic

ted

Va

lue

s


q2 = 0.529

Q2 = 0.542

RMSE = 0.729

target

predicted

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Tru

e P

osi

tive

s

False Positives

AZ_area = 0.903

K-PLS Predictions After Removing 14 Outliers

Method Domain q2 Q2 RMSE %correct #misses time (s) comment

SVMLib time 0.767 0.842 0.852 74 4+5 10 lambda = 0.011, sigma = 10K-PLS time 0.779 0.849 0.856 74 4+5 6 5 latent variables, sgma = 10K-PCA D4-wavelet 0.783 0.812 0.87 71 7+3 5 5 principal components

PLS D4-wavelet 0.841 0.142 1.146 63 2+11 3 5 latent variablesK-PLS D4-wavelet 0.591 0.694 0.773 80 2+5 6 5 latent variables, sigma = 10

DK-PLS D4-wavelet 0.608 0.708 0.781 80 2+5 5 5 latent variables, sigma = 10SVMLib D4-wavelet 0.591 0.697 0.775 80 2+5 10 lambda = 0.011, sigma = 10LS-SVM D4-wavelet 0.554 0.662 0.75 83 1+5 0.5 lambda = 0.011, sigma = 10

SOM D4-wavelet 0.866 1.304 1.06 63 3+10 960 9x18 hexagonal gridDK-SOM D4-wavelet 0.855 1.0113 0.934 71 5+5 28 9x18 hex grid, sigma = 10DK-SOM D4-wavelet 0.755 0.859 0.861 77 3+5 28 18x18 hexagonal, sigma = 8

Benchmark Predictions on Test Cases

Direct Kernel7 (TN) 0 (FN)4 (FN) 14 (TP)

with Robert Bress and Thanakorn Naenna

www.drugmining.com

Kristin Bennett and Mark Embrechts

Docking Ligands is a Nonlinear Problem

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

GATCAATGAGGTGGACACCAGAGGCGGGGACTTGTAAATAACACTGGGCTGTAGGAGTGA

TGGGGTTCACCTCTAATTCTAAGATGGCTAGATAATGCATCTTTCAGGGTTGTGCTTCTA

TCTAGAAGGTAGAGCTGTGGTCGTTCAATAAAAGTCCTCAAGAGGTTGGTTAATACGCAT

GTTTAATAGTACAGTATGGTGACTATAGTCAACAATAATTTATTGTACATTTTTAAATAG

CTAGAAGAAAAGCATTGGGAAGTTTCCAACATGAAGAAAAGATAAATGGTCAAGGGAATG

GATATCCTAATTACCCTGATTTGATCATTATGCATTATATACATGAATCAAAATATCACA

CATACCTTCAAACTATGTACAAATATTATATACCAATAAAAAATCATCATCATCATCTCC

ATCATCACCACCCTCCTCCTCATCACCACCAGCATCACCACCATCATCACCACCACCATC

ATCACCACCACCACTGCCATCATCATCACCACCACTGTGCCATCATCATCACCACCACTG

TCATTATCACCACCACCATCATCACCAACACCACTGCCATCGTCATCACCACCACTGTCA

TTATCACCACCACCATCACCAACATCACCACCACCATTATCACCACCATCAACACCACCA

CCCCCATCATCATCATCACTACTACCATCATTACCAGCACCACCACCACTATCACCACCA

CCACCACAATCACCATCACCACTATCATCAACATCATCACTACCACCATCACCAACACCA

CCATCATTATCACCACCACCACCATCACCAACATCACCACCATCATCATCACCACCATCA

CCAAGACCATCATCATCACCATCACCACCAACATCACCACCATCACCAACACCACCATCA

CCACCACCACCACCATCATCACCACCACCACCATCATCATCACCACCACCGCCATCATCA

TCGCCACCACCATGACCACCACCATCACAACCATCACCACCATCACAACCACCATCATCA

CTATCGCTATCACCACCATCACCATTACCACCACCATTACTACAACCATGACCATCACCA

CCATCACCACCACCATCACAACGATCACCATCACAGCCACCATCATCACCACCACCACCA

CCACCATCACCATCAAACCATCGGCATTATTATTTTTTTAGAATTTTGTTGGGATTCAGT

ATCTGCCAAGATACCCATTCTTAAAACATGAAAAAGCAGCTGACCCTCCTGTGGCCCCCT

TTTTGGGCAGTCATTGCAGGACCTCATCCCCAAGCAGCAGCTCTGGTGGCATACAGGCAA

CCCACCACCAAGGTAGAGGGTAATTGAGCAGAAAAGCCACTTCCTCCAGCAGTTCCCTGT

WORK IN PROGRESS

Direct Kernel Partial-Least Squares (K-PLS)Direct Kernel Partial-Least Squares (K-PLS)

x1

x2

x3

t1

t2

y

• Direct Kernel PLS is PLS with the kernel transform as a preprocessing step• Consider K-PLS as a “better” nonlinear PLS• Consider PLS as a “ better” PCA• K-PLS gives almost identical (but more stable) results as SVMs - PLS is the method by choice for chemometrics and QSAR drug design - hyper-parameters are easy to tune (5 latent variables) - unlike SVMs there is no patent on K-PLS

What have we learned so far?

• There is a “learning paradox” because of redundancies in the data

• We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel

• So far prediction models involved only linear algebra strictly linear

• What is in a kernel?

jiij xxk

nnnn

inijii

n

n

TmnnmD

kkk

kkkk

kkk

kkk

XXK

...

...

...

21

21

22221

11211

�

The data kernel contains linear similarity measures (correlations) of data records

xi

xj

Kernels

• What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel

nnnn

inijii

n

n

D

kkk

kkkk

kkk

kkk

K

...

...

...

21

21

22221

11211

�

xi

xj

• We actually can make up nonlinear similarity measures as well

22

2ji xx

ij ek

jiij xxk

TmnnmD XXK

�

Radial Basis Function Kernel

Nonlinear

Distance or difference

Σ

Σ

Σ

x1

xm

xi

n

Thnnh

Thn

Tmh

T

nThnnh

Thn

Tmhm

yTTTBxy

yTTTBw

1

1

ˆ

TmhB

Weights correspond toH eigenvectors

corresponding tolargest eigenvalues

of XTX

Σ

Σ

Σ

Σ

Σ

Σ

Σ

Σ. . .

ThnT

Weights correspond tothe scores

or PCAs for theentire training set

1y

my

iyy

Weights correspond tothe dependent variable

for the entire training data

1 nh

ThnTT

Means that the projections on the eigenvectors will be dividedwith the corresponding variance (cfr. Mahalanobis scaling)

This layer gives a weighted similarity score

with each datapoint

Kind of a nearestneighbor weighted

prediction score

PCR in Feature Space

Σ

Σ

Σ

x1

xm

xi

n

Thnnh

Thn

Tmh

T

nThnnh

Thn

Tmhm

yTTTBxy

yTTTBw

1

1

ˆ

TmhB

Weights correspond toH eigenvectors

corresponding tolargest eigenvalues

of XTX

Σ

PCR in Feature Space

w1

w2

wh

• Principal components can be thought of as a data pre-processing step• Rather than building a model for an m-dimensional input vector x we now have a h-dimensional t vector

t1

y

th

t2

Tmhmh Bxt ,1,1

Use of a direct kernel self-organizing map in testing mode for the detection of patients with ischemia (read patient IDs). The darker hexagons colored during a separate training phase represent nodes corresponding with ischemia cases.

Predictions on Test Cases with DK-SOM

Outlier Detection Procedure in Analyze

One-class SVM on training dataProprietory regularization mechanism

start

Determine number of outliers from elbow plotEliminate outliers from training set

Run K-PLS for new training/test data

See whether outliers make sense on pharmaplotsInspect outlier clusters on SOMs

end

List ofOutlier pattern IDs

Outliers are flagged in pharmaplots

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

0 10 20 30 40 50 60 70 80

Targ

et a

nd

Pre

dic

ted

Va

lue

s


q2 = 0.818

Q2 = 2.042

RMSE = 1.364

target

predicted

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

resp

on

se

sorted index number

Outlier Detection Plot (1/C)

'outliers.txt' using 1:3

Tagging Outliers on Pharmaplot with Analyze Code

0

0.5

1

1.5

2

2.5

0 10 20 30 40 50 60 70 80

resp

on

se

sorted index number

Outlier Detection Plot (1/C)

'outliers.txt' using 1:3

“Elbows” suggest 7-14 outliers

“Elbow” Plot for Specifying # Outliers

One-Class SVM Results for MCG Data

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

0 10 20 30 40 50 60 70 80

Targ

et a

nd

Pre

dic

ted

Va

lue

s


q2 = 0.818

Q2 = 2.042

RMSE = 1.364

target

predicted

Outlier/Novelty Detection Methods in Analyze: Hypotheses

• One-class SVMs are commonly cited for outlier detection (e.g., Suykens) - used publicly available SVM code (LibSVM) - Analyze has user-friendly interface operators for using LibSVM• Proprietary heuristic tuning for C in SVMs - heuristic tuning method explained in previous publications - heuristic tuning is essential to make outlier detection work properly• “Elbow” curves for indicating # outliers• Pharmaplot justifies/validates detection from different methods• Pharmaplots extended to PLS, K-PCA, and K-PLS pharmaplots

One-Class SVM: Brief Theory

• Well-known method for outlier & novelty detection in SVM literature (e.g., seeSuykens)• LibSVM, a publicly available SVM code for general use, has one-class SVM option built-in (see Chih-Chung Chang and Chih-Jen Lin )• Analyze has operators to interface with LibSVM• Theory: - One-class SVM ignores response (assumes all zeros for responses) - Maximizes spread and subtracts regularization term - Suykens, pp. 203 has following formulation

- is a regularization parameter, Analyze has proprietary way to determine • Application: - Analyze combines one-class SVMs with pharmaplots to see whether outliers can be explained and make sense - Analyze has elbow curves to assist user in determining # outliers - Combination of 1-class SVMs with pharmaplots, gave excellent results on several industrial (non-pharmaceutical) data

Nkxwe

wweewJ

kT

k

TN

kkp

ew

,...,1,such that

2

1

2

1,max

1

2

,

NIPALS ALGORITHM FOR PLS (with just one response variable y)

• Start for a PLS component:

• Calculate the score t:

• Calculate c’:

• Calculate the loading p:

• Store t in T, store p in P, store w in W• Deflate the data matrix and the response variable:

'1

'1

'1'

1

11

'1

ˆm

Tm

mm

nTn

Tmn

m

ww

ww

yy

yXw

11

11

nTn

nTmn

m tt

tXp

'11 ˆmnmn wXt

11

11'11

nTn

nTn

tt

ytc

'11111

'11

ctyy

ptXX

nnn

Tmnnmnm

Do

for

h la

tent

var

iabl

es

Outlier/Novelty Detection Methods in Analyze

• Outlier detection methods where extensively tested: - on a variety of different UCI data sets - models sometimes showed significant improvement after removal of outliers - models were rarely worse - outliers could be validated on pharmaplots and lead to enhanced insight• The pharmaplots confirm the validity of outlier detection with one-class SVM• Prediction on test set for albumin data improves model• A non-pharmaceutical (medical) data set actually shows two data points in the training set that probably were given wrong labels (Appendix A)

P Q

R

S T

Innovations in Analyze for Outlier Detection

• User-fiendly procedure with automated processes• Interface for one-class SVM from LibSVM• Automated tuning for regularization parameters• Elbow plots to determine number of outliers• Combination of LibSVM outliers with pharmaplots - efficient visualization of outliers - facilitates interpretation of outliers• Extended pharmaplots - PCA - K-PCA - PLS - K-PLS• User-friendly and efficient SOM with outlier identification• Direct-Kernel-based outlier detection as an alternative to LibSVM

Principal Component Analysis (PCA)

Tmhnmnh

hmnhnm

hnhn

nThnnh

Thnh

BXT

BTX

bTy

yTTTb

11

1

1

1

ˆ

• We introduce a modest set of h most important principal components, Tnh

• Replace data Xnm by most important principal components Tnh

• The most important T’s are the ones corresponding to largest eigenvalues of XTX• The B’s are the eigenvectors of XTX ordered from largest to lowest eigenvalue• In practice calculation of B’s and T’s proceeds iteratively with NIPALS algorithm• NIPALS: non-linear iterative least squares (Herman Wold)

x1

x2

x3

t1t2

y

Partial Least Squares (PLS)

• Similar to PCA• PLS: Partial Least Squares/Projection to Latent Structures/Please Listen to Svante• T’s are now called scores or latent variables and the p’s are the loading vectors • Loading vectors are not orthogonal anymore and influenced by y vector• A special version of NIPALS is also used to build up t’s

x1

x2

x3

t1

t2

y

1*

*

111

1

1

1

ˆ

WPWW

WXT

PTX

yTWPWXbTy

yTTTb

T

nmnh

Thmnhnm

nThnmh

Thmmhnmhnhn

nThnnh

Thnh

Kernel PLS (K-PLS)

x1

x2

x3

t1

t2

y

• Invented by Rospital and Trejo (J. Machine Learning, December 2000)• Consider K-PLS as a better and nonlinear PLS• K-PLS gives almost identical results to SVMs for the QSAR data we tried• K-PLS is a lot faster than SVMs

P Q

R

S T

Validation Model: 100x leave 10% out validations

PLS, K-PLS, SVM, ANN

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 100 200 300 400 500 600

1 -

Q2

# Features

1 - Q2 versus # Features on Validation Set

Thu Mar 13 15:59:57 2003

'evolve.txt' using 1:2

Feature Selection (data strip mining)

Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC

Documents

data machine learning

n data records

data space wisdom

feature kernel matrix

kf ridge regression

case of ridge regression

linear algebra

redundancies i