Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC D irectK ernelM ethodsfor the D etection ofIschem ia from M agnetocardiogram s:SupportV ector M achinesfor the R estofU s M ark J.Em brechts ( [email protected]) Department of decision Sciences and E ngineering Systems Rensselaer Polytechnic Institute, T roy, NY 12180 Supported by N SF G rant SBIR Phase I# 0232215 and KD I# IIS -9979860
52
Embed
Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC
Presented at the Alabany Chapter of the ASA February 25, 2004 Washinghton DC. Magnetocardiography at CardioMag Imaging inc. With Bolek Szymanski and Karsten Sternickel. Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid). - PowerPoint PPT Presentation
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Presented at the Alabany Chapter of the ASAFebruary 25, 2004Washinghton DC
Direct Kernel Methods for the Detection of Ischemia from Magnetocardiograms: Support Vector Machines for the Rest of Us
Department of decision Sciences and Engineering Systems Rensselaer Polytechnic Institute, Troy, NY 12180
Supported by NSF Grant SBIR Phase I # 0232215 and KDI # IIS-9979860
Magnetocardiography at CardioMag Imaging inc.
With Bolek Szymanski and Karsten Sternickel
Left: Filtered and averaged temporal MCG traces for one cardiac cycle in 36 channels (the 6x6 grid).Right Upper: Spatial map of the cardiac magnetic field, generated at an instant within the ST interval. Right Lower: T3-T4 sub-cycle in one MCG signal trace
1
1
1
1
1
1
1
11
11
nTmnnm
Tmnm
nTmnnm
Tmnmnm
Tmnnm
Tmn
nTmnmnm
Tmn
nTmnmnm
Tmn
yXXXw
yXXXwXXXX
yXwXX
yXwXX
Pseudo inverse
Classical (Linear) Regression Analysis: Predict y from X
• There is a “learning paradox” because of redundancies in the data
• We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel
• So far prediction models involved only linear algebra stricly linear
• What is in a kernel?
jiij xxk
nnnn
inijii
n
n
TmnnmD
kkk
kkkk
kkk
kkk
XXK
...
...
...
21
21
22221
11211
�
The data kernel contains linear similarity measures (correlations) of data records
xi
xj
Kernels
• What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel
nnnn
inijii
n
n
D
kkk
kkkk
kkk
kkk
K
...
...
...
21
21
22221
11211
�
xi
xj
• We actually can make up nonlinear similarity measures as well
22
2ji xx
ij ek
jiij xxk
TmnnmD XXK
�
Radial Basis Function Kernel
Nonlinear
Distance or difference
Review: What is in a Kernel?
• A kernel can be considered as a (nonlinear) data transformation - Many different choices for the kernel are possible - The Radial Basis Function (RBF) or Gaussian kernel is an effective nonlinear kernel
• The RBF or Gaussian kernel is a symmetric matrix - Entries reflect nonlinear similarities amongst data descriptions
- As defined by:
22
2lj xx
ij ek
nnnn
inijii
n
n
nn
kkk
kkkk
kkk
kkk
K
...
...
...
21
21
22221
11211
�
Direct Kernel Methods for Nonlinear Regression/Classification
• Consider the Kernel as a (nonlinear) data transformation - This is the so-called “kernel trick” (Hilbert, early 1900’s) - The Radial Basis Function (RBF) or Gaussian kernel is an efficient nonlinear kernel
• Linear regression models can be “tricked” into nonlinear models by applying such regression models on kernel transformed data - PCA DK-PCA - PLS DK-PLS (Partial Least Squares Support Vector Machines) - (Direct) Kernel Ridge Regression Least Squares Support Vector Machines - Direct Kernel Self-Organizing maps (DK-SOM)
• These methods work in the same space as SVMs - DK models can usually be derived also from an optimization formulation (similar to SVMs) - Unlike the original SVMs DK methods are not sparse (i.,e., all data are support vectors) - Unlike SVMs there is no patent on direct kernel methods - Performance on hunderds of benchmark problems compare favorably with SVMs
• Classification can be considered as a special cae of regression
• Data Pre-processing: Data are usually Mahalanobis scaled first
Nonlinear PCA in Kernel Space
• Like PCA• Consider a nonlinear data kernel transformation up front: Data Kernel• Derive principal components for that kernel (e.g. with NIPALS)• Examples: - Haykin’s Spiral - Cherkassky’s nonlinear function model
Direct Kernel Partial-Least Squares (K-PLS)Direct Kernel Partial-Least Squares (K-PLS)
x1
x2
x3
t1
t2
y
• Direct Kernel PLS is PLS with the kernel transform as a preprocessing step• Consider K-PLS as a “better” nonlinear PLS• Consider PLS as a “ better” PCA• K-PLS gives almost identical (but more stable) results as SVMs - PLS is the method by choice for chemometrics and QSAR drug design - hyper-parameters are easy to tune (5 latent variables) - unlike SVMs there is no patent on K-PLS
What have we learned so far?
• There is a “learning paradox” because of redundancies in the data
• We resolved this paradox by “regularization” - In the case of PCA we used the eigenvectors of the feature kernel - In the case of ridge regression we added a ridge to the data kernel
• So far prediction models involved only linear algebra strictly linear
• What is in a kernel?
jiij xxk
nnnn
inijii
n
n
TmnnmD
kkk
kkkk
kkk
kkk
XXK
...
...
...
21
21
22221
11211
�
The data kernel contains linear similarity measures (correlations) of data records
xi
xj
Kernels
• What is a kernel? - The data kernel expresses a similarity measure between data records - So far, the kernel contains linear similarity measures linear kernel
nnnn
inijii
n
n
D
kkk
kkkk
kkk
kkk
K
...
...
...
21
21
22221
11211
�
xi
xj
• We actually can make up nonlinear similarity measures as well
22
2ji xx
ij ek
jiij xxk
TmnnmD XXK
�
Radial Basis Function Kernel
Nonlinear
Distance or difference
Σ
Σ
Σ
x1
xm
xi
n
Thnnh
Thn
Tmh
T
nThnnh
Thn
Tmhm
yTTTBxy
yTTTBw
1
1
ˆ
TmhB
Weights correspond toH eigenvectors
corresponding tolargest eigenvalues
of XTX
Σ
Σ
Σ
Σ
Σ
Σ
Σ
Σ. . .
ThnT
Weights correspond tothe scores
or PCAs for theentire training set
1y
my
iyy
Weights correspond tothe dependent variable
for the entire training data
1 nh
ThnTT
Means that the projections on the eigenvectors will be dividedwith the corresponding variance (cfr. Mahalanobis scaling)
This layer gives a weighted similarity score
with each datapoint
Kind of a nearestneighbor weighted
prediction score
PCR in Feature Space
Σ
Σ
Σ
x1
xm
xi
n
Thnnh
Thn
Tmh
T
nThnnh
Thn
Tmhm
yTTTBxy
yTTTBw
1
1
ˆ
TmhB
Weights correspond toH eigenvectors
corresponding tolargest eigenvalues
of XTX
Σ
PCR in Feature Space
w1
w2
wh
• Principal components can be thought of as a data pre-processing step• Rather than building a model for an m-dimensional input vector x we now have a h-dimensional t vector
t1
y
th
t2
Tmhmh Bxt ,1,1
Use of a direct kernel self-organizing map in testing mode for the detection of patients with ischemia (read patient IDs). The darker hexagons colored during a separate training phase represent nodes corresponding with ischemia cases.
Predictions on Test Cases with DK-SOM
Outlier Detection Procedure in Analyze
One-class SVM on training dataProprietory regularization mechanism
start
Determine number of outliers from elbow plotEliminate outliers from training set
Run K-PLS for new training/test data
See whether outliers make sense on pharmaplotsInspect outlier clusters on SOMs
end
List ofOutlier pattern IDs
Outliers are flagged in pharmaplots
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
0 10 20 30 40 50 60 70 80
Targ
et a
nd
Pre
dic
ted
Va
lue
s
Sorted Sequence Number
q2 = 0.818
Q2 = 2.042
RMSE = 1.364
target
predicted
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
resp
on
se
sorted index number
Outlier Detection Plot (1/C)
'outliers.txt' using 1:3
Tagging Outliers on Pharmaplot with Analyze Code
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
resp
on
se
sorted index number
Outlier Detection Plot (1/C)
'outliers.txt' using 1:3
“Elbows” suggest 7-14 outliers
“Elbow” Plot for Specifying # Outliers
One-Class SVM Results for MCG Data
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
0 10 20 30 40 50 60 70 80
Targ
et a
nd
Pre
dic
ted
Va
lue
s
Sorted Sequence Number
q2 = 0.818
Q2 = 2.042
RMSE = 1.364
target
predicted
Outlier/Novelty Detection Methods in Analyze: Hypotheses
• One-class SVMs are commonly cited for outlier detection (e.g., Suykens) - used publicly available SVM code (LibSVM) - Analyze has user-friendly interface operators for using LibSVM• Proprietary heuristic tuning for C in SVMs - heuristic tuning method explained in previous publications - heuristic tuning is essential to make outlier detection work properly• “Elbow” curves for indicating # outliers• Pharmaplot justifies/validates detection from different methods• Pharmaplots extended to PLS, K-PCA, and K-PLS pharmaplots
One-Class SVM: Brief Theory
• Well-known method for outlier & novelty detection in SVM literature (e.g., seeSuykens)• LibSVM, a publicly available SVM code for general use, has one-class SVM option built-in (see Chih-Chung Chang and Chih-Jen Lin )• Analyze has operators to interface with LibSVM• Theory: - One-class SVM ignores response (assumes all zeros for responses) - Maximizes spread and subtracts regularization term - Suykens, pp. 203 has following formulation
- is a regularization parameter, Analyze has proprietary way to determine • Application: - Analyze combines one-class SVMs with pharmaplots to see whether outliers can be explained and make sense - Analyze has elbow curves to assist user in determining # outliers - Combination of 1-class SVMs with pharmaplots, gave excellent results on several industrial (non-pharmaceutical) data
Nkxwe
wweewJ
kT
k
TN
kkp
ew
,...,1,such that
2
1
2
1,max
1
2
,
NIPALS ALGORITHM FOR PLS (with just one response variable y)
• Start for a PLS component:
• Calculate the score t:
• Calculate c’:
• Calculate the loading p:
• Store t in T, store p in P, store w in W• Deflate the data matrix and the response variable:
'1
'1
'1'
1
11
'1
ˆm
Tm
mm
nTn
Tmn
m
ww
ww
yy
yXw
11
11
nTn
nTmn
m tt
tXp
'11 ˆmnmn wXt
11
11'11
nTn
nTn
tt
ytc
'11111
'11
ctyy
ptXX
nnn
Tmnnmnm
Do
for
h la
tent
var
iabl
es
Outlier/Novelty Detection Methods in Analyze
• Outlier detection methods where extensively tested: - on a variety of different UCI data sets - models sometimes showed significant improvement after removal of outliers - models were rarely worse - outliers could be validated on pharmaplots and lead to enhanced insight• The pharmaplots confirm the validity of outlier detection with one-class SVM• Prediction on test set for albumin data improves model• A non-pharmaceutical (medical) data set actually shows two data points in the training set that probably were given wrong labels (Appendix A)
P Q
R
S T
Innovations in Analyze for Outlier Detection
• User-fiendly procedure with automated processes• Interface for one-class SVM from LibSVM• Automated tuning for regularization parameters• Elbow plots to determine number of outliers• Combination of LibSVM outliers with pharmaplots - efficient visualization of outliers - facilitates interpretation of outliers• Extended pharmaplots - PCA - K-PCA - PLS - K-PLS• User-friendly and efficient SOM with outlier identification• Direct-Kernel-based outlier detection as an alternative to LibSVM
Principal Component Analysis (PCA)
Tmhnmnh
hmnhnm
hnhn
nThnnh
Thnh
BXT
BTX
bTy
yTTTb
11
1
1
1
ˆ
• We introduce a modest set of h most important principal components, Tnh
• Replace data Xnm by most important principal components Tnh
• The most important T’s are the ones corresponding to largest eigenvalues of XTX• The B’s are the eigenvectors of XTX ordered from largest to lowest eigenvalue• In practice calculation of B’s and T’s proceeds iteratively with NIPALS algorithm• NIPALS: non-linear iterative least squares (Herman Wold)
x1
x2
x3
t1t2
y
Partial Least Squares (PLS)
• Similar to PCA• PLS: Partial Least Squares/Projection to Latent Structures/Please Listen to Svante• T’s are now called scores or latent variables and the p’s are the loading vectors • Loading vectors are not orthogonal anymore and influenced by y vector• A special version of NIPALS is also used to build up t’s
x1
x2
x3
t1
t2
y
1*
*
111
1
1
1
ˆ
WPWW
WXT
PTX
yTWPWXbTy
yTTTb
T
nmnh
Thmnhnm
nThnmh
Thmmhnmhnhn
nThnnh
Thnh
Kernel PLS (K-PLS)
x1
x2
x3
t1
t2
y
• Invented by Rospital and Trejo (J. Machine Learning, December 2000)• Consider K-PLS as a better and nonlinear PLS• K-PLS gives almost identical results to SVMs for the QSAR data we tried• K-PLS is a lot faster than SVMs