SISTA seminar Feb 28, 2002 Preoperative Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1,

SISTA seminar Feb 28, 2002

Preoperative Prediction of Preoperative Prediction of Malignancy of Ovarian Tumors Malignancy of Ovarian Tumors Using Using Least Squares Support Vector MachinesLeast Squares Support Vector Machines

C. Lu1, T. Van Gestel1, J. A. K. Suykens1, S. Van Huffel1,D. Timmerman2, I. Vergote2

1Department of Electrical Engineering, Katholieke Universiteit Leuven, Leuven, Belgium,

2Department of Obstetrics and Gynecology, University Hospitals Leuven, Leuven, Belgium


OverviewOverview

Introduction Data Exploration LS-SVM and Bayesian evidence framework

LS-SVM classifier Bayesian evidence framework Input Selection Sparse Approximation

Model Building and Model Evaluation Conclusions


IntroductionIntroduction Problem

ovarian masses: a common problem in gynecology (1/70 women).

ovarian cancer : high mortality rate early detection of ovarian cancer is difficult treatment and management of different types of ovarian tumors differs greatly.

develop a reliable diagnostic tool to preoperatively discriminate between benign and malignant tumors.

assist clinicians in choosing the appropriate treatment.

techniques for preoperative evaluation Serum tumor maker: CA125 blood test Transvaginal ultrasonography Color Doppler imaging and blood flow indexing


Logistic RegressionArtificial neural

networksSupport Vector Machines

IntroductionIntroduction Attempts to automate the diagnosis

Risk of malignancy Index (RMI) (Jacobs et al) RMI= scoremorph× scoremeno× CA125

Methematical models

Bayesian blief network

Hybrid Methods

Least Squares SVM

Bayesian Framework


IntroductionIntroduction Data

Patient data collected at Univ. Hospitals Leuven, Belgium, 1994~1999

425 records, 25 features. 291 benign tumors, 134 (32%) malignant tumors


IntroductionIntroduction Development Process

Exploratory Data Analysis Data preprocessing, univariate analysis, PCA, factor analysis…

Input Selection Model training Model evaluation

Performance measures: Receiver operating characteristic (ROC) analysis

Goal: High sensitivity for malignancy <-> low false positive rate. Providing probability of malignancy for individual.

ROC curves constructed by plotting the

sensitivity versus the 1-specificity, or false positive rate, for varying probability cutoff level.

visualization of the relationship between sensitivity and specificity of a test.

Area under the ROC curves (AUC)

measures the probability of the classifier to correctly classify events and nonevents.


Data explorationData exploration Univariate analysis:

preprocessing: e.g. CA_125->log, color_score {1,2,3,4} -> 3 design variables {0,1}..

descriptive statistics, histograms…Variable (symbol) Benign Malignant

Demographic Age (age)Postmenopausal (meno)

45.6 15.231.0 %

56.9 14.666.0 %

Serum marker CA 125 (log) (l_ca125) 3.0 1.2 5.2 1.5CDI High blood flow (colsc3,4) 19.0% 77.3 %Morphologic Abdominal fluid (asc)

Bilateral mass (bilat)Unilocular cyst (un)Multiloc/solid cyst (mulsol)Solid (sol)Smooth wall (smooth)Irregular wall (irreg)Papillations (pap)

32.7 %13.3 %45.8 %10.7 %8.3 %56.8 %33.8 %12.5 %

67.3 %39.0 %5.0 %36.2 %37.6 %5.7 %73.2 %53.2 %

Demographic, serum marker, color Doppler imaging and morphologic variables


Data explorationData exploration Multivariate analysis:

factor analysis biplots

Fig. Biplot of Ovarian Tumor data.

The observations are plotted as points (o - benign,

x - malignant), the variables are plotted as vectors from the origin.

- visualization of the correlation between the variables - visualization of the relations between the variables and clusters.


LS-SVM & Bayesian FrameworkLS-SVM & Bayesian Framework LS-SVM

Kernel based method maps n-dimensional input vector into a higher dimensional feature

space where a linear algorithm can be applied. The learning problem:

FN

iii bwxf

1

)()( x

Feature space Mercer’s theorem

K(x, z) = <(x) (z)>

N

iiii bKyxf

1

)()( xx

Dual space

attracting features: good generalization performance, the existing of unique solution, statistical learning theory

}/exp{),( 22 zxzx K

xzzx TK ),(

Positive definite kernel K(.,.)

RBF kernel:

Linear kernel:


where the input data x->(x) are projected to a higher dimensional feature space. One considers the following optimization problem:

subject to

The lagrangian is defined as

where are Lagrange multipliers.

LS-SVMLS-SVM LS-SVM classifier (Suykens & Vandewalle,1999)

Given {(xi, yi)}i=1,..,N, with input data xiRp, and the corresponding output data yi {-1, 1}. The following model is taken:

n

ii

T

ebw

eewJ1

2

,, 2

1

2

1),(min ww

).,...,1(,1])([ Nieeby iiiT

i xw

bf T )()( xwx

n

iii

Tii ebyewJebwL

1

1])([2

1),();,,( xw


Taking the Kuhn-Tucker conditions for optimality, providing a set of linear equations, eliminating w and e, the solutions are obtained:

withY=[y1; …; yN], 1v=[1;…;1], =[1; …, N], and ij= yiyj<(xi)(xj)> = yiyj K(xi, xj) for i, j = 1, …, N

The resulting LS-SVM model for classification is

LS-SVMLS-SVM LS-SVM classifier (c.t.)

N

iiii bKyf

1

),(sign)( xxx

Some parameters need to be tuned: Regularization parameter , determine the tradeoff between the

minimizing training errors and minimizing the model complexity. Kernel parameters, e.g. for an RBF kernel.

Popular ways for choosing hyper parameters: cross-validation, utilize an upper bound on the generalization error. Our approach: Bayesian method.


Bayesian Evidence FrameworkBayesian Evidence Framework Bayesian Evidence Framework (MacKay 1993)

Probability theory and Occam’s razor Bayesian probability theory provides a unifying framework for data modeling. Occam’s razor is needed for model comparison.

Each model Hi is assumed to have: a vector of parameters w; a prior distribution P(w |Hi); a set of probability distributions one for each value of w, defining the predictions P(D | w, Hi) that the model makes about the data.


Bayesian Evidence FrameworkBayesian Evidence Framework Probability theory and Occam’s razor

Model Hi are ranked by evaluating the evidence

(1) Model fitting

(2) Model comparison

Assuming choosing equal priors P(Hi) to alternative models, evidence

evaluate most probable values for wMP, and summarize the posterior distribution by wMP, and error bars; evaluating the Hessian at wMP,

The posterior can be locally approximated as Gaussian with covariance matrix A-1

Evaluating the evidence

if the posterior is well approximated by a Gaussian, then


Bayesian Evidence Framework Bayesian Evidence Framework for LS-SVMfor LS-SVM

A Bayesian framework for LS-SVM classifiers (VanGestel and Suykens, 2001) Starting from the feature space formulation, analytic expression are obtained in the dual space on the three levels of Bayesian inference. Posterior class probabilities marginalizing over the model parameters.

subject to

with regularization term and sum of squares error

while amount of regularization determined by

For classification problem with binary target yi=±1, LS-SVM cost function can also be formulized as



Probability interpretation of LS-SVM classifier (Level1)Applying Bayes rule, the first level of inference is obtained:

Assume: data points are independent, target has Gaussian noise ei, the noise level is defined as 2=1/

Assume: separate Gaussian prior for w and b, w

2=1/, and b (uniform distribution)

wMP and bMP are obtained by solving a standard LS-SVM in dual space.The posterior probability of model parameter w

and b is given by



Posterior class probability for LS-SVM classifier (Level1)

the class probability

with

Calculated at dual space

where

Marginalizing over w, yield a Gaussian distributed e± with mean me± and variance e±

2conditional probability

incorporate prior class probability or misclassification costIn our experiments, the prior P(y=+1)=2/3, P(y=-

1)=1/3



Inference of Hyperparameters (Level 2)

Applying Bayes rule, the second level of inference is obtained:

Assume: uniform distribution in log and log.

Evidence in level 1

The eigenvalue problem

A practical way to find MP , MP the is to solve first the scalar minimization problem in

The number of effective parameters

with



Bayesian model comparison (Level 3)

Applying Bayes rule, the third level of inference is obtained:

Assume: uniform distributionModels

are ranked by evidence

Evidence


Bayesian Evidence Framework Bayesian Evidence Framework for LS-SVM - designfor LS-SVM - design Preprocess the data

Normalize the training data into zero mean, and variance 1. Test set follows the same normalization as training set.

Hyperparameter tuning Select the model Hi by choosing a kernel type Ki and

kernel parameter, e.g. in RBF kernels. Then the optimal regularization parameter for model Hi is estimated on the second level of inference.

The corresponding MP , MP and the number of effective parameters eff can also be estimated. Compute the model evidence P(D|Hi) at the third level of inference.

For a kernel Ki with tuning parameters, refine the tuning parameters (e.g. ), such that a higher model evidence P(D|Hi) is obtained.


Bayesian Evidence Framework Bayesian Evidence Framework for LS-SVM - designfor LS-SVM - design

Input selection under the Bayesian evidence framework Given a certain type of kernel Performs a forward selection (greedy search).

Starting from zero variables, the variable which gives the greatest increase in the current model evidence

is chosen at each iteration step.

The selection is stopped when the adding of any remaining variable can no longer increase the model evidence.

10 variables were selected based on the training set (first treated 265 patient data), using an RBF kernel.

l_ca125, pap, sol, colsc3, bilat, meno, asc, shadows, colsc4, irreg


Bayesian Evidence Framework Bayesian Evidence Framework for LS-SVM - designfor LS-SVM - design

Sparse approximation Due to the choice of 2-norm in cost function, LS-SVM lost the

sparseness compared with standard SVMs. Sparseness can be imposed to LS-SVM by a pruning procedure based

upon the support values i=ei. We propose to prune the data points which have negative support values.

Intuitively, pruning of easy examples will focus the model on the harder cases which lie around the decision boundary.

Iteratively prune the data with negative i, the hyper parameters are retuned several times based on the reduced data set using the Bayesian evidence framework.

Stop when no more support values are negative.


Model Evaluation Model Evaluation - Temporal Validation- Temporal Validation

Training set : data from the first treated 265 patients

Test set : data from the latest treated 160 patients

-- LSSVMrbf

-- LSSVMlin

-- LR

-- RMI

ROC curve on training set

-- LSSVMrbf

-- LSSVMlin

-- LR

-- RMI

ROC curve on test set

-- LSSVMrbf

-- LSSVMlin

-- LR

-- RMI

ROC curve on test set

MODEL TYPE

AUC Accuracy

Sensitivity

Specificity

RMI 0.8733 78.13 74.07 80.1976.88 81.48 74.53

LR1 0.9111 80.63 75.96 83.0280.63 77.78 82.08

LS-SVM1 0.9141 81.25 77.78 83.02(LIN) 81.88 83.33 81.13LS-SVM1 0.9184 83.13 81.48 83.96(RBF) 84.38 85.19 83.96

Performance on Test set

* Probability cutoff value: 0.4 and 0.3


randomly separating training set (n=265) and test set (n=160) Stratified, #malignant : #benign ~ 2:1 for each training and test set. Repeat 30 times

Model Evaluation Model Evaluation - Randomized Cross-validation- Randomized Cross-validation

Averaged Performance on 30 runs of validations

* Probability cutoff value: 0.5 and 0.4

MODEL TYPE

AUC (SD)

Accuracy

Sensitivity

Specificity

RMI 0.8882 82.6 81.73 83.060.0318 81.1 83.87 79.85

LR1 0.9397 83.3 89.33 80.550.0238 81.9 91.6 77.55

LS-SVM1 0.9405 84.3 87.4 82.91(LIN) 0.0236 82.8 90.47 79.27LS-SVM1 0.9424 84.9 86.53 84.09(RBF) 0.0232 83.5 90 80.58

Expected ROC curve on validation


ConclusionsConclusions Summary

Data exploratory analysis helps to analyze the data set. Under the Bayesian evidence framework, choosing of the model

regularization and kernel parameters for LS-SVM classifier can be done in a unified way, without the need of selecting additional validation set.

A forward input selection procedure which tries to maximize the model evidence has been proved to be able to identify the subset of important variables for model building.

A sparse approximation can further improve the generalization performance of the LS-SVM classifiers.

LS-SVMs have the potential to give reliable preoperative prediction of malignancy of ovarian tumors.

Future work A larger scale validation is still needed. Hybrid methodology, e.g. combine the Bayesian network with the learning of LS-SVM, might be more promising

SISTA seminar Feb 28, 2002 Preoperative Prediction of Malignancy of Ovarian Tumors Using Least Squares Support Vector Machines C. Lu 1, T. Van Gestel 1,

Documents

sista seminar

model h i

y i y j kx i

introduction data patient

belgium slide

resulting lssvm model

input data x x

corresponding output