Top Banner
SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Elena Marchiori Department of Computer Department of Computer Science Science Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam
26

SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Jan 14, 2016

Download

Documents

Denis Holmes
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

SVM-based techniques for biomarker discovery in proteomic pattern data

Elena Marchiori Elena Marchiori

Department of Computer ScienceDepartment of Computer Science

Vrije Universiteit AmsterdamVrije Universiteit Amsterdam

Page 2: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Overview

Variable selection Variable selection SVM-based techniquesSVM-based techniques Application to proteomic pattern dataApplication to proteomic pattern data ResultsResults ConclusionConclusion

Page 3: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Variable Selection Select a small subset of input variables (for example genes Select a small subset of input variables (for example genes

in gene expression data, m/z values in proteomic pattern in gene expression data, m/z values in proteomic pattern data) which are used for building classifierdata) which are used for building classifier

Advantages:Advantages: it is cheaper to measure less variablesit is cheaper to measure less variables the resulting classifier is simpler and potentially faster the resulting classifier is simpler and potentially faster prediction accuracy may improve by discarding prediction accuracy may improve by discarding

irrelevant variables irrelevant variables identifying relevant variables gives more insight into the identifying relevant variables gives more insight into the

nature of the corresponding classification problem nature of the corresponding classification problem (biomarker detection)(biomarker detection)

Page 4: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Support Vector Machines

Advantages:Advantages: maximize the margin between two classes in maximize the margin between two classes in

the feature space characterized by a kernel the feature space characterized by a kernel functionfunction

are robust with respect to high input dimensionare robust with respect to high input dimension Disadvantages:Disadvantages:

difficult to incorporate background knowledgedifficult to incorporate background knowledge Sensitive to outliersSensitive to outliers

Page 5: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

wTx + b = 0

wTx + b < 0wTx + b > 0

f(x) = sign(wTx + b)

Binary classification

Page 6: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Linear Separators

Page 7: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

SVM: separable classes

ρ

Support vector

margin

Optimal hyper-plane

Support vectors uniquely characterize optimal hyper-plane

Page 8: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

SVM and outliersoutlier

Page 9: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

SVM-RFELinear binary classifier decision function Linear binary classifier decision function

Recursive Feature Elimination (SVM-RFE) Recursive Feature Elimination (SVM-RFE) at each iteration: at each iteration:

1)1) eliminate threshold% of variables with lower scoreeliminate threshold% of variables with lower score2)2) recompute scores of remaining variables recompute scores of remaining variables

SVM-RFE based algorithms:SVM-RFE based algorithms: run SVM-RFE with different thresholds run SVM-RFE with different thresholds

JOIN: select variables occurring more than cutoff JOIN: select variables occurring more than cutoff timestimes

ENSEMBLE: consider majority vote of resulting ENSEMBLE: consider majority vote of resulting classifiersclassifiers

bxwxxf i

N

iiN

11 ),...,(

ii xw variableof score ||

Page 10: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

SVM-RFEI. Guyon et al.,Machine Learning,46,389-422, 2002

Page 11: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

SVM-RFE variant

Input: Train set, thresholdInput: Train set, threshold TT, number , number NN of variables to be of variables to be selectedselected

Output: subset of variables of Output: subset of variables of size Nsize N RFE:RFE:

TrainTrain: Run linear SVM on train set: Run linear SVM on train set ScoreScore: generate a sequence of variables ordered wrt the : generate a sequence of variables ordered wrt the

absolute value of their weightabsolute value of their weight EliminateEliminate: remove : remove TT % of variables from ordered % of variables from ordered

sequencesequence Repeat Repeat (train, score, eliminate) on train set restricted to (train, score, eliminate) on train set restricted to

remaining variables until only remaining variables until only NN variables are left variables are left

Page 12: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

JOIN and ENSEMBLE SVM-RFE

Page 13: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Case Study: proteomic pattern data

Petricoin et al papersPetricoin et al papers Commercial analysis software (Proteome Commercial analysis software (Proteome

Quest): http://www.correlogic.com/Quest): http://www.correlogic.com/ Data sets available at: Data sets available at:

http://ncifdaproteomics.com/ppatterns.phphttp://ncifdaproteomics.com/ppatterns.php

Page 14: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Data generation: SELDI-TOF MSSurface-enhanced laser desorption/ionization time-of-flight mass spectrometry

Method for profiling a population of proteins in a sample according Method for profiling a population of proteins in a sample according to the size and net charge of individual proteins.to the size and net charge of individual proteins.

The readout is a spectrum of peaks. The position of a protein in the The readout is a spectrum of peaks. The position of a protein in the spectrum corresponds to its “time of flight” because the small spectrum corresponds to its “time of flight” because the small proteins fly faster than the heavy ones.proteins fly faster than the heavy ones.

1 Serum on protein binding plate2 Insert plate in vacuum chamber3 Irradiate plate with laser4 This “launches” the proteins /

peptides5 Measure “time of flight” (TOF) of Ions, related to the molecularweight of proteins

Page 15: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Example of proteomic pattern profile from one blood sample

Time of flight

Abundance

•Heavier peptides move slower -> •Time of flight corresponds to weight•Weight corresponds to peptides•Measurement of relative abundance of detected peptides in serum

Page 16: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

How to use such data?

Diagnostic tool:Diagnostic tool: design a classifier for discriminating healthy design a classifier for discriminating healthy

from disease samplesfrom disease samples

Biomarkers identification:Biomarkers identification: Variable subset selection (VSS): select a subset Variable subset selection (VSS): select a subset

of input variables (m/z values) that best of input variables (m/z values) that best discriminate the two classes (potential discriminate the two classes (potential biomarkers)biomarkers)

Page 17: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Commercial Tools

Proteome Quest (Correlogic): GA+clustering, Proteome Quest (Correlogic): GA+clustering, no pre-selection no pre-selection (Petricoin et al., The Lancet 2002)(Petricoin et al., The Lancet 2002)

Propeak (3Z Informatics): separability analysis Propeak (3Z Informatics): separability analysis + bootstrap + bootstrap

Biomarker AMplification Filter BAMF (Eclipse Biomarker AMplification Filter BAMF (Eclipse Diagnostics): ?Diagnostics): ?

Page 18: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Non-commercial Techniques Pre-processing + ranking + kNN Pre-processing + ranking + kNN (Zhu et al., PNAS (Zhu et al., PNAS

2003)2003) Pre-selection + boosted decision trees Pre-selection + boosted decision trees (Qu et al., (Qu et al.,

Clin. Chem. 2002)Clin. Chem. 2002) Filter FS + classifierFilter FS + classifier (Liu et al., Genome Informatics (Liu et al., Genome Informatics

2002)2002) GA + SVM, SVM-RFE ensembleGA + SVM, SVM-RFE ensemble (Jong et al., (Jong et al.,

EvoBIO 2004, Jong et al. CIBCB 2004)EvoBIO 2004, Jong et al. CIBCB 2004) Many others: any ML methodMany others: any ML method for classification/FS (see, for classification/FS (see,

e.g., special issue on FS, JMLR 2003)e.g., special issue on FS, JMLR 2003)

Page 19: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Goal and Methods Goal: analyze performance of SVM-based techniques Goal: analyze performance of SVM-based techniques

for classification and variable selection with proteomic for classification and variable selection with proteomic pattern data pattern data

SVMSVM SVM-RFESVM-RFE Ensemble SVM-RFE:Ensemble SVM-RFE:

Majority vote of SVM-RFE classifiers obtained from Majority vote of SVM-RFE classifiers obtained from SVM-RFE with different cutoff valuesSVM-RFE with different cutoff values

Join SVM-RFE: Join SVM-RFE: SVM trained on N variables that have been selected SVM trained on N variables that have been selected

more often by SVM-RFE with different threshold more often by SVM-RFE with different threshold valuesvalues

Page 20: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

DataSets

Two proteomic pattern datasets from prostate and ovarian Two proteomic pattern datasets from prostate and ovarian cancer from NCI/CCR and FDA/CBER Clinical proteomics cancer from NCI/CCR and FDA/CBER Clinical proteomics Program Databank:Program Databank:

Data sets available at: Data sets available at:

http://http://ncifdaproteomicsncifdaproteomics.com/.com/ppatternsppatterns..phpphp

1515415154115 (15 115 (15 benign)benign)

100100215215Ovarian 4/03/02Ovarian 4/03/02

15154151542532536969322322ProstateProstate

M/z valuesM/z valueshealthyhealthycancercancertot #tot #

Page 21: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Experimental Setup 10 random partitions of dataset:T (50%),H (25%),V (25%)10 random partitions of dataset:T (50%),H (25%),V (25%) Algorithms: Algorithms:

SVM trained on union of T and H SVM trained on union of T and H SVM-RFE(threshold) with thresholds = 0.2,0.3,0.4,0.5, SVM-RFE(threshold) with thresholds = 0.2,0.3,0.4,0.5,

0.6,0.70.6,0.7Choose threshold giving best classifier sensitivity on HChoose threshold giving best classifier sensitivity on H

JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with cutoffs = 1, 2, 3, JOIN(cutoff, 0.2, 0.3,0.4, 0.5,0.6,0.7) with cutoffs = 1, 2, 3, 4, 5 4, 5 Choose cutoff giving best classifier sensitivity on HChoose cutoff giving best classifier sensitivity on H

Performance: average (over 10 V's) Performance: average (over 10 V's)

Page 22: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Results Prostate Dataset

Page 23: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Results Ovarian Dataset

Page 24: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Controversy

Noise, bias, results reliability and reproducibility in Noise, bias, results reliability and reproducibility in serum proteomics:serum proteomics:

Sorace, Zhan, Sorace, Zhan, BMC Bioinformatics, 2004BMC Bioinformatics, 2004, , Petricoin, Petricoin, BMC Bioinformatics, 2004,BMC Bioinformatics, 2004, Baggerly,Baggerly, Journal of the National Cancer Journal of the National Cancer

Institute, vol. 97, No.4, 2005.Institute, vol. 97, No.4, 2005. Liotta, Liotta, Journal of the National Cancer Institute, Journal of the National Cancer Institute,

vol. 97, No.4, 2005.vol. 97, No.4, 2005. Ransohoff,Ransohoff, Journal of the National Cancer Journal of the National Cancer

Institute, vol. 97, No.4, 2005.Institute, vol. 97, No.4, 2005.

Page 25: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Conclusion Many machine learningMany machine learning techniques can be used for techniques can be used for

potential biomarker detection with pattern proteomic data. potential biomarker detection with pattern proteomic data.

SVM based techniquesSVM based techniques are a possible effective choice are a possible effective choice because of the high input dimension of such data.because of the high input dimension of such data.

Computational analysisComputational analysis of pattern proteomic data has to of pattern proteomic data has to use a use a correct methodologycorrect methodology that considers biases induced that considers biases induced by the selection and classification algorithms and by the by the selection and classification algorithms and by the data splitting. data splitting.

Problems related to reliability and reproducibility of data Problems related to reliability and reproducibility of data are inherent to the are inherent to the laboratory technologylaboratory technology and actually and actually addressed by researchers and practitioners.addressed by researchers and practitioners.

Page 26: SVM-based techniques for biomarker discovery in proteomic pattern data Elena Marchiori Department of Computer Science Vrije Universiteit Amsterdam.

Acknowledgments

Connie Jimenez (Biology, VUMC)Connie Jimenez (Biology, VUMC) Aad van der Vaart (Statistics, VUA)Aad van der Vaart (Statistics, VUA)