Top Banner

Click here to load reader

Proteomic Mass Spectrometry. Outline Previous Research Project Goals Data and Algorithms Experimental Results Conclusions To Do List

Dec 20, 2015



  • Slide 1
  • Proteomic Mass Spectrometry
  • Slide 2
  • Outline Previous Research Project Goals Data and Algorithms Experimental Results Conclusions To Do List
  • Slide 3
  • Motivation MS spectra has high dimension Most ML algorithms are incapable of handling such high dimensional data Dimensionality Reduction (DR) Preserve as much information as possible, while reducing the dimensionality. Feature Extraction (FE) Removal of irrelevant and/or redundant features (information)
  • Slide 4
  • Previous research Usually applies DR then FE Does Order matter ? DR: Down Sampling, PCA, Wavelets FE: T-Test, Random Forests, Manual Peak Extraction In [conrads03] show that high resolution MS spectra produces better classification accuracy. Most previous research down samples spectra CONJECTURE: Down Sampling detrimental to performance.
  • Slide 5
  • Project Goals Test Down Sampling Conjecture Compare FE algorithms (NOTE: Optimal FE is NP-hard !) Use a simple but fast classifier to test a number of FE approaches Test across different data sets Are there any clearly superior FE algorithms ?
  • Slide 6
  • Three Data Sets Heart/Kidney (100/100) 164,168 features, 2 classes Ovarian Cancer (91/162) 15,154 features, 2 classes Prostate Cancer (63/190/26/43) 15,154 features, 4 classes Normal, Benign, Stage 1, Stage 2 Cancer Transformed into Normal/Benign Vs Cancer (1&2)
  • Slide 7
  • Algorithms Centroid Classifier given class means P, Q and sample point s C = argmin (d(P,s), d(Q,s)) PQ d(P,s) d(Q,s)
  • Slide 8
  • Algorithms T-testT-test do the means of 2 distr. Differ ? KS-testKS-test do the cdf differ ? CompositeComposite (T-test)*(KS-test) IFEIFE - Individual Feature Evaluation using the centroid classifier DPCADPCA discriminative principle component analysis
  • Slide 9
  • Preliminary Experiments Compare normalization approaches Compare similarity metrics Cross correlation (-L1) Angular Across 3 data sets => 27 configurations
  • Slide 10
  • Preliminary Experiments (cont) No single norm/metric clearly superior on all data sets 2-5% increase in performance if suitable normalization and similarity metric chosen (can be up to 10% increase) L1-norm with angle similarity metric worked well on Heart/Kidney and Ovarian Cancer sets (easy sets) L1-norm and L1-metric best on Prostate 2-class problem (hard set).
  • Slide 11
  • Down Sampling
  • Slide 12
  • Statistical Tests T-test, KS-test, Composite Ranks features in terms of relevance SFS Sequential Forward Selection Selects ever increasing feature sets I.e., {1}; {1,2}; {1,2,3}; {1,2,3,4}
  • Slide 13
  • Heart/Kidney
  • Slide 14
  • Ovarian Cancer
  • Slide 15
  • Prostate Cancer
  • Slide 16
  • Single Feature Classification Use each feature to classify test samples Rank features in terms of performance SFS
  • Slide 17
  • Performance Comparison
  • Slide 18
  • Slide 19
  • Slide 20
  • Summary For each data set, for each FE algorithm ran 15,000 3-fold cross validation experiments. Total of 810,000 FE experiments ran DE experiments ~ 100,000 experiments Additional 50,000 experiments using DPCA classifier did not produce significantly different results than the centroid classifier
  • Slide 21
  • Conclusions HK and Ovarian Data sets considerably easier to classify than Prostate Cancer Feature Extraction (in general) significantly improves performance on all data sets No single technique superior on all data sets. Best Performance using SFS with feature weighting Smallest feature set with T-test of KS-Test Composite test inferior to all others. Down Sampling appears to be detrimental What about other Dim. Red. Techniques ? E.g. PCA and Wavelets
  • Slide 22
  • Conclusions Down Sampling appears to be detrimental What about other Dim. Red. Techniques ? E.g. PCA and Wavelets What about FE after Down Sampling ? On Prostate data performance appears to drop w.r.t. to best single feature.
  • Slide 23
  • To Do List Check PCA, Wavelets and other DR techniques Use other (better) classifiers General Hypothesis Use a simple fast classifier together with FE techniques to extract a good feature set Replace classifier with a more effective one. Need to verify that other classifiers respond well to the extracted features.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.