lklklkljkhklhkh
Evaluation of Semi-supervised Learning forClassification of
Protein Crystallization Imagery
Madhav Sigdel, Imren Din, Semih Din, Madhu S Sigdel, Marc Pusey,
PhDRamazan Aygun, PhDEmail: [email protected] Research
LabComputer Science DepartmentUniversity of Alabama in
Huntsville
IEEE SoutheastCon 2014, Lexington, KY
OutlineBackgroundMotivationSemi-supervised ClassificationSelf
TrainingYet Another Two Staged Idea (YATSI)Overview of
FeaturesExperimental ResultsConclusion
IEEE SoutheastCon 2014, Lexington, KY
Background
Sample Protein Crystallization Trial Images
Non-crystals
Likely-leads
CrystalsModel of robotic system to collect images
IEEE SoutheastCon 2014, Lexington, KYImage Categories
Non-crystals Images without crystals (Clear
drop/Precipitates)
Likely-leads Images with micro-crystals or high intensity
regions without clear shapes
Crystals Images with different shapes of crystals (needles,
plates, 3D crystals)
Related WorkProtein crystallization classification problem using
variety of algorithms such asSupport vector machines (SVMs)Decision
treesNeural networks etc. Combination of multiple classifiers
(Saitoh et al. 08)Trend to increase the size of training data to
improve classification performance79,632 images(Po & Laine
08)165,351 images (Cumba et al. 10)
MotivationExpert labeling is very difficult and
time-consuming
Can we build a reliable classification system using limited
labeled images?
Semi-Supervised learning
Semi-supervised ClassificationCombine labeled data and unlabeled
data to improve the learning modelExamples of Semi-supervised
classificationSelf-trainingYet-Another Two Staged Idea
(YATSI)Laplacian SVM, transductive SVM etc.Used for applications
such as text classification, spam email detection, software fault
detection etc.
Self-TrainingLet L be the set of labeled data, U be the set of
unlabeled data.Repeat Train a classifier h with training data
LClassify data in U with hFind a subset U of U with the most
confident predictionsL + U L U U U
IEEE SoutheastCon 2014, Lexington, KY
Yet-Another Two Staged Idea (YATSI)Uses a supervised
classification algorithm and a nearest neighborhood algorithmTwo
stagesFirst stageGenerate prediction model (M) using labeled data
(L)Find predictions for unlabeled data (U) using M U (Pre-labeled
data)Combine original labeled data and pre-labeled data (L +
U)Second stageApply k nearest neighbor on (L+U) to determine the
actual predictions for unlabeled instances
Drissens et al. 06
O?XXXXXOOOOOOOOOOXXXXXXXOK = 1, ? OK = 3, ? XXK-Nearest Neighbor
Classifier
O?XXOOOOOOOXXXXX????????????????Classifier???????
Prediction Model
YATSI Algorithm
OOXXOOOOOOOXXXXXOXOOXOOOOXOOXOXOClassifierXOOOOXO
Prediction Model
YATSI Algorithm
OOXXOOOOOOOXXXXXOXOOXOOOOXOOXOXOXOOOOXO
YATSI Algorithm
OOXXOOOOOOOXXXXXOXOOXOOOOXOOXOXOXOOOOOOYATSI Algorithm
Overview of Features
IEEE SoutheastCon 2014, Lexington, KY3 thresholding techniques6
- Intensity Features9 Blob Features3*(6+9) = 45-dimension feature
vector
Dataset
IEEE SoutheastCon 2014, Lexington, KY
Non-crystals
Likely-leads
Crystals2250 Images
2 Class Problem67% Non-crystals 33% Likely Crystals (Crystals +
Likely Leads)
3 Class Problem67% Non-crystals 18% Likely Leads 15%
Crystals
Experiments Self Training2 Supervised ClassifiersNave Bayesian
(NB)Sequential Minimum Optimization (SMO)Confidence level (c) for
first predictionc = 0.8c = 0.9c = 0.95Training sizes - 1%, 2%, 5%,
10%, 20%
IEEE SoutheastCon 2014, Lexington, KY
Experiments Self Training
IEEE SoutheastCon 2014, Lexington, KY
Experiments - Self-training
IEEE SoutheastCon 2014, Lexington, KY
Experiments - YATSI5 Supervised ClassifiersNave Bayesian
(NB)Sequential Minimum Optimization (SMO)Decision Tree
(J48)Multilayer Perceptron (MLP)Random Forest (RF)No of K-nearest
neighbors (K)K = 10K = 20K = 30Training sizes - 1%, 2%, 5%, 10%,
20%
Experiments - YATSI
Supervised vs YATSI
Experiments - YATSIIEEE SoutheastCon 2014, Lexington, KY
Experiments - YATSIIEEE SoutheastCon 2014, Lexington, KY
Best Classifers Comparison
ConclusionCompared the performances of 2 semi-supervised
classification techniques using self-training and YATSI
approachNave Bayesian (NB) and SMO classifiers benefited from
self-training and YATSI approachClassifiers J48, multilayer
perceptron (MLP) and random forest (RF) did not show improvement by
applying semi-supervised approachRandom forest provided the best
classification performance
IEEE SoutheastCon 2014, Lexington, KY
Future WorkInvestigate active learning in combination with
semi-supervised learning
IEEE SoutheastCon 2014, Lexington, KY
AcknowledgementNational Institutes of Health (GM090453)
grantIEEE SoutheastCon 2014, Lexington, KY
THANK YOU
Madhav Sigdel, mren Din, Semih Din, Madhu Sigdel, Marc Pusey,
PhDRamazan Aygun, PhDEmail: [email protected] Research
LabComputer Science DepartmentUniversity of Alabama in
Huntsville