ISAC Tutorial - 5/20/06 - Copyright (c) 2006, R.F. Murphy 1 Basics of Machine Learning for Image or Flow Robert F. Murphy Departments of Biological Sciences, Biomedical Engineering, and Machine Learning, and Contents The multivariate data matrix and its descriptive statistics Comparison: Are two samples the same? Parametric methods・Non-parametric methods (including tree-based methods) Influence of sample size Contents Classification: Which of a set of known classes should a new sample be assigned to? Linear Discriminant Analysis Classification Trees Neural Networks Support Vector Machines Ensemble Classifiers Bayesian Classifiers Contents Clustering: What classes are present in a sample? Basic clustering methods Methods for determining number of clusters Consensus clustering methods Methods for comparing clusterings Co-clustering Graphical models Drawing inference on classes from more than one instance Multivariate Distance Distance at the heart of Machine Learning High dimensionality Based on vector geometry – how close are two data points? Array2 Array 1 Feat 1 Feat 2 Cell 1 1 4 Cell 2 1 3 … Gene 1 Gene 2
18
Embed
Contents Basics of Machine Learningmurphylab.web.cmu.edu/presentations/20060520ISACTutorial2.pdfBasics of Machine Learning for Image or Flow Robert F. Murphy Departments of Biological
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Robert F. MurphyDepartments of Biological Sciences, Biomedical
Engineering, and Machine Learning, and
Contents The multivariate data matrix and its
descriptive statistics Comparison: Are two samples the
same? Parametric methods・Non-parametric
methods (including tree-based methods) Influence of sample size
Contents Classification: Which of a set of known
classes should a new sample be assigned to? Linear Discriminant Analysis Classification Trees Neural Networks Support Vector Machines Ensemble Classifiers Bayesian Classifiers
Contents Clustering: What classes are present in a
sample? Basic clustering methods Methods for determining number of clusters Consensus clustering methods Methods for comparing clusterings Co-clustering
Graphical models Drawing inference on classes from more than one
Support Vector Machines(SVMs) we want to label ‘?’ - linear separator??
+
-+
++
+
-
--
--
?
area
bright.
Slide courtesy of Christos Faloutsos
Support Vector Machines(SVMs) we want to label ‘?’ - linear separator??
+
-+
++
+
-
--
--
?
area
bright.
Slide courtesy of Christos Faloutsos
Support Vector Machines(SVMs) we want to label ‘?’ - linear separator?? A: the one with the widest corridor!
area
bright.
+
-+
++
+
-
--
--
?
Slide courtesy of Christos Faloutsos
Support Vector Machines(SVMs) we want to label ‘?’ - linear separator?? A: the one with the widest corridor!
area
bright.
+
-+
++
+
-
--
--
?
‘support vectors’
Slide courtesy of Christos Faloutsos
Multiclass Support VectorMachine
maxwin: trains N support vector machines, each of whichseparates class i from non-i. Choose the predicted classfrom the machine generating the highest output score.
pairwise: trains all possible binary classifiers resulting N(N-1)/2 machines in total. Each of these binary classifiers givesa vote to the win class. The class with the most votes will beselected as the predicted class.
DAG: puts the N(N-1)/2 binary classifiers trained above intoa rooted binary DAG. Trace down the classifier tree from theroot node disregarding each lose class i at every node andclassify the test point as non-i.
Evaluating Classifiers
Divide ~100 images for each class into training setand test set
Use the training set to determine rules for theclasses
Use the test set to evaluate performance Repeat with different division into training and test Evaluate different sets of features chosen as most
discriminative by feature selection methods Evaluate different classifiers (NN, SVM, MOE)
Many different types:• Hierarchical clustering• k – means clustering• Self-organising maps• Hill Climbing• Simulated Annealing
All have the same three basic tasks of:1. Pattern representation – patterns or features in the data.2. Pattern proximity – a measure of the distance or
similarity defined on pairs of patterns3. Pattern grouping – methods and rules used in grouping
the patterns
Hierarchical vs. k-meansclustering Hierarchical builds tree sequentially
from the closest pair of points (eithergenes or conditions)
k-means starts with k randomly chosenseed points, assigns each remainingpoint to the nearest seed, and repeatsthis until no point moves
Location Proteomics Tag many proteins
We have used CD-tagging(developed by Jonathan Jarvik andPeter Berget): Infect population ofcells with a retrovirus carrying DNAsequence that will “tag” in a random gene in each cell
Principles of CD-Tagging(CD = Central Dogma)
Exon 1 Intron 1
Exon 2
Genomic DNA +CD-cassette
Exon 1 Tag
Exon 2
Tagged DNACD cassette
Tag Tagged mRNA
Tagged ProteinTag (Epitope)
Tag
Location Proteomics Tag many proteins
We have used CD-tagging(developed by Jonathan Jarvik andPeter Berget): Infect population ofcells with a retrovirus carrying DNAsequence that will “tag” in a random gene in each cell
Isolate separate clones, each of which produces express onetagged protein
Use RT-PCR to identify tagged gene in each clone Collect many live cell images for each clone using spinning
Since cells with same location pattern areoften clustered together, considering multiplecells may improve the discrimination of similarlocation patterns.
We developed a novel graphical model todescribe the relationship between multiplecells in a field.
The classification of a cell is influenced by theclassification results of neighboring cells.
Multiple Cells in an Image1. Segmentation
3. Cell Classification2. Feature Extraction
0.0718550.0475830.0513160.015094
0.0918350.0390190.0481930.013216
0.0893810.0498410.0583870.018215
0.0738140.0587180.0529510.014918
0.0781730.0391430.0618730.021942
0.0738130.0418340.0538290.019183
Individually Dependently
o Majority VotingHomogeneous Fieldaccuracy: 98%(Boland and Murphy, 2001)
o Local Dependence Heterogenous Field
ActinER
ER
ER
ER
GolgiER
ER
Golgi
ER
Value of Graphical Model Graphical models can be used to improve accuracy
of classification of heterogeneous images Each individual cell is still classified, and minor or
unusual cells are not “lost” Appropriate for cell array experiments (e.g., RNAi)
where heterogeneity expected Appropriate for tissue images
Bayes Decision Theory
x: featureswj: jth class
Bayes Rule
)(
)()|()|(
xp
wpwxpxwp
jj
j=
evidence
priorlikelihoodposterior
!=
Bayes Decision Theory
Training
)(
)()|()|(
xp
wpwxpxwp
jj
j=
Train a classifier given training images of each class
Assign x to the class with max posterior probability
Bayes Decision Theory
Testing
)(
)()|()|(
xp
wpwxpxwp
jj
j=
x: featureswj: jth class
)(
)()|()|(
xp
wpwxpxwp
jj
j=
Bayes Decision Theory
Testing
Normally, prior distribution assumed or determined aheadof time (prior!). Our idea: adjust priors to reflect theneighbors of a cell (iteratively).> The posterior probability changes to reflect neighbors
x: featureswj: jth class
Graphical Cell Model
1
2
5
63 4
7
Consider multiple cells in a field
Graphical Cell Model
Connect cells if they are close enough (either in physical space or feature space)
Acknowledgments Thanks to Michael Boland, Meel Velliste, Kai Huang, Xiang
Chen, Shann-Ching Chen, Geoffrey Gordon, JonathanJarvik, Peter Berget, and Christos Faloutsos for contributionsto the slides in this tutorial and/or the research they describe
The research was supported in part by research grant RPG-95-099-03-MGO from the American Cancer Society, by grant 99-295 from the Rockefeller Brothers Fund Charles E. CulpeperBiomedical Pilot Initiative, by NSF grants BIR-9217091, MCB-8920118, and BIR-9256343, by NIH grants R33 CA83219 andR01 GM068845 and by Commonwealth of PennsylvaniaTobacco Settlement Fund research grant 017393, and bygraduate fellowships from the Merck Computational Biology andChemistry Program at Carnegie Mellon University funded by theMerck Company Foundation.
Review Articles Y. Hu and R. F. Murphy (2004). Automated Interpretation of Subcellular Patterns
from Immunofluorescence Microscopy. J. Immunol. Methods 290:93-105. K. Huang and R. F. Murphy (2004). From Quantitative Microscopy to Automated
Image Understanding. J. Biomed. Optics 9:893-912. R.F. Murphy (2005). Location Proteomics: A Systems Approach to Subcellular
Interpretation of Subcellular Patterns in Fluorescence Microscope Images.Cytometry 67A:1-3.
X. Chen, and R.F. Murphy (2006). Automated Interpretation of ProteinSubcellular Location Patterns. International Review of Cytology 249:194-227.
X. Chen, M. Velliste, and R.F. Murphy (2006). Automated Interpretation ofSubcellular Patterns in Fluorescence Microscope Images for Location Proteomics.Cytometry, in press.
http://murphylab.web.cmu.edu/publications
First published system forrecognizing subcellular locationpatterns - 2D CHO (5 patterns)
M. V. Boland, M. K. Markey and R. F. Murphy (1997). AutomatedClassification of Cellular Protein Localization Patterns Obtained viaFluorescence Microscopy. Proceedings of the 19th AnnualInternational Conference of the IEEE Engineering in Medicine andBiology Society, pp. 594-597.
M. V. Boland, M. K. Markey and R. F. Murphy (1998). AutomatedRecognition of Patterns Characteristic of Subcellular Structures inFluorescence Microscopy Images. Cytometry 33:366-375.
http://murphylab.web.cmu.edu/publications
2D HeLa pattern classification (10major patterns)
R. F. Murphy, M. V. Boland and M. Velliste (2000). Towards a Systematics forProtein Subcellular Location: Quantitative Description of Protein LocalizationPatterns and Automated Analysis of Fluorescence Microscope Images. Proc IntConf Intell Syst Mol Biol 8:251-259.
M. V. Boland and R. F. Murphy (2001). A Neural Network Classifier Capable ofRecognizing the Patterns of all Major Subcellular Structures in FluorescenceMicroscope Images of HeLa Cells. Bioinformatics 17:1213-1223.
http://murphylab.web.cmu.edu/publications
3D HeLa pattern classification (11major patterns)
M. Velliste and R.F. Murphy (2002). AutomatedDetermination of Protein Subcellular Locations from 3DFluorescence Microscope Images. Proceedings of the2002 IEEE International Symposium on BiomedicalImaging (ISBI 2002), pp. 867-870.
R.F. Murphy, M. Velliste, and G. Porreca (2003). Robust NumericalFeatures for Description and Classification of Subcellular LocationPatterns in Fluorescence Microscope Images. J. VLSI Sig. Proc. 35:311-321.
K. Huang, M. Velliste, and R. F. Murphy (2003). Feature reductionfor improved recognition of subcellular location patterns influorescence microscope images. Proc. SPIE 4962: 307-318.
K. Huang and R.F. Murphy (2004). Boosting accuracy of automatedclassification of fluorescence microscope images for locationproteomics. BMC Bioinformatics 5:78.
X. Chen and R.F. Murphy (2004). Robust Classification ofSubcellular Location Patterns in High Resolution 3D FluorescenceMicroscope Images. Proceedings of the 26th Annual InternationalConference of the IEEE Engineering in Medicine and BiologySociety, pp. 1632-1635.
http://murphylab.web.cmu.edu/publicationsClassification of multi-cellimages
K. Huang and R. F. Murphy (2004). Automated Classification ofSubcellular Patterns in Multicell images without Segmentation intoSingle Cells. Proceedings of the 2004 IEEE InternationalSymposium on Biomedical Imaging (ISBI 2004), pp. 1139-1142.
S.-C. Chen, and R.F. Murphy (2006). A Graphical Model Approachto Automated Classification of Protein Subcellular Location Patternsin Multi-Cell Images. BMC Bioinformatics 7:90.
http://murphylab.web.cmu.edu/publications
Subcellular Location Trees - 3D3T3 CD-tagged images
X. Chen, M. Velliste, S. Weinstein, J.W. Jarvik and R.F. Murphy(2003). Location proteomics - Building subcellular location treesfrom high resolution 3D fluorescence microscope images ofrandomly-tagged proteins. Proc. SPIE 4962: 298-306.
X. Chen and R. F. Murphy (2005). Objective Clustering of ProteinsBased on Subcellular Location Patterns. Journal of Biomedicine andBiotechnology 2005: 87-95.
http://murphylab.web.cmu.edu/publicationsSubcellular Location Trees -Analysis of Location Mutants
P. Nair, B.E. Schaub, K. Huang, X. Chen, R.F.Murphy, J.M. Griffith, H.J. Geuze, and J. Rohrer(2005). Characterization of the TGN Exit Signal ofthe human Mannose 6-Phosphate UncoveringEnzyme. J. Cell Sci. 118:2949-2956.
http://murphylab.web.cmu.edu/publications
PSLID - Protein SubcellularLocation Image Database
K. Huang, J. Lin, J.A. Gajnak, and R.F. Murphy(2002). Image Content-based Retrieval andAutomated Interpretation of FluorescenceMicroscope Images via the Protein SubcellularLocation Image Database. Proceedings of the 2002IEEE International Symposium on BiomedicalImaging (ISBI 2002), pp. 325-328.
http://murphylab.web.cmu.edu/publicationsSLIF - Subcellular LocationImage Finder
R. F. Murphy, M. Velliste, J. Yao, and G. Porreca (2001).Searching Online Journals for Fluorescence Microscope ImagesDepicting Protein Subcellular Location Patterns. Proceedings ofthe 2nd IEEE International Symposium on Bio-Informatics andBiomedical Engineering (BIBE 2001), pp. 119-128.
R. F. Murphy, Z. Kou, J. Hua, M. Joffe, and W. W. Cohen (2004).Extracting and Structuring Subcellular Location Information fromOn-line Journal Articles: The Subcellular Location Image Finder.Proceedings of the IASTED International Conference onKnowledge Sharing and Collaborative Engineering (KSCE 2004),pp. 109-114.