April 9th 2010 Paolo Marcatili - Master in Bioinfo rmatica 1 Machine Learning Methods: an overview Master in Bioinformatica – April 9th, 2010 Paolo Marcatili University of Rome “Sapienza” Dept. of Biochemical Sciences “Rossi Fanelli” [email protected]Overview Supervised Learning Unsupervised Learning Caveats
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 1
Machine Learning Methods:an overview
Master in Bioinformatica – April 9th, 2010
Paolo MarcatiliUniversity of Rome “Sapienza”Dept. of Biochemical Sciences “Rossi Fanelli”
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 17
Methods
Information is
In the data? In the model?
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 18
Methods
Work Schema:
Choose a Learning-Validation Setting
Prepare data (Training, Test, Validation sets)
Train (1 or more times)
Validate
Use
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 19
Love all, trust a few, do wrong to none. Overview
4 patients, 4 controls
0
0.5
1
1.5
2
2.5
3
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
Patients
Control
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 20
Love all, trust a few, do wrong to none. Overview
2 more
0
0.5
1
1.5
2
2.5
3
0 1 2 3 4 5 6 7
Patients
Control
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 21
Love all, trust a few, do wrong to none. Overview
10 more
0
0.5
1
1.5
2
2.5
3
0 2 4 6 8 10 12 14 16
Patients
Control
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 22
Assessment
Prediction of unknown data!
Problems: Few data, robustness.
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 23
Assessment
Prediction of unknown data!
Problems: Few data, robustness.
Solutions:
Training, Test and Validation sets
Leave one Out
K-fold Cross Validation
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 24
Assessment
50% Training set: used to tune the model parameters
25% Test set: used to verify that the machine has “learnt”
25% Validation set: final assessment of the results
Unfeasible with few data
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 25
Assessment
Leave-one-out:
for each sample Ai
Training set: all the samples - {Ai}Test set: {Ai }
Repeat
Computationally intensive, good estimate of the mean errorhigh variance
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 26
Assessment
K-fold cross validation:
Divide your data in K subsets S1..k
Training set: all the samples - Si
Test set: Si
Repeat
good compromise
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 27
Assessment
Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.
Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.
Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 28
Assessment
Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.
Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.
Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present
receiver operating characteristic (ROC)is a graphical plot of the
sensitivity vs. (1 - specificity)
for a binary classifier system as its discrimination threshold is varied.
Overview
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 29
AssessmentOverview
Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.
Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.
Predictive Value Positive: TP / [ TP + FP ]Given a test is positive, the likelihood disease is present
receiver operating characteristic (ROC)is a graphical plot of the
sensitivity vs. (1 - specificity)
for a binary classifier system as its discrimination threshold is varied.
Area under ROC (AROC) is often used as a parameter to compare different classifiers
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 30
AgendaSupervised Learning
OverviewWhyHowDatasetsMethodsAssessments
Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks
Unsupervised LearningClusteringPCA
CaveatsData IndependenceBiasesNo free lunch?
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 31
Supervised LearningSupervised Learning
Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 32
Supervised LearningSupervised Learning
Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data
Example:use microarray data,different condition
classes: genes related/unrelated to different cancer types
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 33
Supervised LearningSupervised Learning
Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data
Example:use microarray data,different condition
classes: genes related/unrelated to different cancer types
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 34
Support Vector MachinesSupervised Learning
Basic idea:Plot your data in an N-dimensional space
Find the best hyperplane that separates the different classes
Further samples can be classified using the region of the space they belong to
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 35
Support Vector MachinesSupervised Learning
length
weight
FailPass
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 36
Support Vector MachinesSupervised Learning
margin
FailPass
length
weight
FailPass
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 37
Support Vector MachinesSupervised Learning
Optimal Hyperplane (OHP)
simple kind of SVM (called an LSVM)
margin
Support vectorsmaximum
margin
FailPass
length
weight
FailPass
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 38
Support Vector MachinesSupervised Learning
Original Data
What if data are not linearly separable?
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 39
Support Vector MachinesSupervised Learning
Original Data
What if data are not linearly separable?
Original Data
Allow mismatchessoft margins
(add a weight matrix)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 40
Support Vector MachinesSupervised Learning
weight2
length2
weight * length
Hyperplane
Original Data
What if data are not linearly separable?
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 41
Support Vector MachinesSupervised Learning
Only Inner product is needed to calculate Dual problem and decision function
weight2
length2
weight * length
Hypersurface
length
sd
Kernelization
Hyperplane
Original Data
What if data are not linearly separable?The Kernel trick!
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 42
SVM exampleSupervised Learning
Knowledge-based analysis of microarray geneexpression data by using support vector machinesMichael P. S. Brown*, William Noble Grundy†‡, David Lin*, Nello Cristianini§, Charles Walsh Sugnet¶, Terrence S. Furey*,Manuel Ares, Jr.¶, and David Haussler*
We introduce a method of functionally classifying genes by usinggene expression data from DNA microarray hybridization experiments.The method is based on the theory of support vectormachines (SVMs). SVMs are considered a supervised computerlearning method because they exploit prior knowledge of genefunction to identify unknown genes of similar function fromexpression data. SVMs avoid several problems associated withunsupervised clustering methods, such as hierarchical clusteringand self-organizing maps. SVMs have many mathematical featuresthat make them attractive for gene expression analysis, includingtheir flexibility in choosing a similarity function, sparseness ofsolution when dealing with large data sets, the ability to handlelarge feature spaces, and the ability to identify outliers. We testseveral SVMs that use different similarity metrics, as well as someother supervised learning methods, and find that the SVMs bestidentify sets of genes with a common function using expressiondata. Finally, we use SVMs to predict functional roles for uncharacterizedyeast ORFs based on their expression data.
To judge overall performance, we define the cost of using themethod M as C(M) 5 fp(M) 1 2zfn(M), where fp(M) is the numberof false positives for method M, and fn(M) is the number of falsenegatives for method M. The false negatives are weighted moreheavily than the false positives because, for these data, the numberof positive examples is small compared with the number of negatives.
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 43
Hidden Markov ModelsSupervised Learning
There is a regular and a biased coin.
You don't know which one is being used.
During the game the coins are exchanged with a certain fixed probability
All you know is the output sequence
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 44
Hidden Markov ModelsSupervised Learning
There is a regular and a biased coin.
You don't know which one is being used.
During the game the coins are exchanged with a certain fixed probability
All you know is the output sequence
HHTHTHTHTHTTTTHTHHTHHHHHHHHHTHTHTHHTHTHHHHTHTH
Given a set the parameters, which is the probability of the output seq.? Which parameters are more likely to have produced the output? Which coin was being used at a certain point of the sequence?
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 45
Hidden Markov ModelsSupervised Learning
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 46
Decision treesSupervised Learning
Mimics the behavior of an expert
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 47
Decision treesSupervised Learning
Mimics the behavior of an expert
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 48
Mimics the behavior of an expert
Decision treesSupervised Learning
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 49
Mimics the behavior of an expert
Decision treesSupervised Learning
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 50
Pros: Easy to interpreter Statistical analysis Informative results
Cons: A single variable Not optimal Not robust
Majority rules!
Decision treesSupervised Learning
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 51
Random ForestsSupervised Learning
Split the data in several subsets,construct a DT for each set
Each DT expresses a vote, the majority wins
Much more accurate and robust (bootstrap)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 52
Random ForestsSupervised Learning
Split the data in several subsets,construct a DT for each set
Each DT expresses a vote, the majority wins
Much more accurate and robust (bootstrap)
Prediction of protein–protein interactions using random decision forest framework
Xue-Wen Chen * and Mei Liu
Motivation: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain–domain interactions.
Results: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein–protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 53
Bayesian NetworksSupervised Learning
The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation
Not all correlations or cause-effect relationships between variables are significative
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 54
Bayesian NetworksSupervised Learning
The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation
Not all correlations or cause-effect relationships between variables are significative
Consider only meaningful links!
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 55
Bayesian NetworksSupervised Learning
I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:
A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 56
Bayesian NetworksSupervised Learning
I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:
A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call
Bayes Theorem again!
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 57
Bayesian NetworksSupervised Learning
We don't know the joint probability distribution,how can we learn it from the data?
Optimize the likelyhood, i.e. the probability that the model generated the data
Maximum likelyhood (simplest) Maximum posterior Marginal likelyhood (hardest)
We don't know which relationship is present between variables,how can we learn it from the data?
Connections in a graph are over-exponential, enumeration is impossibleEuristics, random sampling, monte carlo
Does independence assumption hold? Is the correlation informative? (BIC, Occam's razor, AIC)
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 58
Neural NetworksSupervised Learning
Neural Networks interpolate functionsThey have nothing to do with brains
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 59
Neural NetworksSupervised Learning
Neural Networks interpolate functionsThey have nothing to do with brains
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 60
Neural NetworksSupervised Learning
Neural Networks interpolate functionsThey have nothing to do with brains
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 61
Neural NetworksSupervised Learning
Parameter settings:
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 62
Neural NetworksSupervised Learning
Parameter settings: avoid overfitting
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 63
Neural NetworksSupervised Learning
Parameter settings: avoid overfitting
Learning --> validation --> usageNo underlying model, but it often works
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 64
Neural NetworksSupervised Learning
Protein Disorder Prediction:Implications for Structural ProteomicsRune Linding,1,4,* Lars Juhl Jensen,1,2,4 Francesca Diella,3 Peer Bork,1,2 Toby J. Gibson,1 and Robert B. Russell1
Abstract
A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of “hot loops,” i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 65
AgendaUnsupervised Learning
OverviewWhyHowDatasetsMethodsAssessments
Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks
Unsupervised LearningClusteringPCA
CaveatsData IndependenceBiasesNo free lunch?
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 66
Unsupervised LearningUnsupervised Learning
If we have no idea of actual data classification, we can try to guess
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 67
ClusteringUnsupervised Learning
Put together similar objects to define classes
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 68
ClusteringUnsupervised Learning
Put together similar objects to define classes
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 69
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 72
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 73
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 74
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 75
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 76
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 77
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 78
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 79
Hierarchical ClusteringUnsupervised Learning
•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 80
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 81
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 82
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 83
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 84
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 85
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 86
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 87
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 88
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 89
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 90
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 91
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 92
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 93
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 94
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 95
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 96
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 97
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 98
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 99
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 100
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 101
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 102
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 103
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 104
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 105
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 106
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 107
K-meansUnsupervised Learning
• Start with K random centers
• Assign each sample to the closest center
• Recompute centers (samples average)
• Repeat until converged
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 108
PCAUnsupervised Learning
Multidimensional data (hard to visualize)
Data variability is not equally distributed
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 109
PCAUnsupervised Learning
Multidimensional data (hard to visualize)
Data variability is not equally distributed
Correlation between variables
Change coordinate system, remove correlation
retain only most variable coordinates
How: (generalized eigenvectors, SVD)
Pro: noise (and information) reduction
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 110
AgendaCaveats
OverviewWhyHowDatasetsMethodsAssessments
Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks
Unsupervised LearningClusteringPCA
CaveatsData IndependenceBiasesNo free lunch?
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 111
Data independence
Training set, Test set and Validation set must be clearly separated
E.g. neural network to infer gene function from sequence
training set: annotated gene sequences, deposit date before Jan 2007test set: annotated gene sequences, deposit date after Jan 2007
But annotation of new sequences is often inferred from old sequences!
Caveats
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 112
Biases
Data should be unbiased, i.e. it should be a good sample of our “space”
E.g. neural network to find disordered regionstraining set: solved structures, residues in SEQRES but not in ATOM
But solved structures are typically small, globular, cytoplasmatic proteins
Caveats
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 113
Take-home message
Always look at data. ML methods are extremely error-prone
Use probability and statistics were possible
In this order: Model, Data, Validation, Algorithm
Be careful with biases, redundancy, hidden variables
Occam's Razor: simpler is better
Be careful with overfitting and overparametrizing
Common sense is a powerful tool (but don't abuse it)
Caveats
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 114
References
• Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR (2007) A Primer on Learning in Bayesian Networks for Computational Biology. PLoS Comput Biol 3(8): e129. doi:10.1371/journal.pcbi.0030129
• Tarca AL, Carey VJ, Chen X-w, Romero R, Drăghici S (2007) Machine Learning and Its Applications to Biology. PLoS Comput Biol 3(6): e116. doi:10.1371/journal.pcbi.0030116
• Sean R Eddy (2004) What is a hidden Markov model? Nature Biotechnology 22, 1315 - 1316 (2004) doi:10.1038/nbt1004-1315
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 115
Bayes TheoremSupplementary
a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.
If a person has a positive test, how likely is it for him to be infected?
A BAB
E
April 9th 2010 Paolo Marcatili - Master in Bioinformatica 116
Bayes TheoremSupplementary
a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.
If a person has a positive test, how likely is it for him to be infected?