Top Banner
April 9th 2010 Paolo Marcatili - Master in Bioinfo rmatica 1 Machine Learning Methods: an overview Master in Bioinformatica – April 9th, 2010 Paolo Marcatili University of Rome “Sapienza” Dept. of Biochemical Sciences “Rossi Fanelli” [email protected] Overview Supervised Learning Unsupervised Learning Caveats
116
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 1

Machine Learning Methods:an overview

Master in Bioinformatica – April 9th, 2010

Paolo MarcatiliUniversity of Rome “Sapienza”Dept. of Biochemical Sciences “Rossi Fanelli”

[email protected]

Overview

Supervised Learning

Unsupervised Learning

Caveats

Page 2: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 2

Agenda

OverviewWhyHowDatasetsMethodsAssessments

Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks

Unsupervised LearningClusteringPCA

CaveatsData IndependenceBiasesNo free lunch?

Overview

Page 3: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 3

WhyOverview

Large amount of data

Large dimensionality

Page 4: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 4

Large amount of data

Large dimensionality

Complex dynamics

Data Noisiness

WhyOverview

Page 5: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 5

Large amount of data

Large dimensionality

Complex dynamics

Data Noisiness

Computational efficiency

Because we can

WhyOverview

Page 6: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 6

How

Numerical analysis

Graphs

Systems theory

Geometry

Statistics

Probability!!

Overview

Page 7: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 7

How

Numerical analysis

Graphs

Systems theory

Geometry

Statistics

Probability!!

Probability and statistics are fundamentalThey provide a solid framework forcreating models and acquire knowledge

Overview

Page 8: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 8

Datasets

Most common data used with ML:

Genomes (genes, promoters, phylogeny, regulation...)

Proteomes (secondary/tertiary structure, disorder, motifs, epitopes...)

Clinical Data (drug evaluation, medical protocols, tool design...)

Interactomic (PPI prediction and filtering, complexes...)

Metabolomic (metabolic pathways identification, flux analysis, essentiality)

Overview

Page 9: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 9

Methods

Machine Learning can

Overview

Page 10: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 10

Methods

Machine Learning can

Predict unknownfunction values

Overview

Page 11: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 11

Methods

Machine Learning can

Predict unknownfunction values

Infer classes andassign samples

Overview

Page 12: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 12

Methods

Machine Learning can

Predict unknownfunction values

Infer classes andassign samples

Overview

Page 13: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 13

Methods

Machine Learning can not

Overview

Page 14: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 14

Methods

Machine Learning can not

Provide knowledge

Overview

Page 15: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 15

Methods

Machine Learning can not

Provide knowledge

Overview

Page 16: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 16

Methods

Machine Learning can not

Provide knowledge Learn

Overview

Page 17: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 17

Methods

Information is

In the data? In the model?

Overview

Page 18: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 18

Methods

Work Schema:

Choose a Learning-Validation Setting

Prepare data (Training, Test, Validation sets)

Train (1 or more times)

Validate

Use

Overview

Page 19: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 19

Love all, trust a few, do wrong to none. Overview

4 patients, 4 controls

0

0.5

1

1.5

2

2.5

3

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Patients

Control

Page 20: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 20

Love all, trust a few, do wrong to none. Overview

2 more

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6 7

Patients

Control

Page 21: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 21

Love all, trust a few, do wrong to none. Overview

10 more

0

0.5

1

1.5

2

2.5

3

0 2 4 6 8 10 12 14 16

Patients

Control

Page 22: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 22

Assessment

Prediction of unknown data!

Problems: Few data, robustness.

Overview

Page 23: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 23

Assessment

Prediction of unknown data!

Problems: Few data, robustness.

Solutions:

Training, Test and Validation sets

Leave one Out

K-fold Cross Validation

Overview

Page 24: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 24

Assessment

50% Training set: used to tune the model parameters

25% Test set: used to verify that the machine has “learnt”

25% Validation set: final assessment of the results

Unfeasible with few data

Overview

Page 25: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 25

Assessment

Leave-one-out:

for each sample Ai

Training set: all the samples - {Ai}Test set: {Ai }

Repeat

Computationally intensive, good estimate of the mean errorhigh variance

Overview

Page 26: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 26

Assessment

K-fold cross validation:

Divide your data in K subsets S1..k

Training set: all the samples - Si

Test set: Si

Repeat

good compromise

Overview

Page 27: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 27

Assessment

Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.

Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.

Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present

Overview

Page 28: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 28

Assessment

Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.

Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.

Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present

receiver operating characteristic (ROC)is a graphical plot of the

sensitivity vs. (1 - specificity)

for a binary classifier system as its discrimination threshold is varied.

Overview

Page 29: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 29

AssessmentOverview

Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.

Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.

Predictive Value Positive: TP / [ TP + FP ]Given a test is positive, the likelihood disease is present

receiver operating characteristic (ROC)is a graphical plot of the

sensitivity vs. (1 - specificity)

for a binary classifier system as its discrimination threshold is varied.

Area under ROC (AROC) is often used as a parameter to compare different classifiers

Page 30: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 30

AgendaSupervised Learning

OverviewWhyHowDatasetsMethodsAssessments

Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks

Unsupervised LearningClusteringPCA

CaveatsData IndependenceBiasesNo free lunch?

Page 31: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 31

Supervised LearningSupervised Learning

Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data

Page 32: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 32

Supervised LearningSupervised Learning

Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data

Example:use microarray data,different condition

classes: genes related/unrelated to different cancer types

Page 33: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 33

Supervised LearningSupervised Learning

Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data

Example:use microarray data,different condition

classes: genes related/unrelated to different cancer types

Page 34: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 34

Support Vector MachinesSupervised Learning

Basic idea:Plot your data in an N-dimensional space

Find the best hyperplane that separates the different classes

Further samples can be classified using the region of the space they belong to

Page 35: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 35

Support Vector MachinesSupervised Learning

length

weight

FailPass

Page 36: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 36

Support Vector MachinesSupervised Learning

margin

FailPass

length

weight

FailPass

Page 37: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 37

Support Vector MachinesSupervised Learning

Optimal Hyperplane (OHP)

simple kind of SVM (called an LSVM)

margin

Support vectorsmaximum

margin

FailPass

length

weight

FailPass

Page 38: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 38

Support Vector MachinesSupervised Learning

Original Data

What if data are not linearly separable?

Page 39: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 39

Support Vector MachinesSupervised Learning

Original Data

What if data are not linearly separable?

Original Data

Allow mismatchessoft margins

(add a weight matrix)

Page 40: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 40

Support Vector MachinesSupervised Learning

weight2

length2

weight * length

Hyperplane

Original Data

What if data are not linearly separable?

Page 41: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 41

Support Vector MachinesSupervised Learning

Only Inner product is needed to calculate Dual problem and decision function

weight2

length2

weight * length

Hypersurface

length

sd

Kernelization

Hyperplane

Original Data

What if data are not linearly separable?The Kernel trick!

Page 42: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 42

SVM exampleSupervised Learning

Knowledge-based analysis of microarray geneexpression data by using support vector machinesMichael P. S. Brown*, William Noble Grundy†‡, David Lin*, Nello Cristianini§, Charles Walsh Sugnet¶, Terrence S. Furey*,Manuel Ares, Jr.¶, and David Haussler*

We introduce a method of functionally classifying genes by usinggene expression data from DNA microarray hybridization experiments.The method is based on the theory of support vectormachines (SVMs). SVMs are considered a supervised computerlearning method because they exploit prior knowledge of genefunction to identify unknown genes of similar function fromexpression data. SVMs avoid several problems associated withunsupervised clustering methods, such as hierarchical clusteringand self-organizing maps. SVMs have many mathematical featuresthat make them attractive for gene expression analysis, includingtheir flexibility in choosing a similarity function, sparseness ofsolution when dealing with large data sets, the ability to handlelarge feature spaces, and the ability to identify outliers. We testseveral SVMs that use different similarity metrics, as well as someother supervised learning methods, and find that the SVMs bestidentify sets of genes with a common function using expressiondata. Finally, we use SVMs to predict functional roles for uncharacterizedyeast ORFs based on their expression data.

To judge overall performance, we define the cost of using themethod M as C(M) 5 fp(M) 1 2zfn(M), where fp(M) is the numberof false positives for method M, and fn(M) is the number of falsenegatives for method M. The false negatives are weighted moreheavily than the false positives because, for these data, the numberof positive examples is small compared with the number of negatives.

Page 43: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 43

Hidden Markov ModelsSupervised Learning

There is a regular and a biased coin.

You don't know which one is being used.

During the game the coins are exchanged with a certain fixed probability

All you know is the output sequence

Page 44: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 44

Hidden Markov ModelsSupervised Learning

There is a regular and a biased coin.

You don't know which one is being used.

During the game the coins are exchanged with a certain fixed probability

All you know is the output sequence

HHTHTHTHTHTTTTHTHHTHHHHHHHHHTHTHTHHTHTHHHHTHTH

Given a set the parameters, which is the probability of the output seq.? Which parameters are more likely to have produced the output? Which coin was being used at a certain point of the sequence?

Page 45: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 45

Hidden Markov ModelsSupervised Learning

Page 46: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 46

Decision treesSupervised Learning

Mimics the behavior of an expert

Page 47: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 47

Decision treesSupervised Learning

Mimics the behavior of an expert

Page 48: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 48

Mimics the behavior of an expert

Decision treesSupervised Learning

Page 49: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 49

Mimics the behavior of an expert

Decision treesSupervised Learning

Page 50: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 50

Pros: Easy to interpreter Statistical analysis Informative results

Cons: A single variable Not optimal Not robust

Majority rules!

Decision treesSupervised Learning

Page 51: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 51

Random ForestsSupervised Learning

Split the data in several subsets,construct a DT for each set

Each DT expresses a vote, the majority wins

Much more accurate and robust (bootstrap)

Page 52: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 52

Random ForestsSupervised Learning

Split the data in several subsets,construct a DT for each set

Each DT expresses a vote, the majority wins

Much more accurate and robust (bootstrap)

Prediction of protein–protein interactions using random decision forest framework

Xue-Wen Chen * and Mei Liu

Motivation: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain–domain interactions.

Results: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein–protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.

Page 53: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 53

Bayesian NetworksSupervised Learning

The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation

Not all correlations or cause-effect relationships between variables are significative

Page 54: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 54

Bayesian NetworksSupervised Learning

The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation

Not all correlations or cause-effect relationships between variables are significative

Consider only meaningful links!

Page 55: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 55

Bayesian NetworksSupervised Learning

I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Page 56: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 56

Bayesian NetworksSupervised Learning

I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Bayes Theorem again!

Page 57: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 57

Bayesian NetworksSupervised Learning

We don't know the joint probability distribution,how can we learn it from the data?

Optimize the likelyhood, i.e. the probability that the model generated the data

Maximum likelyhood (simplest) Maximum posterior Marginal likelyhood (hardest)

We don't know which relationship is present between variables,how can we learn it from the data?

Connections in a graph are over-exponential, enumeration is impossibleEuristics, random sampling, monte carlo

Does independence assumption hold? Is the correlation informative? (BIC, Occam's razor, AIC)

Page 58: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 58

Neural NetworksSupervised Learning

Neural Networks interpolate functionsThey have nothing to do with brains

Page 59: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 59

Neural NetworksSupervised Learning

Neural Networks interpolate functionsThey have nothing to do with brains

Page 60: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 60

Neural NetworksSupervised Learning

Neural Networks interpolate functionsThey have nothing to do with brains

Page 61: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 61

Neural NetworksSupervised Learning

Parameter settings:

Page 62: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 62

Neural NetworksSupervised Learning

Parameter settings: avoid overfitting

Page 63: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 63

Neural NetworksSupervised Learning

Parameter settings: avoid overfitting

Learning --> validation --> usageNo underlying model, but it often works

Page 64: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 64

Neural NetworksSupervised Learning

Protein Disorder Prediction:Implications for Structural ProteomicsRune Linding,1,4,* Lars Juhl Jensen,1,2,4 Francesca Diella,3 Peer Bork,1,2 Toby J. Gibson,1 and Robert B. Russell1

Abstract

A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of “hot loops,” i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.

Page 65: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 65

AgendaUnsupervised Learning

OverviewWhyHowDatasetsMethodsAssessments

Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks

Unsupervised LearningClusteringPCA

CaveatsData IndependenceBiasesNo free lunch?

Page 66: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 66

Unsupervised LearningUnsupervised Learning

If we have no idea of actual data classification, we can try to guess

Page 67: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 67

ClusteringUnsupervised Learning

Put together similar objects to define classes

Page 68: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 68

ClusteringUnsupervised Learning

Put together similar objects to define classes

Page 69: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 69

ClusteringUnsupervised Learning

K-meansHierarchical top-downHierarchical down-upFuzzy

Put together similar objects to define classes

How?

Page 70: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 70

ClusteringUnsupervised Learning

EuclideanCorrelationSpearman RankManhattan

Put together similar objects to define classes

Which metric?How?

Page 71: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 71

ClusteringUnsupervised Learning

Put together similar objects to define classes

Which metric? Which “shape”?

CompactConcaveOutliersInner radiuscluster separation

How?

Page 72: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 72

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 73: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 73

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 74: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 74

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 75: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 75

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 76: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 76

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 77: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 77

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 78: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 78

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 79: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 79

Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method

Page 80: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 80

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 81: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 81

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 82: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 82

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 83: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 83

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 84: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 84

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 85: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 85

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 86: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 86

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 87: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 87

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 88: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 88

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 89: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 89

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 90: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 90

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 91: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 91

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 92: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 92

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 93: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 93

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 94: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 94

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 95: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 95

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 96: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 96

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 97: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 97

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 98: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 98

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 99: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 99

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 100: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 100

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 101: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 101

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 102: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 102

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 103: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 103

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 104: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 104

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 105: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 105

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 106: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 106

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 107: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 107

K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged

Page 108: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 108

PCAUnsupervised Learning

Multidimensional data (hard to visualize)

Data variability is not equally distributed

Page 109: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 109

PCAUnsupervised Learning

Multidimensional data (hard to visualize)

Data variability is not equally distributed

Correlation between variables

Change coordinate system, remove correlation

retain only most variable coordinates

How: (generalized eigenvectors, SVD)

Pro: noise (and information) reduction

Page 110: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 110

AgendaCaveats

OverviewWhyHowDatasetsMethodsAssessments

Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks

Unsupervised LearningClusteringPCA

CaveatsData IndependenceBiasesNo free lunch?

Page 111: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 111

Data independence

Training set, Test set and Validation set must be clearly separated

E.g. neural network to infer gene function from sequence

training set: annotated gene sequences, deposit date before Jan 2007test set: annotated gene sequences, deposit date after Jan 2007

But annotation of new sequences is often inferred from old sequences!

Caveats

Page 112: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 112

Biases

Data should be unbiased, i.e. it should be a good sample of our “space”

E.g. neural network to find disordered regionstraining set: solved structures, residues in SEQRES but not in ATOM

But solved structures are typically small, globular, cytoplasmatic proteins

Caveats

Page 113: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 113

Take-home message

Always look at data. ML methods are extremely error-prone

Use probability and statistics were possible

In this order: Model, Data, Validation, Algorithm

Be careful with biases, redundancy, hidden variables

Occam's Razor: simpler is better

Be careful with overfitting and overparametrizing

Common sense is a powerful tool (but don't abuse it)

Caveats

Page 114: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 114

References

• Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR (2007) A Primer on Learning in Bayesian Networks for Computational Biology. PLoS Comput Biol 3(8): e129. doi:10.1371/journal.pcbi.0030129

• Tarca AL, Carey VJ, Chen X-w, Romero R, Drăghici S (2007) Machine Learning and Its Applications to Biology. PLoS Comput Biol 3(6): e116. doi:10.1371/journal.pcbi.0030116

• Sean R Eddy (2004) What is a hidden Markov model? Nature Biotechnology 22, 1315 - 1316 (2004) doi:10.1038/nbt1004-1315

• http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

Page 115: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 115

Bayes TheoremSupplementary

a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.

If a person has a positive test, how likely is it for him to be infected?

A BAB

E

Page 116: Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 116

Bayes TheoremSupplementary

a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.

If a person has a positive test, how likely is it for him to be infected?

P(A|T) =P(T|A)*P(A) / (P(T|A)*P(A) + P(T|¬A)*P(¬A))

P(A|T) = 49.97%

A BAB

E