Machine Learning

April 9th 2010 Paolo Marcatili - Master in Bioinformatica 1

Machine Learning Methods:an overview

Master in Bioinformatica – April 9th, 2010

Paolo MarcatiliUniversity of Rome “Sapienza”Dept. of Biochemical Sciences “Rossi Fanelli”

[email protected]

Overview

Supervised Learning

Unsupervised Learning

Caveats


Agenda

OverviewWhyHowDatasetsMethodsAssessments

Supervised LearningSVMHMMDecision Trees – RFBayesian NetworksNeural Networks

Unsupervised LearningClusteringPCA

CaveatsData IndependenceBiasesNo free lunch?

Overview


WhyOverview

Large amount of data

Large dimensionality




Complex dynamics

Data Noisiness

WhyOverview




Complex dynamics

Data Noisiness

Computational efficiency

Because we can

WhyOverview


How

Numerical analysis

Graphs

Systems theory

Geometry

Statistics

Probability!!

Overview


How

Numerical analysis

Graphs

Systems theory

Geometry

Statistics

Probability!!

Probability and statistics are fundamentalThey provide a solid framework forcreating models and acquire knowledge

Overview


Datasets

Most common data used with ML:

Genomes (genes, promoters, phylogeny, regulation...)

Proteomes (secondary/tertiary structure, disorder, motifs, epitopes...)

Clinical Data (drug evaluation, medical protocols, tool design...)

Interactomic (PPI prediction and filtering, complexes...)

Metabolomic (metabolic pathways identification, flux analysis, essentiality)

Overview


Methods

Machine Learning can

Overview


Methods


Predict unknownfunction values

Overview


Methods



Infer classes andassign samples

Overview


Methods



Infer classes andassign samples

Overview


Methods

Machine Learning can not

Overview


Methods


Provide knowledge

Overview


Methods


Provide knowledge

Overview


Methods


Provide knowledge Learn

Overview

http://w1.500.telia.com/~u50015076/strv_104.html





Methods

Information is

In the data? In the model?

Overview


Methods

Work Schema:

Choose a Learning-Validation Setting

Prepare data (Training, Test, Validation sets)

Train (1 or more times)

Validate

Use

Overview


Love all, trust a few, do wrong to none. Overview

4 patients, 4 controls

0

0.5

1

1.5

2

2.5

3

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Patients

Control



2 more

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5 6 7

Patients

Control



10 more

0

0.5

1

1.5

2

2.5

3

0 2 4 6 8 10 12 14 16

Patients

Control


Assessment

Prediction of unknown data!

Problems: Few data, robustness.

Overview


Assessment

Prediction of unknown data!

Problems: Few data, robustness.

Solutions:

Training, Test and Validation sets

Leave one Out

K-fold Cross Validation

Overview


Assessment

50% Training set: used to tune the model parameters

25% Test set: used to verify that the machine has “learnt”

25% Validation set: final assessment of the results

Unfeasible with few data

Overview


Assessment

Leave-one-out:

for each sample Ai

Training set: all the samples - {Ai}Test set: {Ai }

Repeat

Computationally intensive, good estimate of the mean errorhigh variance

Overview


Assessment

K-fold cross validation:

Divide your data in K subsets S1..k

Training set: all the samples - Si

Test set: Si

Repeat

good compromise

Overview


Assessment

Sensitivity: TP/ [ TP + FN ]Given the disease is present, the likelihood of testing positive.

Specificity: TN / [ TN + FP ]Given the disease is not present, the likelihood of testing negative.

Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present

Overview


Assessment



Positive Predictive Value : TP / [ TP + FP ]Given a test is positive, the likelihood disease is present

receiver operating characteristic (ROC)is a graphical plot of the

sensitivity vs. (1 - specificity)

for a binary classifier system as its discrimination threshold is varied.

Overview


AssessmentOverview



Predictive Value Positive: TP / [ TP + FP ]Given a test is positive, the likelihood disease is present

receiver operating characteristic (ROC)is a graphical plot of the

sensitivity vs. (1 - specificity)

for a binary classifier system as its discrimination threshold is varied.

Area under ROC (AROC) is often used as a parameter to compare different classifiers


AgendaSupervised Learning






Supervised LearningSupervised Learning

Basic Idea: use data+classification of known samplesfind “fingerprints” of classes in the data




Example:use microarray data,different condition

classes: genes related/unrelated to different cancer types




Example:use microarray data,different condition

classes: genes related/unrelated to different cancer types


Support Vector MachinesSupervised Learning

Basic idea:Plot your data in an N-dimensional space

Find the best hyperplane that separates the different classes

Further samples can be classified using the region of the space they belong to



length

weight

FailPass



margin

FailPass

length

weight

FailPass



Optimal Hyperplane (OHP)

simple kind of SVM (called an LSVM)

margin

Support vectorsmaximum

margin

FailPass

length

weight

FailPass



Original Data

What if data are not linearly separable?



Original Data


Original Data

Allow mismatchessoft margins

(add a weight matrix)



weight2

length2

weight * length

Hyperplane

Original Data




Only Inner product is needed to calculate Dual problem and decision function

weight2

length2

weight * length

Hypersurface

length

sd

Kernelization

Hyperplane

Original Data

What if data are not linearly separable?The Kernel trick!


SVM exampleSupervised Learning

Knowledge-based analysis of microarray geneexpression data by using support vector machinesMichael P. S. Brown*, William Noble Grundy†‡, David Lin*, Nello Cristianini§, Charles Walsh Sugnet¶, Terrence S. Furey*,Manuel Ares, Jr.¶, and David Haussler*

We introduce a method of functionally classifying genes by usinggene expression data from DNA microarray hybridization experiments.The method is based on the theory of support vectormachines (SVMs). SVMs are considered a supervised computerlearning method because they exploit prior knowledge of genefunction to identify unknown genes of similar function fromexpression data. SVMs avoid several problems associated withunsupervised clustering methods, such as hierarchical clusteringand self-organizing maps. SVMs have many mathematical featuresthat make them attractive for gene expression analysis, includingtheir flexibility in choosing a similarity function, sparseness ofsolution when dealing with large data sets, the ability to handlelarge feature spaces, and the ability to identify outliers. We testseveral SVMs that use different similarity metrics, as well as someother supervised learning methods, and find that the SVMs bestidentify sets of genes with a common function using expressiondata. Finally, we use SVMs to predict functional roles for uncharacterizedyeast ORFs based on their expression data.

To judge overall performance, we define the cost of using themethod M as C(M) 5 fp(M) 1 2zfn(M), where fp(M) is the numberof false positives for method M, and fn(M) is the number of falsenegatives for method M. The false negatives are weighted moreheavily than the false positives because, for these data, the numberof positive examples is small compared with the number of negatives.


Hidden Markov ModelsSupervised Learning

There is a regular and a biased coin.

You don't know which one is being used.

During the game the coins are exchanged with a certain fixed probability

All you know is the output sequence



There is a regular and a biased coin.

You don't know which one is being used.

During the game the coins are exchanged with a certain fixed probability

All you know is the output sequence

HHTHTHTHTHTTTTHTHHTHHHHHHHHHTHTHTHHTHTHHHHTHTH

Given a set the parameters, which is the probability of the output seq.? Which parameters are more likely to have produced the output? Which coin was being used at a certain point of the sequence?




Decision treesSupervised Learning

Mimics the behavior of an expert











Pros: Easy to interpreter Statistical analysis Informative results

Cons: A single variable Not optimal Not robust

Majority rules!



Random ForestsSupervised Learning

Split the data in several subsets,construct a DT for each set

Each DT expresses a vote, the majority wins

Much more accurate and robust (bootstrap)


Random ForestsSupervised Learning

Split the data in several subsets,construct a DT for each set

Each DT expresses a vote, the majority wins

Much more accurate and robust (bootstrap)

Prediction of protein–protein interactions using random decision forest framework

Xue-Wen Chen * and Mei Liu

Motivation: Protein interactions are of biological interest because they orchestrate a number of cellular processes such as metabolic pathways and immunological recognition. Domains are the building blocks of proteins; therefore, proteins are assumed to interact as a result of their interacting domains. Many domain-based models for protein interaction prediction have been developed, and preliminary results have demonstrated their feasibility. Most of the existing domain-based methods, however, consider only single-domain pairs (one domain from one protein) and assume independence between domain–domain interactions.

Results: In this paper, we introduce a domain-based random forest of decision trees to infer protein interactions. Our proposed method is capable of exploring all possible domain interactions and making predictions based on all the protein domains. Experimental results on Saccharomyces cerevisiae dataset demonstrate that our approach can predict protein–protein interactions with higher sensitivity (79.78%) and specificity (64.38%) compared with that of the maximum likelihood approach. Furthermore, our model can be used to infer interactions not only for single-domain pairs but also for multiple domain pairs.


Bayesian NetworksSupervised Learning

The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation

Not all correlations or cause-effect relationships between variables are significative



The probabilistic approach is extremely powerful butrequires a huge amount of information/data for a complete representation

Not all correlations or cause-effect relationships between variables are significative

Consider only meaningful links!



I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call



I'm at work, neighbor John calls to say my alarm is ringing, but neighbor Mary doesn't call. Sometimes it's set off by minor earthquakes. Is there a burglar?Variables: Burglary, Earthquake, Alarm, JohnCalls, MaryCallsNetwork topology reflects "causal" knowledge:

A burglar can set the alarm off An earthquake can set the alarm off The alarm can cause Mary to call The alarm can cause John to call

Bayes Theorem again!



We don't know the joint probability distribution,how can we learn it from the data?

Optimize the likelyhood, i.e. the probability that the model generated the data

Maximum likelyhood (simplest) Maximum posterior Marginal likelyhood (hardest)

We don't know which relationship is present between variables,how can we learn it from the data?

Connections in a graph are over-exponential, enumeration is impossibleEuristics, random sampling, monte carlo

Does independence assumption hold? Is the correlation informative? (BIC, Occam's razor, AIC)


Neural NetworksSupervised Learning

Neural Networks interpolate functionsThey have nothing to do with brains









Parameter settings:



Parameter settings: avoid overfitting



Parameter settings: avoid overfitting

Learning --> validation --> usageNo underlying model, but it often works



Protein Disorder Prediction:Implications for Structural ProteomicsRune Linding,1,4,* Lars Juhl Jensen,1,2,4 Francesca Diella,3 Peer Bork,1,2 Toby J. Gibson,1 and Robert B. Russell1

Abstract

A great challenge in the proteomics and structural genomics era is to predict protein structure and function, including identification of those proteins that are partially or wholly unstructured. Disordered regions in proteins often contain short linear peptide motifs (e.g., SH3 ligands and targeting signals) that are important for protein function. We present here DisEMBL, a computational tool for prediction of disordered/unstructured regions within a protein sequence. As no clear definition of disorder exists, we have developed parameters based on several alternative definitions and introduced a new one based on the concept of “hot loops,” i.e., coils with high temperature factors. Avoiding potentially disordered segments in protein expression constructs can increase expression, foldability, and stability of the expressed protein. DisEMBL is thus useful for target selection and the design of constructs as needed for many biochemical studies, particularly structural biology and structural genomics projects. The tool is freely available via a web interface (http://dis.embl.de) and can be downloaded for use in large-scale studies.


AgendaUnsupervised Learning






Unsupervised LearningUnsupervised Learning

If we have no idea of actual data classification, we can try to guess


ClusteringUnsupervised Learning

Put together similar objects to define classes






K-meansHierarchical top-downHierarchical down-upFuzzy


How?



EuclideanCorrelationSpearman RankManhattan


Which metric?How?




Which metric? Which “shape”?

CompactConcaveOutliersInner radiuscluster separation

How?


Hierarchical ClusteringUnsupervised Learning

•We start with every data point in a separate cluster•We keep merging the most similar pairs of data points/clusters until we have one big cluster left•This is called a bottom-up or agglomerative method























K-meansUnsupervised Learning

• Start with K random centers

• Assign each sample to the closest center

• Recompute centers (samples average)

• Repeat until converged




































































































































































PCAUnsupervised Learning

Multidimensional data (hard to visualize)

Data variability is not equally distributed


PCAUnsupervised Learning

Multidimensional data (hard to visualize)

Data variability is not equally distributed

Correlation between variables

Change coordinate system, remove correlation

retain only most variable coordinates

How: (generalized eigenvectors, SVD)

Pro: noise (and information) reduction


AgendaCaveats






Data independence

Training set, Test set and Validation set must be clearly separated

E.g. neural network to infer gene function from sequence

training set: annotated gene sequences, deposit date before Jan 2007test set: annotated gene sequences, deposit date after Jan 2007

But annotation of new sequences is often inferred from old sequences!

Caveats


Biases

Data should be unbiased, i.e. it should be a good sample of our “space”

E.g. neural network to find disordered regionstraining set: solved structures, residues in SEQRES but not in ATOM

But solved structures are typically small, globular, cytoplasmatic proteins

Caveats


Take-home message

Always look at data. ML methods are extremely error-prone

Use probability and statistics were possible

In this order: Model, Data, Validation, Algorithm

Be careful with biases, redundancy, hidden variables

Occam's Razor: simpler is better

Be careful with overfitting and overparametrizing

Common sense is a powerful tool (but don't abuse it)

Caveats


References

• Needham CJ, Bradford JR, Bulpitt AJ, Westhead DR (2007) A Primer on Learning in Bayesian Networks for Computational Biology. PLoS Comput Biol 3(8): e129. doi:10.1371/journal.pcbi.0030129

• Tarca AL, Carey VJ, Chen X-w, Romero R, Drăghici S (2007) Machine Learning and Its Applications to Biology. PLoS Comput Biol 3(6): e116. doi:10.1371/journal.pcbi.0030116

• Sean R Eddy (2004) What is a hidden Markov model? Nature Biotechnology 22, 1315 - 1316 (2004) doi:10.1038/nbt1004-1315

• http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

http://see.stanford.edu/see/courseinfo.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1




Bayes TheoremSupplementary

a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.

If a person has a positive test, how likely is it for him to be infected?

A BAB

E


Bayes TheoremSupplementary

a) AIDS is affecting 0,01% of population.b) The AIDS test, when performed on patients, is correct 99,9% of times.b) The AIDS test, when performed on uninfected people, is correct 99,99% of times.

If a person has a positive test, how likely is it for him to be infected?

P(A|T) =P(T|A)*P(A) / (P(T|A)*P(A) + P(T|¬A)*P(¬A))

P(A|T) = 49.97%

A BAB

E

Machine Learning

Documents

data overview

present overview

overview large

samples overview

use overview

agenda overview

essentiality overview

overview master