Top Banner
Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA
39

Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Dec 20, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Lesson 8:

MachineLearning

(and the Legionella as a case study)

Lesson 8:

MachineLearning

(and the Legionella as a case study)

Biological Sequences Analysis, MTA

Page 2: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Introduction to Machine Learning

Introduction to Machine Learning

Page 3: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

3 of 39

Some cool examplesSome cool examples

Introduction

Page 4: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

4 of 39

Types of learningsTypes of learnings

Supervised learning - using "labeled" examples of input and desired output.

Unsupervised learning - Models a set of inputs: labeled examples are not available.

Reinforcement learning - Feedback on the actions from observing the environment (maximizing long term reward)

Introduction

Page 5: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

ClusteringClustering

Page 6: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

6 of 39

Clustering definitionClustering definition

Input: a set of instances Output: subsets (called clusters) so that

observations in the same cluster are similar.Is it supervised or not?

What does similar mean?

Clustering

Page 7: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

7 of 39

K-means clusteringK-means clustering

0. Choose number of clusters (k)

1. Initiation: randomly generate k centers

2. Assignment of each point to nearest cluster center:

Clustering

Page 8: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

8 of 39

K-means clusteringK-means clustering

0. Choose number of clusters (k)

1. Initiation: randomly generate k centers

2. Assignment of each point to nearest cluster center

3. Update location of centers:

Clustering

Page 9: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

9 of 39

K-means clusteringK-means clustering

0. Choose number of clusters (k)

1. Initiation: randomly generate k centers

2. Assignment of each point to nearest cluster center

3. Update location of centers

4. Repeat 2-3 until no further changeK-means - Interactive demo

Clustering

Page 10: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

10 of 39

Other clustering algorithmsOther clustering algorithms Take into account:

homogeneity: similarity of instances inside a cluster.

separation: dissimilarity of instances of different clusters.

Allow "fuzzy clustering": instances bleongs to more than one cluster.

Hierarchal clustering

Clustering

Page 11: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

11 of 39

Hierarchical clusteringHierarchical clustering

12345

C1 C2 C3 C4 C5 C6 ..

Raw table

Hierarchicalclustering

Cluster criterion

ScoresSimilaritymatrix

Similarity criterion12345

Clustering

Page 12: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

12 of 39

UPGMA (you should already know it…)

Neighbor-joining

Hierarchical clusteringHierarchical clustering

12345

C1 C2 C3 C4 C5 C6 ..

Cluster criterion

Scores

Similarity criterion12345

A

C

B

D

E

D

A

D

(C,B)A

E

((C,B),E)

Clustering

Page 13: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

13 of 39

Wait a minute… A tree is clustering?!

Hierarchical clusteringHierarchical clusteringClustering

Page 14: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

ClassifyingClassifying

Page 15: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

15 of 39

What is classificationWhat is classification

Input: labeled training set and unlabeled data set.

Learn classifying (assigning labels), according to the features of the training set

Output: labels on the data set. Example: qualified boy/girlfriend

Classifying

Page 16: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

16 of 39

Where to draw the line?!?!Where to draw the line?!?!

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6X

Y

Unqualified Qualified

Classifying

Page 17: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

17 of 39

Where to draw the line?!?!Where to draw the line?!?!

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6X

Y

Unqualified Qualified

Classifying

Page 18: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

18 of 39

Where to draw the line?!?!Where to draw the line?!?!

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6X

Y

Unqualified Qualified

Classifying

Page 19: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

19 of 39

Where to draw the line?!?!Where to draw the line?!?!

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6X

Y

Unqualified Qualified

Classifying

Page 20: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

20 of 39

Where to draw the line?!?!Where to draw the line?!?!

0 . 9 5

1

1 . 0 5

1 . 1

0 1 2 3 4 5 6E-Value

Effectors NonEffectors

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6X

Y

Unqualified Qualified

Now consider dozens of features…

Classifying

Page 21: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

21 of 39

How to classifyHow to classify

KNN (K Nearest Neighbors) Decision trees SVM (Support Vector Machine) Naïve Bayes Baysian Networks NN (Neural Networks) Many many more…

Classifying

Page 22: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

22 of 39

KNN (K Nearest Neighbors)KNN (K Nearest Neighbors)

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6

X

Y

Lazy (no pre-processing)

Local

Can deal with complex patterns

Classifying

Page 23: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

23 of 39

Decision treesDecision trees

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6

X

Y

X ≥ 1.7

Y ≥ 36

X < 1.7

?

?

Y < 36

Tree actually means something!

Can deal with complex patterns

Classifying

Page 24: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

24 of 39

SVM (Support Vector Machine)SVM (Support Vector Machine)

0

10

20

30

40

50

60

70

80

90

100

0 1 2 3 4 5 6

X

Y

Classifying

Page 25: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

25 of 39

SVM (Support Vector Machine)SVM (Support Vector Machine) Finds optimal linear separation

Maximizes the margin betweenthe two data sets

Can use transformation to higherdimension when not linearseparable.

Classifying

Page 26: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

26 of 39

Naïve BayseNaïve Bayse

X

PP( |

X)P( |

X)and

Can easily compute:

P( |Y)

P( |Y)

andCan do the same for:

Classifying

Score( ) = P( |X,Y)

Score( ) = P( |X) · P( |Y)Score( ) = P( |X,Y)

Score( ) = P( |X) · P( |Y)

Page 27: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

27 of 39

Naïve Bayse –graphical representationNaïve Bayse –graphical representation

P( |X)

P( |Y)

X Y ZP( |

Z)

Score( ) = P( |X,Y,Z) = P( |X)· P( |Y)· P( |Z)

What if there are dependencies??

Classifying

Page 28: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

28 of 39

Baysian NetworkBaysian Network

P( |X,Z)

P( |Y)

X Y

ZP( X|

Z)

Score( ) = P( |X,Y,Z) = P( |X,Z) · P( |Y)

Baysian Network takes dependencies into account

Classifying

Page 29: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

29 of 39

Use a labeled test set (in addition to the training set)

Cross validation: 10-fold

Leave-one-out

How to choose a classifier(estimate performances)?How to choose a classifier(estimate performances)?

Classifying

Page 30: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Legionalla pneumophilacase-study

Legionalla pneumophilacase-study

Page 31: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

31 of 39

How did it all begin? How did it all begin?

Legionella pneumophila

Page 32: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

32 of 39

Legionnaire disease nowadaysLegionnaire disease nowadays

Legionella pneumophila

Page 33: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

33 of 39

Legionella pneumophila Legionella pneumophila

Legionella pneumophila

Copyright © 2005 Nature Publishing Group. Created by Arkitek from Nature Reviews Microbiology

Page 34: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

34 of 39

Identifying the effectorsIdentifying the effectors

Legionella pneumophila

Page 35: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

35 of 39

Homology to host proteins

Regulatory

elements

Genome proximity to

other effectors

Secretion signalAbundance in Metazoa / Bacteria

GC contentSequence homology

The featuresThe features

Legionella pneumophila

Page 36: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

36 of 39

The effectors machineThe effectors machine

5

5

Legionella pneumophila

Page 37: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

37 of 39

The big pictureThe big pictureSimilarity to

known effectors

Regulatory elements

Features

Similarity tohost proteins

G-C content

Secretory signals

Feature selection

NN

SVMNaïve Bayes

Bayesian Net

Voting

Classification algorithms

Experimentalvalidation

Predictedeffectors

Prior knowledge

Trainedmodel

Unclassifiedgenes

Predictednon-effectors

Newly validatedeffectors

Non-effectors

Validatedeffectors

Abundance in Metazoa\Bacteria

Genome arrangement

Legionella pneumophila

Page 38: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

38 of 39

Does it really work??Does it really work??

Machine learning

Page 39: Lesson 8: Machine Learning (and the Legionella as a case study) Biological Sequences Analysis, MTA.

Biological Sequences Analysis, MTA

39 of 39