Top Banner
Machine Learning in the Life Sciences... with KNIME! Gregory Landrum NIBR Informatics Novartis Institutes for BioMedical Research, Basel
16

Machine learning in the life sciences with knime

Aug 23, 2014

Download

Science

Greg Landrum

General audiences presentation from the May Knime Meetup in Boston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Machine learning in the life sciences with knime

Machine Learning in the Life Sciences... with KNIME!

Gregory Landrum NIBR Informatics

Novartis Institutes for BioMedical Research, Basel

Page 2: Machine learning in the life sciences with knime

Cartoon machine learning

Training Data

Training Model

New Items

Model Predictions

Training a model:

Using a model:

Page 3: Machine learning in the life sciences with knime

The data introducing vocabulary

Descriptors End point

Page 4: Machine learning in the life sciences with knime

A typical life-sciences problem

Training Data

Training Model

New Items

Model Predictions

Training a model:

Using a model:

Literature molecules active for an interesting

protein target

New molecules we are thinking about making.

Prioritized list

Page 5: Machine learning in the life sciences with knime

A problem... Here’s what our input looks like:

All data taken from ChEMBL (https://www.ebi.ac.uk/chembl/)

Good luck training a model with that!

Page 6: Machine learning in the life sciences with knime

One solution: Molecular Fingerprints

§  Idea : Apply a kernel to a molecule to generate a bit vector or count vector (less frequent)

§  Typical kernels extract features of the molecule, hash them, and use the hash to determine bits that should be set

§  Typical fingerprint sizes: 1K-4K bits.

...

Page 7: Machine learning in the life sciences with knime

The toolbox: Knime + the RDKit §  Open-source RDKit-based nodes for Knime providing cheminformatics

functionality

+

§  Trusted nodes distributed from knime community site

§  Work in progress: more nodes being added (new wizard makes it easy)

Page 8: Machine learning in the life sciences with knime

What’s there?

Page 9: Machine learning in the life sciences with knime

Let’s build a model! Step 1, getting the data ready

Detail: we’re using atom-pair fingerprints

100 actives

~83K assumed inactives

Detail: we’re using Histamine H3 actives

Page 10: Machine learning in the life sciences with knime

Let’s build a model! Step 2, training

For this example I use 70% of the data (randomly selected) to train the model

Detail: the model is a depth-limited random forest with 500 trees

Page 11: Machine learning in the life sciences with knime

Let’s build a model! Step 3, testing

Test with the 30% of the data that was not used to build the model

The model is 99.9% accurate. Unfortunately it’s saying “inactive” almost all the time. This makes sense given how unbalanced the data is

Page 12: Machine learning in the life sciences with knime

Adjusting the model for highly unbalanced data Is there a signal there?

Test with the 30% of the data that was not used to build the model

Obviously a strong signal there, we just need to figure out how to use it.

Page 13: Machine learning in the life sciences with knime

Adjusting the model for highly unbalanced data Is there a signal there?

Test with the 30% of the data that was not used to build the model

Obviously a strong signal there, we just need to figure out how to use it. How about changing the decision boundary?

Find the model score that corresponds to this point in the ROC curve for the training data

Page 14: Machine learning in the life sciences with knime

Adjusting the model for highly unbalanced data Shifting the decision boundary

Set decision boundary here

Now we’ve got a >99% accurate model that does a good job of retrieving actives without mixing in too many inactives.

Training data ROC

Page 15: Machine learning in the life sciences with knime

Wrapping up

§  We were able to build very accurate random forests for predicting biological activity by adjusting the decision boundary for models built using highly unbalanced data

§  The same thing works with the Knime “Fingerprint Bayesian” nodes.

§  Acknowledgements: • Manuel Schwarze (NIBR) •  Sereina Riniker (NIBR) •  Nikolas Fechner (NIBR) •  Bernd Wiswedel (Knime) •  Dean Abbott (Abbott Analytics)

Page 16: Machine learning in the life sciences with knime

Advertising

3rd RDKit User Group Meeting 22-24 October 2014

Merck KGaA, Darmstadt, Germany

Talks, “talktorials”, lightning talks, social activities, and a hackathon on the 24th. Announcement and (free) registration links at www.rdkit.org We’re looking for speakers. Please contact [email protected]