Top Banner
Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel 5 th KNIME Users Group Meeting Zurich, 2 February 2012 KNIME in NIBR: Stories from Industry Basel, Switzerland Basel, Switzerland
32

KNIME in NIBR Stories from Industry

Oct 01, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: KNIME in NIBR Stories from Industry

Gregory Landrum NIBR IT Novartis Institutes for BioMedical Research, Basel

5th KNIME Users Group Meeting

Zurich, 2 February 2012

KNIME in NIBR: Stories from Industry

Basel, Switzerland

Basel, Switzerland

Page 2: KNIME in NIBR Stories from Industry

KNIME in NIBR

§  Infrastructure

§  Node development • Open-source & in-house •  Sponsored

§  Examples

2

Page 3: KNIME in NIBR Stories from Industry

Infrastructure

§  Enterprise servers + cluster integration running in Cambridge, Basel

§  Standard releases for Windows, Linux, Mac

§  Nightly builds for users comfortable on the bleedingleading edge

3

Page 4: KNIME in NIBR Stories from Industry

Node development : open source

§  Chemistry nodes based on the RDKit •  open-source cheminformatics toolkit •  useable from C++, Python, Java

•  NIBR scientists/developers actively participate •  www.rdkit.org

§  Standard cheminformatics tasks + some nice extras

§  Developed both in-house and together with knime.com

4

Page 5: KNIME in NIBR Stories from Industry

Node development : in house

§  Connections to internal data sources

§  Wrappers around in-house developed algorithms

§  Connection to our web service framework for cheminformatics services

5

Page 6: KNIME in NIBR Stories from Industry

Generic CIx service node

6

Page 7: KNIME in NIBR Stories from Industry

Sponsored node development

§  Modifications to naïve Bayes nodes to support fingerprints

§  Fingerprint naïve Bayes supporting unbalanced datasets

§  Database schema browser

§  Improvements to python integration

§  Improvements to database connector, readers

§  Ensemble tree classifier (in progress)

7

Page 8: KNIME in NIBR Stories from Industry

Case studies

8

Page 9: KNIME in NIBR Stories from Industry

Combining databases

9

§  Question: what kind of activity might I expect to see for a given compound?

§  Do a similarity search in our database of internal compounds

§  Look up assays where those compounds have been tested

Page 10: KNIME in NIBR Stories from Industry

§  More browsing of those results: where are those neighbors most active?

p(Activity) > 8

Combining databases

Page 11: KNIME in NIBR Stories from Industry

p(Activity) > 8

Combining databases

11

§  More browsing of those results: show me the most active neighbors

Page 12: KNIME in NIBR Stories from Industry

Parallel virtual screening example

§  Goal: find some interesting compounds to be screened for a new project

§  2D similarity searches across two databases: •  NIBR powder archive •  Catalogs from trusted vendors

§  About 7 million compounds total.

§  Use several different fingerprints

Finton Sirockin (GDC/CADD)

Page 13: KNIME in NIBR Stories from Industry

The basic process

13

§  Generate fingerprints for database and queries

§  Calculate similarities with the Erlwood Fingerprint Similarity node

§  Sort, filter, standardize

§  Report

Page 14: KNIME in NIBR Stories from Industry

Combining the pieces

14

• Workflow is run for each query

• Fingerprints calculated for each type of search

• 600 – 11 000s • Needs to be calculated only once, even for n queries

Page 15: KNIME in NIBR Stories from Industry

Cluster usage reporting

§  Present a dashboard with a comprehensive view of current and historical usage of our HPC cluster infrastructure

§  Three Phases of processing : •  Input from raw SGE files off of the clusters at each site •  Steps A-C : data pre-processing, filtering & date-time object conversion

-  All logs are gathered into a single file kept in RAM -  Use of java nodes to convert unix time to Knime date objects -  Bash nodes for awk manipulations which are faster natively in LINUX

•  Steps D – I : execute concurrently -  Knime Statistics and grouping are heavily used -  Step H spawns cluster jobs to gather user usage statistics

§  Present summarized and aggregated data using spotfire

15

Mike Derby (NIBR IT) Varun Shivashankar (NIBR IT)

Page 16: KNIME in NIBR Stories from Industry

The workflow

16

•  Usage Data input file : Original logs 2GB – 4 GB in size x 4 clusters

•  Resulting Data file of summarized data : user_usage_DUS.csv == 1.9M

Page 17: KNIME in NIBR Stories from Industry

The complexity

17

Page 18: KNIME in NIBR Stories from Industry

The report: historical data

18

Page 19: KNIME in NIBR Stories from Industry

The dashboard

19

Written out to a UNC path, read every few minutes by Spotfire Server Generates data either from scripts or Knime running headless.

Page 20: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

§  Goal: build a model to predict which of a set of targets a molecule is most likely to hit

§  Method: using RDKit atom-pair fingerprints and a new KNIME learner that builds ensembles of truncated decision trees. (sponsored development with knime.com)

§  Validation data set: active molecules from 50 different ChEMBL assays1

20

1Heikamp, K. & Bajorath, J. Large-Scale Similarity Search Profiling of ChEMBL Compound Data Sets. J. Chem. Inf. Model. 51, 1831-1839 (2011).

Page 21: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

21

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

Page 22: KNIME in NIBR Stories from Industry

About that scaling…

22

Page 23: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

23

§  11561 data points, 50 classes

§  50 trees, random descriptor selection

§  out-of-bag prediction error: 5.8%

§  mean error from cross validation: 4.2%

Page 24: KNIME in NIBR Stories from Industry

Predicting which target a molecule will hit

24

§  mistakes tend to be in families

Page 25: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

25

Page 26: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

26

Page 27: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

27

Page 28: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

28

Page 29: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

29

Page 30: KNIME in NIBR Stories from Industry

Drilling into the confusion matrix

30

Page 31: KNIME in NIBR Stories from Industry

Acknowledgements

§  NIBR •  John Davies (CPC) •  Richard Lewis (GDC) •  Steve Litster (NIBR IT) •  Andy Palmer (NIBR IT) •  Patrick Warren (NIBR IT) •  Case studies

-  Finton Sirockin (GDC) -  Mike Derby (NIBR IT) -  Varun Shivashankar (NIBR IT) -  John Davies (CPC)

•  Node development -  Manuel Schwarze (NIBR IT) -  Dillip Kumar Mohanty (NIBR IT) -  Sudip Ghosh (NIBR IT)

•  Marc Litherland (NIBR IT)

§  knime.com •  Michael Berthold •  Bernd Wiswedel •  Thorsten Meinl •  Peter Ohl

§  Simon Richards (Lilly)

31

Page 32: KNIME in NIBR Stories from Industry

T e a c h • D i s c o v e r • T r e a t

the power of collaborative efforts

Join the Teach-Discover-Treat initiative: participate in our

symposium* and compete on one or more challenges!

*ACS Spring Meeting, March 25th, 1:30pm to 5:00pm, San Diego Convention Center, Room 26A

Goal: Provide high quality computational chemistry tutorials that impact education and drug discovery for neglected diseases

q  Requirements: use freely available software tools; datasets will be provided with a focus on targets for neglected diseases

q  Criteria to judge: quality of the model (statistical measures), clarity of the tutorial (suitable for undergraduate course), innovative application of computational technique(s)

q  Awards: travel awards to cover travel expenses for presenting work at COMP symposium

q  Presentation of Awardees at ACS Spring 2013 meeting (New Orleans)

More information and access to data sets coming in March Bookmark www.teach-discover-treat.org