Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI Francis Atkinson, ChEMBL group, EMBL-EBI
Nov 17, 2015
Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI
Francis Atkinson, ChEMBL group, EMBL-EBI
Outline
2
Introduction to KNIME
Basic components Desktop, nodes, dialogs, workflows
Demo Compound selection for focused screening
Read chemical data
Calculate properties
Apply drug- and lead- likeness filters
Remove nasty compounds
Pick diverse molecules
Visualize results and plot properties
Exercises 1 & 2 (hands-on)
12/12/2013 Resources for Computational Drug Discovery
Are there KNIME users among us?
Resources for Computational Drug Discovery 12/12/2013 3
What is KNIME?
KNIME = Konstanz Information Miner
Developed at University of Konstanz in Germany
Desktop version available free of charge (Open Source)
Modular platform for building and executing workflows using predefined components, called nodes
Core functionality available for tasks such as standard data mining, analysis and manipulation
Extra features and functionality available in KNIME through extensions from various groups and vendors
Written in Java based on the Eclipse SDK platform
4 12/12/2013 Resources for Computational Drug Discovery
KNIME resources
Web pages (documentation) www.knime.org | tech.knime.org | tech.knime.org/installation-0
Downloads knime.org/download-desktop
Community forum tech.knime.org/forum
Books and white papers knime.org/node/33079
Myself
5 12/12/2013 Resources for Computational Drug Discovery
What can you do with KNIME?
6
Data manipulation and analysis File & database I/O, sorting, filtering, grouping, joining, pivoting
Data mining / machine learning R, WEKA, KNIME, interactive plotting
Chemoinformatics Conversions, similarity, clustering, (Q)SAR analysis, MMPs, reaction
enumeration
Scripting integration R, Perl, Python, Matlab, Octave, Groovy
Reporting
So much more Bioinformatics, HTS & image analysis, network & text mining
Marketing, bid data and business analytics
12/12/2013 Resources for Computational Drug Discovery
Community contribution nodes
http://tech.knime.org/community
Chemoinformatics ChEMBL and ChEBI (EBI) SureChEMBL nodes coming soon!
CDK (EBI), RDKit (Novartis), Indigo (GGA), ErlWood (Eli Lilly), Enalos (NovaMechanics)
Bioinformatics HCS (MPI), NGS (Konstanz), Image analysis
Text mining Palladian
Integration Python, Perl, R, Groovy, Matlab (MPI), PDB web services client (Vernalis)
Resources for Computational Drug Discovery 12/12/2013 7
http://tech.knime.org/communityhttp://tech.knime.org/community
Installation & updates
8
Download and unzip KNIME No further setup required
Additional nodes after first launch
knime.ini contains arguments & parameters for launch
New software (nodes) from update sites http://tech.knime.org/update/community-contributions/release
Workflows and data are stored in a workspace /Users/georgep/knime/workspace_mac_new
C:\knime_2.8.2\workspace
Customization in: FilePreferencesKNIME
12/12/2013 Resources for Computational Drug Discovery
KNIME Workbench
9
workflow editor
console outline
tabs
Node description
node repository
workflow projects
favorite nodes
public server
12/12/2013 Resources for Computational Drug Discovery
Auto-layout Execute Execute all nodes
Node = basic processing unit of KNIME workflow which performs a particular task
Title
Icon
Input port(s) on the left of icon
Output port(s) on the right of icon
Status display (traffic lights) Red (not ready) Amber (ready) Green (executed)
Blue bar during execution
(with percentage or flashing)
Sequence number Right-click menu To configure and execute the node, display the output views, edit the node, and display data for the ports
KNIME nodes: Overview
10 12/12/2013 Resources for Computational Drug Discovery
Configuration menus for selected nodes
KNIME nodes: Dialogs
11
Explicit column type
12/12/2013 Resources for Computational Drug Discovery
Double click to configure
An example completed workflow
12
Workflows can be imported and exported as .zip files
With or without the underlying data
File Import KNIME workflow
File Export KNIME workflow
12/12/2013 Resources for Computational Drug Discovery
Any questions so far?
Resources for Computational Drug Discovery 12/12/2013 13
Compound selection for focused screening
1. Read chemical data
2. Calculate phys/chem properties
3. Apply drug- and lead-likeness filters
4. Apply more filters (e.g. remove solubility liabilities)
5. Apply substructural filters (PAINS subset)
6. Pick diverse molecules
Resources for Computational Drug Discovery 12/12/2013 14
The objective
Resources for Computational Drug Discovery 12/12/2013 15
First steps - I
Locate the directory with todays material
Copy and paste it to your desktop
You can take it with you too
Open the presentation file
Import the FocusedScreeningSelection.zip to KNIME
Menu File Import workflow to KNIME
Resources for Computational Drug Discovery 12/12/2013 16
1
2
3
First steps - II
Open a new workflow
Right click on the workflow projects area
Resources for Computational Drug Discovery 12/12/2013 17
1
2
3
Part 1: Reading chemical data
Resources for Computational Drug Discovery 12/12/2013 18
SDF Reader
Resources for Computational Drug Discovery 12/12/2013 19
1
2
3
5
4
.\data\SMDC_cleaned_nodups.sdf
Inspect the structures
Resources for Computational Drug Discovery 12/12/2013 20
Right click on the node
Molecule to RDKit
Resources for Computational Drug Discovery 12/12/2013 21
Any questions so far?
Resources for Computational Drug Discovery 12/12/2013 22
Part 2: Property-based filtering
Resources for Computational Drug Discovery 12/12/2013 23
Descriptor Calculation
Resources for Computational Drug Discovery 12/12/2013 24
1 2
3
Java Snippet
Resources for Computational Drug Discovery 12/12/2013 25
1
2
3
.\code\Lipinski.txt
Numeric Row Splitter
Resources for Computational Drug Discovery 12/12/2013 26
Inspect the Lipinski fails
Resources for Computational Drug Discovery 12/12/2013 27
Right click on the node
Java Snippet
Resources for Computational Drug Discovery 12/12/2013 28
.\code\Oprea.txt 1
2
3
Numeric Row Splitter
Resources for Computational Drug Discovery 12/12/2013 29
Inspect the Oprea fails
Resources for Computational Drug Discovery 12/12/2013 30
Right click on the node
Numeric Row Splitter
Resources for Computational Drug Discovery 12/12/2013 31
Resources for Computational Drug Discovery 12/12/2013 32
Inspect the Solubility fails
Right click on the node
Any questions so far?
Resources for Computational Drug Discovery 12/12/2013 33
Part 3: Substructure-based filtering
Resources for Computational Drug Discovery 12/12/2013 34
Molecule to Indigo
Resources for Computational Drug Discovery 12/12/2013 35
Resources for Computational Drug Discovery 12/12/2013 36
File reader .\data\PAINS_clean_half.sdf
Resources for Computational Drug Discovery 12/12/2013 37
Query Molecule to Indigo
Resources for Computational Drug Discovery 12/12/2013 38
Inspect the SMARTS rules
Resources for Computational Drug Discovery 12/12/2013 39
Chunk Loop Start
Resources for Computational Drug Discovery 12/12/2013 40
Substructure Matcher
Resources for Computational Drug Discovery 12/12/2013 41
Loop End
Resources for Computational Drug Discovery 12/12/2013 42
Inspect matched structures
Right click on the node
Resources for Computational Drug Discovery 12/12/2013 43
Reference Row Filter
Any questions so far?
Resources for Computational Drug Discovery 12/12/2013 44
Part 4: Diversity picking and plotting
Resources for Computational Drug Discovery 12/12/2013 45
RDKit Fingerprint
Resources for Computational Drug Discovery 12/12/2013 46
Inspect the fingerprints
Resources for Computational Drug Discovery 12/12/2013 47
Right click on the node
RDKit Diversity Picker
Resources for Computational Drug Discovery 12/12/2013 48
2D/3D Scatterplot
Resources for Computational Drug Discovery 12/12/2013 49
Inspect the plot
Resources for Computational Drug Discovery 12/12/2013 50
Right click on the node
Any questions so far?
Resources for Computational Drug Discovery 12/12/2013 51
Exercise 1
Read an sd file with drug information from ChEMBL
Inspect the structures and their properties
Select only drugs that were released after 1990 (First Approval)
Select only drugs that target human (Homo sapiens)
How many drugs remain now?
Save the workflow
Tips Open a new workflow
Use the SDF Reader node
Use the Numeric Row Splitter node to filter on First Approval >= 1990
Use the Nominal Value Row filter node to filter on Organism = Homo sapiens
Resources for Computational Drug Discovery 12/12/2013 52
Exercise 2
Continue from your previous workflow
Calculate MW and logP of the drug compounds
Generate a scatter plot of MW and logP
Can you see any compounds with high MW and logP?
Tips Use the Molecule to RDKit node
Use the RDKit Descriptor Calculator node
Include the SlogP and ExactMW descriptors
Use the 2D/3D Scatterplot node
Resources for Computational Drug Discovery 12/12/2013 53
Any questions? Last chance!
Resources for Computational Drug Discovery 12/12/2013 54
Conclusions
Compound selection for focused screening
Typical scenario
KNIME
Open and free
Data analysis
Chemoinformatics toolkits Erl Wood, RDKit, Indigo, CDK, etc.
Lots of other functionality
More advanced KNIME on Friday around lunch time
Resources for Computational Drug Discovery 12/12/2013 55
Further reading
Open data and tools
Resources for Computational Drug Discovery 12/12/2013 56
1. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G., ZINC:
A free tool to discover chemistry for biology. Journal of Chemical Information
and Modeling 2012 ASAP.
2. Saubern, S.; Guha, R.; Baell, J. B., KNIME workflow to assess PAINS filters in
SMARTS format. Comparison of RDKit and Indigo cheminformatics libraries.
Molecular Informatics 2011, 30, (10), 847-850.
3. Barnes, M. R.; Harland, L.; Foord, S. M.; Hall, M. D.; Dix, I.; Thomas, S.;
Williams-Jones, B. I.; Brouwer, C. R., Lowering industry firewalls: pre-
competitive informatics initiatives in drug discovery. Nature Reviews Drug
Discovery 2009, 8, (9), 701-708.
4. Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Ktter, T.; Meinl, T.; Ohl, P.;
Sieb, C.; Thiel, K.; Wiswedel, B., KNIME: The Konstanz Information Miner. In
Data Analysis, Machine Learning and Applications, Preisach, C.; Burkhardt, H.;
Schmidt-Thieme, L.; Decker, R., Eds. Springer: Berlin, 2008; pp 319-326.
5. Tiwari, A.; Sekhar, A. K. T., Workflow based framework for life science
informatics. Computational Biology and Chemistry 2007, 31, (5-6), 305-319.
Further reading
High throughput screening
Lead- and drug-likeness
Resources for Computational Drug Discovery 12/12/2013 57
1. Bajorath, J., Integration of virtual and high-throughput screening. Nature
Reviews Drug Discovery 2002, 1, (11), 882-894.
2. Harper, G.; Pickett, S. D.; Green, D. V. S., Design of a compound
screening collection for use in High Throughput Screening. Combinatorial
Chemistry & High Throughput Screening 2004, 7, (1), 63-70.
1. Chuprina, A.; Lukin, O.; Demoiseaux, R.; Buzko, A.; Shivanyuk, A., Drug- and
lead-likeness, target class, and molecular diversity analysis of 7.9 million
commercially available organic compounds provided by 29 suppliers. Journal of
Chemical Information and Modeling 2010, 50, (4), 470-479.
2. Lipinski, C. A., Lead- and drug-like compounds: the rule-of-five revolution. Drug
Discovery Today: Technologies 2004, 1, (4), 337-341.
3. Oprea, T. I.; Davis, A. M.; Teague, S. J.; Leeson, P. D., Is there a difference
between leads and drugs? A historical perspective. Journal of Chemical
Information and Computer Sciences 2001, 41, (5), 1308-1315.
Further reading
Physicochemical properties and drug discovery
Structural alerts in HTS
Resources for Computational Drug Discovery 12/12/2013 58
1. Brstle, M.; Beck, B.; Schindler, T.; King, W.; Mitchell, T.; Clark, T., Descriptors,
physical properties, and drug-likeness. Journal of Medicinal Chemistry 2002, 45,
(16), 3345-3355.
2. Hill, A. P.; Young, R. J., Getting physical in drug discovery: A contemporary
perspective on solubility and hydrophobicity. Drug Discovery Today 2010, 15,
(15/16), 648-655.
3. Leeson, P. D.; Springthorpe, B., The influence of drug-like concepts on decision-
making in medicinal chemistry. Nature Reviews Drug Discovery 2007, 6, (11), 881-
890.
1. Baell, J. B.; Holloway, G. A., New substructure filters for removal of Pan Assay
Interference Compounds (PAINS) from screening libraries and for their exclusion in
bioassays. Journal of Medicinal Chemistry 2010, 53, (7), 2719-2740.
2. Rishton, G. M., Reactive compounds and in vitro false positives in HTS. Drug
Discovery Today 1997, 2, (9), 382-384.
Further reading
Similarity and diversity
Resources for Computational Drug Discovery 12/12/2013 59
1. Ashton, M.; Barnard, J.; Casset, F.; Charlton, M.; Downs, G.; Gorse, D.; Holliday,
J.; Lahana, R.; Willett, P., Identification of diverse database subsets using
property-based and fragment-based molecular descriptions. Quantitative
Structure-Activity Relationships 2002, 21, (6), 598-604.
2. Bender, A.; Glen, R. C., Molecular similarity: a key technique in molecular
informatics. Organic and Biomolecular Chemistry 2004, 2, 3204-3218.
3. Gorse, A.-D., Diversity in medicinal chemistry space. Current Topics in Medicinal
Chemistry 2006, 6, (1), 3-18.
4. Maldonado, A.; Doucet, J.; Petitjean, M.; Fan, B.-T., Molecular similarity and
diversity in chemoinformatics: From theory to applications. Molecular Diversity
2006, 10, (1), 39-79.
5. Rogers, D.; Hahn, M., Extended-connectivity fingerprints. Journal of Chemical
Information and Modeling 2010, 50, (5), 742-754.
6. Schuffenhauer, A.; Brown, N., Chemical diversity and biological activity. Drug
Discovery Today: Technologies 2006, 3, (4), 387-395.
7. Willett, P.; Barnard, J. M.; Downs, G. M., Chemical similarity searching. Journal
of Chemical Information and Computer Sciences 1998, 38, (6), 983-996.
Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI
Francis Atkinson, ChEMBL group, EMBL-EBI