Top Banner
Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI Francis Atkinson, ChEMBL group, EMBL-EBI
60

George Papadatos - Knime Tutorial

Nov 17, 2015

Download

Documents

Sherin Alfalah

Knime tutorial
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI

    Francis Atkinson, ChEMBL group, EMBL-EBI

  • Outline

    2

    Introduction to KNIME

    Basic components Desktop, nodes, dialogs, workflows

    Demo Compound selection for focused screening

    Read chemical data

    Calculate properties

    Apply drug- and lead- likeness filters

    Remove nasty compounds

    Pick diverse molecules

    Visualize results and plot properties

    Exercises 1 & 2 (hands-on)

    12/12/2013 Resources for Computational Drug Discovery

  • Are there KNIME users among us?

    Resources for Computational Drug Discovery 12/12/2013 3

  • What is KNIME?

    KNIME = Konstanz Information Miner

    Developed at University of Konstanz in Germany

    Desktop version available free of charge (Open Source)

    Modular platform for building and executing workflows using predefined components, called nodes

    Core functionality available for tasks such as standard data mining, analysis and manipulation

    Extra features and functionality available in KNIME through extensions from various groups and vendors

    Written in Java based on the Eclipse SDK platform

    4 12/12/2013 Resources for Computational Drug Discovery

  • KNIME resources

    Web pages (documentation) www.knime.org | tech.knime.org | tech.knime.org/installation-0

    Downloads knime.org/download-desktop

    Community forum tech.knime.org/forum

    Books and white papers knime.org/node/33079

    Myself

    [email protected]

    5 12/12/2013 Resources for Computational Drug Discovery

  • What can you do with KNIME?

    6

    Data manipulation and analysis File & database I/O, sorting, filtering, grouping, joining, pivoting

    Data mining / machine learning R, WEKA, KNIME, interactive plotting

    Chemoinformatics Conversions, similarity, clustering, (Q)SAR analysis, MMPs, reaction

    enumeration

    Scripting integration R, Perl, Python, Matlab, Octave, Groovy

    Reporting

    So much more Bioinformatics, HTS & image analysis, network & text mining

    Marketing, bid data and business analytics

    12/12/2013 Resources for Computational Drug Discovery

  • Community contribution nodes

    http://tech.knime.org/community

    Chemoinformatics ChEMBL and ChEBI (EBI) SureChEMBL nodes coming soon!

    CDK (EBI), RDKit (Novartis), Indigo (GGA), ErlWood (Eli Lilly), Enalos (NovaMechanics)

    Bioinformatics HCS (MPI), NGS (Konstanz), Image analysis

    Text mining Palladian

    Integration Python, Perl, R, Groovy, Matlab (MPI), PDB web services client (Vernalis)

    Resources for Computational Drug Discovery 12/12/2013 7

    http://tech.knime.org/communityhttp://tech.knime.org/community

  • Installation & updates

    8

    Download and unzip KNIME No further setup required

    Additional nodes after first launch

    knime.ini contains arguments & parameters for launch

    New software (nodes) from update sites http://tech.knime.org/update/community-contributions/release

    Workflows and data are stored in a workspace /Users/georgep/knime/workspace_mac_new

    C:\knime_2.8.2\workspace

    Customization in: FilePreferencesKNIME

    12/12/2013 Resources for Computational Drug Discovery

  • KNIME Workbench

    9

    workflow editor

    console outline

    tabs

    Node description

    node repository

    workflow projects

    favorite nodes

    public server

    12/12/2013 Resources for Computational Drug Discovery

    Auto-layout Execute Execute all nodes

  • Node = basic processing unit of KNIME workflow which performs a particular task

    Title

    Icon

    Input port(s) on the left of icon

    Output port(s) on the right of icon

    Status display (traffic lights) Red (not ready) Amber (ready) Green (executed)

    Blue bar during execution

    (with percentage or flashing)

    Sequence number Right-click menu To configure and execute the node, display the output views, edit the node, and display data for the ports

    KNIME nodes: Overview

    10 12/12/2013 Resources for Computational Drug Discovery

  • Configuration menus for selected nodes

    KNIME nodes: Dialogs

    11

    Explicit column type

    12/12/2013 Resources for Computational Drug Discovery

    Double click to configure

  • An example completed workflow

    12

    Workflows can be imported and exported as .zip files

    With or without the underlying data

    File Import KNIME workflow

    File Export KNIME workflow

    12/12/2013 Resources for Computational Drug Discovery

  • Any questions so far?

    Resources for Computational Drug Discovery 12/12/2013 13

  • Compound selection for focused screening

    1. Read chemical data

    2. Calculate phys/chem properties

    3. Apply drug- and lead-likeness filters

    4. Apply more filters (e.g. remove solubility liabilities)

    5. Apply substructural filters (PAINS subset)

    6. Pick diverse molecules

    Resources for Computational Drug Discovery 12/12/2013 14

  • The objective

    Resources for Computational Drug Discovery 12/12/2013 15

  • First steps - I

    Locate the directory with todays material

    Copy and paste it to your desktop

    You can take it with you too

    Open the presentation file

    Import the FocusedScreeningSelection.zip to KNIME

    Menu File Import workflow to KNIME

    Resources for Computational Drug Discovery 12/12/2013 16

    1

    2

    3

  • First steps - II

    Open a new workflow

    Right click on the workflow projects area

    Resources for Computational Drug Discovery 12/12/2013 17

    1

    2

    3

  • Part 1: Reading chemical data

    Resources for Computational Drug Discovery 12/12/2013 18

  • SDF Reader

    Resources for Computational Drug Discovery 12/12/2013 19

    1

    2

    3

    5

    4

    .\data\SMDC_cleaned_nodups.sdf

  • Inspect the structures

    Resources for Computational Drug Discovery 12/12/2013 20

    Right click on the node

  • Molecule to RDKit

    Resources for Computational Drug Discovery 12/12/2013 21

  • Any questions so far?

    Resources for Computational Drug Discovery 12/12/2013 22

  • Part 2: Property-based filtering

    Resources for Computational Drug Discovery 12/12/2013 23

  • Descriptor Calculation

    Resources for Computational Drug Discovery 12/12/2013 24

    1 2

    3

  • Java Snippet

    Resources for Computational Drug Discovery 12/12/2013 25

    1

    2

    3

    .\code\Lipinski.txt

  • Numeric Row Splitter

    Resources for Computational Drug Discovery 12/12/2013 26

  • Inspect the Lipinski fails

    Resources for Computational Drug Discovery 12/12/2013 27

    Right click on the node

  • Java Snippet

    Resources for Computational Drug Discovery 12/12/2013 28

    .\code\Oprea.txt 1

    2

    3

  • Numeric Row Splitter

    Resources for Computational Drug Discovery 12/12/2013 29

  • Inspect the Oprea fails

    Resources for Computational Drug Discovery 12/12/2013 30

    Right click on the node

  • Numeric Row Splitter

    Resources for Computational Drug Discovery 12/12/2013 31

  • Resources for Computational Drug Discovery 12/12/2013 32

    Inspect the Solubility fails

    Right click on the node

  • Any questions so far?

    Resources for Computational Drug Discovery 12/12/2013 33

  • Part 3: Substructure-based filtering

    Resources for Computational Drug Discovery 12/12/2013 34

  • Molecule to Indigo

    Resources for Computational Drug Discovery 12/12/2013 35

  • Resources for Computational Drug Discovery 12/12/2013 36

    File reader .\data\PAINS_clean_half.sdf

  • Resources for Computational Drug Discovery 12/12/2013 37

    Query Molecule to Indigo

  • Resources for Computational Drug Discovery 12/12/2013 38

    Inspect the SMARTS rules

  • Resources for Computational Drug Discovery 12/12/2013 39

    Chunk Loop Start

  • Resources for Computational Drug Discovery 12/12/2013 40

    Substructure Matcher

  • Resources for Computational Drug Discovery 12/12/2013 41

    Loop End

  • Resources for Computational Drug Discovery 12/12/2013 42

    Inspect matched structures

    Right click on the node

  • Resources for Computational Drug Discovery 12/12/2013 43

    Reference Row Filter

  • Any questions so far?

    Resources for Computational Drug Discovery 12/12/2013 44

  • Part 4: Diversity picking and plotting

    Resources for Computational Drug Discovery 12/12/2013 45

  • RDKit Fingerprint

    Resources for Computational Drug Discovery 12/12/2013 46

  • Inspect the fingerprints

    Resources for Computational Drug Discovery 12/12/2013 47

    Right click on the node

  • RDKit Diversity Picker

    Resources for Computational Drug Discovery 12/12/2013 48

  • 2D/3D Scatterplot

    Resources for Computational Drug Discovery 12/12/2013 49

  • Inspect the plot

    Resources for Computational Drug Discovery 12/12/2013 50

    Right click on the node

  • Any questions so far?

    Resources for Computational Drug Discovery 12/12/2013 51

  • Exercise 1

    Read an sd file with drug information from ChEMBL

    Inspect the structures and their properties

    Select only drugs that were released after 1990 (First Approval)

    Select only drugs that target human (Homo sapiens)

    How many drugs remain now?

    Save the workflow

    Tips Open a new workflow

    Use the SDF Reader node

    Use the Numeric Row Splitter node to filter on First Approval >= 1990

    Use the Nominal Value Row filter node to filter on Organism = Homo sapiens

    Resources for Computational Drug Discovery 12/12/2013 52

  • Exercise 2

    Continue from your previous workflow

    Calculate MW and logP of the drug compounds

    Generate a scatter plot of MW and logP

    Can you see any compounds with high MW and logP?

    Tips Use the Molecule to RDKit node

    Use the RDKit Descriptor Calculator node

    Include the SlogP and ExactMW descriptors

    Use the 2D/3D Scatterplot node

    Resources for Computational Drug Discovery 12/12/2013 53

  • Any questions? Last chance!

    Resources for Computational Drug Discovery 12/12/2013 54

  • Conclusions

    Compound selection for focused screening

    Typical scenario

    KNIME

    Open and free

    Data analysis

    Chemoinformatics toolkits Erl Wood, RDKit, Indigo, CDK, etc.

    Lots of other functionality

    More advanced KNIME on Friday around lunch time

    Resources for Computational Drug Discovery 12/12/2013 55

  • Further reading

    Open data and tools

    Resources for Computational Drug Discovery 12/12/2013 56

    1. Irwin, J. J.; Sterling, T.; Mysinger, M. M.; Bolstad, E. S.; Coleman, R. G., ZINC:

    A free tool to discover chemistry for biology. Journal of Chemical Information

    and Modeling 2012 ASAP.

    2. Saubern, S.; Guha, R.; Baell, J. B., KNIME workflow to assess PAINS filters in

    SMARTS format. Comparison of RDKit and Indigo cheminformatics libraries.

    Molecular Informatics 2011, 30, (10), 847-850.

    3. Barnes, M. R.; Harland, L.; Foord, S. M.; Hall, M. D.; Dix, I.; Thomas, S.;

    Williams-Jones, B. I.; Brouwer, C. R., Lowering industry firewalls: pre-

    competitive informatics initiatives in drug discovery. Nature Reviews Drug

    Discovery 2009, 8, (9), 701-708.

    4. Berthold, M. R.; Cebron, N.; Dill, F.; Gabriel, T. R.; Ktter, T.; Meinl, T.; Ohl, P.;

    Sieb, C.; Thiel, K.; Wiswedel, B., KNIME: The Konstanz Information Miner. In

    Data Analysis, Machine Learning and Applications, Preisach, C.; Burkhardt, H.;

    Schmidt-Thieme, L.; Decker, R., Eds. Springer: Berlin, 2008; pp 319-326.

    5. Tiwari, A.; Sekhar, A. K. T., Workflow based framework for life science

    informatics. Computational Biology and Chemistry 2007, 31, (5-6), 305-319.

  • Further reading

    High throughput screening

    Lead- and drug-likeness

    Resources for Computational Drug Discovery 12/12/2013 57

    1. Bajorath, J., Integration of virtual and high-throughput screening. Nature

    Reviews Drug Discovery 2002, 1, (11), 882-894.

    2. Harper, G.; Pickett, S. D.; Green, D. V. S., Design of a compound

    screening collection for use in High Throughput Screening. Combinatorial

    Chemistry & High Throughput Screening 2004, 7, (1), 63-70.

    1. Chuprina, A.; Lukin, O.; Demoiseaux, R.; Buzko, A.; Shivanyuk, A., Drug- and

    lead-likeness, target class, and molecular diversity analysis of 7.9 million

    commercially available organic compounds provided by 29 suppliers. Journal of

    Chemical Information and Modeling 2010, 50, (4), 470-479.

    2. Lipinski, C. A., Lead- and drug-like compounds: the rule-of-five revolution. Drug

    Discovery Today: Technologies 2004, 1, (4), 337-341.

    3. Oprea, T. I.; Davis, A. M.; Teague, S. J.; Leeson, P. D., Is there a difference

    between leads and drugs? A historical perspective. Journal of Chemical

    Information and Computer Sciences 2001, 41, (5), 1308-1315.

  • Further reading

    Physicochemical properties and drug discovery

    Structural alerts in HTS

    Resources for Computational Drug Discovery 12/12/2013 58

    1. Brstle, M.; Beck, B.; Schindler, T.; King, W.; Mitchell, T.; Clark, T., Descriptors,

    physical properties, and drug-likeness. Journal of Medicinal Chemistry 2002, 45,

    (16), 3345-3355.

    2. Hill, A. P.; Young, R. J., Getting physical in drug discovery: A contemporary

    perspective on solubility and hydrophobicity. Drug Discovery Today 2010, 15,

    (15/16), 648-655.

    3. Leeson, P. D.; Springthorpe, B., The influence of drug-like concepts on decision-

    making in medicinal chemistry. Nature Reviews Drug Discovery 2007, 6, (11), 881-

    890.

    1. Baell, J. B.; Holloway, G. A., New substructure filters for removal of Pan Assay

    Interference Compounds (PAINS) from screening libraries and for their exclusion in

    bioassays. Journal of Medicinal Chemistry 2010, 53, (7), 2719-2740.

    2. Rishton, G. M., Reactive compounds and in vitro false positives in HTS. Drug

    Discovery Today 1997, 2, (9), 382-384.

  • Further reading

    Similarity and diversity

    Resources for Computational Drug Discovery 12/12/2013 59

    1. Ashton, M.; Barnard, J.; Casset, F.; Charlton, M.; Downs, G.; Gorse, D.; Holliday,

    J.; Lahana, R.; Willett, P., Identification of diverse database subsets using

    property-based and fragment-based molecular descriptions. Quantitative

    Structure-Activity Relationships 2002, 21, (6), 598-604.

    2. Bender, A.; Glen, R. C., Molecular similarity: a key technique in molecular

    informatics. Organic and Biomolecular Chemistry 2004, 2, 3204-3218.

    3. Gorse, A.-D., Diversity in medicinal chemistry space. Current Topics in Medicinal

    Chemistry 2006, 6, (1), 3-18.

    4. Maldonado, A.; Doucet, J.; Petitjean, M.; Fan, B.-T., Molecular similarity and

    diversity in chemoinformatics: From theory to applications. Molecular Diversity

    2006, 10, (1), 39-79.

    5. Rogers, D.; Hahn, M., Extended-connectivity fingerprints. Journal of Chemical

    Information and Modeling 2010, 50, (5), 742-754.

    6. Schuffenhauer, A.; Brown, N., Chemical diversity and biological activity. Drug

    Discovery Today: Technologies 2006, 3, (4), 387-395.

    7. Willett, P.; Barnard, J. M.; Downs, G. M., Chemical similarity searching. Journal

    of Chemical Information and Computer Sciences 1998, 38, (6), 983-996.

  • Day 4: KNIME Practical George Papadatos, ChEMBL group, EMBL-EBI

    Francis Atkinson, ChEMBL group, EMBL-EBI