Top Banner
1 E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University
59

1 E-Chemistry and Web 2.0 Marlon Pierce [email protected] Community Grids Lab Indiana University.

Dec 18, 2015

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

1

E-Chemistry and Web 2.0

Marlon Pierce

[email protected]

Community Grids Lab

Indiana University

Page 2: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

2

One Talk, Two Projects

NIH funded Chemical Informatics and Cyberinfrastructure Collaboratory (CICC) @ IU. Geoffrey Fox Gary Wiggins Rajarshi Guha David Wild Mookie Baik Kevin Gilbert And others

Proposed Microsoft-Funded Project: E-Chemistry Carl Lagoze (Cornell), Lee Giles (PSU), Steve Bryant (NIH), Jeremy Frey (Soton), Peter Murray-Rust

(Cambridge), Herbert Van de Sompel (Los

Alamos), Geoffrey Fox (Indiana) And others

Page 3: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

3

CICC Infrastructure Vision Chemical Informatics: drug discovery and other academic chemistry,

pharmacology, and bioinformatics research will be aided by powerful, modern, open, information technology. NIH PubChem and PubMed provide unprecedented open, free data and

information. We need a corresponding open service architecture (i.e. avoid stove-piped

applications) CICC set up as distributed cyberinfrastructure in eScience model

Web clients (user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses. Composed of clients to open service APIs (mash-ups) Aggregated into portals

Web services manipulate this data and are combined into workflows. So our main agenda items: create interesting databases and build lots of

Web services and clients.

Page 4: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

CICC DatabasesMost of our databases aim to add value to

PubChem or link into PubChem 1D (SMILES) and 2D structures

3D structures (MMFF94) Searchable by CID, SMARTS, 3D similarity

Docked ligands (FRED, Autodock) 906K drug-like compounds into 7 ligands Will eventually cover ~2000 targets

Philosophy: we have big computers, so let’s calculate everything ahead of time and put the results in a DB.

Page 5: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

5

Building Up the Infrastructure

Our SOA philosophy: use standard Web services. Mostly stateless Some cluster, HPC work needed but these populate

databasesServices are aggregate-able into different

workflows. Taverna, Pipeline Pilot, …

You can also build lots of Web clients.See

http://www.chembiogrid.org/wiki/index.php/CICC_Web_Resources for links and details.

Not so far from Web 2.0….

Page 6: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

6

Type Service Functionality Source License

Database Docking

Provides access to the results of docking a subset of PubChem into a set of ligands. Searchable by 2D structure and docking docking score

Indiana University

Freely accessible

Database 3D Structure

Provides access to 3D structure generated for most of PubChem

Indiana University

Freely accessible

Cheminformatics OSCAR3 Extract chemical structures from text

Cambridge University

Freely accessible

Cheminformatics InChiGoogle Uses Google to search for an InChI

Cambridge University

Freely accessible

Cheminformatics CMLRSSServer Generates a CMLRSS feed from CML data

Cambridge University

Freely accessible

Cheminformatics OpenBabel Converts chemical file formats

Cambridge University

Freely accesible

Sample Services

Page 7: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

7

Cheminformatics ToxTreeServer Obtains toxicity hazard predictions

Indiana University & European Chemical Bureau

Freely accessible

Cheminformatics DBUtil Generates 166 bit MACCS keys

Indiana University & gNova Consulting

Freely accessible

Cheminformatics Molecular Similarity

Evaluates 2D/ 3D similarity and evaluate distance moments for 3D similarity calculations

Indiana University & CDK

Freely accessible

Cheminformatics Molecular Descriptors

Generatesarious descriptors including TPSA, XLogP, s urface areas

Indiana University & CDK

Freely accessible

Cheminformatics 2D Structure Diagrams

Generates 2D structure diagrams from SMILES

Indiana University & CDK

Freely accessible

Cheminformatics Druglikeness Methods

Evaluates measures of druglikeness

Indiana University & CDK

Freely accessible

Cheminformatics Utility Methods

Generates hashed fingerprints, 2D coordinate generation etc.

Indiana University & CDK

Freely accessible

Page 8: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

8

Statistics Sampling Distributions

Samples from several distributions (normal, uniform, Weibull etc)

Indiana University

Freely accessible

Statistics Linear Regression Builds linear regression models

Indiana University

Freely accessible

Statistics CNN Regression Builds neural network regression models

Indiana University

Freely accessible

Statistics RF Regression Builds random forest regression models

Indiana University

Freely accessible

Statistics LDA Builds linear discriminant analysis models

Indiana University

Freely accessible

Statistics K-Means Performs K-means clustering

Indiana University

Freely accessible

Statistics Feature Selection

Performs feature selection using stepwise regression

Indiana University

Freely accessible

Statistics XY Plots Generates 2D scatter plots

Indiana University

Freely accessible

Statistics Histogram Plots Generates histograms

Indiana University

Freely accessible

Page 9: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

9

Data Exchange TabToVOTables Converts tab delimited files to VOTables

Indiana University

Freely accessible

Data Exchange VOTablesToTab Converts VOTables to tab delimited files

Indiana University

Freely accessible

Data Exchange VOTablesToXLS Converts VOTables to Excel spreadsheet

Indiana University

Freely accessible

Data Exchange VOTable Retrieve

Retrieves field names and data types from a VOTables document

Indiana University

Freely accessible

Data Exchange VOTableExtract Extracts columns from a VOTables document

Indiana University

Freely accessible

Computational Chemistry

Varuna File Format

Handles file formats for QM/ MM packages

Indiana University

Freely accessible

Computational Chemistry

Varuna Analysis Performs analysis of results from Ja guar and ADF

Indiana University

Freely accessible

Computational Chemistry

Varuna Query Searches the Varuna database

Indiana University

Freely accessible

Computational Chemistry

Varuna Submit Submits input data for calculation on a local cluster

Indiana University

Freely accessible

Page 10: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

10

Application Fred Performs docking Openeye Software

Commercial

Application Filter Property calculation and filtering

Openeye Software

Commercial

Application Omega Generates 3D conformers

Openeye Software

Commercial

Application BCI Fingerprint Generates 1052 BCI st ructural keys

Digital Chemistry

Commercial

Application BCI Clustering Performs divisive k-means clustering

Digital Chemistry

Commercial

Application PkCell

Evaluates pharmacokinetic parameters for druglike molecules

Indiana University & University of Michigan

Freely accessible

Application Scripps MLSCN Toxicity

Gets toxicity predictions for RF models built using MLSCN cell-line data

Indiana University & Scripps, FL.

Freely accessible

Application NTP DTP Anti-cancer activity

Gets anti-cancer actvity predictions for the 60 NCI cell lines

Indiana University

Freely accessible

Application Ames Mutagenicity

Gets mutagenicity predictions

Indiana University

Freely accessible

Page 11: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

11

Name Functionality Type Links

PubDock In terface to the docking database

Web http://w ww.chembiogrid.org/ cheminfo/ dock/

Pub3D In terface to the 3D structure database

Web http://w ww.chembiogrid.org/ cheminfo/ p3d/

Frequent Hitters

Identify compounds that occur in multiple assays, with links to individual assays

Web http://w ww.chembiogrid.org/ cheminfo/f reqhit/ fh

MLSCN Toxicity Predictions

Predict whether a compound will be toxic or not

Web and Pipeline Pilot

http://w ww.chembiogrid.org/ cheminfo/rw s/s cripps

ToxTree Predict toxicity hazard class

Web http:/ / cheminfo.informatics.indiana.edu/~r guha/c ode/ java/ cdkws/ cdkws.html#tox

DTP Anti-Cancer Predictions

Predict whether a compound exhibits anti-cancer activity against the 60 NCI cell lines

Web http://w ww.chembiogrid.org/ cheminfo/n cidtp/dtp

Web Client Interfaces

Page 12: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

12

Ames Mutagenicity Predictions

Predict whether a compound is mutagenic or not in the Ames test

Web http://w ww.chembiogrid.org/ cheminfo/rw s/ame s

PkCell Evaluate pharmacokinetic parameters

Web http://w ww.chembiogrid.org/ cheminfo/p kcell/

Kemo Natural language interface to PubChem

Web http:/ / cheminfo.informatics.indiana.edu:8080/k emo/

RSS Feeds

Generate RSS feeds for various PubChem related queries

Web and RSS feed

http://w ww.chembiogrid.org/ cheminfo/r ssint.html

Statistical Model Download

Download statistical models as R binary files

Web http://w ww.chembiogrid.org/ cheminfo/rw s/m list

Cheminformatics

Miscellaneous functions such as structure diagrams, similarity etc.

Web http:/ / cheminfo.informatics.indiana.edu/~r guha/c ode/ java/ cdkws/ cdkws.html

More Clients…

Page 13: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

13

Varuna File operations and result analysis

Web

http:// 129.79.139.29/ filecon/D efault.aspx and http:// 129.79.139.29/ utilityclient/ Default.aspx

VOTables

Plotting data using VOTables as well as using Excel files via VOTables

Web

http:/ / gf1.ucs.indiana.edu:9080/a xis/ VOTables.html and http://w ww.chembiogrid.org/ cheminfo/rw s/ xlsvor

PubChemSR .Net interface to PubChem

Desktop application

http:/ / darwin.informatics.indiana.edu/j uhur/ To ols/ PubChemSR/

rpubchem and rcdk

R packages to interface with the CDK and access PubChem

Desktop applciation

http:/ / cran.r-project.org/ src/ contrib/ Descriptions/rcdk.html and http:/ /c ran.r-project.org/ src/ contrib/ Descriptions/rpubchem.html

Chimera plugin

A plugin to allow Chimera to utilize the PubDock database

Desktop application (requires Chimera)

http:/ / poincare.uits.iupui.edu/ ~heiland/c icc/ code/

PubChem 3D View

A Greasemonkey script that shows 3D structures when viewing Pubchem pages

Web (requires Firefox and Greasemonkey)

http:// rna.informatics.indiana.edu/ hgopalak/ 3DStructView.user.js

More Clients…

Page 14: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

14

Example: PubDock Database of approximately 1

million PubChem structures (the most drug-like) docked into proteins taken from the PDB

Available as a web service, so structures can be accessed in your own programs, or using workflow tools like Pipeline Polit

Several interfaces developed, including one based on Chimera (right) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target

Can be used as a tool to help understand molecular basis of activity in cellular or image based assays

Page 15: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

15

Example: R Statistics applied to PubChem data

By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data

Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications

Example uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines.

Page 16: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

16

Example assay screening workflow: finding cell-protein relationships

A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

Similar structures to the ligand can be

browsed using client portlets.

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations

LeastSquaresRegression

RandomForests

NeuralNets

Page 17: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

17

Relevance to Web 2.0Some Web 2.0 Key Features

REST Services Use of RSS/Atom feeds Client interfaces are “mashups” Gadgets, widgets for portals aggregate clients

So… We provide RSS as an alternative WS format. We have experimented with RSS feeds, using Yahoo

Pipes to manipulate multiple feeds. CICC Web interfaces can be easily wrapped as

universal gadgets in iGoogle, Netvibes. Alternative to classic science gateways.

Page 18: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

RSS Feeds/REST ServicesProvide access to DB's via RSS feedsFeeds include 2D/3D structures in CMLViewable in Bioclipse, Jmol as well as Sage etc.Two feeds currently available

SynSearch – get structures based on full or partial chemical names

DockSearch – get best N structures for a target

Really hampered by size of DB and Postgres performance.

Page 19: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

19

Tools and mashups based on web service infrastructure

http://www.chembiogrid.org/projects/proj_tools.html

Page 20: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

20

Mining information from journal articles

Until now SciFinder / CAS only chemistry-aware portal into journal information

We can access full text of journal articles online (with subscription)

ACS does not make full text available … but there are ways round that!

RSC is now marking up with SMILES and GO/Goldbook terms! www.projectprospect.org

Having SMILES or InChI means that we can build a similarity/structure searchable database of papers: e.g. “find me all the papers published since 2000 which contain a structure with >90% similarity to this one”

In the absence of full text, we can at least use the abstract

Page 21: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

21

Text Mining: OSCAR A tool for shallow, chemistry-specific natural language

parsing of chemical documents (e.g. journal articles). It identifies (or attempts to identify):

Chemical names: singular nouns, plurals, verbs etc., also formulae and acronyms.

Chemical data: Spectra, melting/boiling point, yield etc. in experimental sections.

Other entities: Things like N(5)-C(3) and so on. Part of the larger SciBorg effort

See http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html) http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/

Oscar3

Page 22: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

Mash-Up: What published compounds might bind to this protein?

Create a database containing thetext of all recent PubMed abstracts(2006-2007 = ~500,000)

Convert molecules to 3D and dock into

a protein of interest

Visualize top docked molecules in a Google-like interface

Use OSCAR to extract all of the chemical names referred to in the abstracts and covert to SMILES

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

DATABASE SERVICE

DOCKING SERVICE

+

Page 23: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

23

E-Chemistry and Digital Libraries

We can’t wait to get started….

Page 24: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

24

E-Chemistry and Digital Libraries

Key problem with our SOA-based e-Science is information management. Where is the service that I need? What does it do?

We may consider our data-centric services to be digital libraries.

Data is diverse Documents Not just computational information like structures.

Another point of view: how can I link together publications, results, workflows, etc? That is, I need to manage digital documents.

Page 25: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

25

Digital Libraries Open Archives Initiative Object Reuse and Exchange

Project (OAI-ORE) Developing standardized, interoperable, and machine-

readable mechanisms to express information about compound information objects on the web.

Graph-based representations of connected digital objects. Objects may be encoded in (for example) RDF or XML, Retrievable via repositories with REST service interfaces

(c.f. Atom Publishing Protocal) Obtain, harvest, and register

Page 26: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 27: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 28: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

28

Challenges for E-Chemistry

Can digital library principals be applied to data as well as documents? Can you link your workflow to your conference paper?

Can we engineer a publishing framework and message formats around Web 2.0 principals? REST, Atom Publishing Protocol, Atom Syndication

Format, JSON, Microformats

Can we do this securely? Access control, provenance, identify federation are key

problems.

Page 29: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

Institution Project Focus

Cambridge Retrospective Data Extraction Searching and Indexing Data Models/Ontologies Tools and Applications

Cornell Data Models Interoperability infrastructure Project Management Publicity and outreach

Indiana Infrastructure Integration Trust and Provenance Tools and Applications

LANL Data Models

Interoperability infrastructure

PuBChem Chemical Structure Archive

Results of Experimental Biological Activity Testing

Cross References to BioMedical Databases

Penn State Retrospective Data Extraction Searching and Indexing Analysis

Southampton Prospective & Retrospective Data Provision Tools and Applications In-process capture of eChemistry data Data Linking Š in analysis and publication

Page 30: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

30

More Information

Project Web Site: www.chembiogrid.orgProject Wiki: www.chembiogrid.org/wikiContact me: [email protected]

Page 31: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

31

Page 32: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

CICC Combines Grid Computing with Chemical Informatics

CICCCICC CICCCICCChemical Informatics and Cyberinfrastucture Collaboratory

Funded by the National Institutes of Healthwww.chembiogrid.org

Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories

Science and Cyberinfrastructure

.

Large Scale Computing Challenges

Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated.

CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs.

CICC supports the NIH mission by combining state of the art chemical informatics techniques with

• World class high performance computing• National-scale computing resources (TeraGrid)• Internet-standard web services • International activities for service orchestration• Open distributed computing infrastructure for scientists world wide

NIHPubMed

DataBase

OSCARText

Analysis

POVRayParallel

Rendering

Initial 3DStructure

Calculation

ToxicityFiltering

ClusterGrouping

Docking

MolecularMechanics

Calculations

Quantum Mechanics

Calculations

IU’sVaruna

DataBase

NIHPubChemDataBase

Chemical informatics text analysis programs can process 100,000’s of abstracts of online journalarticles to extract chemical signatures of potential drugs.

OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Top-ranking docked molecules can be further examined for drug potential.

Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry calculations on all of PubMed. Results go back to public databases that are freely accessible by the scientific community.

Page 33: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

MLSCN Post-HTS Biology Decision SupportPercent Inhibition

or IC50 data is retrieved from HTS

Question: Was this screen successful?

Question: What should the active/inactive cutoffs be?

Question: What can we learn about the target protein or cell line from this screen?

Compounds submitted to PubChem

Workflows encoding distribution analysis of screening results

Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis

A Grid of Grids linking collections of services atPubChemECCR centersMLSCN centers

Workflows encoding plate & control well statistics, distribution analysis, etc

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etcCHEMINFORMATICSPROCESS GRIDS

Page 34: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

34

R Web Services

Page 35: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

35

Why?

Need access to math and stat functionality

Did not want to recode algorithmsWanted latest methodsNeeded a distributed approach to

computation Keep computation on a powerful machine Access it from a smaller machine

Page 36: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

36

Why R?

Free, open-sourceMany cutting edge methods avilableFlexible programming languageInterfaces with many languages

Python Perl Java C

Page 37: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

37

The R Server

R can be run as a remote compute server Requires the rserve package

Allows authenticated access over TCP/IP

Connections can maintain stateClient libraries for Java & C

Page 38: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

38

R as a Web Service

On its own the R server is not a web service

We provide Java frontends to specific functionalities

The frontend classes are hosted in a Tomcat web container

Accessible via SOAPFull Javadocs for all available WS’s

Page 39: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

39

Flowchart

Page 40: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

40

Functionality

Two classes of functionality General functions

Allows you to supply data and build a predictive model

Sample from various distributions Obtain scatter plots and hisotgram Model development functions use a Java front-

end to encapsulate model specific information

Page 41: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

41

Functionality

Two classes of functionality Model deployment

Allows you to build a model outside of the infrastructure

Place the final model in the infrastructure Becomes available as a web service Each model deployed requires its own front end

class In general, these classes are identical - could

be autogenerated

Page 42: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

42

Available Functionality

Predictive models - OLS, RF, CNN, LDA

Clustering - k-means Statistical distributionsXY plot and scatter plotsModel deployment for single model

types and ensemble model types

Page 43: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

43

Deployed Models

Since deployed models are visible as web services we can build a simple web front end for them

Examples NCI anti-cancer predictions Ames mutagenicity predictions

Page 44: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

44

Applications

The R WS is not restricted to ‘atomic’ functionality

Can write a whole R program Load it on the R compute server Provide a Java WS frontend

Examples Feature selection Automated model generation Pharmacokinetic parameter calculation

Page 45: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

45

Data Input/Output

Most modeling applications require data matrices

Depending on client language we can use SOAP array of arrays (2D matrices) SOAP array (1D vector form of a 2D

matrix) VOTables

Page 46: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

46

Data Input/Output

Some R web services can take a URL to a VOTables document Conversion to R or Java matrices is done

by a local VOTables Java library

R also has basic support for VOTables directly Ignores binary data streams

Page 47: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

47

Interacting With R WS’s

Traditional WS’s do not maintain statePredictive models are different

A model is built at one time May be used for prediction at another time Need to maintain state

State is maintained by serialization to R binary files on the compute server

Clients deal with model ID’s

Page 48: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

48

Interacting with R WS’s

Protocol Send data to model WS Get back model ID Get various information via model ID

Fitted values Training statistics New predictions

Page 49: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

49

Cheminformatics at Indiana University School of

Informatics

David J. [email protected]

Associate Director of Chemical Informatics & Assistant Professor

Indiana University School of Informatics, Bloomington

http://djwild.info

Page 50: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

50

Cheminformatics education at Indiana

M.S. in Chemical Informatics 2 years, 36 semester hours Includes a 6-hour capstone / research project Opportunity to work in Laboratory Informatics (IUPUI) or closely with Bioinformatics (IUB) Currently 9 students enrolled

Ph.D. in Informatics, Cheminformatics Specialty 90 credit hours, including 30 hours dissertation research. Usually 4 years. Research rotations expose students to research in related areas Currently 4 students enrolled

Graduate Certificate 4 courses, all available by Distance Education

I571 Chemical Information Technology I572 Computational Chemistry & Molecular Modeling I573 Programming for Science Informatics I553 Independent Study in Chemical Informatics

D.E. students pay in-state fees! (~$800 per class) See http://cheminfo.informatics.indiana.edu for more information, or a general review

of cheminformatics education in Drug Discovery Today 11, 9&10 (May 2006), pp436-439

Page 51: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

51

Distance Education for Cheminformatics

Uses Breeze + teleconference for live sharing of classes: all that is required is a P.C. and a telephone. Optional Polycom videoconferencing.

Lectures are recorded for easy playback through a web browser

Wiki or similar webpage for dissemination of course materials

Also participate in CIC courseshare to give class at University of Michigan

Of 75 students taking our courses since fall 2005, 39 have been D.E. students

See JCIM 2006; 46(2) pp 495 - 502 for more details

Page 52: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

52

Current research in the Wild lab

Integration of cheminformatics tools and data sources A web service infrastructure for cheminformatics Compound information & aggregation web service and interface (“by the way box”) An enhanced chatbot for exploting chemical information & web services A semantically-aware workflow tools for cheminformatics Data mining the NIH DTP tumor cell line database PubDock: a docking database for PubChem

Aggregating life science information from web and journal documents Data mining semantically rich chemistry journal articles Document similarity based on chemical structure similarity Evaluating semantic markup of chemistry journal articles

Integrating cheminformatics into the chemistry lab Integrating cheminformatics with the Second Life virtual world Integrating cheminformatics tools with electronic lab notebooks Usability of cheminformatics tools

Page 53: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

53

Current research in the Guha lab

Predictive Modeling Interpretation, validation, domain applicability Generalization to other ‘models’ such as docking, pharmacophore etc Integration of multiple data types Addressing imbalanced and noisy datasets

Analysis of Chemical Spaces Quantify distributions in spaces Investigation of density approaches Applications to lead hopping, model domains

Methods to summarize & compare data Applications to HTS and smaller lead series type datasets

Network models combining chemical structures and biological systemsSoftware and infrastructure

Model exchange and annotation Pharmacophore representations, matching Toolkit development (CDK)

Page 54: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

54

Cheminformatics web service infrastructure

Database ServicesPostgreSQL + gNovaPubChem mirror (augmented)Pub3D - 3D structures for PubChemPubDock - Bound 3D structuresCompound-indexed journal article DBNIH Human Tumor Cell LineLocal PubChem mirrorVARUNA quantum chemistry database

Statistics (based on R)Regression, LDANeural Nets, Random ForestK-means clusteringPlottingT-test and distribution sampling

Cheminformatics servicesDocking (FRED)3D structure generation (OMEGA)Filtering (FRED, etc)OSCAR3Fingerprints (BCI, CDK)Clustering (BCI)Toxicity prediction (ToxTree)R-based predictive modelsSimilarity calculations (CDK)Descriptor calculation (CDK)2D structure diagrams (CDK)

Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey C. Fox and David J. Wild, Web service infrastructure for chemoinformatics, Journal of Chemical Information

and Modeling, 2007; 47(4) pp 1303-1307

Page 55: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

55

RSC Project Prospect - what can we do with the

information?www.projectprospect.org>100 papers marked up with SMILES/InChI (using

OSCAR3), plus Gene Ontology and Goldbook Ontology terms

Created similarity searchable PostgreSQL / gNova database with paper DOIs, SMILES, and ontology terms

Web service and simple HTML interfaces for searching … “which papers reference compounds similar to this one in the scope of these ontological terms?”

Applying statistics to look at co-occurrence of compounds, structural features (MACCS keys) and ontological terms in papers

Page 57: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

57

By the way… annotation (mock-up!)

By the way…

This compounds is very similar to a prescription drug, Tamoxifen.

This compound is referenced in 20 journal articles published in the last 5 years

Similar compounds are associated with the words “toxic” and “death” in 280 web pages

It appears to be covered under 3 patents

It has been shown to be active in 5 screens

Computer models predict it to show some activity against 8 protein targets

Here are some comments on this compound:

David Wild: don’t take any notice of the computational models - they are rubbish

Page 58: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

58

Some useful chemical reactions

Iodoacetate a Iodoacetamide I-CH4COO- ICH2CONH2

This may also react, chem favored by alkaline pH

….

Cheminformatics aware simple lab notebook

(mock up!)

Free text input can beconverted to machine

readable form byelectrovaya

Automatic detection ofdata fields (yield, etc)

Where possible

Plug-in allows structuresto be drawn with

the pen and cleaned up

Web service interfaceprovides access to

computation and searching.Page is marked up by what

is possible

FIND INFO ABOUT THIS REACTION

S H2C C

OH

O

S

C

O

OH

I+ +

Page 59: 1 E-Chemistry and Web 2.0 Marlon Pierce mpierce@cs.indiana.edu Community Grids Lab Indiana University.

59

Automatic workflow generation and natural

language queriesDevelop service ontology using OWL-S or similar language Allows service interoperability, replacement and input/outut

compatibilityWe can then use generic reasoning and network analysis

tools to find paths from inputs to desired outputsNatural language can be parsed to inputs and desired

outputsSmart Clients <--> Agents <--> ServicesPossible “supercharged life science Google?” - e.g. type

in “what compounds might bind to the enclosed protein?”

2D -> 3D

2Dstructurecrawler

dock

3D search

P’phoresearch

2dsimilarity

2D structures

2D structures3D structures

3D structures3D structures & complexes

dock = bind

3D proteinstructure

result3D structures are

compounds

2D structures arecompounds

3D structures arecompounds