Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.

Building a Chemical Informatics Grid

Marlon Pierce

Community Grids Laboratory

Indiana University

CICC Project Information

“Chemical Informatics and Cyberinfrastructure Collaboratory” is an NIH and MS-funded research project to combine the CI’s.

Project web site and more information www.chembiogrid.org www.chembiogrid.org/wiki

Team members include Computer Science: Geoffrey Fox (PI), Dennis Gannon, Beth Plale, Marlon

Pierce, Yuqing (Melanie) Wu, Malika Mahoui, Jake Kim Chemical Informatics and Chemistry: Gary Wiggins, Mu-Hyun (Mookie)

Baik, David Wild, Rajarshi Guha, Kevin Gilbert I have stolen slides and content from these fine people.

We collaborate with several groups Peter Murray Rust’s group at University of Cambridge University of Michigan’s MACE group. Chemistry Development Kit (CDK) project DTP NIC at NIH Scripps High Throughput Screening Center

Chemical Informatics and Cyber-infrastructure Building Blocks Chemical Informatics Resources:

Deluge of experimental data > 100,000 compounds screened by 10 publicly funded high throughput

screening centers using various assay techniques (molecular to cellular) Molecular Libraries Screening Center Network

Chemical databases maintained by various groups NIH PubChem, NIH DTP

Chemical informatics and computational chemistry Data clustering, data mining, descriptor calculations, toxicity prediction,

docking, molecular modeling, and quantum chemistry Visualization tools Web resources: journal articles, etc.

A Chemical Informatics Grid will need to integrate these into a common, loosely coupled, open, distributed computing environment.

Our Solution Stack Domain specific Web Services

VOTables, CDK services Grid services, Cyber-

infrastructure for computationally intensive applications. Clustering, quantum chemistry

Workflow and service management We work with Taverna Many solutions: Kepler, BPEL

engines, etc. Portlets and other user

interfaces Rich desktop apps Ubiquitous clients

Portals and Other User Interfaces

Workflow and ServiceManagement

Web and Grid Services

Each level is subject for research and development, as is their integration.

A Library of Chemical Informatics Web Services

All Services Great and Small

Like most Grids, a Chemical Informatics Grid will have the classic styles: Data Grid Services: these provide access to data sources

like PubChem, etc. Execution Grid Services: used for running cluster analysis

programs, molecular modeling codes, etc, on TeraGrid and similar places.

But we also need many additional services Handling format conversions (InChI<->SMILES) Shipping and manipulating tabular data Determining toxicity of compounds Generating batch 2D images

So one of our core activities is “build lots of services”

VOTables: Handling Tabular Data Developed by the Virtual Observatory community for encoding

astronomy data. The VOTable format is an XML representation of the tabular

data (data coming from BCI, NIH DTP databases, and so on). VOTables-compatible tools have been built

We just inherit them. SAVOT and JAVOT JAVA Parser APIs for VOTable allow us

to easily build VOTable-based applications Web Services Spread sheet Plotting applications.

VOPlot and TopCat are two

mrtd1.txt – smiles representation of chemical compounds along with its properties

Votable.xml : xml representation of mrtd1.txt file

VOPlot Application from generated votable.xml file : Graph plotted on Mass (X–axis) and PSA (Y-axis)

More Services: WWMM ServicesServices Descriptions Input Output

InChIGoogle Search an InChI structure through Google

inchiBasic

type

Search result in HTML format

InChIServer Generate InChI version

format

An InChI structure

OpenBabelServer

Transform a chemical format to another using Open Babel

format

inputData

outputData

options

Converted chemical structure string

CMLRSSServer

Generate CMLRSS feed from CML data

mol, title description link, source

Converted CMLRSS feed of CML data

CDK-Based Services

Common Substructure

Calculates the common substructure between two molecules.

CDKsim Takes two SMILES and evaluates the Tanimoto coefficient (ratio of intersection to union of their fingerprints).

CDKdesc Calculates a variety of molecular and atomic descriptors for QSAR modeling

CDKws Fingerprint generation

CDKsdg Creates a jpeg of the compound’s 2D structure

CDKStruct3D Generates 3D coordinates of a molecule from its SMILE

ToxTree Service The Threshold of Toxicological

Concern (TTC) establishes a level of exposure for all chemicals below which there would be no appreciable risk to human health.

ToxTree implements the Cramer Decision Tree approach to estimate TTC.

We have converted this into a service. Uses SMILES as input. Note the GUI must be

separated from the library to be a service

http://ecb.jrc.it/QSAR/home.php?CONTENU=/QSAR/qsar_tools/qsar_tools_toxtree.php

OSCAR3 Service Oscar3 is a tool for shallow, chemistry-specific

natural language parsing of chemical documents (i.e. journal articles).

It identifies (or attempts to identify): Chemical names: singular nouns, plurals, verbs etc., also

formulae and acronyms. Chemical data: Spectra, melting/boiling point, yield etc. in

experimental sections. Other entities: Things like N(5)-C(3) and so on.

Results are exported as an XML file. There is a larger effort, SciBorg, in this area

http://www.cl.cam.ac.uk/~aac10/escience/sciborg.html It also has potentially very interesting Workflows

http://wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Oscar3

Use Cases and Workflows

Putting data and clustering together in a distributed environment.

A Workflow Scenario: HTS Data Organization and Flagging This workflow demonstrates how screening data can be flagged

and organized for human analysis. The compounds and data values for a particular screen are

retrieved from the NIH DTP database and then are filtered to remove compounds with reactive groups, etc. A tumor cell line is selected. The activity results for all the

compounds in the DTP database in the given range are extracted from the PostgreSQL database

OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs

ToxTree is used to flag the potential toxicities of compounds. Divkmeans is used to add a column of cluster numbers. Finally, the results are visualized using VOPlot and the 2D

viewer applet.

HTS data organization & flagging

A tumor cell line is selected. The activity results for all the compounds in the DTP database in the given range are extracted from the PostgreSQL database

The compounds are clustered on chemical structure

similarity, to group similar compounds together

The compounds along with property and cluster information are converted to VOTABLES format and displayed in VOPLOT

OpenEye FILTER is used to calculate biological and chemical properties of the compounds that are related to their potential effectiveness as drugs

Web Services

Example plots of our workflow output using VOPlot and VOTables

Chemical Informatics and the TeraGrid

A Workflow for IU’s Big Red Demo PubMed abstracts

555,007 PubMed abstracts of 2005 – 2006 (part) 1,000 abstracts per node

511 nodes X 1,000 input abstracts used for the demo OSCAR3

Extracts chemical information from text and produces an XML instance highlighting the chemical information

SMILES extraction Extracting SMILES elements from OSCAR’s XML output files Unique SMILES list within a batch

Use this to drive docking and molecular modeling applications.

Bigger Picture for the Workflow

NIHPubMed

Database

OSCARText

Analysis

POV-RayParallel

Rendering

Initial 3DStructure

Calculation

ToxicityFiltering

ClusterGrouping

Docking

MolecularMechanics

Calculations

Quantum Mechanics

Calculations

IU’sVaruna

Database

NIHPubChemDatabase

Big Red Demo

High Throughput Screening (HTS)

Data Organization and Flagging

A Workflow for Big Red Demo

Final HTML pages

VARUNA – Towards a Grid-based Molecular Modeling Environment

Taking the Big Red demo from stunt to science.

Automatic Quantum Mechanical Curation of Structure Data (AutoGeFF) Chemical research logic is often driven by molecular structure

Large-scale, small molecule DB’s (such as PubChem and, through OSCAR, PubMed) have low-resolution structure data

Often key properties are not consistently available: e.g.: Rotation-barriers, Redox Potentials, Polarizabilities, IR

frequencies, reactivity towards nucleophiles QM web-services will provide tools for generating high-resolution

data Produces a new, curated database of QM results

These can then be combined with databases of proteins (PDB, MOAD, PDB-Bind) for docking and other detailed simulation studies.

Prototype-Project: Controlling the TGF pathway

PDB

1IAS1IASInactive TGF

VARUNA

Experimentsin the Zhang

Lab

Active TGFActive TGFWith inhibitorWith inhibitor

PubChem

in-house Molecules in Varuna

Conceptual Conceptual Understanding of Understanding of TGFTGF

InhibitionInhibition

SimulationsAutoGeFFAutoGeFF

Questions:

- What molecular feature controls inhibitor binding?

- How do mutations impact binding?

More Information

Contact me: [email protected] Website and wiki:

www.chembiogrid.org www.chemibiogrid.org/wiki

We have project plus collaborator mailing lists if you really are interested.

Building a Chemical Informatics Grid Marlon Pierce Community Grids Laboratory Indiana University.

Documents

data grid services

chemical format

cdk services grid services

tabular data data

nih dtp chemical informatics

chemical names

file slide

services great