Great promise of navigating the internet using in chis

Post on 10-May-2015

3846 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

The InChI, the International Chemical Identifier, has been the basis of both indexing and deduplication of the ChemSpider database since the inception of the platform. When the InChI was adopted we envisaged a future whereby the identifier would proliferate across journals, databases and the internet in general providing us a basis for “structure searching the internet”. This presentation will provide an overview of how the InChI has facilitated the integration of ChemSpider to chemistry on the internet, some of the surprising findings that have resulted from this work and extrapolate the influence of InChIs into the future for a chemically enabled web.

Transcript

Great promise of navigating the internet using InChIs

Antony J WilliamsACS San Diego March 2012

Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Warning…

This talk is not about Quality…it’s about quantity

Warning…

This talk is not about Quality…it’s about quantity

Drugbank was here

Data quality is a known issue

We ALL have issues!!!

How to Link it…

And getting out of overwhelm…

So what is Yohimbine?

Of course it is out there…

Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs

Tell me more…

Where can I find the molfile for Yohimbine? Papers/Patents about Yohimbine? What are the side effects of Yohimbine? Where can I order Yohimbine? What are the physicochemical properties? Metabolic pathways? Different synonyms of Yohimbine? Synthesis of Yohimbine? Side effects of Yohimbine? Etc….

How do we build it?

We deal in Molfiles or SDF files – with coordinates

Deposit anything that has an InChI – we support what InChI can handle, good and bad

Standardization based on “InChI standardization”

InChIs aggregate (certain) tautomers

We link out to external sites using their IDs

Downsides of InChI

InChI was a moving target (multi versions) but overall worked as planned.

Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”

InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…

Side Effects of InChI Usage

SMILES by comparison…

Side Effects of InChI Usage

Standardization IssuesDepiction based on molfile

Downsides of Overall Approach

Meshing data together based on InChIs worked for simple molecules

2D layout errors inherited or limited by algorithm

Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same

Yohimbine on ChemSpider..Quality?

Recognizing Compound Dilution

So much chemistry on the web….

And so much dilution – “structural uniqueness” versus “accidental ambiguity”

InChI as an easy skeleton search

Vancomycin – Search the Internet

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Full Skeleton Search

All aggegators suffer dilution!

Many Problems Can be Solved…

Clean up databases – structure validation, structure standardization

Warn about Valency, charge balance, depiction issues,

bond types, absent stereo, and another 100 rules (or so…)

Standardize Agree community rules to “Standardize”

Structure Validation

Structure Validation - Fixed

What needs to happen?

If we could validate Catch errors in databases (and clean) Proactively catch errors in publications/patents Reduce junk in the ether – improve QUALITY!

If we standardized Interlinking should improve

Download, Deposit, Reprocess

Substructure # of

Hits

# of

Correct

Hits

No

stereochemistry

Incomplete

Stereochemistry

Complete but

incorrect

stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Structure-Name Validation

NH

O

O

OO

O OO

O

O

OHO

O

CH3

OH

OH

CH3

CH3

CH3

CH3

CH3

H

O

NH2

I

I

I

OH

CH3

Choladine

Taxol

NN

Cl

Chlotrimazole

CH3

CH3

CH3

CH3

HH

HCholane

Standardize

Use the SRS as a guidance document for standardization

Adjust as necessary to our needs

Nitro groups

Salt and Ionic Bonds

Ammonium salts

Millions of structures? Lots of Issues

ChemSpider Standardization

Entire ChemSpider database will be standardized using modified FDA rule set

Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated

Standardization procedures automatically applied to all future depositions

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Proof of Concept Data Curation SharingWho wants to work with us?

Structure Validation using feed

Look for approved synonyms

Compare feed InChIKey with database InChIKey

If different, flag for inspection

It is so difficult to navigate…

What’s the structure?What’s the structure?

Are they in our file?

Are they in our file?

What’s similar?What’s

similar?

What’s the target?

What’s the target?Pharmacology

data?Pharmacology

data?

Known Pathways?

Known Pathways?

Working On Now?

Working On Now?Connections

to disease?Connections to disease?

Expressed in right cell type?Expressed in

right cell type?

Competitors?Competitors?

IP?IP?

Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in

pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project

Guiding principle is open access, open usage, open source- Key to standards adoption -

Guiding principle is open access, open usage, open source- Key to standards adoption -

Chemistry in Open PHACTS

Selected data slices of ChemSpider carrying pharmacological links into the “linked data cache”

ChemSpiderIDs and InChIs/InChIKeys will be in Open PHACTS and available for linking

A structure ID standard to enable further linking across the semantic web of science

Internet Data

ChemSpider and InChI

Commercial SoftwarePre-competitive Data

Open ScienceOpen DataPublishersEducators

Open DatabasesChemical Vendors

Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals

The great promise should be obvious InChIs are here to stay They will evolve, they will encompass, we will

adopt and adapt Public and private databases will federate &

build a linked environment of validated data! Data validation and standardization is

needed Open Data will continue to proliferate InChIs are in the “Semantic Web” already

If InChI never existed or went away..

ChemSpider would never have been built

Database linking would suffer dramatically

The web would not be “structure searchable”

Cheminformatics tools would likely not be linking to public domain databases in the same way

And we would not have the pleasure of today…

Acknowledgments

The inspiration of the InChI Masters – Steve H., Steve S., Alan, Dmitrii, Igor

IUPAC, NIST, all adopters, supporters, challengers and users

The InChI Trust and its supporters for funding continued development

Al Gore –enabling us to search InChIs on the web

Steve Heller

Steve Heller

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams

top related