Top Banner
Great promise of navigating the internet using InChIs Antony J Williams ACS San Diego March 2012
63

Great promise of navigating the internet using in chis

May 10, 2015

Download

Technology

The InChI, the International Chemical Identifier, has been the basis of both indexing and deduplication of the ChemSpider database since the inception of the platform. When the InChI was adopted we envisaged a future whereby the identifier would proliferate across journals, databases and the internet in general providing us a basis for “structure searching the internet”. This presentation will provide an overview of how the InChI has facilitated the integration of ChemSpider to chemistry on the internet, some of the surprising findings that have resulted from this work and extrapolate the influence of InChIs into the future for a chemically enabled web.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Great promise of navigating the internet using in chis

Great promise of navigating the internet using InChIs

Antony J WilliamsACS San Diego March 2012

Page 2: Great promise of navigating the internet using in chis

Openness and Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Page 3: Great promise of navigating the internet using in chis

Warning…

This talk is not about Quality…it’s about quantity

Page 4: Great promise of navigating the internet using in chis

Warning…

This talk is not about Quality…it’s about quantity

Drugbank was here

Page 5: Great promise of navigating the internet using in chis

Data quality is a known issue

Page 6: Great promise of navigating the internet using in chis

We ALL have issues!!!

Page 8: Great promise of navigating the internet using in chis

How to Link it…

Page 9: Great promise of navigating the internet using in chis

And getting out of overwhelm…

Page 10: Great promise of navigating the internet using in chis

So what is Yohimbine?

Page 11: Great promise of navigating the internet using in chis

Of course it is out there…

Drugbox: 3001/5080 with InChIs Chembox:5436/7690 with InChIs

Page 12: Great promise of navigating the internet using in chis

Tell me more…

Where can I find the molfile for Yohimbine? Papers/Patents about Yohimbine? What are the side effects of Yohimbine? Where can I order Yohimbine? What are the physicochemical properties? Metabolic pathways? Different synonyms of Yohimbine? Synthesis of Yohimbine? Side effects of Yohimbine? Etc….

Page 15: Great promise of navigating the internet using in chis

How do we build it?

We deal in Molfiles or SDF files – with coordinates

Deposit anything that has an InChI – we support what InChI can handle, good and bad

Standardization based on “InChI standardization”

InChIs aggregate (certain) tautomers

We link out to external sites using their IDs

Page 16: Great promise of navigating the internet using in chis

Downsides of InChI

InChI was a moving target (multi versions) but overall worked as planned.

Good for small molecules – but no polymers, issues with inorganics, organometallics, imperfect stereochemistry. ChemSpider is “small molecules”

InChI used as the “deduplicator” – FIRST version of a compound into the database becomes THE structure to deduplicate against…

Page 17: Great promise of navigating the internet using in chis

Side Effects of InChI Usage

Page 18: Great promise of navigating the internet using in chis

SMILES by comparison…

Page 19: Great promise of navigating the internet using in chis

Side Effects of InChI Usage

Page 20: Great promise of navigating the internet using in chis

Standardization IssuesDepiction based on molfile

Page 21: Great promise of navigating the internet using in chis

Downsides of Overall Approach

Meshing data together based on InChIs worked for simple molecules

2D layout errors inherited or limited by algorithm

Complex molecules that are meant to be the same thing were NOT deduplicated. Compounds differing by one stereocenter, named the same, meant to be the same, are not the same

Page 22: Great promise of navigating the internet using in chis

Yohimbine on ChemSpider..Quality?

Page 30: Great promise of navigating the internet using in chis

Recognizing Compound Dilution

So much chemistry on the web….

And so much dilution – “structural uniqueness” versus “accidental ambiguity”

InChI as an easy skeleton search

Page 31: Great promise of navigating the internet using in chis

Vancomycin – Search the Internet

Page 32: Great promise of navigating the internet using in chis

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Page 33: Great promise of navigating the internet using in chis

Full Skeleton Search

Page 34: Great promise of navigating the internet using in chis

All aggegators suffer dilution!

Page 35: Great promise of navigating the internet using in chis

Many Problems Can be Solved…

Clean up databases – structure validation, structure standardization

Warn about Valency, charge balance, depiction issues,

bond types, absent stereo, and another 100 rules (or so…)

Standardize Agree community rules to “Standardize”

Page 36: Great promise of navigating the internet using in chis

Structure Validation

Page 37: Great promise of navigating the internet using in chis

Structure Validation - Fixed

Page 38: Great promise of navigating the internet using in chis

What needs to happen?

If we could validate Catch errors in databases (and clean) Proactively catch errors in publications/patents Reduce junk in the ether – improve QUALITY!

If we standardized Interlinking should improve

Page 41: Great promise of navigating the internet using in chis

Download, Deposit, Reprocess

Page 42: Great promise of navigating the internet using in chis

Substructure # of

Hits

# of

Correct

Hits

No

stereochemistry

Incomplete

Stereochemistry

Complete but

incorrect

stereochemistry

Gonane 34 5 8 21 0

Gon-4-ene 55 12 3 33 7

Gon-1,4-diene 60 17 10 23 10

Page 43: Great promise of navigating the internet using in chis

Structure-Name Validation

NH

O

O

OO

O OO

O

O

OHO

O

CH3

OH

OH

CH3

CH3

CH3

CH3

CH3

H

O

NH2

I

I

I

OH

CH3

Choladine

Taxol

NN

Cl

Chlotrimazole

CH3

CH3

CH3

CH3

HH

HCholane

Page 44: Great promise of navigating the internet using in chis

Standardize

Use the SRS as a guidance document for standardization

Adjust as necessary to our needs

Page 45: Great promise of navigating the internet using in chis

Nitro groups

Page 46: Great promise of navigating the internet using in chis

Salt and Ionic Bonds

Page 47: Great promise of navigating the internet using in chis

Ammonium salts

Page 48: Great promise of navigating the internet using in chis

Millions of structures? Lots of Issues

Page 49: Great promise of navigating the internet using in chis

ChemSpider Standardization

Entire ChemSpider database will be standardized using modified FDA rule set

Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated

Standardization procedures automatically applied to all future depositions

Page 50: Great promise of navigating the internet using in chis

Identifier Dictionaries

Reciprocal curation processes…share curation with each other.

If a database has a compound already then use InChiKeys to match “suggested” validation against the compound.

A series of “added” and “removed” synonyms against InChIKeys for matching.

Page 51: Great promise of navigating the internet using in chis

Proof of Concept Data Curation SharingWho wants to work with us?

Page 52: Great promise of navigating the internet using in chis

Structure Validation using feed

Look for approved synonyms

Compare feed InChIKey with database InChIKey

If different, flag for inspection

Page 53: Great promise of navigating the internet using in chis

It is so difficult to navigate…

What’s the structure?What’s the structure?

Are they in our file?

Are they in our file?

What’s similar?What’s

similar?

What’s the target?

What’s the target?Pharmacology

data?Pharmacology

data?

Known Pathways?

Known Pathways?

Working On Now?

Working On Now?Connections

to disease?Connections to disease?

Expressed in right cell type?Expressed in

right cell type?

Competitors?Competitors?

IP?IP?

Page 54: Great promise of navigating the internet using in chis

Open PHACTS Project Develop a set of robust standards… Implement the standards in a semantic integration hub Deliver services to support drug discovery programs in

pharma and public domain 22 partners, 8 pharmaceutical companies, 3 biotechs 36 months project

Guiding principle is open access, open usage, open source- Key to standards adoption -

Guiding principle is open access, open usage, open source- Key to standards adoption -

Page 55: Great promise of navigating the internet using in chis
Page 56: Great promise of navigating the internet using in chis

Chemistry in Open PHACTS

Selected data slices of ChemSpider carrying pharmacological links into the “linked data cache”

ChemSpiderIDs and InChIs/InChIKeys will be in Open PHACTS and available for linking

A structure ID standard to enable further linking across the semantic web of science

Page 57: Great promise of navigating the internet using in chis

Internet Data

ChemSpider and InChI

Commercial SoftwarePre-competitive Data

Open ScienceOpen DataPublishersEducators

Open DatabasesChemical Vendors

Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals

Page 58: Great promise of navigating the internet using in chis

The great promise should be obvious InChIs are here to stay They will evolve, they will encompass, we will

adopt and adapt Public and private databases will federate &

build a linked environment of validated data! Data validation and standardization is

needed Open Data will continue to proliferate InChIs are in the “Semantic Web” already

Page 59: Great promise of navigating the internet using in chis

If InChI never existed or went away..

ChemSpider would never have been built

Database linking would suffer dramatically

The web would not be “structure searchable”

Cheminformatics tools would likely not be linking to public domain databases in the same way

And we would not have the pleasure of today…

Page 60: Great promise of navigating the internet using in chis

Acknowledgments

The inspiration of the InChI Masters – Steve H., Steve S., Alan, Dmitrii, Igor

IUPAC, NIST, all adopters, supporters, challengers and users

The InChI Trust and its supporters for funding continued development

Al Gore –enabling us to search InChIs on the web

Page 61: Great promise of navigating the internet using in chis

Steve Heller

Page 62: Great promise of navigating the internet using in chis

Steve Heller

Page 63: Great promise of navigating the internet using in chis

Thank you

Email: [email protected] Twitter: ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams