Top Banner
Hosting public domain chemicals data online for the community – the challenges of handling materials Antony Williams NIST Diffusion/CALPHAD Data Informatics and Tools Workshop May 14 th , 2015 ORCID ID:0000-0002-2668-4821
80

Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Jul 18, 2015

Download

Career

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Hosting public domain chemicals data online for the community – the

challenges of handling materials

Antony WilliamsNIST Diffusion/CALPHAD Data Informatics and Tools Workshop

May 14th, 2015

ORCID ID:0000-0002-2668-4821

Page 2: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Disclaimer…

• Previously at the Royal Society of Chemistry• Now I am here…

Page 3: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Many challenges are the same

• What I will discuss in terms of publisher, public domain databases, curated chemistry challenges etc. are the same…• Need capable tools to handle the data• Need standards for data exchange • Meshing data without review is dangerous!

• Quality costs – time, effort and money• Algorithms can help clean data

Page 4: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Where is chemistry online?

• Encyclopedic articles (Wikipedia)• Chemical vendor databases• Metabolic pathway databases• Property databases• Patents with chemical structures• Drug Discovery data• Scientific publications

• Compound aggregators• Blogs/Wikis and Open Notebook Science

Page 5: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Chemistry on the Internet…

• Most searching for chemistry on the internet…• Name searching Google/Bing/Yahoo• Name searching Wikipedia• Name searching Wolfram Alpha• Name, name, name, name…searching

Page 6: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

The issue of identifiers

Page 7: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Page 8: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Page 9: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Some names for Aspirin..

Page 10: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

The CAS Number• MUCH integration is done using CAS Numbers• MANY searches are CAS Numbers and Names

Page 11: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CAS Numbers are GREAT!

Page 12: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

The CAS Number Index grows…

Page 13: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Scifinder

Page 14: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Prophetic Enumeration

Page 15: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CAS Numbers are “Trademarked”?

• From http://www.cas.org/legal/infopolicy

Page 16: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CAS and Wikipedia• http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Chemistry/CAS_validation

Page 17: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CAS and Wikipedia

Page 18: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CAS and Wikipedia

Page 19: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

7900 CAS Chemicals Online…

Page 20: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

How many CAS Numbers?

Page 21: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

How many CAS Numbers?• >34 million chemicals from >500 sources

Page 22: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

But CAS is hard to “Resolve”

Page 23: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Why CAS Numbers are not great

• There is no free service…like DOIs

• The resolver is a “Google Search”• Maybe we need another “identifier”?

• And thanks to IUPAC/NIST….

Page 24: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

The InChI Identifier

Page 25: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Multiple Layers

Page 26: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

InChI

• SINGLE code base managed by IUPAC – integrated into drawing packages and used by MANY databases. No variability as with SMILES

Page 27: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Vendor-dependent SMILESACD/LabsCC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\C)=C\CC2=C(C)C(=O)c1ccccc1C2=O

OpenEyeCC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/CCC[C@H](C)CCC[C@H](C)CCCC(C)C

ChEMBLCC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C

Page 28: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

InChI

• SINGLE code base managed by IUPAC – integrated into drawing packages and used by MANY databases. No variability as with SMILES

• InChI Strings can be reversed to structures – same problem as with SMILES – no layout

• Adopted by the community (databases, blogs, Wikipedia) – good for searching the internet

Page 29: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

InChIStrings Hash to InChIKeys

Page 30: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

InChIs for small molecules…

• InChIs are good for “small molecules”• Read here: http://www.jcheminf.com/series/InChI

Page 31: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

A Vision in December 2006

Page 32: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Lots of data coming online…

Page 33: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

ChemSpider

Page 34: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

ChemSpider

Page 35: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

ChemSpider

Page 36: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Experimental/Predicted Properties

Page 37: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Literature references

Page 38: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Patents references

Page 39: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Google Books

Page 40: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Vendors and data sources

Page 41: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Structure search the web

Page 42: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Exact Search

Page 43: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Skeleton Search

Page 44: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

6 years ago this week…

Page 45: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

ChemSpider strengths

• Serves over 40,000 unique users per day• Advanced searching of >34 million chemicals

Page 46: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Fully documented APIs

Page 47: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Fully documented APIs

Page 48: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Data Quality/Standardization

• MANY structures meant to be something online are MISREPRESENTED.

• Commonly you will have better success finding information by name searches than structure – with many caveats of course…

• Validating chemical structure representations is laborious work – and it’s shocking to review data…

Page 49: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

What is the Structure of Vitamin K1?

Page 50: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Data Quality IssuesWilliams and Ekins, DDT, 16: 747-750 (2011)

Science Translational Medicine 2011

Page 51: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Data quality is a known issue

Page 52: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Data quality is a known issue

Page 53: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Patent data in public databases

Page 54: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Patent data in public databases

Page 55: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Depiction vs Accurate Representation

Page 56: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Depiction vs Accurate Representation

Page 57: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

There are Unused Standards!

Page 58: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

There are Unused Standards!

Page 59: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

There are Unused Standards!

Page 60: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Nitro groups

Page 61: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Salt and Ionic Bonds

Page 62: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Ammonium salts

Page 63: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Can we MAKE Quality Data?

• Systems for everyone to validate and standardize their data would be useful

• Would improve structure data in publications, databases etc. and make searching across resources better

• Collaboration to establish community rules would be good!

Page 64: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Chemical Validation and Standardization: http://cvsp.chemspider.com

Page 65: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CVSP Rules Sets

Page 66: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CVSP Filtering of DrugBank

Page 67: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CVSP Filtering of DrugBank

Page 68: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

CVSP is Open to Anyone!

Page 69: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

ChemSpider limitations

• Supports “small molecules” only – no InChI, no possibility to register a compound

• SO MUCH of chemistry is “materials”

• Severe limitation in chemistry coverage:• Monomers but no polymers• Inorganic and organometallic handling• Ambiguous structures – “Markush”• Nanomaterials

• Minerals• Bound to beads, surfaces etc

Page 70: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

ORGANICS vs. Materials• Comment – you don’t know all of the

challenges until you start to work in the area!

• We, and cheminformatics companies, have solved MANY, but not all of the issues regarding organic chemistry management

• The majority of our approaches do not map to materials • No standard ways to represent compounds• No InChI for materials

Page 71: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Questions to consider…

• Organics are hard enough! • What are your best dictionaries of materials?• We have chemical ontologies. Status for

materials?• Is open annotation of your databases possible?• What standards do you have for materials data

exchange?

Page 72: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Polymorphism is common

Page 73: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Known Challenges

• Many materials are non-stoichiometric• How to represent composite materials (e.g.

supported catalysts)?

• Methods to distinguish novelty in materials (equivalent to diversity in organic structures)?

• Lots of challenges ahead..a curated “community dictionary” would be of value…

Page 74: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Mapped DICTIONARIES…

• Structure IDs• Systematic name(s)• Trivial Name(s)• SMILES• InChI Strings• InChIKeys• Database IDs

• Registry Number

Page 75: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Page 76: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials
Page 77: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Pragmatism wins

Page 78: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Collaboration is key

Page 79: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Wouldn’t it be nice if…

Page 80: Hosting Public Domain Chemicals Data Online for the Community – the Challenges of Handling Materials

Thank you

Email: [email protected] ORCID: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams