ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
May 11, 2015
ChemSpider – A Community Platform for Chemistry and Resources Supporting the Life Sciences
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
The Final Search Strategy
All Those Names, One StructureA problem to solve…
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Trustworthy Chemistry? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Where Would You look? What Do You Trust?
Question Everything online: www.dhmo.org
Di-Hydrogen Monoxide
2H
Di-Hydrogen Monoxide
2H + 1O
Di-Hydrogen Monoxide
H2O
Di-Hydrogen Monoxide
H2OWater
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
Drugs are REALLY Messy
Vancomycin
Who will curate?
How would you clean such a large dataset?
Assertions!!!
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
Feedback from C&E Senior Editor
“Although CAS and C&EN are both part of the ACS Publications Division, we at C&EN still have to pay for our SciFinder access, strangely enough.”
“It would be nice to have an authoritative web-based source of standard, well-drawn structures for chemists to go to so they can freely cut and paste structures into their papers, PowerPoint presentations, and anything else they might need. Maybe Wikipedia will be that source one day.”
Structural Data for LifeSciencesDailyMed
Lack of Stereochemisty
Incorrect Structures
Ugh…
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
Just “Public Compound” Databases
PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider
media.obsessable.com
As few interfaces as possible
What do humans want?
A Pragmatic Vision“Build a Structure Centric Community to
Serve Chemists”
December 2006 – A hobby project initiated to connect chemistry on the web
Integrate chemical structure data on the web Create a “structure-based hub” to information and
data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data
Answer Questions
Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
ChemSpider Searches
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
A Link Farm to Content
Linked across the internet
Kyoto Encyclopedia of Genes and Genomes
Linking SMPDB
Links to Patents based on structure
Articles Linked
Search “OEA”
Search OEA
Search OEA
Search OEA
Linked Patents for OEA
Statistics for Today
>23 million compounds from >300 data sources
About 7000 unique users per day and up to ½ million transactions per day
A crowdsourced deposition and curation platform
Grows daily – more depositions, more links, more data
Searching Chemistry on the Internet
How complete a result set will we get if we search for “chemicals” by name?
Is there a better way to link chemistry databases? Linking by “names” is dangerous
Chemists want structure and SUBstructure searching
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Link the Internet with InChIKeys!
Taken from: Rafael Sidis’ Blog
Vancomycin – Search the Internet
Vancomycin
Search Molecular SKELETON
Search Full Molecule
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
Vancomycin
Vancomycin on ChemSpider 1 compound – 3 days
InChIKeys
RCINICONZNJXQF-MZXODVADSA-N
Make the internet searchable by adding InChIKeys
Publishers add InChIKeys to papers now…
InChIKeys
RCINICONZNJXQF-MZXODVADSA-N
Make the internet searchable by adding InChIKeys
Publishers add InChIKeys to papers now…
is what???
The InChI “Resolver”
InChI Resolver to DOIsStructure Search the Web
Most Chemistry is NOT Published
Only a fraction of chemistry is published
Only a tiny fraction of chemistry is patented
What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found
The CAS Registry
CAS Registry
Crowd-sourcing Curation and Deposition
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Building a Structure Centric Community for Chemists
Multi-level Curation and Approval
Entity-Extraction, Mark-up, Annotate
Semantic Markup: Project Prospect
Success Depends on Dictionaries
Link to a Structure or the Right Structure?
Name-Structure Pairs
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
Org Prep Daily (Blog)
ChemSpider SyntheticPages
Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,
syntheses, data, publications and patents A world of Open Access and Open Data
Classical business models will have to morph
ChemSpider Web Services
Thank you
[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams