Chemicals, Chemical Identifiers and Navigating Through Databases

Post on 13-Dec-2014

1249 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is a presentation given to a group of students at the UNC Eshelman School of Pharmacy. As chemists many of us want to resource information that is high quality, accurate and addresses our query. With the increasing proliferation of online chemistry resources it is very common for us to turn to these resources to source data. However, are resources such as Wikipedia, PubChem and the plethora of databases delivering information for metabolism, medicinal chemistry and synthetic chemistry trustworthy? Which of these resources, if any, should be treated as authorities? What is the most integrated approach to resource chemistry related data online? What approaches can be taken to validate the data that is available and how can individual scientists participate in helping to improve the content and quality of chemistry related data on the web. Antony Williams is ChemSpiderman. He started the ChemSpider database (www.chemspider.com) as a hobby to deliver a free platform for the community to source chemistry related data. Within three years the system was acquired by the Royal Society of Chemistry and now serves up close to 25 million chemical structures linked to over 400 data sources across the internet and offers individual scientists the opportunity to host and share their data with the community and to participate in data curation and annotation. Tony will share his experiences of building this chemistry database with a focus on data validation and curation and sourcing high quality data. During the presentation he will discuss ways to check chemical structure representations before submission to public systems for searching and provide an overview of chemical identifiers such as SMILES strings and the International Chemical Identifier (InChI) allows for the interlinking of resources. Attendees can expect to leave the session with a deeper understanding of utilizing the internet to resource chemistry related data.

Transcript

Chemicals, Chemical Identifiers and Navigating Through Databases

Antony WilliamsUNC Chapel Hill, October 2010

Chemistry on the Internet

Where do you source chemistry information? What can you trust online? How can you recognize potential issues? Cross-referencing and curating data

What is the Structure of Vitamin K?

MeSH

A lipid cofactor that is required for normal blood clotting. Several forms of vitamin K have been identified: VITAMIN K 1 (phytomenadione) derived from plants, VITAMIN K 2 (menaquinone) from bacteria, and synthetic naphthoquinone provitamins, VITAMIN K 3 (menadione). Vitamin K 3 provitamins, after being alkylated in vivo, exhibit the antifibrinolytic activity of vitamin K. Green leafy vegetables, liver, cheese, butter, and egg yolk are good sources of vitamin K

What is the Structure of Vitamin K1?

Wikipedia

What is the Structure of Vitamin K1?

CAS’s Common Chemistry

PubChem

“2-methyl-3-(3,7,11,15-tetramethylhexadec-2-enyl)naphthalene-1,4-dione”

Variants of systematic names on PubChem

2-methyl-3-[(E,7R,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11R)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7R,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,7S,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E,11S)-3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl 2-methyl-3-(3,7,11,15-tetramethyl 2-methyl-3-[(E)-3,7,11,15-tetramethyl

Bioassay Data are Associated…

Lack of Stereochemistry

ChEBI – Manual Curation

Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)

Molfiles 10 9 0 0 1 0 0 0 0 0 1 V2000 31.2937 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6526 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 31.2937 -7.7066 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -9.6877 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 25.5096 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 28.9731 -9.0366 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 27.8163 -9.7016 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 26.6664 -7.7066 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 32.4367 -9.6877 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 30.1161 -11.0177 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0 3 1 2 0 0 0 0 4 1 1 0 0 0 0 9 1 1 0 0 0 0 7 2 1 0 0 0 0 5 2 2 0 0 0 0 8 2 1 0 0 0 0 6 4 1 0 0 0 0 4 10 1 6 0 0 0 7 6 1 0 0 0 0 M END

Molfiles Molfiles are the primary exchange format between

structure drawing packages Can be different between different drawing packages Most commonly carry X,Y coordinates for layout Can support polymers, organometallics, etc. Can carry 3D coordinates

SMILES (http://en.wikipedia.org/wiki/SMILES)

SMILES is a common format Can support polymers,

organometallics, etc. Does NOT carry X,Y or Z

coordinates for layout so requires layout algorithms – can be problematic!

Generally different between drawing packages

Stereo

Tautomers

SMILES ACD/Labs CC(C)CCC[C@@H](C)CCC[C@@H](C)CCCC(\

C)=C\CC2=C(C)C(=O)c1ccccc1C2=O

OpenEye CC1=C(C(=O)c2ccccc2C1=O)C/C=C(\C)/

CCC[C@H](C)CCC[C@H](C)CCCC(C)C

ChEMBL CC(C)CCC[C@@H](C)CCC[C@@H](C)CCC\

C(=C\CC1=C(C)C(=O)c2ccccc2C1=O)\C

The InChI Identifier

InChI

SINGLE code base managed by IUPAC – integrated into drawing packages. No variability as with SMILES

InChI Strings can be reversed to structures – same problem as with SMILES – no layout

Well adopted by the community (databases, publishers, blogs, Wikipedia) – good for searching the internet

Multiple Layers

Tautomers – “Mobile H Perception”

Double Bond Orientation

Stereo

Checking for Stereochemistry

Checking for StereochemistryUse your drawing package!

Checking for Stereochemistry

Checking for Stereochemistry

Checking for Stereochemistry

InChIStrings Hash to InChIKeys

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

PubChem InChIKeys

MBWXNTAXLNYFJB-NKFFZRIASA-N MBWXNTAXLNYFJB-LKUDQCMESA-N MBWXNTAXLNYFJB-UHFFFAOYSA-N MBWXNTAXLNYFJB-FAKCLFGASA-N MBWXNTAXLNYFJB-NIHVXYICSA-N (O-18 label) MBWXNTAXLNYFJB-ODDKJFTJSA-N MBWXNTAXLNYFJB-KSVLJPARSA-N MBWXNTAXLNYFJB-UDCSOKOMSA-N MBWXNTAXLNYFJB-JHBCSKSVSA-N MBWXNTAXLNYFJB-JXAKDHTRSA-N

Databases and Standardization

Databases and Standardization

InChI

No support for polymers, organometallics

Many option settings can lead to variability and make integration across databases difficult – FixedH option especially problematic

“Slight” chance of collisions of InChIKeys

VERY USEFUL FOR INTEGRATING THE WEB

Vancomycin

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Full Skeleton Search: 104 Hits

Full Molecule Search: 4 Hits

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

www.chemspider.com

Search for a Chemical…by name

Available Information…

Linked to vendors, safety data, toxicity, metabolism

How do we build it?

25 million chemicals from 400 data sources We deal in Molfiles or SDF files – including

coordinates We do rudimentary filtering – valence checking,

charge imbalance – prior to deposition We have our own “business logic” to standardize We use InChI to “aggregate tautomers” to one

record We link out to external sites where possible using

their IDs

Inherited Errors

We have inherited errors from every database… all public compound databases, including ours, have errors

“Incorrect” structures – assertions, timelines etc “Incorrect” names associated with structures Properties Links Publications ENORMOUS CHALLENGE

Compounds and Identifiers

Be careful searching by Name!

Determining the correct structure by name searching is difficult online! Good, not perfect Wikipedia ChEBI/ChEMBL ChemIDPlus ChemSpider

Be VERY careful with MOST databases

Validating structures

Check for “full stereo” and use stereo descriptors especially for checking!

Check for quality of associated data sources Check against reference literature when available

– but it can be wrong Question EVERYTHING!

Online Curation

Online databases generally do NOT allow curation or annotation

If you find errors they stay there! ChemSpider is unique…immediate curation

ChemSpider live demo following this lecture Searching Deposition and Curation ChemSpider SyntheticPages

Thank you

Email: williamsa@rsc.org Twitter: ChemConnectorBlog: www.chemspider.com/blogPersonal Blog: www.chemconnector.comSLIDES: www.slideshare.net/AntonyWilliams

top related