Top Banner
Higher Quality Chemical Depictions: Lessons Learned and Advice John Mayfield NextMove Software Ltd
54

RDKit UGM 2016: Higher Quality Chemical Depictions

Apr 13, 2017

Download

Science

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: RDKit UGM 2016: Higher Quality Chemical Depictions

Higher Quality Chemical Depictions: Lessons Learned and Advice

John MayfieldNextMove Software Ltd

Page 2: RDKit UGM 2016: Higher Quality Chemical Depictions

Chemistry Development Kit• Java Library, KNIME Nodes, RCDK• 16 years old• 115 contributors• History in Computer Assisted Structure Elucidation

Many legacy APIs and bad wrong algorithms and data structures. Many of my contributions have focussed on core functionality because I needed it for my PhD at the time

“Every PhD student in cheminformatics writes their own toolkit”“Every company in cheminformatics writes their own toolkit”

During writing of thesis (2013) I needed publication quality depictions. Existing FOSS and affordable commercial offerings below par (didn’t want a ChemDraw license for such a short period)

Page 3: RDKit UGM 2016: Higher Quality Chemical Depictions

Improvements

“John will show us what good coordinate generation looks like” - Greg Landrum

1.4

2.0

Page 4: RDKit UGM 2016: Higher Quality Chemical Depictions

LAYOUT RENDERING

OPEN PROBLEM:

“Structure Diagram Generation”

-Position atoms X,Y coords

-Orientation

-Wedge assignment

-Objective (overlaps)

- Subject (orientation)

“Drawing”

-Generate and position graphic primitives

-Atom Label alignment

-Fonts

-Annotation coordinates

-Display Shortcuts (Abbreviations)

- Subjective (color, donuts)

Page 5: RDKit UGM 2016: Higher Quality Chemical Depictions

2D Layout LiteratureClark, A et al. 2D Structure Depiction. J. Chem. Inf. Model. 2006. 46(3)Helson, H. Structure Diagram Generation. Reviews in Computational Chemistry, Volume 13. 1999. Ch 6Weininger, D. SMILES. 3. Depict. Graphical Depiction of Chemical Structures. J. Chem. Inf. Comput. Sci. 1990. 30(3).

Rendering LiteratureBrecher J.Graphical Representation Standards For Chemical Structure Diagrams (IUPAC Recommendations 2008). Pure Appl. Chem. 2008. 80(2)Clark, A et al. Rendering Molecular Sketches for Publication Quality Output. Molecular Informatics. 2013. 32Cambridge Soft. CDX File Format. Online: http://www.cambridgesoft.com/Clark, A et al. Basic primitives for molecular diagram sketching. J. Cheminf. 2010. 2(8)Gushurst, A et al. The Substance Module: The Representation, Storage, and Searching of Complex Structures. J. Chem. Inf. Comput. Sci. 1991. 31.

Homework

Page 6: RDKit UGM 2016: Higher Quality Chemical Depictions

LAYOUTRDKit algorithm, better architecturethan CDK. My recent patches treat “symptom over cause”, but useful:(1) Ring templates(2) Macrocycle templates(3) Layout refinement (fix collisions)(4) Humpty Dumpty

Page 7: RDKit UGM 2016: Higher Quality Chemical Depictions

Which Way Up?

Page 8: RDKit UGM 2016: Higher Quality Chemical Depictions

Canonical Ring Indexing

1) Sub+Hetero 2) Hetero 3) Anonymised

Each template ring system is indexed in three ways, lookup follows the same orderCapture standard orientations (algorithm fallback possible)Generated library from hand drawn structures - duplication needed (algorithm or hand curated)Possible sources: ChEBI, Suppliers, Patents, JournalsStored as CXSMILES in CDK

*1**C*1 |(1.21,.39,;.75,-1.03,;-.75,-1.03,;-1.21,.39,;.0,1.28,)|

See Helson, H. 1999

Page 9: RDKit UGM 2016: Higher Quality Chemical Depictions

Macrocycle IndexingIndex multiple layouts for even ring sizeSelects multiple templates and scores registry shifts based on: ring attach points, cis/trans correctness, heteroatom positions.Odd size rings use the n+1 template, last coord dropped.

R18R18 R18 R18

R18 R18 R18

See Clark, A. et al. 2006

Page 10: RDKit UGM 2016: Higher Quality Chemical Depictions

RDDepictor.cpp

Page 11: RDKit UGM 2016: Higher Quality Chemical Depictions

Layout RefinementRDKit

1. Initialise2. Sample or Rotate3. Shrink and Bend4. Orientation

Rotate: flip rotatable bonds (most desirable, optimal)Bend: unsnap/open anglesStretch: make bonds longerShrink: make bonds shorterInvert: mirror a terminal bond inside a ring

CDK1. Initialise2. Rotate, Bend, Stretch, Invert3. Orientation

See Shelly 1983 and Helson, H. 1999

Page 12: RDKit UGM 2016: Higher Quality Chemical Depictions

Mostly align principle ring system, rules ~IUPAC naming• Core ring orientation (fused rings, steroids)• Layout width/height (RDKit canonical orientation)• Bond snapping (align to 30°)• Symmetry (patented in US)

Orientation

See GR-4.2.1, IUPAC, 2008

Page 13: RDKit UGM 2016: Higher Quality Chemical Depictions

Humpty Dumpty sat on a wall,

Humpty Dumpty had a great fall.

All the king's horses and all the king's men

Couldn't put Humpty together again.

CHEMBL590010(BioVia)

Rebond monoatomic ions before initial layout, delete after.

Page 14: RDKit UGM 2016: Higher Quality Chemical Depictions

CHEMBL590010(CDK)

Humpty Dumpty sat on a wall,

Humpty Dumpty had a great fall.

All the king's horses and all the king's men

Couldn't put Humpty together again.

Page 15: RDKit UGM 2016: Higher Quality Chemical Depictions

Layout Comparison Open Babel

AvalonRDKitCDK

Indigo

Page 16: RDKit UGM 2016: Higher Quality Chemical Depictions

Structure Layout TestsetBased on 28 structures the 10 obstacles from Clark, A. et al. 2006

1 new obstacle (11 total) 20 new structures (48 total) - have a lot more

Previous post by Noel O’Boyle used random PubChem sample, too easy: http://baoilleach.blogspot.co.uk/2008/10/cheminformatics-toolkit-face-off.html

All layout algorithms make mistakes and produce crowded or even misrepresent (wrong) structures! CDK definitely still does this but also commercial offerings (see Clark 06). Emphasis here is on silly mistakes rather than perfect layout.

Evaluating layout only, all rendered with CDK here

Page 17: RDKit UGM 2016: Higher Quality Chemical Depictions

48 Structures 11 Obstacles

1. Find Optimal Solution - avoidable overlaps (+2)2. Suboptimal Solution - unavoidable overlaps (+7)3. Global Chain Blocks4. Double Bond Stereochemistry (+3)5. Congested Small Rings6. Bonding Counterions (+2) new7. Spirocenters8. Macrocycles (+3)9. Ring Template Matches (+1)10. Planar Embedding11. 3D Ring Systems (+2)

Page 18: RDKit UGM 2016: Higher Quality Chemical Depictions

Library Elapsed Mean Elapsed MeanOpen Babel

v2.4.1 25:59.0 - 1.9s 46ms

RDKit V2016.09.1.dev1 10.2s 214ms 0.1s 2ms

Avalon1.2.0 0.3s 6ms 0.06s 1ms

CDK2.0-SNAPSHOT 0.5s 9ms 0.06s 1ms

Indigo1.2.3.r0

no-smart-layout1.7s 35ms 0.05s 1ms

Performance

All 48 Structures “Fair” 41 Subset (skip 3D rings)

(skip dendritic structure)

Indigo ‘smart-layout’ option: better macrocycles but a lot worst in general and degrades performance

Page 19: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 1 - Find Optimal SolutionCDK IndigoRDKitOpen Babel Avalon

Page 20: RDKit UGM 2016: Higher Quality Chemical Depictions

CHEMBL1077020

Open Babel 25m

Page 21: RDKit UGM 2016: Higher Quality Chemical Depictions

CHEMBL1077020

Avalon 200ms

Page 22: RDKit UGM 2016: Higher Quality Chemical Depictions

CHEMBL1077020

RDKit 10s

Page 23: RDKit UGM 2016: Higher Quality Chemical Depictions

CHEMBL1077020

CDK 300ms

Page 24: RDKit UGM 2016: Higher Quality Chemical Depictions

CHEMBL1077020

Indigo 997ms

Page 25: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 2 - Suboptimal solutionCDK IndigoRDKitOpen Babel Avalon

Page 26: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 2 - Suboptimal solutionCDK IndigoRDKitOpen Babel Avalon

Page 27: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 2 - Suboptimal solutionCDK IndigoRDKitOpen Babel Avalon

Page 28: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 4 - Cis/Trans BondsCDK IndigoRDKitOpen Babel Avalon

Page 29: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 5 - Congested RingsCDK IndigoRDKitOpen Babel Avalon

Page 30: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 6 - CounterionsCDK IndigoRDKitOpen Babel Avalon

Page 31: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 8 - MacrocyclesCDK

IndigoRDKitOpen Babel +smart-layoutAvalon

Page 32: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 8 - MacrocyclesCDK

IndigoRDKitOpen Babel +smart-layoutAvalon

Page 33: RDKit UGM 2016: Higher Quality Chemical Depictions

Level 9/10 - Ring Template/EmbeddingCDK IndigoRDKitOpen Babel Avalon

Page 34: RDKit UGM 2016: Higher Quality Chemical Depictions

RENDERINGA lot of quick wins for RDKit are in improvingrending capabilities.

Page 35: RDKit UGM 2016: Higher Quality Chemical Depictions

Avoid “Angstroms” in layout (CDK) and drawing (RDKit), 2D depictions are not accurate or scale models!

px okay for raster but pt, mm better for vector graphics and publications

Journal Style: ACS 1996

Measurement and Parameters

Bond Spacing 18% …π bond widthBond Length 14.4pt

Bold Width 2pt …wedge bond widthLine Width 0.6pt

Margin Width 1.6ptHash Spacing 2.5pt

Captions 10pt …annotationsAtom Labels 10pt

Page 36: RDKit UGM 2016: Higher Quality Chemical Depictions

How Many <text> Elements in SVG?(a) 3(b) 6 (c) 7

Page 37: RDKit UGM 2016: Higher Quality Chemical Depictions

How Many <text> Elements?(a) 3(b) 6 (c) 7(d) 0

Page 38: RDKit UGM 2016: Higher Quality Chemical Depictions

Font Embedding

See Clark, A 2013

More Portable

Convex Hull Bounds

Page 39: RDKit UGM 2016: Higher Quality Chemical Depictions

Bounding Box

See Clark, A 2013

8.46x8.03

7.85x7.68

Page 40: RDKit UGM 2016: Higher Quality Chemical Depictions

Fun with Fonts

Page 41: RDKit UGM 2016: Higher Quality Chemical Depictions

Depiction ScaleControl depiction size by bond length parameter. Shrink to fit, but avoid stretch to fit (make optional?).

Page 42: RDKit UGM 2016: Higher Quality Chemical Depictions

Adjunct Placement and Alignment

Alignment

Hydrogen Placement

See Clark, A. 2013 and GR-2.1.7, GR-5, IUPAC, 2008

Page 43: RDKit UGM 2016: Higher Quality Chemical Depictions

Bold and Hashed Wedges

Slanting and Bifurcation of Wedges

Possible with hashes but controversial

See Clark, A. 2013

Page 44: RDKit UGM 2016: Higher Quality Chemical Depictions

Double Bonds

Offset Double Bonds

Which side? (General rule=more benzene!)

Centred Double Bonds

See GR-1.10, IUPAC, 2008

Page 45: RDKit UGM 2016: Higher Quality Chemical Depictions

Annotations

See GR-11.1, IUPAC, 2008

Page 46: RDKit UGM 2016: Higher Quality Chemical Depictions

Sgroups and Generic Features

Attachment Points Substituent Labels“R” Groups Positional Variation

Structure Repeat UnitsAbbreviations

Page 47: RDKit UGM 2016: Higher Quality Chemical Depictions

Leblanc, C. Pulz, R. Siefl, N. Pyrrolopyrimidines And Pyrrolopyridines. U.S. Patent Grant (2012)

Abbreviations in Action

Page 48: RDKit UGM 2016: Higher Quality Chemical Depictions

Leblanc, C. Pulz, R. Siefl, N. Pyrrolopyrimidines And Pyrrolopyridines. U.S. Patent Grant (2012)

Abbreviations in Action

Page 49: RDKit UGM 2016: Higher Quality Chemical Depictions

Colors

See GR-0.5, IUPAC, 2008

Page 50: RDKit UGM 2016: Higher Quality Chemical Depictions
Page 51: RDKit UGM 2016: Higher Quality Chemical Depictions

Layout + Rendering ComparisonCDK IndigoRDKitOpen Babel

Page 52: RDKit UGM 2016: Higher Quality Chemical Depictions

CDK IndigoRDKitOpen Babel

Layout + Rendering Comparison

Page 53: RDKit UGM 2016: Higher Quality Chemical Depictions

Acknowledgements

Constructive criticism of depictions: Roger Sayle, Daniel Lowe, Noel O’Boyle

Reviewing patches: Egon Willighagen

Initial CDK layout: Christoph Steinbeck

Seminal Papers: Alex Clark and Jonathan Brecher.

Page 54: RDKit UGM 2016: Higher Quality Chemical Depictions

CDKRDKit

NH

NH NH

H2N

NH

OH

NH

NH

OHO

OSS

NH

O

NH

OH

O

H2N O

O

O

O

O

Spot the difference

label offset

subscript

H position

slant wedges

label offset