Top Banner
Canonicalized systematic nomenclature in chemoinformatics And some new canonicalization tools from OpenEye Jeremy J. Yang 3600 Cerrillos Road Suite 1107 Santa Fe, New Mexico 87507 505.473.7385 [email protected] www.eyesopen.com Introduction Canonicalization in chemoinformatics facilitates rigorous, unambiguous expression and handling of chemical data and knowledge. However, just as chemistry encompasses multiple levels of abstraction and modelling, no single canonicalization method is sufficient to solve all problems. This study reviews some existing canonicalization methodology and describes new methods implemented by chemoinformatics library OEChem and other OpenEye tools. Aha! -- Chemo-taxonomy is a stranded hierarchyThe Morgan algorithm was a huge step forward, but the basic algorithm has some shortcomings, in performance and comprehensiveness, which have been corrected by subsequent investigators. The resulting methods have been implemented and widely used in large scale database systems. Some key contributions: • Morgan, 1965 note to Harry: You da man! CAS • Wipke & Dyott, 1974 stereo-enhanced Morgan MDL • Jochum & Gasteiger, 1977 Morgan refinement CACTVS • Shelley & Munk, 1977 Morgan refinement • Weininger, 1988 CANSMI canonical line notation Daylight • Bradshaw, 1998 parent compounds GSK,Daylight • Delany & Sayle, 1999 tautomers OpenEye • INChi, 2004 global canonical line notation Fig 2: Morgan slow due to symmetry. 1. Morgan, H. L., "Generation of a unique machine description for chemical structures - A technique developed at Chemical Abstracts Services", J. Chem. Doc. 1965, 5, 107. 2. Stereochemically unique naming algorithm, W. Todd Wipke, Thomas M. Dyott; J. Am. Chem. Soc.; 1974; 96(15); 4834-4842. 3. Canonical Numbering and Constitutional Symmetry, Clemens Jochum and Johann Gasteiger, J. Chem. Inf. Comput. Sci.; 1977; 17(2); 113-117. 4. Computer Perception of Topological Symmetry, Craig A. Shelley, Morton E. Munk; J. Chem. Inf. Comput. Sci.; 1977; 17(2); 110-113. 5. An Approach to the Assignment of Canonical Connection Tables and Topological Symmetry Perception, Craig A. Shelley, Morton E. Munk, J. Chem. Inf. Comput. Sci.; 1979; 19(4); 247-250. 6. David Weininger, Arthur Weininger and Joseph L. Weininger, "SMILES 2: Algorithm for Generation of Unique SMILES Notation", Journal of Chemical Information and Computer Science (JCICS), Vol. 29, No. 2, pp. 97-101, 1989. 7. A beginner's guide to responsible parenting or knowing your roots, www.daylight.com/meetings/emug98/Bradshaw/ , EuroMUG '98, Cambridge, UK, Oct 1998. 8. Canonicalization and Enumeration of Tautomers, Jack Delany and Roger Sayle, www.daylight.com/meetings/emug99/Delany/taut_html/sld001.htm EuroMUG '99, Cambridge, UK, Oct 1999. 9. Hooked on Protonics, Roger Sayle and Geoff Skillman, www.eyesopen.com/about/events/presentations/acs02/sld001.htm , 224th ACS National Meeting, Boston, Aug 2002. 10. Introduction to Chemical Info Systems, John Bradshaw, www.daylight.com/meetings/emug02/Bradshaw/Training/ , Euromug02 24th-26th September 2002, Cambridge UK 11. That INChIFeeling, www.reactivereports.com/40/40_3.html , Reactive Reports, Sep 2004 (issue 40) 12. OEChem, OpenEye Scientific Software, 2002. 13. QuacPac, OpenEye Scientific Software, 2004. References • subatomic atoms molecules • normal weight atoms isotopes • Kekule molecule model aromatic molecule models • non-stereo molecule stereoisomers • single molecule combinatorial libraries • single molecule queries • small molecule macromolecule + cofactors + ligands • single molecule Markush structures • single molecule tautomer set • single molecule pKa states • single molecule reactions • 2D 3D There is a hierarchical relationship among some of these expansions while some are independent. For example, combinatorial library may involve stereoisomeric individuals or non-stereo. For every combination of molecular representations, canonicalization could be advantageous for the reasons described. Hence the task of canonicalization is a multi-faceted one. Dealing with reality: practical problems A canonicalization algorithm must determine a single representation among many possible representations for an individual in its domain. Definition of canonicalization More Morgan, and more Benefits of canonicalization • testing equality of molecules • database search speed • rigorous informatics and thinking 1. Existing formats (may often be): ambiguous – poorly defined spec or poor compliance un-rigorous – both syntax and semantics are important non-comprehensive – only organic, covalent, size limits 2. Stereoisomer canonicalization remains difficult "relative stereo-centers" 3. Differing valence assumptions and conventions implicit-valence and Hcount formats prone to mishandling 4. Information content and model differences in existing formats cannot robustly convert if info must be inferred (e.g. bonds) 5. Disagreement over correct chemistry e.g., valences, aromaticity 6. Local versus global canonicalization Benefits of canonicalization are available locally or globally. But global canonicalization requires cooperation. Locality definition (time, place, software versions) Morgan demo and study New: canonicalizing molfiles Canonicalizing a connection table is not new and was discussed by Morgan 1 and others. But generating canonical forms of current standard formats is not widely done, for historical and practical reasons, although the available benefits. This is increasingly true now that longer strings are more easily handled by existing computers. OEChem provides sufficient control to accomplish this task. Proposed algorithm: • Remove non-structural data • Supress hydrogens • Canonical atom order • Canonical bond order • Canonical Kekule bonding based on (selected) aromaticity model However, the advantages of more terse canonical line notations remain. RESULTS: Using test program canmol.py, 1990 NCI Diversity set converted to canonical SDF files, exactly equal to SDF files converted via SMILES (demo.eyesopen.com/cgi-bin/canmol). Also done with MOL2 format. This test validates the ability of OEChem to canonicalize molfiles as strings. New: canonical tautomers Tautomers have the same formula (structural isomers), but may differ in proton and electron location, and formal bond order. Special cases: keto/ enol, zwitterion, ring-chain. In the Delany/Sayle algorithm 8,13 , hydrogen donors and acceptors are perceived, and the number of free hydrogens. Donors and acceptor atoms are ordered canonically. At this stage all tautomerically equivalent inputs are represented identically. Hydrogen locations are exhaustively enumerated. A simple ruleset for enumeration order can designate the first to be the canonical tautomer. Through additional rules, the liklihood can be increased that the canonical tautomer is a low-energy form. Applications: registration (exact search), substructure searching, property prediction, similarity/clustering, protein-ligand analysis. Failure to perceive tautomerism leads to different results for different valence models which really represent the same chemical entity. OpenEye canonicalization tools The OpenEye chemoinformatics toolkit OEChem 12 employs an optimized Morgan-like canonical algorithm to generate canonical smiles. In addition, the api provides a rich set of tools which can facilitate generation of canonical representations of many types, for many chemical and informational models, and for many standard file formats. • OEChem::OECanonicalOrderAtoms() • OEChem::OECanonicalOrderBonds() • OEChem aromaticity models: OE, Daylight, Tripos, MDL, MMFF • OEChem: many file formats and flavors, low-level writers • QuacPac 13 : tautomers application and toolkit Fig 3: Morgan fails Fig 4: example: tautomers listed separately in ACD98. The latter is the OE- canonical form. Conclusion Rigorous and effective chemoinformatics systems require concepts and methods for canonicalization at multiple levels of chemical abstraction and organization. The current state of the art presents many theoretical and practical challenges. OpenEye tools can help. Fig 1: Morgan demo. Extended connectivity values and atom orders. Uses OEChem and Ogham. NCI Diversity set processed with no errors. The Morgan algorithm 1 is the basis of most chemical canonicalization work since, and deserves careful study. In 1965 Harry L. Morgan published the algorithm already implemented at CAS for its compound registry system. This work, based on generic graph theory, comprises a theoretical solution to the problem of molecular canonicalization, and material validation of its efficacy. N! (graph isomorphism is hard) – Morgan to the rescue This study: canonical molecular descriptions, not descriptors The study of graph theory and canonicalization applied to chemistry is extensive and diverse. Canonical descriptors which do not fully represent the model can be of great utility in statistical analyses but are not the focus of this nomenclature study. Results: The Maybridge 2003 database was analyzed by the OE program tautomers 13 . Of 71367 molecules, 97 have tautomers (47 pairs and one triplet). Note that additionally, 2381 were found to be non-unique molecules. Fig 5: tautomer triplet from Maybridge 2003 New: canonical pKa states The canonicalization of alternative pKa states is accomplished for many classes of molecules by the OpenEye program pkatyper 13 . This problem resembles tautomer canonicalization in many respects, and is an area of active research at OpenEye.
1

Canonicalized systematic nomenclature in cheminformatics

Jun 04, 2015

Download

Documents

Jeremy Yang

Poster presented at the 229th National ACS Meeting in San Diego, 2005.
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: Canonicalized systematic nomenclature in cheminformatics

Canonicalized systematic nomenclature in chemoinformatics And some new canonicalization tools from OpenEye

Jeremy J. Yang

3600 Cerrillos Road Suite 1107 Santa Fe, New Mexico 87507

505.473.7385 [email protected]

www.eyesopen.com

Introduction

Canonicalization in chemoinformatics facilitates rigorous, unambiguous expression and handling of chemical data and knowledge. However, just as chemistry encompasses multiple levels of abstraction and modelling, no single canonicalization method is sufficient to solve all problems. This study reviews some existing canonicalization methodology and describes new methods implemented by chemoinformatics library OEChem and other OpenEye tools.

Aha! -- Chemo-taxonomy is a “stranded hierarchy”

The Morgan algorithm was a huge step forward, but the basic algorithm has some shortcomings, in performance and comprehensiveness, which have been corrected by subsequent investigators. The resulting methods have been implemented and widely used in large scale database systems. Some key contributions: •  Morgan, 1965 à note to Harry: “You da man!” à CAS •  Wipke & Dyott, 1974 à stereo-enhanced Morgan à MDL •  Jochum & Gasteiger, 1977 à Morgan refinement à CACTVS •  Shelley & Munk, 1977 à Morgan refinement •  Weininger, 1988 à CANSMI canonical line notation à Daylight •  Bradshaw, 1998 à parent compounds à GSK,Daylight •  Delany & Sayle, 1999 à tautomers à OpenEye •  INChi, 2004 à global canonical line notation

Fig 2: Morgan slow due to symmetry.

1.  Morgan, H. L., "Generation of a unique machine description for chemical structures - A technique developed at Chemical Abstracts Services", J. Chem. Doc. 1965, 5, 107.

2.  Stereochemically unique naming algorithm, W. Todd Wipke, Thomas M. Dyott; J. Am. Chem. Soc.; 1974; 96(15); 4834-4842.

3.  Canonical Numbering and Constitutional Symmetry, Clemens Jochum and Johann Gasteiger, J. Chem. Inf. Comput. Sci.; 1977; 17(2); 113-117.

4.  Computer Perception of Topological Symmetry, Craig A. Shelley, Morton E. Munk; J. Chem. Inf. Comput. Sci.; 1977; 17(2); 110-113.

5.  An Approach to the Assignment of Canonical Connection Tables and Topological Symmetry Perception, Craig A. Shelley, Morton E. Munk, J. Chem. Inf. Comput. Sci.; 1979; 19(4); 247-250.

6.  David Weininger, Arthur Weininger and Joseph L. Weininger, "SMILES 2: Algorithm for Generation of Unique SMILES Notation", Journal of Chemical Information and Computer Science (JCICS), Vol. 29, No. 2, pp. 97-101, 1989.

7.  A beginner's guide to responsible parenting or knowing your roots, www.daylight.com/meetings/emug98/Bradshaw/, EuroMUG '98, Cambridge, UK, Oct 1998.

8.  Canonicalization and Enumeration of Tautomers, Jack Delany and Roger Sayle, www.daylight.com/meetings/emug99/Delany/taut_html/sld001.htm EuroMUG '99, Cambridge, UK, Oct 1999.

9.  Hooked on Protonics, Roger Sayle and Geoff Skillman, www.eyesopen.com/about/events/presentations/acs02/sld001.htm, 224th ACS National Meeting, Boston, Aug 2002.

10.  Introduction to Chemical Info Systems, John Bradshaw, www.daylight.com/meetings/emug02/Bradshaw/Training/, Euromug02 24th-26th September 2002, Cambridge UK

11.  That INChIFeeling, www.reactivereports.com/40/40_3.html, Reactive Reports, Sep 2004 (issue 40)

12.  OEChem, OpenEye Scientific Software, 2002. 13.  QuacPac, OpenEye Scientific Software, 2004.

References •  subatomic à atoms à molecules •  normal weight atoms à isotopes •  Kekule molecule model à aromatic molecule models •  non-stereo molecule à stereoisomers •  single molecule à combinatorial libraries •  single molecule à queries •  small molecule à macromolecule + cofactors + ligands •  single molecule à Markush structures •  single molecule à tautomer set •  single molecule à pKa states •  single molecule à reactions •  2D à 3D There is a hierarchical relationship among some of these expansions while some are independent. For example, combinatorial library may involve stereoisomeric individuals or non-stereo. For every combination of molecular representations, canonicalization could be advantageous for the reasons described. Hence the task of canonicalization is a multi-faceted one.

Dealing with reality: practical problems

A canonicalization algorithm must determine a single representation among many possible representations for an individual in its domain.

Definition of canonicalization

More Morgan, and more

Benefits of canonicalization •  testing equality of molecules •  database search speed •  rigorous informatics and thinking

1.  Existing formats (may often be): •  ambiguous – poorly defined spec or poor compliance •  un-rigorous – both syntax and semantics are important •  non-comprehensive – only organic, covalent, size limits

2.  Stereoisomer canonicalization remains difficult •  "relative stereo-centers"

3.  Differing valence assumptions and conventions •  implicit-valence and Hcount formats prone to mishandling

4.  Information content and model differences in existing formats •  cannot robustly convert if info must be inferred (e.g. bonds)

5.  Disagreement over correct chemistry •  e.g., valences, aromaticity

6.  Local versus global canonicalization •  Benefits of canonicalization are available locally or globally. But

global canonicalization requires cooperation. •  Locality definition (time, place, software versions)

Morgan demo and study New: canonicalizing molfiles Canonicalizing a connection table is not new and was discussed by Morgan1 and others. But generating canonical forms of current standard formats is not widely done, for historical and practical reasons, although the available benefits. This is increasingly true now that longer strings are more easily handled by existing computers. OEChem provides sufficient control to accomplish this task. Proposed algorithm: •  Remove non-structural data •  Supress hydrogens •  Canonical atom order •  Canonical bond order •  Canonical Kekule bonding based on (selected) aromaticity model

However, the advantages of more terse canonical line notations remain. RESULTS: Using test program canmol.py, 1990 NCI Diversity set converted to canonical SDF files, exactly equal to SDF files converted via SMILES (demo.eyesopen.com/cgi-bin/canmol). Also done with MOL2 format. This test validates the ability of OEChem to canonicalize molfiles as strings.

New: canonical tautomers Tautomers have the same formula (structural isomers), but may differ in proton and electron location, and formal bond order. Special cases: keto/enol, zwitterion, ring-chain. In the Delany/Sayle algorithm8,13, hydrogen donors and acceptors are perceived, and the number of free hydrogens. Donors and acceptor atoms are ordered canonically. At this stage all tautomerically equivalent inputs are represented identically. Hydrogen locations are exhaustively enumerated. A simple ruleset for enumeration order can designate the first to be the canonical tautomer. Through additional rules, the liklihood can be increased that the canonical tautomer is a low-energy form. Applications: registration (exact search), substructure searching, property prediction, similarity/clustering, protein-ligand analysis. Failure to perceive tautomerism leads to different results for different valence models which really represent the same chemical entity.

OpenEye canonicalization tools The OpenEye chemoinformatics toolkit OEChem12 employs an optimized Morgan-like canonical algorithm to generate canonical smiles. In addition, the api provides a rich set of tools which can facilitate generation of canonical representations of many types, for many chemical and informational models, and for many standard file formats. •  OEChem::OECanonicalOrderAtoms() •  OEChem::OECanonicalOrderBonds() •  OEChem aromaticity models: OE, Daylight, Tripos, MDL, MMFF •  OEChem: many file formats and flavors, low-level writers •  QuacPac13: tautomers application and toolkit

Fig 3: Morgan fails

Fig 4: example: tautomers listed

separately in ACD98. The latter is the OE-

canonical form.

Conclusion Rigorous and effective chemoinformatics systems require concepts and methods for canonicalization at multiple levels of chemical abstraction and organization. The current state of the art presents many theoretical and practical challenges. OpenEye tools can help.

Fig 1: Morgan demo. Extended connectivity values and atom orders. Uses OEChem and Ogham. NCI Diversity set processed with no errors.

The Morgan algorithm1 is the basis of most chemical canonicalization work since, and deserves careful study. In 1965 Harry L. Morgan published the algorithm already implemented at CAS for its compound registry system. This work, based on generic graph theory, comprises a theoretical solution to the problem of molecular canonicalization, and material validation of its efficacy.

N! (graph isomorphism is hard) – Morgan to the rescue

This study: canonical molecular descriptions, not descriptors The study of graph theory and canonicalization applied to chemistry is extensive and diverse. Canonical descriptors which do not fully represent the model can be of great utility in statistical analyses but are not the focus of this nomenclature study.

Results: The Maybridge 2003 database was analyzed by the OE program tautomers13. Of 71367 molecules, 97 have tautomers (47 pairs and one triplet). Note that additionally, 2381 were found to be non-unique molecules.

Fig 5: tautomer triplet from Maybridge 2003

New: canonical pKa states The canonicalization of alternative pKa states is accomplished for many classes of molecules by the OpenEye program pkatyper13. This problem resembles tautomer canonicalization in many respects, and is an area of active research at OpenEye.