Chemical Databases, Identifiers, Tool Kits and Web Services October 16, 2003 Marc C. Nicklaus, CADD Group, Lab. of Medicinal Chemistry, CCR, NCI, NIH; [email protected]Thanks also to the other members of the CADD Group: Rajeshri G. Karki, Megan L. Peach, Karen M. Green, Guangyu Sun, Igor Filippov
70
Embed
Chemical Databases, Identifiers, Tool Kits and Web Services October 16, 2003 Marc C. Nicklaus, CADD Group, Lab. of Medicinal Chemistry, CCR, NCI, NIH;
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Chemical Databases, Identifiers, Tool Kits and Web
ServicesOctober 16, 2003
Marc C. Nicklaus, CADD Group, Lab. of Medicinal Chemistry, CCR, NCI, NIH; [email protected]
Thanks also to the other members of the CADD Group:
Rajeshri G. Karki, Megan L. Peach, Karen M. Green, Guangyu Sun, Igor Filippov
Acknowledgements
• Wolf-Dietrich Ihlenfeldt (formerly Computer Chemistry Center [CCC], Erlangen, Germany)
• Frank Oellien (formerly CCC)
• Bruno Bienfait (formerly CCC and LMC)
• Johannes Voigt (formerly LMC)
Reasons to Deal with (Large) Chemical Databases (of Small
Molecules)• Inventory• Source for drug design, (virtual) screening• Repository for associated information (assay
results, physicochemical data, environmentally important properties etc.)
• Chemoinformatics• Link to other services, databases• Source for individual structures (e.g. for comp.
chem.)
Questions
• What databases are out there?• What databases can we share?• What tools can we share?• What can we offer the public?• [How] Can we standardize?• How can we determine: have we, has anyone, see this
structure before?• What’s next (XML/CML… )?
The NCI Database &
Enhanced NCI Database Browser
The NCI Database
• Approximately half a million compounds• Collected since 1955 by the National Cancer Institute, NIH• Tested in anti-cancer screens; since 80’s also in AIDS screens• Managed by NCI’s Developmental Therapeutics Program
(DTP); see http://dtp.nci.nih.gov• Publicly available: currently 260,071 cpds. (“Open NCI
Database”)• Cancer screening data (60 cell lines) available for ca. 43,000
compounds• AIDS screening data available for ca. 44,000 compounds• Samples available from DTP for ~60% of compounds
Enhanced CACTVS Browser of the Open NCI Database
• Web-based interface for searching data from the Open NCI Database by numerous criteria, including 2D and 3D structural searches
• Augmented by many additional data, - derived: e.g. number of rotatable bonds - calculated/predicted: e.g. log P; biological activities - systematically determined: e.g. IUPAC names - cross-evaluated: e.g. commercial availability• Boolean searches possible• Requirements: Just a Web browser, several plug-ins are optional• Based on chemical information toolkit CACTVS (Wolf-Dietrich
Ihlenfeldt, see http://www2.chemie.uni-erlangen.de/software/cactvs/)
How To Get There….
URLs: http://cactus.nci.nih.gov/ncidb2
(U.S. mirror)
http://www2.chemie.uni-erlangen.de/ncidb2
(European mirror)
http://cactus.nci.nih.gov
Enhanced NCI Database Browser: Query Form
Combined Search: PASS Antiangiogenesis Prediction & Name (Fragment) Exclusion
Search Result: Hitlist
Search Result: Image Gallery
Search Result: Detail View (top)
Search Result: Detail View (continued)
Update of Enhanced NCI Database Browser• Most up-to-date DTP data sets
• Completion/curation of data where possible
• Many more additional calculated data
• New search capabilities
• New underlying database format
Update of Enhanced NCI Database Browser
Most up-to-date DTP data sets
• 260,071 compounds (new and old releases of Open NCI Database merged, new entry prevails)
• 42,577: have cancer screen results >300 properties
• 43,905: have AIDS screening results 3
(EC50: 3,143; IC50: 39,352)
• 115,324: have animal model assay data (NEW) ~10 (string)
• 139,735: are plated 1
• 45,224: have name(s) (from DTP records) 0...>20
• 127,361: have CAS RN 1
• 3,576: have experimental log P 1
Update of Enhanced NCI Database Browser
Completion/curation of data
• Add CAS numbers: cross-check with other DBs
• Correct structures and/or names: mostly manually, only occasionally possible, often dangerous
• Calculate averages and SD for cell-line assay data
• (No evaluation/curation planned for animal model assay data)
260,071 2D or 3D structures with USMILES, IChI, IUPAC Name, CAS RN and additional “canonical properties” [in beta test stage: http://cactus.nci.nih.gov/ncidb2/download_NEW_09-03.html ]
250,251 2D structures in SDF format250,251 structures in USMILES format249,081 2D structures plus cancer and AIDS data (where present) in SDF format32,557 2D structures with cancer test data as of August 1999, in SDF format42,689 2D structures with AIDS test data as of October 1999, in SDF format23,031 2D structures in SDF format for which both cancer and AIDS data are
available
Updated structure and combined structure+data files under preparation
Future Plans and Wishes…
• Apply/add IChI names• Create Search&Display GUIs for more DBs• Cross-complete databases (CAS RNs etc.)• XML/CML-ize databases• Move toward one virtual chemical database• Link different database spaces together (small
(1) Possibly, but not consistently, repeated in fields <RNEXTREG> and/or <RN>
Canonical FieldsField/Property Name Possible values (or explanation)
Properties dependent on structure:E_UNIQUE_ID to be concatenated from Origin + Version/Date + internal ID; example: NLM_04-03_08359076E_CAS CAS Reg. No. (if available; otherwise 999-99-9)E_NAME Chemical Name (if available)E_STEREO_SPECIFIED no_stereocenter | unknown | partial | full
(for most DBs, only choices 1 & 4 relevant)E_COMPOUND_TYPE unspecified (i.e. typically “normal organic”)
metal_complex (if pre-processed as such).... (maybe others?)
E_THREED_SOURCE is_2D | experimental | CORINA_x.y | 3D_unsuccessfulE_ICHI IChI chemical identifierE_SMILES CACTVS-calculated Unique SMILES code (according to 1989 definition published by Daylight, Inc.)E_FORMULA Molecular formula calculated by CACTVSE_HASHY Non-stereospecific hash codeE_HASHSY Stereospecific hash codeE_TAUTO_HASH Non-stereospecific tautomer-invariant hash codeE_STEREO_TAUTO_HASH Stereospecific tautomer-invariant hash codeE_MAXFRAG_HASHY Non-stereospecific hash code of largest fragmnetE_MAXFRAG_HASHSY Stereospecific hash code of largest fragmnetE_MAXFRAG_HASHTY Non-stereospecific tautomer-invariant hash code of largest fragmantE_MAXFRAG_HASHSTY Stereospecific tautomer-invariant hash code of largest fragmantE_MULTIFRAG single_fragment | multi_fragment
Properties constant for whole database:E_ORIGIN Detailed origin of data file; e.g. “DTP February 2003”E_DB_VERSION Explicit DB version number, if availableE_DB_YEAR Year data file was generatedE_DB_TYPE government public (e.g. NCI Open DB)
E_IS_PUBLIC yes | no | unknownE_SAMPLES_AVAIL yes | on-demand | no | unknown (for whole DB!)E_SOURCE Source of database: Company, Agency etc.; e.g. “NCI”E_CONTEXT Nature, or context, of compound: e.g. environmentally relevant cpd., general chemical, drug…E_SUPPLIER_TYPE broker | manufacturer | repository / N/AE_CACTVS_VERSION x.yyy (version of the CACTVS toolkit used to process the database)
Can’t we all get along with each other?
Identifiers
CACTVS Hash Codes
Overlap Analysis via Hash Codes
• CACTVS hash codes• Tautomer-invariant hash codes (NEW!)• Stereospecific vs. non-stereospecific, tautomer-sensitive vs. tautomer-invariant, entire ensemble vs. max. fragment only, -- all eight combinations calculated and added to SD file.• Index file created for rapid comparison:
Create High-Quality 2D Drawingsof Your Chemical Structures:Many input formats supported:SD, MOL, PBD, SMILES, Sybyl…
NCI Screening Data 3D Minerhttp://cactus.nci.nih.gov/services/3DMiner/
Visualize & Mine 60-Cell Line Cancer Data for 41,000 NCI Compounds
GUI Generator CACTVS-based GUI that takes a structure
file (SD or other format) and associated data files (TXT, Excel, etc.) as input and generates a Web service from it. Its output is a CACTVS database file (.cbase file) and the CACTVS script – generating JavaScript pages – that allows web-based searches and display of results in/from this database.
Will be freely distributed for non-profit use. Since script=executable, also freely adaptable. Currently: alpha version.
Pipeline Pilot
• Scitegic, Inc.• GUI-based Icon Pipeline paradigm• Predefined components (really: scripts); you can
modify, or write your own ones• Drag&drop to form pipelines• Ensemble of pipelines is the program (called
‘Protocol’), can be saved, shared…• Now Win. 2000/XP only; fall ‘03: Linux• Very fast: up to 20,000 cpds/sec• Expensive
3D Pharmacophore Searches• Up to 25 conformers pre-
calculated by the program Catalyst (MSI) are stored for each compound.
• Searches are possible by distance constraints and other query features. Most ISIS features are implemented, such as exclusion spheres, centroids, points on lines….
• There are two ways to define a query: Query file is prepared in an external program, such as Catalyst, ISIS/Draw etc., and submitted in .mol format; or use JME Editor available within the service.
Example:3-Point pharmacophore used in previous study on HIV-1 integrase inhibitor discovery. J.Med.Chem. 1997, 40(6), 920-929.