The ChEMBL Database ICIC 2012 Berlin, Germany October 2012 John P. Overington EMBLEBI [email protected]
The ChEMBL Database
ICIC 2012 Berlin, Germany October 2012
John P. Overington
EMBL-‐EBI
Drug-like compounds
Chemical Space All compounds Available compounds
Only certain molecules have features consistent with good pharmacological properties
Druggable targets
Target Space
Only certain targets have binding sites capable of ligand efficient binding of drug-like ligands
All targets Available targets
Accessible Pharmacological Space
Available compounds for
target but non-drug-like
Drug-like compounds but no complementary
targets
Druggable targets but no
complementary compounds
Druggable targets and complementary
compounds
Pre
sent
ed to
P&
G, C
inci
nnat
i, A
pril
2005
, © 2
005
Inph
arm
atic
a Lt
d.
All reasonable molecules 1020
All reasonable proteins
106
Screened proteins 103
Screened molecules 107-8
ChEMBL
Chemogenomics
Exploration of bioactivity space at genomic scale Structure Activity Relationship (SAR)
Drugs 103
Drug targets 102 Drugs
ChEMBL Database
• hKp://www.ebi.ac.uk/chembl • Funded by a Strategic Award from the Wellcome Trust • World’s largest primary source of Open pharmacology/drug
discovery data – Contains syntheTc small molecules, natural products and biologicals – Strong integraTon and annotaTon of chemical and biological data – OSINT approach to data gathering – Tight integraTon with other EBI resources
• Ensembl, 1000 Genomes, UniProt, PDBe, ArrayExpress, Atlas…. – Data sharing agreements in place with key public resources, e.g. PubChem
• Open Data – CC-‐BY-‐SA licence • Free downloads, secure private searching,… • REST web service API
Target Discovery
Lead Discovery Lead OpTmizaTon
Preclinical Development
Phase 1 Phase 2 Phase 3 Launch (Phase 4)
Drug Discovery
~1,400,000 compound records >10,000,000 bioacTviTes ~46,000 abstracted papers ~9,000 targets
~12,000 clinical candidates
~1,600 drugs
• Target idenTficaTon • Microarray profiling • Target validaTon • Assay development • Biochemistry • Clinical/Animal disease models
• High-‐throughput Screening (HTS) • Fragment-‐based screening • Focused libraries • Screening collecTon
• Medicinal Chemistry • Structure-‐based drug design • SelecTvity screens • ADMET screens • Cellular/Animal disease models • PharmacokineTcs
• Toxicology • In vivo safety pharmacology • FormulaTon • Dose predicTon
PK tolerability Efficacy
Safety & Efficacy
IndicaTon discovery, repurptg & expansion
Med. Chem. SAR Clinical Candidates Drugs
Discovery Development Use
ChEMBL content
Only ~1% of Genome is a Drug Target
Drug Approvals
FDA Approved Drugs
NFκB Pathway – key control mechanism for inflammaTon
Affinity of Drugs for their‘Targets’
Ki, Kd, IC50, EC50, & pA2 endpoints for drugs against their‘efficacy targets’
2 3 4 5 6 7 8 9 10 11 12 0
50
100
150
200
250
300
350
400
Freq
uency
-‐log10 affinity
10mM 1mM 100mM 10mM 1mM 100nM 10nM 1nM 100pM 10pM 1pM
Overington, et al, Nature Rev. Drug Discov. 5 pp. 993-‐996 (2006) Gleeson et al, Nature Rev. Drug Discov. 10 pp. 197-‐208 (2011)
Clinical Candidates
• CollecTon of clinical development candidates – Contains ~12,000 2-‐D structures/sequences
• EsTmated size ~35-‐45,000 compounds
– Work in progress • e.g. Protein kinases, 393 disTnct clinical candidates
Different Types of Drugs
Pharma Industry ProducTvity File RegistraTon number vs. USAN date
0
100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000
1960 1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
Phase 2b date
~Discovery date
Overington, unpublished
Pharma Industry ProducTvity
0
10
20
30
40
50
60
70
1-‐ 100,000
100,001-‐ 200,000
200,001-‐ 300,000
300,001-‐ 400,000
400,001-‐ 500,000
500,001-‐ 600,000
600,001-‐ 700,000
700,001, 800,000
File registraTon number range
64 USANs/100,000 compounds
1.9 USANs/100,000 compounds
16 Drugs/100,000 compounds
0.4 Drugs/100,000 compounds
Large Pharma needs on average to synthesize and test ~250,000 compounds for each launched drug
Overington, unpublished
Patent and PublicaTon Lag
IBM Patent data and ChEMBL
Clinical Candidates
What Is the ChEMBL Data?
SAR Data
Compound
Assay
Ki=4.5 nM
>Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE
ED2=230 nM
What Is the ChEMBL Data?
Inhibition of human Thrombin
PTT (partial thromboplastin time)
ChEMBL Target Types
Protein complex
e.g. NicoTnic acetylcholine receptor e.g. Muscarinic receptors e.g. DNA
e.g. Mitochondria e.g. Trachea e.g. HEK293 cells e.g. Drosophila
e.g. PDE5
Protein Nucleic Acid Protein family
Cell line Tissue Sub-‐cellular frac>on Organism
Compound Searching
21
Spreadsheet Views
22
Ligand Efficiency
23
• Ligand efficiency is an objecTve measure of how much binding energy comes from each atom in a parTcular interacTon – Drugs have high ligand efficiency
– Every atom counts – Need to avoid affinity from lipophilicity
Target Class Data
Assay Organism Data
Allosteric Regulators • Allosteric drugs can have some advantages over orthosteric drugs – SelecTvity – Orthosteric site may be undruggable
Allosteric/Orthosteric sites for GPCRs
hKp://www.chemblog.org