Data Mining & Molecular Modeling Pavel Pospisil Department of Radiology Experimental Radionuclide Therapy Harvard Medical School To download this talk, see my personal website: Type Pavel Pospisil in Google first hit.
Data Mining & Molecular Modeling
Pavel Pospisil
Department of Radiology Experimental Radionuclide Therapy
Harvard Medical School
To download this talk, see my personal website:Type Pavel Pospisil in Google first hit.
Outlook
• Data Mining – why data mining?• Bioinformatics• Cheminformatics
• Molecular Modeling – why do we do models?• Macromolecules• Small molecules
• Examples of studies• Let’s try it now!
Data Mining Molecular modeling
Bioinformatics
Knowledge bases
Databases
Literature
Ligand-based drug design
Structure-based drug design
QSAR
Homology modeling
Pharmacophore
Cheminformatics
Chemical databases
Docking
Microarray
Chemical microarray
HTS
Bioinformatics“Bioinformatics derives knowledge from computer analysis of biological data.”Information stored in the genetic code, but also experimental results from various sources, patient statistics, medical imaging, and scientific literature.
Bioinformatics finds genes or proteins involved in a particular disease and identifies novel therapeutic targets.Bioinformatics creates data mining tools.
Cheminformatics
Informatics of chemical databasesCalculation of chemical propertiesQSAR - Quantitative-structure activity relationshipChemical descriptors – describe molecules in 2-D, 3-D dimensionsLigand-based drug design
Example from Accelrys web page
Data Mining Knowledge pathway databases
Text of scientific literatureGene/protein databases
Chemical and drug databasesHigh
throughputexperiment:micro-arrays
Text Mining
PubMed16 M articles4000 full-access articles5000 journals52% non-US journals
ISI Web of Knowledgeaccess, analyze, and manage research literatureMore then PubMed! All science
Cross-product searchingLinks to full textPersonal journal listsPersonal bibliographic management
Cool: your personal impact factorbased on where YOU are cited.
Visit the web pages through Countway library and Harvard PIN
NCBI GenBank and Entrez Gene
Entrez Gene“~all genesStarted with 500 K molecules of NCI (National Cancer Institute) and 350 K toxicology molecules of NLM (National Library of Medicine).Today 8 M molecules contributed by more than 20 commercial and scientific organizations.
Conflict with American Chemical Society (ACS)
GenBank“~all genes”NIH genetic sequence database130 B nucleotides66 M sequences248 K speciesBillions inquiries made yearly
www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
www.ncbi.nlm.nih.gov/gquery
Entrez www.ncbi.nlm.nih.gov/gquery/gquery.fcgi
Protein Databases
UniProt All proteinsUniProt is a centralized resource for protein sequences and functional information.UniProt was created by joining together the information from Swiss-Prot, TrEMBL and PIR. Over 3 M proteins.
PDBAll proteins with resolved 3-dimensional structureOver 36K proteins (and other macromolecules).Also modeled proteins.
www.rcsb.orgwww.uniprot.org
PubChem or SciFinder
PubChem“~all open-access chemicals”Small molecules with biological activities.Started with 500 K molecules of NCI (National Cancer Institute) and 350 K toxicology molecules of NLM (National Library of Medicine).Today 8 M molecules contributed by more than 20 commercial and scientific organizations.
SciFinderWorld's largest collection of biochemical, chemical, medical, and other related information.Scientific information in journals and patent literature from around the world.12 M single- and multi-step reactions 1.5 B predicted and experimental properties Original source and final authority for CAS Registry Numbers®All patent records
pubchem.ncbi.nlm.nih.gov
Free access.Download through www.chem.harvard.edu/library/databases.php
ChemBank and DiscoveryGate
www.discoverygate.com
Harvard access free!
DiscoveryGateAll chemical compounds, contains alsoBeilstein – since 18th centuryGmelin – inorganic and organometallicsACD: commercially available chemicalsAccess compounds and related data, reactions, original journal articles and patents
ChemBank is a public, web-based informatics environment created by the Broad Institute and funded in large part by NCIStores cell measurements derived from, cell lines treated with small molecules. Small molecule screens
http://chembank.broad.harvard.edu
Harvard access free.
The Gene Ontology
All descriptionsA controlled vocabulary to
describe gene and gene product attributes in any organism, consistent descriptions of gene products in different databases.
The building blocks of the Gene Ontology are the terms: e.g. cell, fibroblast, growth factor receptor binding, or signal transduction.
The three organizing principles of GO are
cellular componentbiological processmolecular function. www.geneontology.org
Microarrays
OncomineAll cancer microarrays1125 cancer microarray studiesOver 18 K microarraysOver 580 M data points39 cancer types
Stanford Microarray DatabaseAll cancer microarraysOver 65 K microarray experimentsOver 11 K publicly available50 organisms
Partly free access.
www.oncomine.orgFree access.
http://genome-www5.stanford.edu/index.shtml
Knowledge Bases
Ingenuity•world’s largest ontology•20000 genes•1 million pathway interactions retrieved from 36 full text curated, peer-reviewed journals
LSGraph•PubMed-based search based on combined keywords and retrieval of most cited proteins•Functional neighbors•Link to gene/protein databases and Gene Ontology
http://www.it-omics.com http://www.ingenuity.com
Example of the Ingenuity Network
ExtracellularSpace
PlasmaMembrane
CytoplasmNucleus
iHOP – info Hyperlinked Over Proteins
Good way to find all about gene or protein
www.ihop-net.org
Example- Data mining in most serious cancers
Prostate Breast Lung Colon Ovarian Pancreas
abstracts with proteins related to extracellular OR membrane environment
8097
LSGraph
Ingenuity
1602all cited proteins retrieved
Proteins in cancer and extracell. space
104 147 185 124 159 140
Enzymes 23 54 71 41 38 47
Phosphatases 3 4 8 3 5
5457 10628 4222
4
12226
2105
14680
1068 1956 2771 1974
nr. of abstracts with protein relations; normal: nr. of genes or proteins (entities)
From bioinformatics to modeling
Once data mining identifies suitable therapeutic targete.g. protein
Target structure is studied or modeled –macromolecule modeling
Small organic molecules (ligands) are designed in order to bind to the protein – drug design, small molecule modeling
NH
PO
OH
O
+Rational
drug design
Molecular Modeling
•Visualization of molecules – molecular modelsSmall molecules (Mw ≤ 500 g/mol)Large molecules – polynucleotides (genes), polypeptides (proteins)
•Calculation of molecular physicochemical properties• Minimization and dynamic simulation= optimization of conformations• Interactions between molecules= docking• Virtual Screening= automatic docking of chemical libraries
What do we need?
Visualization and simple modeling
Creating molecules, calculation of basic properties
•ChemDraw (2D), Chem3D -www.cambridgesoft.com•ViewerPro Lite -accelrys.com•Java web applications -www.molinspiration.com
HO
H
HO
H
ChargesO -0.366 [O(2)]H 0.183 [H(1)]H 0.183 [H(3)]
HETATM 1 O * 1 0.003 -0.007 0.005 0.00 0.00 O HETATM 2 H * 1 0.325 0.449 0.794 0.00 0.00 H HETATM 3 H * 1 -0.964 -0.007 0.005 0.00 0.00 H CONECT 1 3 2END
Example of the file format: .pdb (PDB format)
Homology modeling and Docking program examples
Homology Modeling and sequence alignmentSwiss-Prot retrieves sequencesSwiss-Model – create models based on sequence -
http://swissmodel.expasy.org
Modeller – already contains models - salilab.org/modeller
GPCRdb – contain GPCR models – gpcr.org
Free macro modeling visualization programsSwiss DeepView – http://ca.expasy.org/spdbv/
Chimera – http://www.cgl.ucsf.edu/chimera/
Docking of small molecules to macromoleculesAutoDock 3.0 – free for academia – autodock.scripps.duDock 6.0 (Kuntz lab) – free for academia -
http://dock.compbio.ucsf.edu/DOCK_6/index.htmArgusLab 4.0 docking tool – free for academia
www.planaria-software.com/arguslab40.htm
Large modeling software
• MOE - chemcomp.com
• Tripos – tripos.com
• Accelrys – accelrys.com
• Schrödinger – schrodinger.com
• OpenEye - eyesopen.com
Rational Drug Design
Receptor-based design
Molecular complementarity
Ligand-based design
Molecular overlapping
Pharmacophore-based design
Molecular mimicry
∆Srt
∆Sint
∆HLW∆HRW
∆HLR
∆SW
∆Svib
Ligand insolution
free rotation
Receptor
bound water
loosely associatedwater molecules
free water
Receptor-Ligand complex
Predicting binding affinities∆Gbinding = f (Interactions)
Dbinding KRTG log.=∆
Binding free energyGas constant
Temperature
Equilibrium dissociation constant
∆H-T∆S = ∆G
Free energyEnthalpyEntropy
Case Study
*IQ2-P
*IQ2-OH
Overexpression of phosphatase by cancer cells
CANCER CELLCANCER CELL
HEALTHY CELLHEALTHY CELL
water-soluble,
non-fluorescent
prodrug
*123I/125I/127I/131I
water-insoluble,fluorescent drug
I
NNH
O
HO OPO
O-
*
I
NNH
O
HO
*
• By data mining, identified extracellular hydrolases overexpressed by tumor cells
EMCIT concept: Enzyme Mediated Cancer Imaging and Therapy
Results of PAP-IQ2-P Docking
Asp258
His12
Arg79
Arg11
His257
Asp258
His12Arg79
Arg11
Arg15
His257
Docking using AutoDock 3.0: Docking of flexible ligand into the rigid active site of the target; genetic algorithm
PAP-IQ2-P∆G = -13.39 kcal/mol
PAP-BABPA∆G = -12.35 kcal/mol
NH
PO
OH
O
N
NH
OP
O
OH O
O
I
Pospisil et al., Cancer Research, in press for March 2007
IQ2-P Hydrolysis In Vitro by PAP
125IQ2-P
125IQ2-OH
PAP
0:00 2:00 4:00 6:00 8:00 10:00 12:00 14:00 16:00 min0.0
1.0e3
2.0e3
3.0e3
4.0e3
5.0e3
6.0e3
7.0e3
8.0e3
9.0e3
1.0e4
1.1e4
1.2e4
1.3e4cps
0:00 2:00 4:00 6:00 8:00 10:00 12:00 14:00 16:00 min0.0
1.0e32.0e33.0e34.0e35.0e36.0e37.0e38.0e39.0e31.0e41.1e41.2e41.3e41.4e41.5e41.6e41.7e41.8e4
cps
0.0001 Unit/µl
Time (min)
1 Unit/µl
PAP
LNCaP
22Rv1
HMEC
Pospisil et al., Cancer Research, in press for March 2007
Conclusion• Bioinformatics can link resources and reveal
known/unknown information about gene and proteins, their relationships to biological functions and diseases.
• Data Mining can identify the therapeutically interesting targets present in the huge corpus of knowledge.
• Molecular modeling allows exploration of target candidates
• Docking places ligands in target active site• QSAR compares compounds activities to their
structures
Let’s exercise!Let’s try it.
Think about a protein or its gene, think about its function, cellular process..Say its name…Can we find it?
Where is the protein expressed?Is the 3D structure known?
Does the protein bind a ligand, inhibitor or substrate?Can we draw the ligand?
Docking or QSAR?