Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 1 Data Mining in Bioinformatics Day 9: Graph Mining in Chemoinformatics Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen
54
Embed
Data Mining in Bioinformatics Day 9: Graph Mining in ... · Chemoinformatics Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 8 The chemical space 1060 possible small or-ganic
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 1
Data Mining in BioinformaticsDay 9: Graph Mining in Chemoinformatics
Chloé-Agathe Azencott & Karsten Borgwardt
February 18 to March 1, 2013
Machine Learning & Computational Biology Research GroupMax Planck Institutes Tübingen andEberhard Karls Universität Tübingen
Drug discovery
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 2
Modern therapeutic researchFrom serendipity to rationalized drug design
Ancient Greeks treatinfections with mould
CH 3
N
S
O
NH
O
HO
NH 2
O
HO
CH 3
Biapenem in PBP-1A
Drug discovery process
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 3
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
Protein that we want to inhibit so as to interfer with a biological process
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 4
52 months 90 months
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
Drug discovery process
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 5
$500,000,000to
$2,000,000,000
52 months 90 months
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
Chemoinformatics
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 6
How can computer science help?→ Chemoinformatics!
“...the mixing of information resources to transform data into informa-tion, and information into knowledge, for the intended purpose of mak-ing better decisions faster in the arena of drug lead identification andoptimisation.” – F. K. Brown
“... the application of informatics methods to solve chemical problems.”– J. Gasteiger and T. Engel
Chemoinformatics
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 7
Chemoinformatics
1. Find a target
2. Identifyhits
3.Hit-to-lead: characterize
hits
4. Lead optimization
and synthesis
5. Assay
Chemoinformatics
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 8
The chemical space
1060 possible small or-ganic molecules
1022 stars in the observ-able universe
(Slide courtesy of Matthew A. Kayala)
Drug discovery process
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 9
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 11
Similar Property PrincipleMolecules having similar structures should exhibit similaractivities.
→ Structure-based representationsCompare molecules by comparing substructures
Molecular graph
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 12
C
O
N C
C
C
N
O
S
C
C
O O
C
C
d
d
d
C
C
NC
C
C
C
C
CO
Undirected labeled graph
Fingerprints
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 13
Define feature vectors that record the presence/absence(or number of occurrences) of particular patterns in a givenmolecular graph
φ(A) = (φs(A))s substructure
whereφs(A) =
{1 if s occurs in A0 otherwise
Extension of traditional chemical fingerprints
Fingerprints
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 14
Learning from fingerprintsClassical machine learning and data mining techniquescan be applied to these vectorial feature representations.
Any distance / kernel can be usedClassificationFeature selectionClustering
Fingerprints
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 15
Fingerprints compressionSystematic enumeration→ long, sparse vectorse.g. 50, 000 random compounds from ChemDB→ 300, 000 paths of length up to 8→ 300 non-zeros on average“Naive” Compression
List the positions of the 1s219 = 524, 288average encoding: 300× 19 = 5, 700 bits
Fingerprints
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 16
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 17
MOLFEA [Helma et al., 2004]
P = positive (mutagenic) compoundsN = negative compounds
features: fragments (= patterns) f such thatboth freq(f,P) ≥ t and freq(f,N) ≥ t
Limited to frequent linear patterns
ML algorithm: SVM with linear or quadratic kernel
Frequent patterns fingerprints
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 18
MOLFEA [Helma et al., 2004]
CPDB – Carcinogenic Potency DataBase684 compounds classified in 341 mutagens and 343 non-mutagens according to Ames test on Salmonella
1% 3% 5% 10%Frequency threshold
50
60
70
80
90
100
Cross-validated sensitivity
Mutagenicity prediction [Hema04]
Linear kernelQuadratic kernel
Spectrum kernels
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 19
φ(A) = (φs(A))s∈S
Kspectrum(A,A′) = k(φ(A), φ(A′))
k ∈ RR|(S)|×R|(S)| can beDot product (linear kernel)
RBF kernel
Tanimoto kernel: k(A,B) = A∩BA∪B
MinMax kernel:∑N
i=1min(Ai,Bi)∑Ni=1max(Ai,Bi)
Spectrum kernels
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 20
Tanimoto and MinMaxBoth Tanimoto and Minmax are kernels.
Proof for Tanimoto: J.C. Gower A general coefficientof similarity and some of its properties. Biometrics1971.Proof for MinMax:
MinMax(x, y) =〈φ(x), φ(y)〉
〈φ(x), φ(x)〉 + 〈φ(y), φ(y)〉 − 〈φ(x), φ(y)〉with φ(x) of length: # patterns × max countφ(x)i = 1 iff. the pattern indexed by bi/qc appears morethan i mod q times in x
All patterns fingerprints
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 21
Paths fingerprintsLabeled sub-paths (walks)
O
N C C
N
O
S
C
C
O O
C
C
d
d
d
C
C
NsCsCsS
CsCsCdO
C
C
NC
C
C
C
C
CO
Some sub-paths of length 3
All patterns fingerprints
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 22
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 50
Inhibition of DHFR: ROC Curves [Azencott et al., 2007]
method AUCIRV 0.71SVM 0.59kNN 0.59
MAX-SIM 0.54
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
RANDOM
IRV
SVM
MAXSIM
Measuring performance
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 51
Precision-recall curves
Precision = # True Positives# Predicted Positives
Recall = sensitivity
0 1/4 2/4 3/4 1
01/5
2/5
3/5
4/5
1
Recall
Pre
cis
ion
x
x
x
x
x
x
x
xxx
0.95
0.94
0.9
0.81
0.73
0.52
0.2
0.170.120.09
perfect
real
Other applications
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 52
Other applications of graph mining in chemoinformatics
Database indexing and searchPrediction of 3D structures of small compoundsand proteinsReaction Prediction
References and further reading
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 53
[Azencott et al., 2007] Azencott, C.-A., Ksikes, A., Swamidass, S. J., Chen, J. H., Ralaivola, L. and Baldi, P. (2007). One-to four-dimensional kernels for virtual screening and the prediction of physical, chemical, and biological properties. Journal of chemical
information and modeling 47, 965–974. 23, 24, 38, 39, 40, 50
[Baldi et al., 2007] Baldi, P., Benz, R. W., Hirschberg, D. S. and Swamidass, S. J. (2007). Lossless compression of chemical fingerprintsusing integer entropy codes improves storage and retrieval. Journal of chemical information and modeling 47, 2098–2109. 16
[Ceroni et al., 2007] Ceroni, A., Costa, F. and Frasconi, P. (2007). Classification of small molecules by two-and three-dimensionaldecomposition kernels. Bioinformatics 23, 2038–2045. 41, 42
[Helma et al., 2004] Helma, C., Cramer, T., Kramer, S. and De Raedt, L. (2004). Data mining and machine learning techniques forthe identification of mutagenicity inducing substructures and structure activity relationships of noncongeneric compounds. Journal ofchemical information and computer sciences 44, 1402–1411. 17, 18
[Hinselmann et al., 2010] Hinselmann, G., Fechner, N., Jahn, A., Eckert, M. and Zell, A. (2010). Graph kernels for chemical compoundsusing topological and three-dimensional local atom pair environments. Neurocomputing 74, 219–229. 36, 37, 44
[Mahé et al., 2006] Mahé, P., Ralaivola, L., Stoven, V. and Vert, J.-P. (2006). The pharmacophore kernel for virtual screening withsupport vector machines. Journal of chemical information and modeling 46, 2003–2014. 43
[Menchetti et al., 2005] Menchetti, S., Costa, F. and Frasconi, P. (2005). Weighted Decomposition Kernels. In Proceedings of the 22nd
International Conference on Machine Learning pp. 585–592, ACM, Bonn, Germany. 33, 34
[Saigo et al., 2009] Saigo, H., Nowozin, S., Kadowaki, T., Kudo, T. and Tsuda, K. (2009). gBoost: a mathematical programmingapproach to graph classification and regression. Machine Learning 75, 69–89. 26, 27, 28, 29
[Shervashidze et al., 2011] Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K. and Borgwardt, K. M. (2011). Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research 12, 2539–2561. 30, 31
The end
Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 54
Tomorrow: Projects Presentations
By 9:45 AM on Friday, March 1, 2013, please submit the following byemail to Prof. Borgwardt:
A short report on your project, that gives your answers to the ques-tions in Section 2 (You can ignore Section 1 here) in your exercisesheet