CZ3253: Computer Aided Drug design CZ3253: Computer Aided Drug design Drug Design Methods I: QSAR Drug Design Methods I: QSAR Prof. Chen Yu Zong Prof. Chen Yu Zong Tel: 6874-6877 Tel: 6874-6877 Email: Email: [email protected][email protected]http://xin.cz3.nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of Singapore National University of Singapore
59
Embed
CZ3253: Computer Aided Drug design Drug Design Methods I: QSAR Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] Room.
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
CZ3253: Computer Aided Drug designCZ3253: Computer Aided Drug design
Drug Design Methods I: QSARDrug Design Methods I: QSAR
• Discern relationships between multiple variables (descriptors)
• Identify connections between structural traits (type of subunits, bond angles local components) and descriptor values (e.g. activity, LogP, % denatured)
• Principle components are a set of vectors representing the variance in the original data
2323
Principal components – Principal components – reducing the dimensionality of a datasetreducing the dimensionality of a dataset
x
y
Clearly there is a relationship between x and y- a high correlation.We can define a new variable z = x+y suchthat we can express most of the variation inthe data as the new variable z.This new variable is a principal component.
v
j
jjii xcp1
,pi is the ith principalcomponent and ci,j is the coefficient of the variable xj.There are v such variables.
PCA is the transformation of a set of correlated variablesto a set of orthogonal uncorrelated variables called principalcomponents. These new variables are a linear combination of theoriginal variables in decreasing order of importance.
ikpkipiik tbYYr
p
.1
data matrix loadings (measure of the variation betweenvariables)
– Graphing each object/molecule in space of 2 or more PCs
• # rows = # of objects/molecules• # columns = # of descriptors OR # of molecules
For benzene corresponds to graph in PC1 (x’) and PC2 (y’) system
3030
PC1
PC2
x
y
The PC’s each maximise the variancein the data in orthogonal directions andare ordered by size.
Usually only a few components are neededto explain (>90%) of the variance in thedata – or the properties are not relevant
The first step is to calculate the varience-covarience matrix from the data
Principal componentsPrincipal components
3131
PC1
PC2
x
y
If there are s observations each of which contains v values, the data can be represented by a matrix D with v rows and s columns.
The varience-covariance matrix is Z = DTD.
The eigenvectors of Z are the principal components. Z is a square symmetric matrix so the eigenvectors are orthogonal. Usually the matrix is diagonalised to obtain the eigenvectors (the weightings for the properties) and eigenvalues (the explained variance).
• Cross-Validation (used in PLS)– Remove one or more pieces of input data– Re-derive QSAR equation– Calculate omitted data– Compute root-mean-square error to evaluate efficacy of model
• Typically 20% of data is removed for each iteration• The model with the lowest RMS error has the optimal number of
components/descriptors
5555
QSPR ExampleQSPR Example
• Relation between musk odorant properties and benzenoid structure– Training set of 148 compounds (81 non-musk and 67 musk)– 47 chemical descriptors initially– Pre-qualifications
• Correlations (47-12=35)
– Post-qualifications• Bootstrapping • Test-set
– 6/6 musks, 8/9 non-musks
Narvaez, J. N., Lavine, B. K. and Jurs, P. C. Chemical Senses, 11, 145-156 (1986)
5656
Practical IssuesPractical Issues
• 10 times as many compounds as parameters fit
• 3-5 compounds per descriptor
• Traditional QSAR – Good for activity prediction– Not good for whether activity is due to binding
or transport
5757
Advanced MethodsAdvanced Methods
• Neural Networks• Support Vector Machines• Genetic/Evolutionary Algorithms• Monte Carlo• Alternate descriptors