36 CHAPTER - 2 QSAR METHODOLOGY 2.1 INTRODUCTION Quantitative structure-activity relationships (QSAR) represent an attempt to correlate structural or property descriptors of compounds with activities. These physicochemical descriptors, which include parameters to account for hydrophobicity, topology, electronic properties, and steric effects, are determined empirically or, more recently, by computational methods. Activities used in QSAR include chemical measurements and biological assays. QSAR currently are being applied in many disciplines, with many pertaining to drug design and environmental risk assessment. QSAR date back to the 19th century. In 1863, A.F.A. Cros at the University of Strasbourg observed that toxicity of alcohols to mammals increased as the water solubility of the alcohols decreased [1]. In the 1890's, Hans Horst Meyer of the University of Marburg and Charles Ernest Overton of the University of Zurich, working independently, noted that the toxicity of organic compounds depended on their lipophilicity [1, 2]. 2.2 Linear Free Energy Relationships Little additional development of QSAR occurred until the work of Louis Hammett (1894- 1987), who correlated electronic properties of organic acids and bases with their equilibrium constants and reactivity. Consider the dissociation of benzoic acid: Hammett observed that adding substituents to the aromatic ring of benzoic acid had an orderly and quantitative effect on the dissociation constant. For example,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
36
CHAPTER - 2
QSAR METHODOLOGY
2.1 INTRODUCTION
Quantitative structure-activity relationships (QSAR) represent an attempt to correlate
structural or property descriptors of compounds with activities. These physicochemical
descriptors, which include parameters to account for hydrophobicity, topology, electronic
properties, and steric effects, are determined empirically or, more recently, by computational
methods. Activities used in QSAR include chemical measurements and biological assays.
QSAR currently are being applied in many disciplines, with many pertaining to drug design
and environmental risk assessment.
QSAR date back to the 19th century. In 1863, A.F.A. Cros at the University of Strasbourg
observed that toxicity of alcohols to mammals increased as the water solubility of the
alcohols decreased [1]. In the 1890's, Hans Horst Meyer of the University of Marburg and
Charles Ernest Overton of the University of Zurich, working independently, noted that the
toxicity of organic compounds depended on their lipophilicity [1, 2].
2.2 Linear Free Energy Relationships
Little additional development of QSAR occurred until the work of Louis Hammett (1894-
1987), who correlated electronic properties of organic acids and bases with their equilibrium
constants and reactivity. Consider the dissociation of benzoic acid:
Hammett observed that adding substituents to the aromatic ring of benzoic acid had an
orderly and quantitative effect on the dissociation constant. For example,
37
a nitro group in the meta position increases the dissociation constant, because the nitro group
is electron-withdrawing, thereby stabilizing the negative charge that develops. Consider now
the effect of a nitro group in the para position:
The equilibrium constant is even larger than for the nitro group in the meta position,
indicating even greater electron-withdrawal.
Now consider the case in which an ethyl group is in the para position:
In this case, the dissociation constant is lower than for the unsubstituted compound,
indicating that the ethyl group is electron-donating, thereby destabilizing the negative charge
that arises upon dissociation.
Hammett also observed that substituents have a similar effect on the dissociation of other
organic acids and bases. Consider the dissociation of phenylacetic acids:
38
Electron-withdrawal by the nitro group increases dissociation, with the effect being less for
the meta than for the para substituent, just as was observed with benzoic acid. The electron-
donating ethyl group decreases the equilibrium constant, as would be expected.
Data for these equilibria typically are graphed as illustrated below:
Figure 1: Example of a graph for a linear free energy relationship. K0 or K0' represent
equilibrium constants for unsubstituted compounds and K or K', for substituted compounds.
Values for the abscissa are calculated from the dissociation constants of unsubstituted and
substituted benzoic acid. Values for the ordinate are obtained from another organic acid or
base with identical patterns of substitution, in this case phenylacetic acid.
Because this relationship is linear, the following equation can be written:
where is the slope of the line. The values for the abscissa in Figure 1 are always those for
benzoic acid and are given the symbol, . Therefore, we can write:
, the slope of the line, is a proportionality constant pertaining to a given equilibrium. It
relates the effect of substituents on that equilibrium to the effect of those substituents on the
benzoic acid equilibrium. That is, if the effect of substituents is proportionally greater than on
the benzoic acid equilibrium, then > 1; if the effect is less than on the benzoic acid
equilibrium, < 1. By definition, for benzoic acid is equal to 1.
39
is a descriptor of the substituents. The magnitude of gives the relative strength of the
electron-withdrawing or -donating properties of the substituents. is positive if the
substituent is electron-withdrawing and negative if it is electron-donating.
These relationships as developed by Hammett are termed linear free energy relationships.
Recall the equation relating free energy to an equilibrium constant:
That is, the free energy is proportional to the logarithm of the equilibrium constant. These
linear free energy relationships are termed "extrathermodynamic". Although they can be
stated in terms of thermodynamic parameters, no thermodynamic principle states that the
relationships should be true.
To develop a better understanding of these relationships, it is instructive to consider some
values of and . Values of are provided below:
In the aniline and phenol equilibria, the hydrogen ion that is dissociating is one atom removed
from the phenyl ring, whereas in the benzoic acid equilibrium it is two atoms removed. Thus,
substituents are able to exert a greater effect on the dissociation in aniline and phenol than in
benzoic acid and the value of > 1. In phenylacetic and phenylpropionic acids, the hydrogen
ion dissociating is three and four atoms removed, respectively, from the phenyl ring.
Substituents are able to exert a lesser effect on the equilibrium than on the benzoic acid
equilibrium and < 1.
Some illustrative values of for substituents in the meta and para positions are given below:
40
By definition, for hydrogen is 0. The positive values of for the nitro group indicate that it
is electron-withdrawing. In understanding the magnitudes of the values for the nitro group
in meta vs. para positions, consider the mechanisms of electron withdrawal or donation. For a
nitro group in the meta position, electron-withdrawal is due to an inductive effect produced
by the electronegativity of the constituent atoms. If only induction were operative, one would
expect the electron-withdrawing effect of a nitro group in the para position to be less than in
the meta position. The larger value for a para-substituted nitro group results from the
combination of both inductive and resonance effects. For chlorine, the electronegativity of the
atom produces an inductive electron-withdrawing effect, with the magnitude of the effect in
the para position being less than in the meta position. For chlorine, only the inductive effect is
possible. The methoxy group can be electron-donating or -withdrawing, depending on the
position of substitution. In the meta position, the electronegativity of the oxygen produces an
inductive electron-withdrawing effect. In the para position, only a small inductive effect
would be expected. Moreover, an electron-donating resonance effect occurs for the methoxy
group in the para position, giving an overall electron-donating effect. Tables of values for
numerous substituents have been published [3,4]. In some cases, the sigma values are
generally applicable to many different equilibria. In other cases, sigma values have been
derived for specific equilibria, which is particularly true when one considers sigma values for
ortho substituents.
2.3 Computer-Assisted Design
Computer-assisted drug design (CADD), also called computer-assisted molecular design
(CAMD), represents more recent applications of computers as tools in the drug design
process. In considering this topic, it is important to emphasize that computers cannot
substitute for a clear understanding of the system being studied. That is, a computer is only an
additional tool to gain better insight into the chemistry and biology of the problem at hand.
In most current applications of CADD, attempts are made to find a ligand (the putative drug)
that will interact favorably with a receptor that represents the target site. Binding of ligand to
41
the receptor may include hydrophobic, electrostatic, and hydrogen-bonding interactions. In
addition, solvation energies of the ligand and receptor site also are important because partial
to complete desolvation must occur prior to binding.
This approach to CADD optimizes the fit of a ligand in a receptor site. However, optimum fit
in a target site does not guarantee that the desired activity of the drug will be enhanced or that
undesired side effects will be diminished. Moreover, this approach does not consider the
pharmacokinetics of the drug.
The approach used in CADD is dependent upon the amount of information that is available
about the ligand and receptor. Ideally, one would have 3-dimensional structural information
for the receptor and the ligand-receptor complex from X-ray diffraction or NMR. The ideal is
seldom realized. In the opposite extreme, one may have no experimental data to assist in
building models of the ligand and receptor, in which case computational methods must be
applied without the constraints that the experimental data would provide.
Based on the information that is available, one can apply either ligand-based or receptor-
based molecular design methods. The ligand-based approach is applicable when the structure
of the receptor site is unknown, but when a series of compounds have been identified that
exert the activity of interest. To be used most effectively, one should have structurally similar
compounds with high activity, with no activity, and with a range of intermediate activities. In
recognition site mapping, an attempt is made to identify a pharmacophore, which is a
template derived from the structures of these compounds. It is represented as a collection of
functional groups in three-dimensional space that is complementary to the geometry of the
receptor site.
In applying this approach, conformational analysis will be required, the extent of which will
be dependent on the flexibility of the compounds under investigation. One strategy is to find
the lowest energy conformers of the most rigid compounds and superimpose them.
Conformational searching on the more flexible compounds is then done while applying
distance constraints derived from the structures of the more rigid compounds. Ultimately, all
of the structures are superimposed to generate the pharmacophore. This template may then be
used to develop new compounds with functional groups in the desired positions. In applying
this strategy, one must recognize that one is assuming that it is the minimum energy
42
conformers that will bind most favorably in the receptor site. In fact, there is no a priori
reason to exclude higher energy conformers as the source of activity.
The receptor-based approach to CADD applies when a reliable model of the receptor site is
available, as from X-ray diffraction, NMR, or homology modeling. With the availability of
the receptor site, the problem is to design ligands that will interact favorably at the site, which
is a docking problem.
2.4 Hansch Analysis
QSAR based on Hammett's relationship utilize electronic properties as the descriptors of
structures. Difficulties were encountered when investigators attempted to apply Hammett-
type relationships to biological systems, indicating that other structural descriptors were
necessary.
Robert Muir, a botanist at Pomona College, was studying the biological activity of
compounds that resembled indoleacetic acid and phenoxyacetic acid, which function as plant
growth regulators. In attempting to correlate the structures of the compounds with their
activities, he consulted his colleague in chemistry, Corwin Hansch. Using Hammett sigma
parameters to account for the electronic effect of substituents did not lead to meaningful
QSAR. However, Hansch recognized the importance of the lipophilicity, expressed as the
octanol-water partition coefficient, on biological activity [5]. We now recognize this
parameter to provide a measure of the bioavailability of compounds, which will determine, in
part, the amount of the compound that gets to the target site.
Relationships were developed to correlate a structural parameter (i.e., lipophilicity) with
activity. In some cases, a univariate relationship correlating structure and activity was
adequate. The form of the equation is:
where C is the molar concentration of compound that produces a standard response (e.g.,
LD50, ED50). With other data, it was observed that correlations were improved by
combining Hammett's electronic parameters and Hansch's measure of lipophilicity using an
equation such as
43
where is the Hammett substituent parameter and pi is defined analogously to . That is,
In yet other cases, parabolic relationships between biological response and hydrophobicity
were observed that could be fit by including a (log P)**2 term in the QSAR. One
interpretation to account for this term is that many membranes must be traversed for
compounds to get to the target site, and those with greatest hydrophobicity will become
localized in the membranes they encounter initially. Thus, an optimum hydrophobicity may
be found in some test systems.
QSAR are now developed using a variety of parameters as descriptors of the structural
properties of molecules. Hammett sigma values are often used for electronic parameters, but
quantum mechanically derived electronic parameters also may be used. Other descriptors to
account for the shape, size, lipophilicity, polarizability, and other structural properties also
have been devised.
Quantitative Structure-Activity/Toxicity Relationship Studies (QSAR) are mathematical
models relating the biological activity measurements of a set of chemical compounds to the
variation in their chemical structures [6-9].
2.5 DEVELOPMENT OF QSAR
The steps involved in development of QSAR for a given type of activity and a given
type of chemical compounds can be mentioned as below:
(a) Selecting a “training set” of compounds from an almost infinite universe of
possible candidates;
(b) Synthesizing or otherwise getting hold of pure samples of these compounds;
(c) Obtaining appropriate biological activity measurements for the (n x M) matrix
Y, with n being the number of training set of compounds, M being the number
of biological activity measurements;
44
(d) Translating the variation in structure among the training set compound to the
variation of “structure descriptor variables”, i.e., achieving a relevant
quantitative description of the structural variation among the compounds. The
resulting data are denoted Xik for compound i (i = 1,2,….,n) and descriptor
variable k (k = 1,2,…., k). Together they form the (n x K) matrix X.
(e) Deriving a mathematical model connecting X and Y. This gives, among other
results, a measure of low well Y is modeled (e.g., explained variance, and
which variables in X that are important in the model (e.g., regression
coefficients);
(f) Validating the model by having it predict the biological activity / toxicity for
several new compounds, followed by the actual biological testing of these
compounds and the comparison of predictions with outcomes;
(g) Interpreting the model by relating it to biological and chemical knowledge.
2.6 MODEL DEVELOPMENT AND STATISTICAL DESIGNS
It is an accepted fact that all chemical and biological measurements are not accurate.
The value of any property / activity / varies a little (or much) when measured repeatedly,
even under as similar conditions as possible. Therefore, scientific models based on measured
data, including QSARs, necessarily are statistical in nature. These models separate the
variability in the data in two parts, the systematic and random parts. For the biological
measurements (BA) we can write the model as –
BA = F(X) + E (2.1)
Here, the systematic part, F(X) is the part of the biological data that is “explained” by
X. The random part – the residuals E – contain errors of measurements of Y, plus the model
errors, the imperfection of the systematic model F(X). The model errors, in turn have two
sources, the imperfect form, “shape” of the model F, and the incompleteness and possible
errors in the descriptor variables X [10-15].
Since presumably these structurally related properties of a chemical can be
determined by experimental or computational mean much more efficiently that its biological
activity using in-vivo or in-vitro approaches, a statistically validated QSAR model is capable
of predicting the biological activity of a new chemical within the same series in lieu of the
45
time-consuming and lab our- intensive processes of chemical synthesis and biological
evaluation. Applied judiciously, QSAR can have substantial amount of time, money and man
power.
The Quantitative Structure-Activity Relationships (QSARs) are mathematical models
relating the activity measurements of chemical compounds to the variation in their chemical
structure. In case properties or activities are related to structures of chemical compounds,
then the methodology is called quantitative structure-property relationships (QSPRs) or
quantitative structure-activity relationships (QSARs), respectively. Such methodologies are
being used successfully in antihypertensive drugs to understand the adverse effects of
chemical compounds. These predictions are done because of the large number of untested
chemicals and because of the high costs of biological testing [16-24].
QSAR models are now regarded as a scientifically credible tool for predicting and
classifying the biological activities of untested chemicals. QSAR has become inexorably
embedded as an essential tool in the pharmaceutical industry, from lead discovery and
optimization to lead development [25, 26]. A growing trend is to use QSAR early in the drug
discovery process as a screening and enrichment tool to estimate from further development.
Those chemicals lacking drug like properties [26]or those chemicals predicted to elicit a toxic
response. The fundamental assumption of QSAR is that variations in the biological activity of
a series of chemicals that target a common mechanism of action are correlated with variations
in their structural, physical and chemical properties [27]. A good number of papers have
been reported in this field [28-34].
The molecular structure and parameters derived from molecular spectral data of
organic compounds acting as drugs can be combined to form powerful models of biological
activity. Such data-activity relation is now-a-days called Quantitative Structure-Data-Activity
Relationships (QSDARs) instead of QSAR.
The pioneer worker in this field, considered as father of QSAR, is Corwin Hansch
who for the first time introduced this methodology in 1962 [35]. For the past 47 years, the use
of QSAR has become increasingly helpful in understanding chemical and biological
interactions in the drug design process, pesticide research, environmental pollution, in the
area of toxicology etc. QSAR methodology is very useful in eluc idating the mechanism of
chemical-biological interactions; in particular enzyme action.
46
In environmental toxicology QSAR are used to understand the adverse effects of
chemicals, and to predict the biological effects of yet untested compounds. These predictions
are done because of the large number of untested chemicals, and because of the high costs of
biological testing. It is worth mentioning that, although sometimes taken as a criterion,
prediction is not the primary goal of QSAR. If it results from interpolation, it is often trivial;
if extrapolation goes too far outside the included parameter space, it usually fails. QSAR
helps to understand structure-activity relationships in a quantitative manner and to find the
borders of certain properties, e.g. the optimum lipophilicity within a series of analogs or the
maximum size of a certain group in a stepwise procedure.
The strategy and philosophy of QSAR enables chemists, pharmacologists,
environmentalists and medicinal chemists to look at their structures in terms of
physicochemical properties instead of only considering certain pharmacophore groups in it.
Nevertheless, QSAR methods are still used to prove and to quantify the underlying
hypothesis regarding the dependence of biological activities on physicochemical interactions.
The predictive power of QSAR critically depends on the statistical quality of the data used to
develop the model and estimates its parameters.
Parameters which encode certain structural features and properties are needed to
correlate biological activities with chemical structures in a quantitative manner.
Physicochemical properties, which are directly related to the extra molecular force involved
in the drug-receptor interaction as
The development of a QSAR for a given type of biological activity and a given type
of chemical compounds involves the following steps:
(i) Selecting a “training set” of compounds from an almost infinit
universe of possible candidates;
(ii) Synthesizing or otherwise getting hold of pure samples of these compounds;
(iii) Obtaining appropriate biological activity-measurements from the training set
compounds. Together these data form the (n x M) matrix Y, n being the
number of training set-compounds, and M being number of biological activity
measurements;
(iv Translating the variation in structure among the training set compounds to the
variation of “structure descriptor variables”. That is, achieving a relevant
47
quantitative description of the structural variation among the compounds. The
resulting data are denoted Xik for compound i, (i = 1,2,3,....,n) and descriptor
variable k (k = 1,2,…..,k). Together they form the (n x K) matrix X.
(v) Deriving a mathematical model connecting X and Y. This gives, among other
results, a measure of how well Y is modeled and which
variables in X that are important in the model (i.e. regression coefficients);
(vi) Validating the model by having it predict the biological activity for several
new compounds, followed by the actual biological testing of these compounds
and the comparison of predictions with outcomes.
(vii) Interpreting the model by relating it to biological and chemical knowledge.
It is interesting to mention that the history of QSAR dates back to the
19th century [32,43 and44]. At this time QSAR was in foreground and prediction
played only a minor role. Only a few parameters were used at that time. The
parameters used were log P, ,, MR and some steric parameters.
Now, quantitative prediction is made using quantum chemical, geometrical,
connectivity values, electro- topology, WHIM and many other distance-based and
connectivity type topological indices. Based on the origin of molecular descriptors
used in calculations, QSAR method can be divided into three groups as shown below
QSAR Methods
1st Group 2nd Group 3rd Group
Based on relatively number
of physico-chemical
properties and parameters
describing hydrophobic,
steric, electrostatic, etc.
effects. Usually, these
descriptors are used as
independent variables in
Based on quantitative
characteristic of molecular
graphs i.e. molecular topological
descriptors. These include
molecular connectivity indices
[36-39], molecular shape indices
[40], topological and electro-
topological state indices [41,42],
Based on using descriptors
derived from spatial (three
dimensional) representation of
the molecular structures. These
methods are called (3D) QSAR.
They require 3D alignment of
all molecules according to a
pharmacophore model or based
48
multiple regression
approaches [11]. These
methods are referred to as
Hansch analysis [11].
atom-pair descriptors [41] etc.
Sometimes topological
descriptors are also combined
with physico-chemical properties
and/or spectral parameters of the
molecule. These methods are
referred to [27] QSAR.
on docking to a ligand binding
site of a receptor. The
descriptors used could be
electro-static, steric,
hydrophobic etc. field values in
grid points corresponding the
molecule.
2.7 MODEL VALIDATION AND PREDICTIVE POWER OF QSAR MODELS
In MLR (Multiple Linear Regression) QSAR analysis, more independent variables
result into higher probability of a chance correlation between predicted and observed
activities [33-35]. This conclusion is true not only for MLR - QSAR, but also for any QSAR
approach when the number of variables (descriptors) is comparable to or higher than the
number of compounds in data set. Consequently, model validation is one of the most
important aspects of QSAR analysis.
The best way to validate a model is to perform cross-validation which is based on
leave-one-out (LOO) or leave-some-out (LSO) Cross-validation procedures [33-35]. The
outcome from this procedure is the cross-validated parameters: PRESS (Predicted Residual
Sum of Squares), SSY (Sum of the Squares of the response values), Spress (Uncertainty of
Prediction), R2CV (or Q2) (Overall Predictive ability) and PSE (Predictive Square Error).
Frequently, R2cv (or Q2) is used as a criterion of both robustness and predictive ability of the
model. Many authors consider high Q2 (for instance, Q2 > 0.5) as an indicator or even as the
ultimate proof of the high predictive power of the QSAR model [45-49]. They do not test the
models for their ability to predict the activity of compounds of an external test set (i.e.
compounds which have not been used in the QSAR model development) [50, 51]. Other
authors validate their models using only one or more compounds that were not used in QSAR
model development [50, 51] and still claim that their models are highly predictive. In contrast
with such expectations, it has been shown that if a test set with known values of biological
activities is available for prediction, there exists no correlation between LOO cross-validated
Q2 and correlation coefficient R2 between the predicted and observed activities for the test set
[52, 53].
49
Another widely used approach to establish the model robustness is called y-
randomization (randomization of response i.e. activities) [50,51]. It consists of repeating the
calculation procedure with randomized activities and subsequent probability assessment of
the resultant statistics. Frequently, it is used along with cross-validation. It is expected that
models obtained for the data set with randomized activity should have low values of Q2.
However, sometimes models based on the randomized data have high Q2 values due to
chance correlation or structural redundancy [50-53].
Some authors have suggested that the only way to estimate the true predictive power
of a QSAR model is to compare the predicted and observed activities of sufficiently large
external test set of compounds that were not used in the model development [50-53].
To estimate predictive power of a QSAR model one needs the following statistical
characteristics of the test model:
(1) Correlation coefficient R between the predicted and observed activities;
(2) Coefficient of determination i.e. predicted versus observed activities (R2obs)
and observed versus predicted activities (R2Cal);
(3) Slopes k and k‟ of the regression lines through the origin. A QSAR model is
predictive, if the following conditions are satisfied[50].
Q2 > 0.5, (2.2)
k2 > 0.6, (2.3)
2
2o
2
R
RR < 0.1 or
2
2o
2
R
RR < 0.1 (2.4)
0.85 < k < 1.15 or 0.85 < k‟ < 1.15 (2.5)
2.8. QSAR MODELING
The QSAR modeling generally involves three steps:
(1) Collect or, if possible, design a training set of chemicals,
(2) Choose descriptors that can properly relate chemical structure to biological
activities, and
50
(3) Apply statistical methods that correlate changes in structure with changes in
biological activity.
Obtaining a good quality QSAR model with the ability to predict activity of a
chemical outside the training set depends upon many factors in the approach and execution of
each of the three steps mentioned above.
2.9. QUALITY OF DATA
Data should come from the same assay protocol, and care should be taken to avoid
inter- laboratory variability. Any bad data points will tend to corrupt the proper correlation of
structure and activity. Rules of thumb for a good QSAR data set are that the dose-response
relationship should be smooth, the potency (or affinity) should be reproducible, the activity
range should span two or more orders of magnitude from the least active to most active
chemicals in the series, the number of chemicals used to build the QSAR model should be
sufficiently large to ensure statistical stability, the activities of the chemicals should be evenly
distributed across the range of activity, and the chemicals selected for the training set should
possess enough structural diversity to span the range of chemical space associated with the
biological activity under study.
2.10. PHYSICO-CHEMICAL PARAMETERS IN QSAR
QSAR development involves the quantification of the structural variation in the
current structural class by a small number of design variables [6-9]. These design variables
may correspond to characteristic properties of the compounds, e.g. size, solubility,
lipophilicity, oxidation potential etc. In case such “principal properties” are not available,
they may be derived by a multivariate analysis [16,17] of a battery of measured or calculated
properties of parent compounds. In case, the structural variation in the class corresponds to
the variation of a number of substituents or other structural fragments, the principal properties
of these fragments, i.e. substituent scales [pi(π),sigma etc.] are suitable design variables.
Sometimes, physicochemical properties related to molecular structure are also used as
suitable design variables. Some such properties are : density (d), molecular weight (MW),