INDEX Chapter No Title Page no. 1. Abstract 2 2. Aim of study 4 3. Introduction 5 3.1. Drug Designing 5 3.2. Protein 16 4. Materials and Methods 24 4.1. Structure Based Drug Design 28 4.2. De novo Ligand Design 35 4.3. Structure Based Pharmacophore Generation 40 4.4. Analogue Based Drug Design 42 5. Results and Discussion 77 5.1. Structure Based Drug Design 78 6. Analogue Based Drug Design 90 7. Conclusion 105 1
159
Embed
IN-SILICO STRUCTURE AND ANALOGUE BASED STUDIES ON BACE1 INHIBITORS FOR ALZHEIMER’S DISEASE
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
INDEX
Chapter No Title Page no.
1. Abstract 2
2. Aim of study 4
3. Introduction 5
3.1. Drug Designing 5
3.2. Protein 16
4. Materials and Methods 24
4.1. Structure Based Drug Design 28
4.2. De novo Ligand Design 35
4.3.Structure Based Pharmacophore Generation 40
4.4. Analogue Based Drug Design 42
5. Results and Discussion 77
5.1. Structure Based Drug Design 78
6. Analogue Based Drug Design 90
7. Conclusion 105
8. Abbreviations 106
9. References 108
1
1. ABSTRACT:
β-Secretase also called BACE1 (β-site of APP Cleaving Enzyme) or
memapsin-2. BACE1 is an aspartic-acid protease important in the pathogenesis of
Alzheimer's disease and in the formation of myelin sheaths in peripheral nerve cells. The
transmembrane protein, contains two active site aspartate residues in its extracellular
protein domain and may function as a dimer. BACE1 produces amyloid β peptide (the
primary constituent of neurofibrillary plaques, implicated in Alzheimer's disease,) by
cleavage of the amyloid precursor protein.
The potent BACE1 inhibitors have been suggested to be useful drugs. In this
QSAR, Pharmacophore and Docking studies on BACE1 inhibitors provided to be useful
to find new and potent active compounds against a neurodegenerative disorder,
Alzheimer's disease (AD). As per these studies high active compound had dock score of
82.634 when ligand fit protocol was used and the molecules formed hydrogen bond
interactions with ASP 290, GLY 96 amino acids, while low active compound showed
67.26. Using C-DOCKER protocol the molecules formed hydrogen bond interactions
with THR 293 amino acid, where high and low active compounds showed 36.36 and -
14.58 of C-Docker energy. Using Lib-Dock protocol the high active compound showed
86.88 of lib-dock score and the molecules formed hydrogen bond interaction with THR
294 amino acids, where as low active compound showed 110.87 of lib-dock score and the
molecules formed hydrogen bond interaction with ASP 290, which is same interaction of
that of crystal ligand when compared with lig-plot.
2
Novel Ligand found through the ludi formed hydrogen bond interaction with
active site amino acids GLY 96, SER 291.
Analogue based studies performed using pharmacophore generation on BACE1
inhibitors showed the important features from HipHop run as Hydrogen bond acceptor,
Hydrogen bond donor, Hydrophobic aromatic. Hypogen resulted with these features in
the training set as having cost difference of 53.41 and RMS value of 1.07. The test set
resulted with r2 value of 0.66 by plotting on the estimated activity. QSAR model
generated with training set had the r2 value of 0.972 while the test set has given the r2
value as 0.923.
3
2. AIM OF THE SUTDY
In the field of structure based drug design, there are some major goals of that
biologists seek to achieve.
To protect the proper structure of the proteins and if no X-ray
crystallography structure of the protein is available, then derive the protein
structure through homology modeling.
Given the structure of inhibitors and its target to predict correctly the
binding site on the target, the orientation of the ligand and the
conformations of the both.
Given the structure of a target macromolecules and a set of ligands is to
rank the order of the compounds in their experimental characterization.
The immediate major practical application of the above study are firstly to
improve the binding capacity of existing inhibitors and secondly to
suggest the lead compounds to locus the experimental screening effort
either by searching chemical database or by Denovo Drug Designing.
The main objective of the present study is:
I. To dock the ligand molecule (BASE-1 inhibitor) correctly on the active
site of the receptor.
II. QSAR studies to predict the structure activity relationship between the
ligand and the receptor.
III. To identify from the database and to suggest new molecules by structure
based drug designing or by analogue based drug designing.
4
3. INTRODUCTION
3.1 DRUG DESIGNING
Drug design also sometimes referred to as Rational Drug Design is the inventive
process of finding new medications based on the knowledge of the biological target. The
drug is most commonly a organic small molecule which activates or inhibits the function
of a biomolecule such as a protein which in turn results in a therapeutic benefit to the
patient. In the most basic sense, drug design involves design of small molecules that are
complementary in shape and charge to the bimolecular target to which they interact and
therefore will bind to it. Drug design frequently but not necessarily relies on computer
modeling techniques. This type of modeling often referred to as Computer Aided Drug
Design (CADD).
The phrase “Drug Design” is to some extent a misnomer. What is really meant by
drug design is ligand design. Modeling techniques for prediction of binding affinity are
reasonably successful. However, there are many other properties such as bioavailability,
metabolic half life, lack of side effects, etc. that first must be optimized before a ligand
can becomes a safe and efficacious drug. These other characteristics are often difficult to
optimize using Rational Drug Design techniques.
3.1.1 Background
Typically a drug target is a key molecule involved in a particular metabolic or
signaling pathway that is specific to a disease condition or pathology, or to the infectivity
or survival of a microbial pathogen. Some approaches attempt to inhibit the functioning
of the pathway in the diseased state by causing a key molecule to stop functioning. Drugs
may be designed that bind to the active region and inhibit this key molecule. Another
approach may be to enhance the normal pathway by promoting specific molecules in the
normal pathways that may have been affected in the diseased state. In addition, these
For this QSAR, pharmacophore and docking studies, the protein 2ZDZ is
loaded from RCSB protein data bank (www.rcsb.org/pdb/) and force field is applied.
Force field refers to the functional form parameter sets which are used to find out
potential energy of a system. It includes parameter which is obtained through
experimental works and quantum mechanics calculations. All molecules in a mechanical
system are made up of a number of components. Covalently bonded atoms takes into
consideration several parameters such as bond length , bond angle , dihedral angles etc.,
similarly there exists non-bonded interactions such as Van der Waals interactions ,
electrostatic interactions. Thus the total potential energy of the system is calculated as
follows
E1= [E bond + E angle + E torsion + E vanderwaals + E electronic ]This summation when given is an explicit form, represents force field, evaluating the
potential of a system.
Minimization :
The Minimizer uses an algorithm to identify the geometrics of the
molecule corresponding to the minimum points on the potential surface energy. The
Minimizer reduces the unwanted forces which are present in the molecule and lowers the
energy level of the molecule. There are many algorithms available in the minimization
process. Some of the minimization methods used in the Smart Minimizer is Steepest
Descent method, Conjugate Gradient method, Newton Raphson method and quasi
Newton method. From the DS protocols select the Minimization option and run the
protocol for the protein with fixed constraints .Then save the minimized protein for
absence of crystallographic structure data of a protein for which the active site for
receptor binding is clearly identified, a chemist must rely on the structure activity data for
a given set of ligands. If these ligands are known to bind to the same receptor, then one
can attempt to define the commonality between them. Accelrys Catalyst program can
generate two types of automated pharmacophore models, Hypo Gen and HipHop,
depending on whether or not activity data is used. In the presence of protein crystal
structure data, active site pharmacophore models can be used as a pre-filter for docking
large libraries. Generation of a pharmacophore model using the active site residue
information is the key to the success of any pharmacophore-based docking algorithm. In
the absence of X-Ray bound ligand information; it is a challenge to select a single
pharmacophore model that represents the binding characteristics. A methodology is
proposed in this case study that can be used to analyze and visualize multiple
pharmacophore models. This methodology can be applied to different types of Catalyst
pharmacophore models (qualitative, quantitative, receptor-based, etc.) as it only considers
feature types and coordinates.
This methodology can be applied successfully to the following applications:
VHTS screening
Multiple binding mode identification
Classification of proteins based on binding characteristics
Visualization of pharmacophore model space
To build a better pharmacophore, the following steps were employed:
1. Building a set of molecules
2. Conformer generation
3. Hypothesis Generation
4. Database Search
5. Compare/Fit to estimate Activity
The Feature Dictionary list contains the generalized chemical functions in Catalyst.
44
Definitions of these functions are:
1. HB ACCEPTOR (vector): Matches the following types of atoms or groups of atoms
with surface accessibility-
sp or sp2 nitrogen’s that have a lone pair and charge less than or equal to zero
sp3 oxygen’s or sulfurs that have a lone pair and charge less than or equal to zero
non-basic amines that have a lone pair
Does not match: basic, primary, secondary, and tertiary amines that are protonated at
physiological pH. There is no exclusion of electron-deficient pyridines and imidazoles.
2. HB ACCEPTOR lipid (vector): Matches these types of atoms or groups of atoms:
nitrogen’s, oxygens, or sulfurs (except hypervalent) that have a lone pair and charge less
than or equal to zero. This function is the same as HB ACCEPTOR except that it includes
basic nitrogen. There is no exclusion of electron-deficient pyridines and imidazoles.
3. HB DONOR (vector): Matches these types of atoms or groups of atoms:
Non-acidic hydroxyls
Thiols
Acetylenic hydrogens
NHs (except tetrazoles and trifluoromethyl sulfonamide hydrogens)
Does not match: electron-rich pyridines and imidazoles that would be protonated or
nitrogen’s that would be protonated due to their high basicity
4. HYDROPHOBIC (point): Matches these types of groups of atoms:
A contiguous set of atoms that is not adjacent to any concentrations of charge (charged
atoms or electronegative atoms) in a conformer such that the atoms have surface
accessibility such as phenyl, cycloalkyl, isopropyl, and methyl.
5. HYDROPHOBIC ALIPHATIC (point): Matches these types of groups of atoms:
A contiguous set of atoms that are not adjacent to any concentrations of charge (charged
atoms or electronegative atoms) in a conformer such that the atoms have surface
accessibility is cycloalkyl, isopropyl, and methyl
6. HYDROPHOBIC AROMATIC (point): Matches these types of groups of atoms:
45
A contiguous set of atoms that is not adjacent to any concentrations of charge (charged
atoms or electronegative atoms) in a conformer such that the atoms have surface
accessibility such as phenyl and indole.
7. NEG CHARGE (atom): Matches negative charges not adjacent to a positive charge.
8. NEG IONIZABLE (point): Matches atoms or groups of atoms that are likely to be
deprotonated at physiological pH, such as:
Trifluoromethyl sulfonamide hydrogens
Sulfonic acids (centroid of the three oxygens)
Phosphoric acids (centroid of the three oxygen’s)
Sulfinic, carboxylic, or phosphinic acids (centroid of the two oxygen’s)
Tetrazoles
Negative charges not adjacent to a positive charge
9. POS CHARGE (atom): Matches positive charges not adjacent to a negative charge.
10. POS IONIZABLE (point): Matches atoms or groups of atoms that are likely to be
protonated at physiological pH, such as:
Basic amines
Basic secondary amidines (iminyl nitrogen)
Basic primary amidines, except guanidine’s (centroid of the two nitrogen’s)
Basic guanidine’s (centroid of the three nitrogen’s)
Positive charges adjacent to a negative charge do not match weakly basic aromatic
nitrogen’s such as pyridine and imidazole.
11. RING AROMATIC (vector and plane): Matches 5- and 6-membered aromatic
rings. The feature defines 2 points, the ring centroid and a projected point normal to the
ring plane. The projected point can map both above and below the ring.
Steps to be followed in DS:
1. Construct or import the molecules.
2. Perform conformational search
46
3. Examine the each conformer for the presence of chemical features.
4. Determine the set of features that correlate with activity
Pharmacophore hypothesis
Catalyst’s Confirm Common Feature Pharmacophore generation (HipHop) and
3D QSAR generation (HypoGen) are applications that provide tools to generate
pharmacophore hypothesis. The hypotheses are created by generating conformation for a
set of study molecules, then using the conformation to find and align chemically
important functional groups common to the molecules in the study set. Chemically
important functional groups common to the molecules in the study set. Each hypothesis
can also incorporate data on the biological activities of the study molecules.
Steps involved generating a pharmacophore hypothesis:
1. Generate conformations
The interface to confirm is used to generate conformations for a single molecule or
a set of molecules. The number of conformation needed to produce a good representation
of a compound conformational space depends on the molecules. Both conformations
generating algorithms available in Confirm (Best and Fast) are adjusted to produce a
diverse set of conformations, avoiding repetition groups of conformations all representing
local minima.
The conformations all representing local minima.
The conformations generated by Confirm can be used as input into HipHop and
HypoGen to align common molecular features and generate a hypothesis.
Align common features to generate a hypothesis.
The following procedure involves
1. Aligning common molecular features.
2. Setting preferences using control panel
3. Incorporating activity data into a hypothesis
47
4. Using aligned structures to generate receptor models.
HipHop and HypoGen use conformations generated in Confirm to align
chemically important functional groups common in the molecules in the study set. A
pharmacophore hypothesis can then be generated from these aligned structures.
Incorporated biological activity data into a hypothesis
The HipHop is also used to incorporate biological activity data into the hypothesis
generating process. Each hypothesis is tested by regression techniques to compare
estimated activity with actual activity data. The software uses the data from these tests to
select the hypothesis that do the best job predicting activity for the set of study molecules.
This capability is provided by Catalyst / HypoGen.
4.4 ia Common feature pharmacophore generation (HipHop)
Pharmacophore based on multiple common features alignment generate receptor
models using Hip Hop. The objective is to identify and enumerate all possible
pharmacophore configurations that are common to the training set. The aligned structures
the model receptor menu card is included in the hypothesis models card deck so that you
can use structures that have been aligned in HipHop to generate a receptor surface model.
Since structures used in HipHop are aligned by common chemical features, the receptor
surface model that is generated for them can be significantly different from a receptor
surface model generated from template aligned structures.
The ideal HipHop training set are as follows:-
2-30 compounds ideally 6 molecules
Structurally diverse set of input molecules.
Feature rich compounds
Include the most active compounds
Spread sheet set up for HipHop
48
Molecules hypothesis generation work bench imported into a spread sheet
principal specific the reference molecules references configuration models are potential
centres for hypothesis
If (0) do not consider these molecules
If (1) consider configuration of the molecules.
If(2) use this compound as a reference molecules used only for HipHop
hypothesis generation
Maximum omit features: shows how many features for each compound may be omitted
If (0) all features must map to generate hypothesis
If (1) all but one feature must map to generate hypothesis
If(2) features need to map to generate hypothesis used only for HipHop
hypothesis generation.
When compound data appear in the spreadsheet, you are ready to add values in the
Principal and MaxOmitFeat columns. Common feature hypothesis generation uses
values in these columns to determine which molecules should be considered when
building hypothesis space and which molecules should map to all or some of the
features in the final hypotheses.
In the Principal column, a value of 2 means that all the chemical features in the
compound will be considered in building hypothesis space. A value of 1 means that
features will be considered when generating hypotheses and that at least one mapping for
each generated hypothesis will be found unless the Misses or Complete Misses options
are used. A value of 0 means the compound will be ignored.
The MaxOmitFeat column specifies how many hypothesis features must map to
the chemical features in each compound a 0 in this column forces mapping of all features,
a 1 means that all but one feature must map, and a 2 allows hypotheses to which no
compound features map
4.4.ii 3 D QSAR Pharmacophore generation (HypoGen)
49
HypoGen attempts to derive SAR models for a set of molecules for which activity
value (IC50 or Ki) on a given biological target are available. HypoGen optimizes
hypothesis that are present in the highly active compounds in the training set. But missing
among the least active (or inactive) ones. It attempts to construct the simplest hypothesis
that best correlates that activity (estimates vs. measured) the predicted models are created
the predicted models are created in three stages:
Constructive
Subtractive
Optimization
Fig14: HypoGen process flow
50
Pharmacophore domain
Feasible models
Top scoring models
Constructive phase
Subtractive phase
Optimization phase
1. Constructive Phase:
The constructive phase identifies hypotheses those are common to the most active
set of compounds. The process flow of this phase is depicted below:
Fig 15: Constructive phase process flow
2. Subtractive Phase: The objective of this phase is to identify those pharmacophore
configurations that are developed in the constructive phase that are also present in
the least active set of molecules and remove them. The process is depicted as
follows:
51
Training setMost active compounds
Identify the most active compounds
Enumerate all possible pharmacophore configurations.
Check for duplicates.
Ensure that the rest of most actives fit to MinSubsetPoint features.
Pharmacophore Domain
2nd most active
The most active
(Most Active Cmpd x Unc)-(CmpdX/Unc)>0
Identify the least active compounds
Enumerate all possible pharmacophore configurations.
Check for configurations shared with the most active compounds.
Eliminate if shared by more than half of the least actives.Feasible pharmacophores
2nd most active
The most active
log(CmpdX)-log(Most Active Cmpd)>3.5
Training set
Least active compounds
Fig 16: Subtractive phase process flow
3. Optimization Phase:
This phase involves improvement of hypotheses score. HypoGen reports
the top scoring 10 unique pharmacophores. The process flow is depicted as
follows:
Fig 17: Optimization phase process flow
The constructive phase identifies hypothesis that are common to the most active
set of compounds.
The most active set is determined by the following equation:
MA x UncA = (A/UncA)>0.0
Where MA is the activity of the most active compounds
Uncert is the uncertainty in the measured activity and A is the activity of the compound
The most active set of compounds is limited to a maximum of 8. Once the set is
determined HypoGen enumerates all possible pharmacophore features for each of the
52
Feasible pharmacophores
Features and /or locations are varied to optimize activity prediction via stimulated annealing approach.
Geometric fits are calculated.
Linear regression of –log(Activity) vs Geometric Fit performed.
Total cost is calculated for each new hypothesisTotal cost = [Cost(Err)xCC(Err)]+[Cost(Wt)xCC(Wt)]+[Cost(Cnfg)xCC(Cnfg)]Where CCs are the cost coefficients contained in CATALYST_CONF/hypo.data
Stops when the optimization no longer improves the score.
“Occam’s Razor”: the simplest hypothesis that accurately estimates the activity is considered the best
conformations for the two most active compounds. Furthermore, the hypothesis must fit a
minimum subset of features of the remaining most active compounds in order to be
considered. At the end of the constructive phase a database of every number of
pharmacophore configurations is generated. The objective of the subtractive phase is to
identify those pharmacophore configurations is generated. The objective of the
subtractive phase is to identify that pharmacophore configuration developed in the
constructive phase that is also present in the least active set of molecules and remove
them. The first step is the identification of the least active compounds. This is
accomplished by the equation
Log (A) - log (MA) < 3.5
Where the A is the activity of the current compound and MA is the activity of the most
active compound.
In simple terms, all compounds whose activity is 3.5 order of magnitude less than
that of the most active compound are considered to be in the set of least active molecules.
The value 3.5 is user adjustable parameter, if needed (i.e., if the activity range of the
dataset does not span more than 3.5 orders of magnitude the subtractive phase identifies
the hypothesis that are common to the least active compounds the least active set is
determined by the following equation:
log (cmpdx)-log (most active compounds) > 3.5
It enumerates all possible pharmacophore configurations. Then it checks for
configuration with the most active compounds and eliminates if shared by more than half
of the least actives leading to feasible pharmacophore.
The optimization phase involves improvement of the hypothesis score.
Small perturbations are applied to those pharmacophore configurations that survived the
subtractive phase and that are scored based on errors I activity estimates from regression
and complexity of the hypothesis. The cost of a hypothesis is a quantitative extension of
Occams razor (everything else being equal, the simplest model is preferred;
A detail of the cost of each pharmacophore is computed by the sum of three costs:
weight, error and configuration. While the weight component increases with deviation of
the feature weight from the ideal value of 2.0, the error component increases with RMS
53
difference between the measured and estimated activities. The configuration cost is fixed
and depends on the complexity of the pharmacophore upon completion of this phase.
HipHop and HypoGen use conformations generated in Confirm to align
chemically important functional groups common to the molecules in a study set.
Biological activity data can be incorporated into this hypothesis so that the best
hypothesis for predicting activity are generated and selected. Additionally, you can use
structures that have been aligned in these programs to generate a receptor surface model.
HypoGen Training and Test set selection
Selection of the training set molecules is one of the most important exercises the
user must purpose for the following reasons:
Catalyst derives the information used in subsequent analysis from those structures
thus; “the garbage in garbage out” paradigm certainly applies.
The statistical procedures applied during analysis have limits in terms of over and
under fitting the data.
Data sets that are ideal for those analysis procedures and data sets from typical
medicinal chemistry structure activity series are often not the same thing.
The ideal training set should satisfy the following conditions:
1. At least 16 compounds are necessary to assure statistical power.
2. Activities should span 4 orders of magnitude.
3. Each order of magnitude should be represented by at least 3 compounds.
4. No redundant information.
5. No excluded volume problems.
Methodology
To build a better pharmacophore the following steps were employed
1. Building set of molecules
2. Conformer generation
3. Hypothesis generation
4. Database generation
54
5. Database search
6. Compare / fit to estimate activity
Criteria to generate successful hypothesis are:
1. Cost factor: a dumping score that is the difference between fixed and null cost
should be greater than so hits i.e., larger difference gives better prediction.
2. Fixed cost represents the simplest method model that fits all data perfectly and the
null cost represents the highest cost of a pharmacophore with no features and
which estimates activity to be average of activity data of training set of molecules.
3. The configuration value which is a measure of magnitude of hypothesis space for
a given training set should be less than 18. If it is above, more degree s of
freedom and the result may not be useful.
4. The estimated and the actual activity data correlation value should be around 1.0
5. The RMS deviations, which should be as low as possible, nearly equal to 0, which
represents the quality of the correlation between the estimated and the actual
activity data.
Method
1. Building a set of molecules
All molecules were built using Catalyst view compound work bench. They were
cleaned using option 2D beautify and minimized using CHARMm like force field.
2. Conformer generation
A conformer is a representation model of the possible conformational space of a
ligand. It is assumed that the biologically active conformation of a ligand (or a close
approximation thereof should be contained within this model. Conformers were
generated for all molecules with cut off energy range 20 Kcal /mol and up to a maximum
of 255 conformers.
Cost hypothesis:
55
The lowest cost hypothesis is considered to be the best. However, hypothesis with
costs within 10-15 of the lowest cost hypothesis are also considered as good candidates.
The units of cost are binary bits. Hypothesis costs are calculated according to the number
of bits required to completely describe a hypothesis. Simplex hypothesis require bits for a
complete description and the assumption is made that simplex hypothesis are better.
Hypothesis generation / pharmacophore search
A pharmacophore model consists of a collection of features necessary for the
biological activity of the ligand arranged in 3D space, the common ones being hydrogen
bond acceptor, hydrogen bond donor and hydrophobic features. Hydrogen bond donors
are defined as vectors from the donor atom of the ligand to the corresponding acceptor
atom in the receptor. Hydrogen bond acceptors are analogously defined. Hydrophobic
features are located at the centroids of hydrophobic atoms.
Conformation s for all molecular were generated in view compound work bench
using poling algorithm and the best quality conformer generation method. The best
conformer generation considers the arrangement of atoms. Best conformer generation
accepts a maximum of 255 conformers for the set of molecules Catalyst generated
conformers that provided the most comprehensive treatment of flexible ring systems. All
the conformers are automatically saved and the number of conformers generated for each
molecule with lowest conformer energy in kcal/mol. Conformers were selected that fell
within 20 kcal/mol range above the lowest energy conformation found.
Hypothesis generation
The pharmacophore hypothesis generated in generate hypothesis work bench. The
molecular were selected as training set based on order of magnitude. Hypothesis
generation carried out by employing following assumptions.
1. Highly active and most inactive molecule should represent in the training set.
2. At least 3 or more molecules from each order of magnitude should be selected for
pharmacophore generation.
3. A minimum of 15 or above molecules will constitute for a training set.
56
4. Molecules selected should represent diversity towards chemical features.
Hypothesis considerations
In order to achieve a better pharmacophore, the following limits or considerations
should be met by generated hypothesis:
Configuration value should be around 17.
RMS should be as low as possible, preferable nearer to zero.
Correlation should be around 1.0
Cost factor difference between fixed cost and Null cost should be between 40-80
bits.
Factors that determine the quality of pharmacophore
The overall cost of a hypothesis is calculated by summing three cost factors, a
weight cost, an error cost and a configuration cost. These are qualitatively defined.
1. Weight cost
A value that increases in a Gaussian form as the feature weight in model
deviates from an idealized value of 2.0. This cost factor is designed to favour hypothesis
where the feature Weights are close to 2.
2. Error cost
A value that increases at the RMS difference between estimated and measured
activities for the training set molecules increases. This cost factor is designed to favour
models where the correlation between estimated and measured activities is better.
3. Configuration cost
This is a fixed cost which depends on the complexity of the hypothesis space
being optimized. It is equal to the entropy of the hypothesis space.
57
Of the three, the error cost factor has the major effect in establishing hypothesis
cost. During the beginning phase of an automated hypothesis generation, Catalyst
calculates the cost of two theoretical hypothesis one in which the error cost is minimal
(all compounds fall along a line of slope=10, and one where the error cost is high (all
compounds fall along a line of slope +O). These models can be considered upper and
lower bounds for the training set. The cost values for them are useful guides for
estimating the chances for a successful experiment and are available within 15 minutes
from the start of the run because these experiments can easily require days of run time.
The ideal hypothesis cost (fixed cost) is reported in the full file found in the hypothesis
generation directory. This value tends to be 70-100 bits. The null hypothesis cost is
reported in the log file found in the same directory and is usually higher than the fixed
cost. What is important is the difference between these two costs. The greater the
difference, the higher is the probability for finding useful model. In terms of hypothesis
significance, what really matters is the magnitude of the difference the cost of any
returned hypothesis and the cost of the null hypothesis. In general, if this difference is
greater than 60 bits, there is an excellent chances the model represents a true correlation.
Since, most returned hypothesis will be higher in cost than fixed cost model, a difference
between fixed cost and null cost of 70 or more will be necessary in order to achieve the
60 bit difference. If a returned hypothesis has a cost that differs from the null hypothesis
by 40-60 bits, there is a high probability it has a 75-90% chances of representing a true
correlation in the data. As the difference becomes less than 40 bits, likelihood of the
hypothesis representing a true correlation in the data rapidly drops below 50%%. Under
these conditions, it may be difficult to find a model that can be shown to be predictive. In
the extreme situation where the fixed and null cost differential is small (>20), there is
little chance of succeeding and it is advisable to reconsider the training set before
proceeding. Another useful number is the entropy of hypothesis space. This value is
calculated early in the run and is in full near the value for fixed cost.
Training set
1. Training set should contain the most active compounds.
2. Each compound must provide a unique feature to Catalyst.
58
3. If two compounds have similar structures (collections of features), they must
differ in activity by an order of magnitude to be included, otherwise, pick only the
more active of the two.
4. If two compounds have similar activities (within one order of magnitude), they
must be structurally distinct (from a chemical feature point of view) in order to
both be included, otherwise pick only the most active of the two.
The pharmacophore features are perceived from the HipHop data. The
features present in training set molecules are hydrogen bond acceptor, hydrogen bond
donor, hydrophobic and ring aromatic. 19 molecules are selected for the training set. The
training set molecules and their activity values are loaded into a spread sheet and all the
preferences and uncertainty values are loaded. Then the HypoGen algorithm is used to
generate the hypotheses.
4.4. iii Quantitative Structure Activity Relationship (QSAR)
The idea of quantitative structure-activity (or structure-property) relationships
(QSAR/QSPR) was introduced by Hansch et al. in 1963 and was first applied to analyze
the importance of lipophilicity for biological potency. This concept is based on the
assumption that the difference in the structural properties of molecules, whether
experimentally measured or computed, accounts for the difference in their observed
biological or chemical properties. In general QSAR methods deals with identifying and
describing important structural features of molecules that are relevant to explaining
variation in biological or chemical properties. QSAR started as a simple comparison of
properties for two or more molecules using single number and has ended up as a complex
multivariable treatment of properties versus structure based on statistical analysis and
relying on extraordinary power of modern computers.
QSAR is a technique that quantifies the relationship between structure and
biological data and useful for optimizing the groups those modulate the potency of a
molecule .QSAR has been the useful for rationalizing compound activity and for rational
design of new compounds.
59
Most QSAR methods developed over the years have been dealt with descriptors of
molecular structures derived from 2D representation of molecular structures .i.e., based
on molecular connectivity. Numerous 2D structural descriptors have been reported,
including hydrophobicity constants, molar refractivities, Hammett electronic constants,
Verloop STERIMOL parameters, and topological indices developed by Kier and Hall.
Traditional QSAR methods have utilized several of the above parameters and multiple
regression methods to develop equation relating structure and biological activity
The fundamental quantitative structure activity relationship studies reveals that the
structures can be easily be compared, overlaid and displayed. The QSAR is obtained by
providing more parameters to optimize a series of bioactive molecules. The quantitative
structure activity relationship based on physiochemical properties describes the structural,
electronic and physiochemical characteristics of a drug. Data sets are produced using all
available descriptors.
Application of knowledge of the three-dimensional (3D) structure of the target
(receptor/enzyme/DNA) to rationally design drug molecules to bind to the target is done
for the following reasons are:-
1. Understand atomic details of binding strength and specificity of a drug (drug-receptor
interactions).
2. Develop novel drugs (unique chemical structures) for a selected target via de novo
drug design or database searching techniques.
3. Optimize the therapeutic index of an already available drug or lead compound
concerning structural requirements for activity from a minimum number of compounds
are tested.
A QSAR equation numerically defines the chemical properties, biological activity and
physiochemical properties. Biological activity is defined as pharmacological response
usually expressed in millions such as the effective dose in 50% of the subjects (ED 50).
The lethal dose is 50% of the subjects (LD50) or the minimum inhibitory concentration
IC50. It is common to express the biological activity as a reciprocal QSAR equation is
similar to the equation for a straight line:-
y = mx + c
or
60
Log biological activity = a (physiochemical property) + c
a = regression coefficient of slope of the straight line.
c = intercept on y-axis (when the physiochemical property equals zero)
Fig 18: Concept of QSAR
Biological activity expressed as a reciprocal to produce a positive slope and
also due to the inverse relationship between physiochemical chemical property and
biological potency. There is a positive relationship between the reciprocal of the
biological activity(I/BA) and physiochemical property, because (I/BA) increases as the
studies are based on the descriptors and biological activity relationship the biological
activity data must be minimal and the choice of the descriptors of the descriptors must be
accurate and appropriate.
Objective of QSAR:
1. Drug transport/ mechanism
2. Prediction of activity.
3. Classification of molecules as highly active, moderately active and inactive.
4. Optimization of activity by steric, electrostatic and hydrophobicity
61
5. Refinement of synthetic targets.
6. Reduction and replacement of animals for the action of drugs
Basic requirement in QSAR studies:
1. All analogues should belong to congeneric series.
2. All analogues should exert same mechanisms of actions.
3. All analogue should bind in a comparable manner.
4. Effect of isosteric replacement can be predicted.
5. Binding affinity can be correlated to interaction energies.
6. Biological activities can be correlated to binding activity.
QSAR studies involve the following steps
CSD data base.
Choice of descriptors.
Statistical methods to evaluate to evolve QSAR equation.
Validation.
CSD database
Experimental information about the structures of molecules can often be
extremely useful for forming theories of conformational analysis and hoping to predict
the structures of molecules for which no experimental information is available. The most
important technique currently available for determining the three dimensional structure of
molecules is x-ray crystallography community has distributed in electronic form two
practically important databases for molecular modeler are the Cambridge structural
database CSD which contains crystal structures of organic and organ metallic molecules
and the protein data bank (PDB) which contain structures of proteins and some DNA
fragments.
62
A data base of little use without software tools to search extract and manipulate the
data. A simple use of a database is for extracting information about a particular molecule
or group of molecules .the data may also be identified by creating a two dimensional
representation of molecule and using a substructure search program to search the
database. Crystallographic database have also been used to develop an understanding of
the factors that influence the conformations of the molecules, and of the ways in which
molecules interact with each other. For example, the CSD has comprehensively analyzed
to characterize how the lengths of chemical bonded depend upon the atomic numbers,
hybridization and the environment of the atoms involved. Analyzing of intermolecular
hydrogen bonding have revealed distinct distance and angular preferences a major use of
the CSD is substructure searching for molecules which contain a particular fragment, in
order to investigate the conformation that the fragment adopts.
A crystallographic database can only provide information about the crystal state
of matter and that the possible influence of crystal packing forces should always be taken
into account. This is less of concern for protein than for small molecules as protein
crystals contain a large amount of water and indeed NMR studies are established that
protein have approximately, the same structure in solution as in the crystal.
A second, more stable subtle, bias is that crystallographic databases only contain
molecules that can be crystallized and indeed only those molecules whose X-ray
structures were considered enough to be published. The structures in a crystallographic
database may therefore not be a wholly representative set.
Molecular descriptors
The study of steric requirements for interaction between ligands and
corresponding biological acceptor sites is often of decisive importance in understanding
the role played by the structural features in promoting activity in its most general form
drug receptor theory requires that a ligand exerts its biological action as a consequence of
binding or otherwise interacting with a specific biological acceptor site such as
membrane protein , an enzyme etc., which may be generally termed the receptor the
concept is the basis for modern drug receptor theory involves the old principle that a
ligand fits its receptor much as a key fits a lock. This concept, although somewhat
63
arbitrary since a high degree of flexibility is present in biomacromolecules, structure,
governs the principle of molecular recognition and molecular discrimination. Although
stereochemistry often plays a major role in drug bioactive, care must be taken when
considering structure activity relationship to explore whether other differences in
physiochemical properties exists before one makes significant correlations with the steric
properties of the structure under study.
In early studies organic chemists defined a number of steric parameters in
order to explain steric effects of substituents on the reaction centers of organic molecules.
The same type of steric effects observe in studies of variation of physical properties and
the chemical reactivity with structure may be assumed to be involved in biological
activity studies which at least as a first approximation may be treated in similar fashion in
the past 35 years owing to the development of drug design and Hansch Approach many
other parameters and methods have been developed which have the permit of trying to
avoid a simple empirical correlation with given ligand properties and also trying to
propose the possible geometric features of the receptor.
Steric descriptors are classified into following groups:
1. Topological indices based on characterization of the chemical structures of the graph
theory.
2. Geometric descriptors resulting from the view of organic molecules as three
dimensional objects from which standard dimensions can be calculated.
3. Chemical descriptors derived from steric influence upon a standard reaction.
4. Physical descriptors derived when an organic molecule is considered as three
dimensional object with size determined physical properties and different descriptors
which result when an organic molecule is considered as a three dimensional object from
reference structure.
Different molecular descriptors available are described below.
Molecular Descriptors
1. Fragment constant descriptors
64
These are constants that relate the effect of substituents on a “reaction center”
from one type of process to another. The basic idea is that similar changes in
structure are likely to produce similar changes in reactivity, ionization or
binding. There are different constants corresponding to different effects. These
are typically used to parameterize the Hammett equation for some series of
analogs.
Log kx= pσ +log kh
Where Kx and kh are reaction rate constants for the substituents x and h,
respectively ;0 is an electronic constant by an ionization constant and p is fit to
set etc at different properties (electronic , steric )etc at different R group
positions are used . In this way measurements of ionization constants can be
used to predict rate constants once a sealing factor (p) is determined effects for
the rate of constant. The default database currently contains the following types
of constants. These come from table VI –I of Hansch except for the Sterimol
constant which is calculated.
Sm, Sp - Electronic effect sigma meta and sigma para
F, R - Inductive polar part (F) and resonance part (R)
pi – Hydrophobic character
HA, HB – Hydrogen bond acceptor (HA) and donor (HB)
MR - Molar refractivity = (n2-1/n2+1)*(MW/d)
[n -refractive index, MW -molecular weight and d -compound density]
Sterimol-L – Steric length parameter
Sterimol-B1 through B4 – Steric distances perpendicular to bond axis
Sterimol-BS – Overall maximum steric distance perpendicular to bond
axis
2. Conformational descriptors
65
Energy – Descriptor energy is the energy of the selected conformation
Low Energy – Energy of the most stable conformation in the set of
conformations belonging to each molecular model
E penalty – Difference between Energy and Low Energy
3. Electronic descriptors
Charge – Sum of partial charges
F charge – Sum of formal charges
A pol – Sum of atomic polarizabilities
Dipole – Dipole moment
HOMO – Highest occupied molecular orbital energy
LUMO – Lowest unoccupied molecular orbital energy
Sr – Super delocalizability
4. Graph theoretic descriptors
All these descriptors ultimately base their calculation on representation of
molecular structures as graphs, where atoms are represented by vertices and
covalent chemical bonds by edges. These descriptors fall into 2 categories:
a.) Topological descriptors: These view molecule graphs as connectivity
structures to which numerical invariants can be assigned. There are 20
descriptors based on graph theory concept. They help to differentiate
molecules according mostly to their size, degree of branching, flexibility and
overall shape. Examples are Weiner’s index, Zagreb Index, Hosoya index,
Kier and Hall molecular connectivity index and Balaban indices.
66
b.) Information content descriptors: These view molecule graphs as source of
certain probability distribution to which Shannon’s statistical information
theory tool can be applied. In this approach molecules are viewed as
structures which can be partitioned into subsets of elements that are in some
sense equivalent. The notion of equivalence depends on the particular
descriptor.
All of these descriptors perform their evaluations on Hydrogen suppressed
graphs, i.e, there are no vertices corresponding to hydrogens and no edges
corresponding to bonds connecting hydrogen to another atom.
5. Molecular Shape Analysis (MSA) descriptor
DIFFV – Difference volume
Fo – Common overlap volume (ratio)
NCOSV - Non common overlap steric volume
Shape RMS – RMS to shape reference
COSV – Common overlap steric volume
SRVol – Volume of shape reference compound
6. Spatial descriptors
RadofGyration – Radius of gyration
Jurs descriptors – Jurs charged partial surface area descriptors
Shadow indices – Surface area projections
Area – Molecular surface area
Density – Density
67
PMI – Principle Moment of Inertia
Vm – Molecular volume
7. Structural descriptors
MW – Molecular weight
Rotlbonds – Number of rotatable bonds
Hbond acceptors – Number of Hydrogen bond acceptors
Hbond donor - Number of Hydrogen bond donors
8. Thermodynamic descriptors
AlogP – Log of partition coefficient
Fh2o – Desolvation free energy of water
Foct - Desolvation free energy for octanol
Hf – Heat of formation
MolRef – Molar refractivity
9. Molecular Field Analysis (MFA) descriptors:
Molecular field analysis (MFA) evaluates the energy between a probe and
molecular model at a series of points defined by a rectangular or spherical grid. This
method quantifies the interaction energy between a probe molecule and a set of aligned
target molecules in QSAR. This energy may be added to the study table to form new
columns headed according to the probe type. The new columns may be used as
independent X variables in the generation of QSAR.
Six descriptors are available in this family.
68
1. H+ probe: This selects proton “as a probe’, having +1 charge and zero vanderwaals
radius. It has electrostatic interactions and non bonded interaction are not
considered
2. CH3 probe: This probe with a vanderwaals radius of united CH3 group but with a
zero charge. The energy of interaction of this probe with a study molecule will
include only non bonded interactions.
3. Donor / acceptor probe: It is two atom probes consisting of oxygen bounded to
hydrogen. The vanderwaals radii of eth atoms are exactly how they are defined in
the particular force field loaded. The probe is neutral. Depending on the
orientation of this probe. It is capable of bleaching as a hydrogen bond donor or
an acceptor.
4. CH3 probe: It is single atom probe with a vanderwaals radius of a united CH3 of -
1. The energy of interaction of this probe includes both non-bonded of interaction
of this probe includes both non bounded and electrostatic interactions.
5. Generic probe: There is a generic single atom probe with a user specified Vander
radius and charge.
6. Other probes: Any multi atom model may be employed as a probe specifying the
Msi file format.
Statistical methods to evaluate QSAR equation
QSAR analysis uses statistical methods for studying the correlation of biological
activity to structural and physiochemical properties of candidate molecules. Here are
different statistical techniques used to fit the molecule under multivariate statistics, which
include the following:-
1. PCA (Principal Component Analysis):
It aims at representing large amount of multidimensional data by
transforming them into a more intuitive low dimensional representation. This
69
method does not create a model, but searches for relationship among the
independent variables. It then creates new variables (the principal components)
which represent most of the information contained in the independent variables.
2. Cluster Analysis:
The goal of cluster analysis is to partition (typically to representing set of
models in a molecular descriptor property space) into classes or categories
consisting of elements of comparable similarity. The algorithm assumes that
models are represented by points in multidimensional property space with
Euclidian distance between points representing model dissimilarity. The below
mentioned are the types in this category
1. Jarivs – Patrick clustering
2. Variable-Length Jarnis Patrick clustering
3. Relocation Clustering
4. Hierarchical Clustering Analysis (HCA)
3. Simple Linear Regression:
It performs a standard linear regression calculation to generate a set of
QSAR equations that includes one equation for each independent variable. It is
good for exploring simple relations between structure and activity.
4. Multiple Linear Regressions (MLR):
This method calculates QSAR equation by performing standard multi
variable regression calculations using multiple variables in a single equation. In this
method variables are independent correlated).
5. Stepwise Multiple Linear Regression:
It calculates QSAR equation s by adding one variable data time and
testing each addition for significance and such variables are sued in QSAR
70
equation. It is useful when the number of variables is large and when the key
descriptors are not known. If the number of variables exceeds number of structures
this method should not be used.
6. PLS (Partial Least Squares):
This method carries out regression using latent variables. From the
independent and dependent data that are along their axes of greatest variation and
are most highly correlated. It can be used with more than one dependent variable.
It is typically applied when the independent variables are correlated or the number
of independent variables exceeds the number of observations (rows).
7. GFA (Genetic Function Approximation):
GFA is designed to be applied to the problems of function
approximation. When it receives a large number of potential factors influencing a
response including several powers and other functions of the raw inputs, it should
find the subsets of terms that correlate best with the response.
The central concepts of GFA are simple. The region to be searched is coded into
one or more strings. In the GFA these strings are sets of terms: power and splines
of the raw input. Each string represents a location in the search space.The
algorithm works with a set of these strings called a population. This population is
evolved in manner that leads it towards the objective of research. This requires
that a measure of the fitness of each string corresponding to a model in the GFA is
available.
Following this three operations are performed iteratively in succession: selection,
crossover, mutation. Newly added members are screened according to fitness
criteria. In GFA the scoring criteria for models are related to the quality of the
regression fit to the data. The selection probabilities must be revaluated each time
when a new member is added to the population.
1. Selection: Two parents are selected from the present population with
probabilities proportional to their fitness.
71
2. Crossover: A crossover splices and rejoins the characters in the two parent
string to create a new child string. In conventional genetic algorithm this is
accomplished by selecting the crossover point along each of the parents and
combining the first substring from the first parent from the second substring with
the second parent.
Parents: Child:
X 12 , X 2 | 3 X 4, X 33 X 12 , X 2 , X 4 , X 52
X 1 , X3 | X 4, X 52
3. Mutations: In a mutation, the single term in a string (a model) is altered.
This is the mechanism for continuously introducing a measure of diversity into
the population acting to prevent the algorithm from getting stuck with in a
suboptimal of solutions.
In the GFA algorithm simulations are performed with the user defined probability
after each crossover. The GFA procedure continues for a specified number of
generations unless convergence occurs in the interim. Generation is the number of
attempted a crossover equal to the size of population. Convergence is triggered by
lack of progress in the highest and average score of populations.
8. GPLS: (Genetic Partial Least Squares):
It is a method derived from GFA and PLS that are valuable analytical
tools for datasets that have more descriptors than samples. The following three
statistical methods are useful in combi chem. and analog builder.
9. FA (Factor Analysis):
It addresses one of the main problems found in PCA that is not simple to
relate the principal component to molecular properties. All the common factors
have a close relationship to real molecular properties.
72
10. RP (Recursive Partition):
It identifies the internal representation of classes used by classification
structure activity relations hip (CSAR) for deriving recursive portioning models.
Validation Methods
Once a regression equation is obtained it is important to determine its
reliability and its significance. Internal validation uses the data set for which the model is
derived and checks for internal consistency. The procedure derives a new model and is
used to predict the activities of the molecules that were not included in the new model
set. This is repeated until all compounds have been deleted and predicted once. Internal
validation is less rigorous than external validation. External validation evaluates how well
the equation generalization. The original data are divided into two groups, the training set
and the test set. The training set is used to derive a model, and the model is used to
predict the activities of the test set numbers. The following procedures are used to check
that the size of the model is appropriate for the quantity of data availability as well as
provides some estimate of how well the model can predict activity for new models are as
follows:-
1. Cross Validation: This process repeats the regression may times on subsets of the data.
Usually each molecule is left out intern and r2 is computed using the predicted values of
the missing molecules (r2)
2. Randomization Test: Even with large number of observations and a small number of
terms, an equation can still have a very poor predictive power. This can come about it the
observation are not sufficiently independent of each other.
Interpreting QSAR equation
QSAR is used for predicting the activities of as yet untested and possibly not yet
synthesized) molecules. The predictive ability of a QSAR is generally more accurate for
73
interpolative (for compound that have parameters within the range of those considered in
the data set) than for the extrapolative predictions (compounds that are outside the range)
A QSAR equation provides insights into the mechanism of the process being studies.
1. Square Of Correlation Coefficient (r2): If x (independent) and y (dependent) variables
are highly correlated, there is considerable information in x and y that is redundant. The
degree of correlation is measured by the correlation coefficient (r2)
2. Cross Validated r2 (Termed As Q2 or Xvr2): r2can be computed using cross validation
methods (XVr2) or boot strap methods (BSr2). It is also the fraction of the variance
explained by the model. Cross validated r2 is always somewhat lower and often much
lower than the r2.
3. PRESS (Predictive Error Sum Of Squares): The sum of overall compares of the
squared differences between the actual and the predicted values for independent variables
[1/y]2. The intensity of the cross validated process is controlled by selecting the number
of groups or number of times the cross validation step is to be carried out while
predicting all rows (at each stage of model development).
Procedure
74
Fig 19: Flowchart of QSAR procedure
Calculate molecular properties
The Calculate Molecular Properties protocol will calculate many properties or
perform basic statistical and correlation analysis of the numeric properties as requested.
To set up a Calculate Molecular Properties protocol:
1. Load the QSAR and apply the force field on molecules and Calculate
Molecular Properties protocol from the Protocols Explorer. The parameters
display in the Parameters Explorer.
2. On the Parameters Explorer, click in the cell for the Input Ligands parameter
and click the button to specify the ligand source on the Specify Ligands dialog.
On the dialog, select all ligands from a Table Browser, a 3D Window, or a file.
3. Select the properties to calculate by clicking the button in a cell for the
Molecular Properties, Semi empirical QM descriptors, or Density Functional QM
descriptors, and follow the instructions in the popup dialog window.
75
The Create Genetic Function Approximation can build a Genetic Function
Approximation model for a dependent property using the selected molecular descriptors.
To set up a Create Genetic Function Approximation Model protocol:
1. Load the QSAR /Create genetic function approximation Model protocol from
the Protocols Explorer. The parameters display in the Parameters Explorer.
2. On the Parameters Explorer, click in the cell for the Input Ligands parameter
and click the button to specify the ligand source on the Specify Ligands dialog.
On the dialog, select all ligands from a Table Browser, a 3D Window, or a file.
3. Set the desired model name using the Model Name parameter. Once created,
this model will appear under the other category of the Molecular Properties
parameter in the Calculate Molecular Properties protocol and can be used to
compute the property for future ligands.
4. Set the initial equation length and remaining parameters as desired. Parameters
presented in red are required.
76
77
5.1. LIGAND FIT
The docking score is the negative values of the non-bonded inter molecular energy, if the
ligand atom has partial charge on it, the electrostatic grid is used to estimate electrostatic
energy. If it is a hydrogen atom, the hydrogen grid is used for Vander Waals energy.
Fig1: This figure is showing the binding site of the protein, which is defined for the
ligand fit.
78
Fig2: Molecule scafold4 molecule1 (high active) which has been subjected to ligand
fit is showing its interaction with amino acids of 2ZDZ.
Fig3: Molecule 2 (low active) which has been subjected to ligand fit is showing its
interaction with amino acids of 2ZDZ.
79
Table showing top 10 Dock scores of high active molecule