STRUCTURE–WATER SOLUBILITY MODELING OF ALIPHATIC ALCOHOLS USING THE WEIGHTED PATH NUMBERS* D. AMIC ´ a , S.C. BASAK b , B. LUC ˇ IC ´ c , S. NIKOLIC ´ c, and N. TRINAJSTIC ´ c a Faculty of Agriculture, The Josip Juraj Strossmayer University, HR-31001 Osijek, The Republic of Croatia; b Natural Resources Research Institute, University of Minnesota, Duluth, MN 55811, USA; c The Rugjer Bos ˇkovic ´ Institute, Department of Physical Chemistry, Bijenic ˇka 54, P.O. Box 180 HR-10002 Zagreb, The Republic of Croatia (Received 17 September 2000; In final form 25 April 2001) The structure–water solubility modeling of aliphatic alcohols was performed using the weighted path numbers. Aliphatic alcohols were represented by weighted trees. The weight of the edge representing C – O bond was taken to be x, while the weights of C – C bonds were taken to be all equal to one. Four (one-, two-, three- and four-descriptor) models with weighted path numbers were considered. They were compared with models based on surface areas of aliphatic alcohols, models based on the vertex-connectivity indices for the corresponding alkanes, models based on orthogonal valence vertex-connectivity indices, models based on valence vertex- and edge-connectivity indices with optimum exponents and models based on weighted line graphs. The main result of this comparative study is that the models based on two, three, or four weighted path numbers posses the best statistical characteristics of all models considered in this paper. In addition, the predictive performance of these models was also tested using the training/test set partition. Very good and stable predictions for 19 test set compounds were obtained. For this data set we find, in all performed tests of models, that optimum x values are in the range 3.0 – 4.0. This result supports views about the potential of the weighted path numbers for deriving high quality structure–property models. Keywords: Aliphatic alcohols; Structure–property modeling; Water solubility; Weighted paths INTRODUCTION In this paper, we report on modeling the water solubility of aliphatic alcohols using recently proposed molecular descriptors named the weighted path numbers [1,2]. We are interested in the physicochemical and biological properties of alcohols for some time because they are toxic materials and thus represent dangerous environmental pollutants [3–5]. The toxic action of alcohols on the higher organisms, presumably through the phenomenon of narcosis, depends on their solubility in water. The lower alcohols: methanol, ethanol, and propanol mix with water in any ratio, while this is not the case with other aliphatic alcohols, especially ISSN 1062-936X q 2002 Taylor & Francis Ltd DOI: 10.1080/10629360290002776 *Presented at the 9th International Workshop on Quantitative Structure–Activity Relationships in Environmental Sciences (QSAR 2000), September 16–20, 2000, Bourgas, Bulgaria Corresponding author. E-mail: [email protected]SAR and QSAR in Environmental Research, 2002 Vol. 13 (2), pp. 281–295
15
Embed
Structure-water solubility modeling of aliphatic alcohols using the weighted path numbers
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
STRUCTURE–WATER SOLUBILITY MODELING OFALIPHATIC ALCOHOLS USING THE WEIGHTED PATH
NUMBERS*
D. AMICa, S.C. BASAKb, B. LUCICc, S. NIKOLICc,† and N. TRINAJSTICc
aFaculty of Agriculture, The Josip Juraj Strossmayer University, HR-31001 Osijek, The Republic ofCroatia; bNatural Resources Research Institute, University of Minnesota, Duluth, MN 55811, USA;
cThe Rugjer Boskovic Institute, Department of Physical Chemistry, Bijenicka 54, P.O. Box 180HR-10002 Zagreb, The Republic of Croatia
(Received 17 September 2000; In final form 25 April 2001)
The structure–water solubility modeling of aliphatic alcohols was performed using the weighted path numbers.Aliphatic alcohols were represented by weighted trees. The weight of the edge representing C–O bond was taken tobe x, while the weights of C–C bonds were taken to be all equal to one. Four (one-, two-, three- and four-descriptor)models with weighted path numbers were considered. They were compared with models based on surface areas ofaliphatic alcohols, models based on the vertex-connectivity indices for the corresponding alkanes, models based onorthogonal valence vertex-connectivity indices, models based on valence vertex- and edge-connectivity indices withoptimum exponents and models based on weighted line graphs. The main result of this comparative study is that themodels based on two, three, or four weighted path numbers posses the best statistical characteristics of all modelsconsidered in this paper. In addition, the predictive performance of these models was also tested using thetraining/test set partition. Very good and stable predictions for 19 test set compounds were obtained. For this data setwe find, in all performed tests of models, that optimum x values are in the range 3.0–4.0. This result supports viewsabout the potential of the weighted path numbers for deriving high quality structure–property models.
Keywords: Aliphatic alcohols; Structure–property modeling; Water solubility; Weighted paths
INTRODUCTION
In this paper, we report on modeling the water solubility of aliphatic alcohols using recently
proposed molecular descriptors named the weighted path numbers [1,2]. We are interested in
the physicochemical and biological properties of alcohols for some time because they are
toxic materials and thus represent dangerous environmental pollutants [3–5]. The toxic
action of alcohols on the higher organisms, presumably through the phenomenon of narcosis,
depends on their solubility in water. The lower alcohols: methanol, ethanol, and propanol
mix with water in any ratio, while this is not the case with other aliphatic alcohols, especially
ISSN 1062-936X q 2002 Taylor & Francis Ltd
DOI: 10.1080/10629360290002776
*Presented at the 9th International Workshop on Quantitative Structure–Activity Relationships in EnvironmentalSciences (QSAR 2000), September 16–20, 2000, Bourgas, Bulgaria
SAR and QSAR in Environmental Research, 2002 Vol. 13 (2), pp. 281–295
higher ones. Alcohols are also technologically important materials and are used in the
manufacture of a number of products. And again their usefulness in this respect also depends,
among other things, on their solubility in water.
Our aim in this paper is to investigate how structure–water solubility models based on
weighted path numbers compare to models based on descriptors with prescribed values for
vertex and edge weights [6,7]. Already almost fifty years ago Platt [8] advocated the use of
path numbers in the structure–property modeling. However, the significance of Platt’s
proposal was overlooked until the resurrection of the chemical graph theory in early
seventies [9].
It should be pointed out that there are more than 1800 (empirical and theoretical)
molecular descriptors available [10] for use in QSAR/QSPR modeling, but we are still
lacking good descriptors for many molecular properties of interest [11,12]. It is especially
difficult to unearth good descriptors to handle diverse heterosystems [13]. This was the
motivation behind the effort by Randic and his coworkers when they introduced the concept
of optimal descriptors to treat molecules with structural features such as double bonds,
heteroatoms and heterobonds [1,2,14–19]. Of course, there are a number of other authors
who are trying to do the same, that is, to discover the best possible descriptors for modeling a
given molecular property [7,20–24].
WEIGHTED PATH NUMBERS
Apliphatic alcohols will be represented by weighted graphs [25]. As an example, a weighted
graph wG representing the hydrogen-depleted skeleton of 3,3-dimethyl-1-butanol is shown in
Fig. 1.
A path Pl(l is the length of the path) is a sequence of adjacent edges, which do not pass
through the same vertex more than once [26]. Note that P1 is equal to the number of edges in
FIGURE 1 A labeled weighted graph wG representing the hydrogen-depleted skeleton of 3,3-dimethyl-1-butanol.Black dot denotes the position of oxygen and x stands for the weight of C–O bond. Weights of C–C bonds are takento be one.
D. AMIC et al.282
the graph. A weighted path contains weighted edges. In the case of aliphatic alcohols, the
weight of the edge representing C–O bond is taken to be x, while the weights of edges
representing C–C bonds are all equal to one (see Fig. 1). The count of paths and weighted
paths in a weighted graph wG, given in Fig. 1, is illustrated in Table I. The length of a path
containing the edge with weight x is simply denoted by x.
The weighted path numbers for 3,3-dimethyl-1-butanol are equal to the sums of weighted
paths over all vertices in a graph divided by 2, since each vertex is counted twice. If we
assume x ¼ 1; then the count ðP1 ¼ 6; P2 ¼ 8; P3 ¼ 4; P4 ¼ 3Þ represents paths in
2,2-dimethylpentane. If x is left undefined, then it can be viewed as a variable to be adjusted
from regression analysis to obtain the smallest standard error of estimate for a regression that
is considered. This is the idea on which the use of variable topological indices is based
[1,2,14–19]. However, so far the use of variable descriptors was very little exploited in
QSAR/QSPR modeling [27,28].
RESULTS AND DISCUSSION
In Table II, we give the weighted path numbers and the experimental [29,30] and calculated
water solubilities of 54 aliphatic alcohols in terms of their natural logarithms (ln SOL)
(Table II).
We searched for the optimum value of the weight x by varying it in the range 1.00–5.00.
The standard error of estimate (s ) was used as a criterion for the selection of the optimum
value of x. We considered models with one descriptor (P1), two descriptors (P1, P2), three
descriptors (P1, P2, P3) and four descriptors (P1, P2, P3, P4).
Model With One Descriptor
When P1 is used as a single descriptor the regression does not depend on x, because x is an
additive constant for all paths of length one. Therefore, P1 is not an adequate weighted path
number for modeling the water solubilities of aliphatic alcohols as it was expected since P1
only reflects the molecular size having the same value for all isomers. However, for
completeness we give below the structure-water solubility for P1 and x ¼ 0 :
ln SOL ¼ 5:016ð^0:257Þ2 1:367ð^0:038ÞP1 ð1Þ
n ¼ 54 r ¼ 0:981 s ¼ 0:693 F ¼ 1314
where r is the correlation coefficient, s the standard error of estimate and F the Fisher ratio.
TABLE I The count of paths and weighted paths in a labeled weighted graph wG (see Fig. 1) representing thehydrogen-depleted skeleton of 3,3-dimethyl-1-butanol
Vertex P1 P2 P3 P4
1 1 þ x 1 32 2 3 þ x3 4 1 x4 1 3 1 x5 1 3 1 x6 1 3 1 x7 x x x 3x
Weighted path numbers 5 þ x 7 þ x 3 þ x 3x
STRUCTURE–WATER SOLUBILITY MODELING 283
TABLE II The weighted path numbers, experimental, calculated (fitted and cross-validated), and predicted water solubilities of 54 aliphatic alcohols using the best two-descriptor model
No. Compound P1 P2 P3 P4
ln SOL
exp. calc-FIT calc-CV calc/prd
1 2-Methyl-1-propanol 3 þ x 3 þ x 2x 0 0.0227 0.093 0.099 0.0882 2-Butanol 3 þ x 2 þ 2x 1 þ x 0 0.0658 0.611 0.643 0.605*3 3-Methyl-1-butanol 4 þ x 4 þ x 2 þ x 2x 21.1680 21.205 21.208 21.2044 2-Methyl-1-butanol 4 þ x 4 þ x 2 þ 2x x 21.0584 21.205 21.215 21.204*5 2-Pentanol 4 þ x 3 þ 2x 2 þ x 1 þ x 20.6349 20.687 20.689 20.6866 3-Pentanol 4 þ x 3 þ 2x 2 þ 2x 1 20.4861 20.687 20.695 20.6867 3-Methyl-2-butanol 4 þ x 4 þ 2x 2 þ 2x 0 20.4050 20.457 20.459 20.461*8 2-Methyl-2-butanol 4 þ x 4 þ 3x 2 þ x 0 0.3386 0.291 0.288 0.2829 3-Hexanol 5 þ x 4 þ 2x 3 þ 2x 2 þ x 21.8326 21.986 21.990 21.977
10 3-Methyl-3-pentanol 5 þ x 5 þ 3x 4 þ 2x 1 20.8301 21.007 21.017 21.010*11 2-Methyl-2-pentanol 5 þ x 5 þ 3x 3 þ x 2 þ x 21.1178 21.007 21.001 21.0112 2-Methyl-3-pentanol 5 þ x 5 þ 2x 3 þ 3x 2 21.6094 21.756 21.759 21.75213 3-Methyl-2-pentanol 5 þ x 5 þ 2x 4 þ 2x 1 þ x 21.6399 21.756 21.758 21.752*14 2,3-Dimethyl-2-butanol 5 þ x 6 þ 3x 4 þ 2x 0 20.8510 20.777 20.771 20.78515 3,3-Dimethyl-1-butanol 5 þ x 7 þ x 3 þ x 3x 22.5903 22.043 22.029 22.04516 3,3-Dimethyl-2-butanol 5 þ x 7 þ 2x 3 þ 3x 0 21.4106 21.295 21.291 21.302*17 4-Methyl-1-pentanol 5 þ x 5 þ x 3 þ x 2 þ x 22.2828 22.504 22.514 22.49518 4-Methyl-2-pentanol 5 þ x 5 þ 2x 3 þ x 2 þ 2x 21.8140 21.756 21.754 21.75219 2-Ethyl-1-butanol 5 þ x 5 þ x 4 þ 2x 1 þ 2x 22.7871 22.504 22.491 22.495*20 2-Methyl-2-hexanol 6 þ x 6 þ 3x 4 þ x 3 þ x 22.4734 22.306 22.296 22.30121 3-Methyl-3-hexanol 6 þ x 6 þ 3x 5 þ 2x 3 þ x 22.2634 22.306 22.308 22.30122 3-Ethyl-3-pentanol 6 þ x 6 þ 3x 6 þ 3x 3 21.9173 22.306 22.328 22.301*23 2,3-Dimethyl-2-pentanol 6 þ x 7 þ 3x 6 þ 2x 2 þ x 22.0025 22.075 22.081 22.07624 2,3-Dimethyl-3-pentanol 6 þ x 7 þ 3x 6 þ 3x 2 21.9379 22.075 22.086 22.07625 2,4-Dimethyl-2-pentanol 6 þ x 7 þ 3x 4 þ x 4 þ 2x 22.1456 22.075 22.070 22.076*26 2,4-Dimethyl-3-pentanol 6 þ x 7 þ 2x 4 þ 4x 4 22.8018 22.824 22.824 22.81927 2,2-Dimethyl-3-pentanol 6 þ x 8 þ 2x 4 þ 4x 3 22.6437 22.593 22.592 22.59428 3-Heptanol 6 þ x 5 þ 2x 4 þ 2x 3 þ x 23.1942 23.284 23.286 23.269*29 4-Heptanol 6 þ x 5 þ 2x 4 þ 2x 3 þ 2x 23.1966 23.284 23.286 23.26930 2,2,3-Trimethyl-3-pentanol 7 þ x 10 þ 3x 8 þ 4x 3 22.9318 22.913 22.910 22.91731 2-Octanol 7 þ x 6 þ 2x 5 þ x 4 þ x 24.7560 24.583 24.579 24.560*32 2-Ethyl-1-hexanol 7 þ x 7 þ x 6 þ 2x 4 þ 2x 24.9967 25.101 25.104 25.07833 2-Nonanol 8 þ x 7 þ 2x 6 þ x 5 þ x 26.3200 25.881 25.869 25.85234 3-Nonanol 8 þ x 7 þ 2x 6 þ 2x 5 þ x 26.1193 25.881 25.875 25.852*35 4-Nonanol 8 þ x 7 þ 2x 6 þ 2x 5 þ 2x 25.9522 25.881 25.879 25.852
D.
AM
ICet
al.
28
4
Table II – continued
No. Compound P1 P2 P3 P4 ln SOL
exp. calc-FIT calc-CV calc/prd
36 5-Nonanol 8 þ x 7 þ 2x 6 þ 2x 5 þ 2x 25.7446 25.881 25.885 25.85237 2,6-Dimethyl-4-heptanol 8 þ x 9 þ 2x 6 þ 2x 5 þ 4x 25.7764 25.421 25.407 25.401*38 3,5-Dimethyl-4-heptanol 8 þ x 9 þ 2x 8 þ 4x 5 þ 2x 25.2983 25.421 25.425 25.40139 2,2-Diethyl-1-pentanol 8 þ x 10 þ x 10 þ 3x 6 þ 3x 25.5728 25.939 25.949 25.91940 7-Methyl-1-octanol 8 þ x 8 þ x 6 þ x 5 þ x 25.7446 26.399 26.424 26.369*41 3,5,5-Trimethyl-1-hexanol 8 þ x 11 þ x 7 þ x 7 þ 2x 25.7699 25.708 25.707 25.69442 1-Dodecanol 11 þ x 10 þ x 9 þ x 8 þ x 210.6800 210.525 210.509 210.46843 1-Butanol 3 þ x 2 þ x 1 þ x 0 0.0953 20.137 20.163 20.137*44 1-Pentanol 4 þ x 3 þ x 2 þ x 1 þ x 21.3471 21.436 21.443 21.42945 2,2-Dimethyl-1-propanol 4 þ x 6 þ x 3x 0 20.6463 20.745 20.749 20.75446 1-Hexanol 5 þ x 4 þ x 3 þ x 2 þ x 22.7181 22.734 22.735 22.720*47 2-Hexanol 5 þ x 4 þ 2x 3 þ x 2 þ x 21.9951 21.986 21.986 21.97748 1-Heptanol 6 þ x 5 þ x 4 þ x 3 þ x 24.0745 24.033 24.030 24.01249 1-Octanol 7 þ x 6 þ x 5 þ x 4 þ x 25.4015 25.331 25.328 25.303*50 1-Nonanol 8 þ x 7 þ x 6 þ x 5 þ x 26.9078 26.629 26.615 26.59451 1-Decanol 9 þ x 8 þ x 7 þ x 6 þ x 28.2208 27.9280 27.910 27.886*52 1-Tetradecanol 13 þ x 12 þ x 11 þ x 10 þ x 212.7720 213.122 213.186 213.05153 1-Pentadecanol 14 þ x 13 þ x 12 þ x 11 þ x 214.6140 214.420 214.373 214.342*54 1-Hexadecanol 15 þ x 14 þ x 13 þ x 12 þ x 215.5875 215.718 215.761 215.634
*Prediction for 19 test set molecules is indicated by asterisks (last column). Remaining 35 molecules are used for training and for model generation. Optimum x value ðx ¼ 3:3Þ for the two-descriptor model (based onP1 and P2) is selected according to the fitted statistical parameters on the training set (35 compounds).
ST
RU
CT
UR
E–
WA
TE
RS
OL
UB
ILIT
YM
OD
EL
ING
285
Best Model With Two Descriptors
In Table III, we give the standard error of estimate for various values of x and this is shown in
Fig. 2.
The best two-descriptor model is obtained for x ¼ 3:25 :
ln SOL ¼ 8:209ð^0:136Þ2 1:529ð^0:015ÞP1 þ 0:230ð^0:011ÞP2 ð2Þ
n ¼ 54 r ¼ 0:9980 s ¼ 0:227 F ¼ 6363
This model is much improved over the above model. The standard error of estimate has
dramatically decreased from s ¼ 0:693 to s ¼ 0:227:
Best Model With Three Descriptors
In Table IV, we give the standard error of estimate for various values of x and this is
graphically shown in Fig. 3.
The best three-descriptor model is obtained for x ¼ 3:75 :
ln SOL ¼ 8:880ð^0:136Þ2 1:515ð^0:014ÞP1 þ 0:190ð^0:011ÞP2
þ 0:024ð^0:009ÞP3 ð3Þ
n ¼ 54 r ¼ 0:9982 s ¼ 0:215 F ¼ 4698
This model is only slightly better than model (2).
Best Model With Four Descriptors
In Table V, we give the standard error of estimate for various values of x and this is shown in
Fig. 4.
The best four-descriptor model is obtained for x ¼ 3:65 :
ln SOL ¼ 8:701ð^0:147Þ2 1:506ð^0:020ÞP1 þ 0:193ð^0:011ÞP2
þ 0:023ð^0:009ÞP3 þ 0:008ð^0:009ÞP4 ð4Þ
n ¼ 54 r ¼ 0:9983 s ¼ 0:216 F ¼ 3508
This model is comparable to the model (3).
TABLE III The standard error ofestimate (s ) for various values of xwhen two weighted paths (P1, P2) areused as descriptors. Asterisks denote theoptimum values of x and s
Among the listed models (2)–(4), there is not much difference. Therefore, we selected,
according to the Ockham’s Razor Principle [31], a model with least number of parameters,
that is, the two-descriptor model (2) for calculating fitted values of ln SOL. They are reported
in Table II under heading (ln SOL)calc-FIT.
We also used the leave-one-out cross-validation technique as a criterion for checking the
quality of the model (2) [32]. Since the obtained cross-validated statistical parameters
ðrcv ¼ 0:998; scv ¼ 0:237Þ are close to the fitted values, the good quality of model (2) is thus
confirmed.
FIGURE 2 A plot of the path weight x against the standard error of estimate s for two descriptors (P1, P2).
TABLE IV The standard error ofestimate (s ) for various values of xwhen three weighted paths (P1, P2, P3)are used as descriptors. Asterisks denotethe optimum values of x and s
The cross-validated values are reported in Table II under heading (ln SOL)calc-CV. A plot
(ln SOL)exp vs. (ln SOL)calc-CV is shown in Fig. 5.
Predictive Performance of the Weighted-Paths Based Models
In addition, following suggestions of reviewers, predictive performance of the models
containing two, three and four weighted paths were examined. For the training (selection of
optimal x ) 35 randomly selected molecules were used, and remaining 19 molecules were
FIGURE 3 A plot of the path weight x against the standard error of estimate s for three descriptors (P1, P2, P3).
TABLE V The standard errorof estimate (s ) for various valuesof x when four weighted paths(P1, P2, P3, P4) are used asdescriptors. Asterisks denote theoptimum values of x and s
In the first column weighted paths involved in the models are given. x is the weight of the edge representing C–O bond. rtrain and rtest
are correlation coefficients calculated on 35 molecules from the training set and on 19 molecules from the test set, respectively. strain
and stest are standard error of estimate calculated on 35 molecules from the training set and standard error of prediction calculated on19 molecules from the test set, respectively. Prediction for the test set molecules is made using model developed on the training set.
STRUCTURE–WATER SOLUBILITY MODELING 289
used for testing the predictive model quality. Test set molecules are indicated by asterisks in
the last column in Table II. Statistical parameters of the models obtained for this training/test
set partition are given in Table VI. The best two-descriptor model developed for 35 training
set molecules involved P1 and P2 calculated for x ¼ 3:3: For the training set we obtained
s ¼ 0:181: Standard error of prediction for the test set compounds is 0.281.
The best three- and four-descriptor models involved paths calculated for x ¼ 4:0: Three-
descriptor model contains P1, P2, and P3 and gives s ¼ 0:160 on the training set. The
standard error of prediction on the test set compounds is s ¼ 0:276: The best four descriptor
models (with P1, P2, P3, P4) is also obtained for x ¼ 4:0 and produces s ¼ 0:160 in training
and s ¼ 0:275 in prediction on the test set. We see that inclusion of P3 and P4 does not
produce significant improvement of the models quality, and we take the two-descriptor
model as the best one. This particular selection is in accordance with the selection of the best
model based on the cross-validated parameters. Moreover, there is only a small difference
between x values selected according to the best fitted or cross-validated parameters, and
those selected according the best predictive parameters.
In order to see good predictive quality of the best two-descriptor model containing P1 and
P2, we give scatter plot (ln SOL)exp vs. (ln SOL)calc for 35 training set compounds (black
circles), and, in the same plot, (ln SOL)exp vs. (ln SOL)pred for 19 test set compounds (open
circles) (Fig. 6).
Comparison With Models Reported in the Literature
We list below several structure–water solubility models of aliphatic alcohols based on
different molecular descriptors that are reported last 25 years in the literature.
Models Based on Surface Areas of Aliphatic Alcohols
An early attempt to model water solubility of aliphatic alcohols is by Amidon et al. [29].
These authors produced three models between the water solubility of aliphatic alcohols and
FIGURE 6 A plot of (ln SOL)exp against (ln SOL)calc for 35 molecules from the training set (black circles) and(ln SOL)exp against (ln SOL)pred for 19 molecules from the test set (open circles).
D. AMIC et al.290
their surface areas. They adopted the definition of molecular surface area from Hermann
[33]: A molecule is considered as a collection of spheres with each radius located at the
nuclear center. To each radius on the solute molecule (aliphatic alcohol), a radius for the
solvent (water) is added to give a surface. This approach has the convenient property of
eliminating from the total surface area of a molecule those areas that are not exposed or
accessible to the solvent. Another advantage of Hermann’s procedure is that it allows
computation of the individual atom contributions to the total surface area. This permits
division of the total surface area of a molecule into group contributions, e.g. hydrocarbon and
hydroxyl group parts of alcohols, and the group contribution to the solubility can be
estimated. Standard bond length and bond angles were used for all aliphatic alcohols
considered. It should be also pointed out that long time ago (in 1925) one of the first attempts
to relate a molecular property to critical structural features is due to Langmuir [34] who
related intermolecular interactions in the liquid state to the surface energy. Recently (in
1999) Mebane et al. [35] pointed out that the surface area is an important descriptor to use in
the structure–property modeling.
Amidon et al. [29] have considered only 51 aliphatic alcohols, omitting those which are
solid at the room temperature. Their first model relates ln SOL and the total surface area
(TSA) in A2:
ln SOL ¼ 11:78 2 0:043 TSA ð5Þ
n ¼ 51 r ¼ 0:974 s ¼ 0:499
Their second model relates ln SOL with the surface area of the hydrocarbon part of the
aliphatic alcohol. Amidon et al. [29] calculated the hydrocarbon surface area (HYSA) by
substracting the hydyroxyl group surface area (OHSA) from TSA: HYSA ¼ TSA 2 OHSA:They obtained the following not particularly good model:
ln SOL ¼ 8:94 2 0:0396 HYSA ð6Þ
n ¼ 54 r ¼ 0:94 s ¼ 0:706
The best empirical model of Amidon et al. [29] is a two-descriptor model that relates
ln SOL with the HYSA and the OHSA:
ln SOL ¼ 12:41 2 0:043 HYSA 2 0:060 OHSA ð7Þ
n ¼ 51 r ¼ 0:978 s ¼ 0:462
The TSA and HYSA are highly intercorrelated quantities, the value of r being 0.994.
Models (5) and (7) are somewhat better than the one-descriptor model (1), while models (2)–
(4) from above possess better statistical characteristics.
Kier–Hall–Murray Models
Hall, Kier and Murray [36] were first to study the structure–water solubility of aliphatic
alcohols using topological indices. Their first model has been derived in 1975 before Kier
and Hall [37] introduced the valence vertex-connectivity index. Therefore, Hall, Kier and
Murray [36] used the vertex-connectivity index x in the original Randic’s formulation for
hydrocarbons [38]. Hence, Hall, Kier and Murray [36] instead of computing the vertex-
connectivity index for aliphatic alcohols computed x for the related alkanes. In spite of this,
they obtained a decent model comparable to models based on surface areas of aliphatic
STRUCTURE–WATER SOLUBILITY MODELING 291
alcohols and to our model (1):
ln SOL ¼ 6:702 2 2:666x ð8Þ
n ¼ 51 r ¼ 0:978 s ¼ 0:455
When they used the empirical parameter cOH that takes care about the environment of the
OH group, that is, whether we deal with primary, secondary or tertiary alcohols, the two-
descriptor model considerable improves:
ln SOL ¼ 9:204 2 2:630x 2 4:390cOH ð9Þ
n ¼ 50 r ¼ 0:991 s ¼ 0:289
Model Based on Orthogonal Descriptors
A very good two-descriptor structure–water solubility model for aliphatic alcohols can be
obtained using ordered orthogonal descriptors [39–41] which includes explicitly the OH
group [4]:
ln SOL ¼ 23:666ð^0:037Þ2 2:689ð^0:479ÞVV 2 6:336ð^0:479ÞVOH ð10Þ
n ¼ 54 r ¼ 0:997 s ¼ 0:273 F ¼ 4479
where the symbol V stands for orthogonal descriptors. The valence vertex-connectivity indices
are used as the basis set. The structure–water solubility model for aliphatic alcohols based on
four ordered orthogonal descriptors including the OH group is a very good model, better than any
other model from above except three models (2)–(4) based on the weighted path numbers:
ln SOL ¼ 23:666ð^0:033Þ2 2:689ð^0:025Þ 1VV 2 1:225ð^0:116Þ 3VV
2 0:650ð^0:324Þ 4VV þ 5:747ð^0:510Þ VOH ð11Þ
n ¼ 54 r ¼ 0:998 s ¼ 0:240 F ¼ 2897
Model Based on the Valence Vertex-Connectivity Index With Optimum Exponent
The following is the best structure–water solubility model for aliphatic alcohols based on the
valence vertex-connectivity index with the optimum value of the exponent [42] equal to
20.425 [43]:
ln SOL ¼ 5:864ð^0:210Þ2 2:418ð^0:050ÞðxVÞ½20:425� ð12Þ
n ¼ 54 r ¼ 0:989 s ¼ 0:526 F ¼ 2320
where (x V)[20.425] is a shorthand notation for the valence vertex-connectivity indices
calculated using the value of 20.425 for the exponent in Eq. (12). There is a slight
improvement over the structure–water solubility model when the valence vertex-
connectivity indices are computed using the standard value for the exponent, that is, 20.5:
ln SOL ¼ 5:814ð^0:211Þ2 2:671ð^0:056ÞðxVÞ½20:5� ð13Þ
n ¼ 54 r ¼ 0:989 s ¼ 0:531 F ¼ 2279
D. AMIC et al.292
Model Based on the Valence Edge-Connectivity Index With Optimum Exponent
The same approach as above is applied to the use of the valence edge-connectivity index 1 V
[44,45]. The best structure–water solubility model for aliphatic alcohols based on the
valence edge-connectivity index with the optimum value of the exponent equal to 20.675 is
[43]:
ln SOL ¼ 5:527ð^0:167Þ2 3:306ð^0:056Þð1VÞ½20:675� ð14Þ
n ¼ 54 r ¼ 0:992 s ¼ 0:434 F ¼ 3426
where (1 V)[20.675] is a shorthand notation for the valence edge-connectivity indices
calculated using the value of 20.675 for the exponent in (14). This is a better model than the
structure–water solubility model based on the valence edge-connectivity indices with the
standard value for the exponent, that is, 20.5:
ln SOL ¼ 6:210ð^0:282Þ2 2:727ð^0:074Þð1VÞ½20:5� ð15Þ
n ¼ 54 r ¼ 0:982 s ¼ 0:680 F ¼ 1368
Model (14) is better than models (1), (6), (12), (13) and (15) and is comparable to models
(5), (7) and (8). It falls behind two-, three- and four-descriptor models based on weighted
paths numbers, the Kier–Hall–Murray two-descriptor model and models based on
orthogonal descriptors.
Model Based on Weighted Line Graphs
The same approach as in last two previous subsections is also applied to the use of the
valence edge-connectivity index computed for the weighted line graphs [25]. The best
structure–water solubility model for aliphatic alcohols, represented by weighted line graphs,
based on the valence edge-connectivity index with the optimum value of the exponent equal
to 20.85 is [43]:
ln SOL ¼ 4:899ð^0:247Þ2 4:078ð^0:109Þ½1VðLÞ�½20:85� ð16Þ
n ¼ 54 r ¼ 0:981 s ¼ 0:674 F ¼ 1392
where [1 V(L)][20.85] is a shorthand notation for the valence edge-connectivity indices of line
graphs L, calculated using the value of 20.85 for the exponent in Eq. (16). This is a much
better model than the structure–water solubility model based on the standard value of the
exponent, that is, 20.50:
ln SOL ¼ 4:367ð^1:106Þ2 1:942ð^0:256Þ½eVðLÞ�½20:50� ð17Þ
n ¼ 54 r ¼ 0:725 s ¼ 2:447 F ¼ 58
Model (17) is the worst model of all models listed here, while the model (16) is
comparable to models (1), (6) and (15).
CONCLUDING REMARKS
In the present work, we compared four structure–water solubility models of aliphatic
alcohols based on the weighted path numbers among themselves and with a number of
STRUCTURE–WATER SOLUBILITY MODELING 293
models based on a variety of descriptors. The comparison was based on the correlation
coefficient and standard error of estimate. The result of this comparative study is that two-,
three- and four-descriptor models based on the weighted path numbers are superior to all
other models reviewed here. We have found that the optimal values of the weight of the edge
representing C–O bond (x ) for the studied data set is in the range 3.0–4.0, instead of x ¼ 1:0in the case when non-weighted paths were used. The high quality of these models suggests
that they can be used with confidence for predicting the water solubilities of yet unknown
higher aliphatic alcohols, if such a need arises.
Acknowledgements
This work was supported in part by the Ministry of Science and Technology of the Republic
of Croatia via Grants 00980606 (BL, SN and NT), 106N407 (BL) and 079301 (DA). A part
of this work was completed during the stay of SN and NT at the Center for Water and the
Environment, Natural Resources Research Institute (NRRI), University of Minnesota,
Duluth, Minnesota 55811, USA. SCB and SN and NT (during their stay at NRRI) were
supported, in part, by Grants F49620-94-1-0401 and F49620-96-1-0330 from the United
States Air Force. We thank referees for their helpful comments.
References
[1] Randic, M. and Basak, S.C. (1999) “Optimal molecular descriptors based on weighted path numbers”, J. Chem.Inf. Comput. Sci. 39, 261–266.
[2] Randic, M. and Pompe, M. (1999) “On characterization of the CC double bond in alkenes”, SAR QSAREnviron. Res. 10, 451–471.
[3] Roy, A.B., Basak, S.C., Harriss, D.K. and Magnuson, V.R. (1984) “Neighborhood complexities and symmetryof chemical graphs and their biological applications”, In: Avula, X.J.R., Kalman, R.E., Lipais, A.I. and Rodin,E.Y., eds, Mathematical Modelling in Science and Technology (Pergamon Press, New York), pp. 745–775.
[4] Lucic, B., Nikolic, S., Trinajstic, N., Juric, A. and Mihalic, Z. (1995) “A structure–property study of thesolubility of aliphatic alcohols in water”, Croat. Chem. Acta 68, 417–434.
[5] Nikolic, S. and Trinajstic, N. (1998) “Modeling the aqueous solubility of aliphatic alcohols”, SAR QSAREnviron. Res. 9, 117–126.
[6] Barysz, M., Jashari, G., Lall, R.S., Srivastava, V.K. and Trinajstic, N. (1983) “On the distance matrix containingheteroatoms”, In: King, R.B., ed, Chemical Applications of Topology and Graph Theory (Elsevier,Amsterdam), pp. 222–227.
[7] Ivanciuc, O., Ivanciuc, T., Cabrol-Bass, D. and Balaban, A.T. (2000) “Comparison of weighting schemes formolecular graph descriptors: Application in quantitative structure–retention relationship models foralkylphenols in gas–liquid chromatography”, J. Chem. Inf. Comput. Sci. 40, 732–743.
[8] Platt, J.R. (1952) “Prediction of isomeric differences in paraffin properties”, J. Phys. Chem. 56, 328–336.[9] Hosoya, H. (1971) “Topological index. A newly proposed quantity characterizing the topological nature of
structural isomers of saturated hydrocarbons”, Bull. Chem. Soc. Jpn 44, 2332–2339.[10] Todeschini, R. and Consonni, V. (2000) The Handbook of Molecular Descriptors (Wiley-VCH, New York).[11] Devillers, J. and Balaban, A.T. (1999) Topological Indices and Related Descriptors in QSAR and QSPR
(Gordon and Breach Science Publisher, Amsterdam), p. 811.[12] Karelson, M. (2000) Molecular Descriptors in QSAR/QSPR (Wiley-Interscience, New York), p. 430.[13] Katritzky, A.R., Maran, U., Lobanov, V.S. and Karelson, M. (2000) “Structurally diverse quantitative
structure–property relationship correlations of technologically relevant physical properties”, J. Chem. Inf.Comput. Sci. 40, 1–18.
[14] Randic, M., Hansen, P. and Jurs, P.C. (1988) “Search for useful graph theoretical invariants of molecularstructure”, J. Chem. Inf. Comput. Sci. 28, 60–68.
[15] Randic, M. (1991) “Novel graph theoretical approach to heteroatom in quantitative structure–activityrelationship”, Chemom. Intell. Lab. Syst. 12, 970–980.
[16] Randic, M. (1991) “On computation of optimal parameters for multivariate analysis of structure–propertyrelationship”, J. Comput. Chem. 12, 792–980.
[17] Randic, M. (1991) “Search for optimal molecular descriptors”, Croat. Chem. Acta 64, 43–54.[18] Randic, M. and Basak, S.C. (2000) “Multiple regression analysis with optimal molecular descriptors”, SAR
QSAR Environ. Res. 11, 1–23.
D. AMIC et al.294
[19] Randic, M. and Basak, S.C. (2000) “Variable molecular descriptors”, In: Sinha, D.K., Basak, S.C., Mohanty,R.K. and Busamallick, I.N., eds, Some Aspects of Mathematical Chemistry (Visva-Bharati University Press,Santiniketan).
[20] Balaban, A.T. (1997) From Chemical Topology to Three-Dimensional Geometry (Plenum Press, New York),p. 420.
[21] Gute, B.D. and Basak, S.C. (1997) “Predicting acute toxicity (LC50) of benzene derivatives using theoreticalmolecular descriptors: A hierarchical QSAR approach”, SAR QSAR Environ. Res. 7, 117–131.
[22] Plavsic, D., Trinajstic, N., Amic, D. and Soskic, M. (1998) “Comparison between the structure–boiling pointrelationships with different descriptors for condensed benzenoids”, N. J. Chem. 22, 1075–1078.
[23] Karelson, M., Maran, U., Wang, Y. and Katritzky, A.R. (1999) “QSPR and QSAR models derived using largemolecular descriptors spaces”, Coll. Czech. Chem. Commun. 64, 1551–1571.
[24] Pogliani, L. (1996) “A strategy for molecular modeling of a physicochemical property using a linearcombination of connectivity indexes”, Croat. Chem. Acta 69, 95–109.
[25] Basak, S.C., Nikolic, S., Trinajstic, N., Amic, D. and Beslo, D. (2000) “QSPR Modeling: Graph ConnectivityIndices versus Line Graph Connectivity Indices”, J. Chem. Inf. Comput. Sci. 40, 927–933.
[26] Chartrand, G. (1977) Graph as Mathematical Models, (Prindle, Weber and Schmidt, Boston), pp. 41–42.[27] Randic, M. and Pompe, M. (2000) “The variable molecular descriptors based on distance related matrices”,
J. Chem. Inf. Comput. Sci. 40, 575–581.[28] Amic, D., Lucic, B., Nikolic, S., Trinajstic, N. Variable Wiener number, manuscript in preparation.[29] Amidon, G.L., Yalkowsky, S.H. and Leung, S. (1974) “Solubility of nonelectrolytes in polar solvents. II.
Solubility of aliphatic alcohols in water”, J. Pharm. Sci. 63, 1858–1866.[30] Weast, R.B. (1987) CRC Handbook of Chemistry and Physics, 67th Ed. (CRC Press, Inc., Boca Raton, FL), 3rd
printing.[31] Hoffmann, R., Minkin, V.I. and Carpenter, B.K. (1996) “Ockham’s razor in chemistry”, Bull. Soc. Chim.
France 133, 117–130.[32] Lucic, B. and Trinajstic, N. (1997) “New development in QSPR/QSAR modeling based on topological
indices”, SAR QSAR Environ. Res. 7, 45–62.[33] Hermann, R.B. (1972) “The theory of hydrophobic bonding. II. The correlation of hydrocarbon solubility in
water with solvent cavity surface area”, J. Phys. Chem. 76, 2754–2759.[34] Langmuir, I. (1925) “The distribution and orientation of molecules”, Colloid. Symp. Monogr. 3, 49–75.[35] Mebane, R.C., Schanley, S.A., Rybolt, T.R. and Bruce, C.D. (1999) “The correlation of physical properties of
organic molecules with computed molecular surface areas”, J. Chem. Edu. 76, 688–693.[36] Hall, L.H., Kier, L.B. and Murray, W.J. (1975) “Molecular connectivity. II. Relationship to water solubility and
boiling point”, J. Pharm. Sci. 64, 1974–1977.[37] Kier, L.B. and Hall, L.H. (1976) “Molecular connectivity. VII. Specific treatment of heterosystems”, J. Pharm.
Sci. 65, 1806–1809.[38] Randic, M. (1975) “On characterization of molecular branching”, J. Am. Chem. Soc. 97, 6609–6615.[39] Randic, M. (1991) “Orthogonal molecular descriptors”, N. J. Chem. 15, 517–525.[40] Lucic, B., Nikolic, S., Trinajstic, N. and Juretic, D. (1995) “The structure–property models can be improved
using the orthogonalized descriptors”, J. Chem. Inf. Comput. Sci. 35, 532–538.[41] Lucic, B., Nikolic, S., Trinajstic, N., Juretic, D. and Juric, A. (1995) “A novel QSPR approach to
physicochemical properties of the a-amino acids”, Croat. Chem. Acta 68, 435–450.[42] Amic, D., Beslo, D., Lucic, B., Nikolic, S. and Trinajstic, N. (1998) “The vertex-connectivity index revisited”,
J. Chem. Inf. Comput. Sci. 38, 819–822.[43] Nikolic, S., Trinajstic, N., Amic, D., Beslo, D. and Basak, S.C. (2001) “Modeling the solubility of aliphatic
alcohols in water. Graph connectivity indices versus line graph connectivity indices”. In: QSAR/QSPR Studiesby Molecular Descriptors (Nova Science Publishers, Commack), pp. 63–81.
[44] Estrada, E. (1995) “Edge adjacency relationships and a novel topological index related to molecular volume”,J. Chem. Inf. Comput. Sci. 35, 31–33.
[45] Estrada, E. (1995) “Edge adjacency relationships in molecular graphs containing heteroatoms: A newtopological index related to molecular volume”, J. Chem. Inf. Comput. Sci. 35, 701–707.