A Fuzzy ARTMAP-Based Quantitative Structure Property ...alexandre.arenas/publicacions/pdf/jcics03.pdf · methods of the Henry’s Law constant for organic compounds. For example,
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
A Fuzzy ARTMAP-Based Quantitative Structure-Property Relationship (QSPR) for theHenry’s Law Constant of Organic Compounds
Denise Yaffe,† Yoram Cohen,*,† Gabriela Espinosa,‡ Alex Arenas,§ and Francesc Giralt‡
Department of Chemical Engineering, University of California, Los Angeles,Los Angeles, California 90095-1592, and Departament d’Enginyeria Quı´mica, ETSEQ,
and Departament d’Enginyeria Informa`tica, ETSE, Universitat Rovira i Virgili,43006 Tarragona, Catalunya, Spain
Received July 1, 2002
Quantitative structure-property relationships (QSPRs) for estimating a dimensionless Henry’s Law constantof organic compounds at 25°C were developed based on a fuzzy ARTMAP and back-propagation neuralnetworks using a heterogeneous set of 495 organic compounds. A set of molecular descriptors developedfrom PM3 semiempirical MO-theory and topological descriptors (second-order molecular connectivity index)were used as input parameters to the neural networks. Quantum chemical input descriptors included averagemolecular polarizability, dipole moments (total point charge, total hybridization, and total sum), ionizationpotential, and heat of formation. The fuzzy ARTMAP/QSPR correlated Henry’s Law constant for-6.72elogH e 2.87 with average absolute errors of 0.03 and 0.13 logH units for the overall data and the test set,respectively. The optimal 7-17-1 back-propagation/QSPR model was less accurate and exhibited largeraverage absolute errors of 0.28 and 0.27 logH units for the validation (recall) and test sets, respectively.The fuzzy ARTMAP-based QSPR was superior to the back-propagation and multiple linear regression/QSPR models for Henry’s Law constant of organic compounds.
I. INTRODUCTION
The Henry’s Law constant for chemical air-water parti-tioning, typically defined as the ratio of a chemical partialpressurePa in air to its mole fractionxa in water, is a keyparameter for assessing the distribution of trace organiccompounds among the various environmental media (e.g.,air, water, vegetation, soil, sediment, and groundwater).Henry’s Law constant data are also essential for the designof air stripping technologies for the remediation of organic-contaminated aqueous streams and groundwater. The Henry’sLaw constant, H, which is usually concentration-independentfor dilute systems, is typically reported in units of atm.m3/gmol. However, in most correlations, as is the case in thepresent work, H is reported as a dimensionless partitioncoefficient (i.e., H)Ca/Cw, where Ca and Cw are the chemicalequilibrium concentrations in the air and water phases,respectively). Although H can vary with temperature,1-3 itsvalue (usually taken at 25°C) is often (explicitly or implicit)assumed to be temperature invariant over the range of interestfor most environmental applications. Since the dimensionlessHenry’s law constant varies over many orders of magnitude(10-7 to 103), it is commonly reported as logH.
Experimental determination of the Henry’s Law constantfor organic compounds is not a trivial task, especially fororganic compounds of very low vapor pressures or lowaqueous solubilities.4,5 Brennan et al.,6 note that experimen-
tally measured Henry’s Law constants have been reportedin the literature for fewer than 600 organic chemicals out ofthe over 70 000 chemicals that are in current use. Moreover,H values reported by different sources, for a given compound,can often differ by factors of up to 7.0, with an averagevariation of a factor of 2.4.6
Given the limited Henry’s Law constant data, severalapproaches for estimating Henry’s Law have been devel-oped.1-3,7 The simplest estimation is based on the theoreticalexpression H) γ∞Ps, whereγ∞ and Ps are the chemicalinfinite dilution activity coefficient and saturation vaporpressure, respectively. Molecular activity coefficients atinfinite dilution can be estimated with UNIFAC,1,7,8modified-UNIFAC,7-10 and UNIQUAC1 group contribution methods.The above methods require interaction parameters that areobtained from model fits to experimental phase-equilibriumdata. Unfortunately, group interaction parameters for UNI-FAC and reliable vapor pressure data are often lacking forchemicals of environmental interest, as well as for manyheterocyclic aromatic compounds and compounds withfunctional groups containing sulfur and phosphorus.7-10
Group contribution11,12and quantitative structure-propertyrelations (QSPRs)4-6,13-16 have been popular estimationmethods of the Henry’s Law constant for organic compounds.For example, Hine and Mookerjee11 proposed bond contribu-tion and group contribution schemes for logH, based on datasets of 263 and 212 compounds (-5 <logH <2), with areported standard deviations of 0.42 and 0.12 logH units,respectively, for the above two sets.11 The bond contributionapproach of Hine and Mookerjee,11 which was not validatedwith an external chemical data set, was later revised byMeylan and Howard,12 who expanded the bond contributionfactors and added correction factors, based on a data set of
345 chemicals, demonstrating a lower standard deviation of0.34 logH units. Validation of the above model with aheterogeneous set of 74 chemicals (-5 e logH e 3)demonstrated a standard deviation of 0.46 logH units.
An early example of a logH-linear QSPR was reportedby Nirmalakhandan and Speece4 based on a set of 180compounds (-4.99e logH e 2.32) using a set of structuralchemical descriptors (a first-order valence molecular con-nectivity index, a hydrogen bonding index and a polariz-ability parameter). The above QSPR was validated with adata set of 20 organic compounds (-2.98e logH e -1.27)with a reported average absolute error of 0.34 (26%) logHunits. The Nirmalakhandan and Speece4 QSPR was testedin a subsequent study by Nirmalakhandan et al.,5 with a setof 105 organics (-5.23 e logH e 2.32) of ‘similar’structures (aliphatic hydrocarbons, halogenated aliphatic andaromatic, aromatic, PAHs, esters, acids, alcohols and phe-nols) performing with an average absolute error and standarddeviation of 0.41 (68.8%) and 0.40 (179.3%) logH units,respectively. An update of the above model, with optimizedcontributions to the polarizability parameter and an additionalset of 87 chemicals (aldehydes, ketones, amines, nitrocompounds, pyridines, and sulfonated compounds), resultedin a reduced average absolute error and standard deviationof 0.33 (13%) and 0.36 (14%) logH units, respectively, forthe range of- 6.06 e logH e -0.94. A test of the abovemodel with a heterogeneous set of 70 chemicals (-8.07elogH e -0.15) led to logH estimations with an averageabsolute error and standard deviation of 0.56 and 0.61 logH,respectively. In a later study, Russell et al.,17 reported anoptimal logH-linear correlation, derived using a data set of63 organic compounds (-4.91 e logH e 2.12), with fivestructural descriptors (selected from a set of 165 descriptors)revealing an absolute logH errors of 0.375 and 0.34 for thetraining set and a test set consisting of only seven com-pounds.
Correlations of dimensionless water/air partition coef-ficients using a solvatochromic approach have also beenproposed in the literature. For example, Abraham et al.13
proposed a correlation of the so-called Ostwald solubilitycoefficient, Lw () H-1) using the linear solvation energyrelationship (LSER) method with five solvatochromic pa-rameters that included excess molar refraction, dipolarity/polarizability, effective hydrogen-bond acidity, effectivehydrogen bond basicity, and McGowan characteristic vol-ume. The logLw correlation, for a training set of 408chemicals (equivalent range of-8 e logH e 2.32),performed with a standard deviation of 0.151 logLw units;13
however, model performance for an external data set wasnot demonstrated. Another example is the linear logLw QSPRreported by Katritzky et al.,14 based on using seven moleculardescriptors and a data set of 406 chemicals from Abrahamet al.,13 which performed with a standard deviation andaverage absolute error of 0.52 and 0.42 logLw units,respectively. In a later study Katritzky et al.,15 correlatedlogLw (for an equivalent Henry’s law constant range of-6.99e logH e 2.32), where Lw values were estimated fromaqueous solubility and vapor pressure QSPRs, with a standarddeviation of 0.63 logLw units.
In recent years, a variety of QSPRs for physicochemicalproperties that are based on neural networks (NNs) havegained popularity as alternatives to regression-based QSPRs
that make use ofa priori analytical correlation equations.The advantage of NNs, over classical regression analysismethods, is their inherent ability to incorporate nonlinearrelationships between chemical structural parameters andphysicochemical properties.16-21 Neural network/QSPR mod-els for estimating the Henry’s Law constant for a data set of357 organic compounds (-7.08e logH e 2.32) have beenrecently reported by English and Carroll (2001).16 The aboveauthors reported QSPRs based on 12-4-1 and 10-3-1 back-propagation neural network architectures (trained using 303compounds) that performed with absolute errors of 0.237and 0.281 logH units for the test set (54 compounds),respectively.
Recently, neural network-based QSPR models that arebased on the cognitive classifier fuzzy ARTMAP have beenproposed for the estimation of boiling temperature,18,19criticalproperties,19 aqueous solubility,20 and octanol-water partitioncoefficients21 of organics. The approach was shown to besuperior to the popular back-propagation neural networkapproach as well as other statistical QSPR correlationsreported in the literature. The application of fuzzy ARTMAPnetwork for NN-based QSPR development has severaladvantages since it does not require problem-specific craftingor choice of initial weights, it does not get trapped in localminima (a problem that typically increases with data set size),and it is capable of classifying and analyzing noisy data.22-27
The recent success of NN-based QSPRs suggests that thereis merit in exploring the applicability of the approach forestimating the Henry’ law constant for organic compounds,which is the focus of the present study.
In the present study we report on the development of aQSPR for the Henry’s law constant based on both a fuzzyARTMAP and back-propagation neural networks. The aboveapproaches were evaluated based on a heterogeneous set of495 organic compounds and chemical descriptors obtainedfrom PM3 semiempirical MO-theory calculations. Theperformance of the proposed models and that of otherpublished approaches were compared to demonstrate thepotential application and reliability of the fuzzy ARTMAPapproach.
II. METHODOLOGY
Data Set and Molecular Descriptors.The overall ap-proach of developing the present neural network-based logHQSPRs is summarized in Figure 1. The first step involvedthe compilation of experimental Henry’s Law constants at25 °C for a diverse set of 495 organic compounds.5-7,11-13
The compiled dimensionless H data, presented in Tables 1and 2 as logH, covered a range of-6.72 e logH e 2.87.The heterogeneous data set of compounds included aromatic(polycyclic aromatic) and nonaromatic (normal, branched,cyclic) hydrocarbons, halogens, PCBs, mercaptans, sulfides,anilines, pyridines, alcohols, carboxylic acids, aldehydes,amines, ketones, and esters. Prior to model development, theinput descriptors and Henry’s Law constant data werenormalized from 0 to 1 as Xn ) (X-Xmin)/(Xmax-Xmin), whereX, and Xn are the actual and normalized logH values,respectively, and Xmin and Xmax are the minimum andmaximum logH values in the data set, respectively.
Molecular descriptors for each compound were calculatedbased on the chemical molecular structure. Molecular
86 J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003 YAFFE ET AL.
structures were drawn using Molecular Modeling Pro 3.01(ChemSW Software Inc.)28 and converted to 3-dimensionalstructures using the CAChe Software (Oxford MolecularLtd.).29 The geometries of the 3-dimensional structures weresubsequently optimized using MOPAC,30 a semiempiricalmolecular orbital modeling program, with the PM3 (para-metric model 3) Hamiltonian31 to arrive at the compounds’minimum energy conformations. In conjunction with theMOPAC energy minimization, quantum chemical descrip-tors, derived from the PM3 MO theory, were also calculated.The calculated descriptors included average polarizability,dipole moments (point-charge, hybridization, total) momentsof inertia (x, y, and z directions), ionization potential, numberof doubly occupied (filled) MO levels, molecular weight,heat of formation, total energy, and energy components thatincluded the one-center (three descriptors) and two-centerterms (seven descriptors). The total energy, in terms of thePM3 MO, is the sum of the total one-center and two-centerterms both also used as independent descriptors. The one-center energy terms include electron-electron repulsion, andelectron-nuclear attraction as well as the sum of these twoenergy terms. The two-center energy terms consisted of thefollowing descriptors: resonance energy, exchange energy,electron-electron repulsion, electron-nuclear attraction,nuclear-nuclear repulsion, the sum of the exchange andresonance energies and the total two-center energy terms.Finally, the total electrostatic interaction was also used as adescriptor. This latter descriptor is the sum of the three two-center energy terms: electron-electron repulsion, electron-nuclear attraction, and nuclear-nuclear repulsion energy.
Molecular topological descriptors, which were also utilizedin the present QSPRs, included the four valance molecularconnectivity indices of orders 1, 2, 3, and 4 (1øv, 2øv, 3øv,
4øv)32,33 and the second Kappa shape index,2κ.34 The abovemolecular indices were generated from the 2-dimensionalmolecular structure using Molecular Modeling Pro 3.01software. Molecular connectivity indices, which encode2-dimensional structural information, are determined froma molecular structure expressed topologically by a hydrogen-suppressed graph. The carbons (and heteroatoms) are rep-resented as vertices, and bonds connecting atoms arerepresented as edges. Briefly, the connectivity indicesmøv
are valance-weighted counts of connected subgraphs. Thefirst-order term1øv is related to the degree of branching andsize of the molecule expressed as the number of non-hydrogen atoms. The second-order term2øv represents adissection of the molecular skeleton into “two contiguousbond” fragments. The third-order term3øv is a weighted countof four atoms (three-bond) fragments representing thepotential for rotation around the central bond and is thesmallest molecular structure necessary for conformationalvariability. The3øv index also reflects the degree of branchingat each of the four atoms in the fragment. The fourth-orderterm, 4øv represents path, cluster, path/cluster, and cyclicsubgraphs of four edges. Structural information from the4øv
index is useful for compounds with at least five carbon atomsin a chain. Finally, the kappa 2 shape index,2κ, is includedto characterize the level of branching among isomers.
The set of input descriptors for the final QSPR wasselected, from the initial set of 28 descriptors, using anonlinear variable selection method based on a dynamicneural network genetic algorithm.35 Variable selection analy-sis was performed based on the generation of eleven differentback-propagation neural networks to identify the optimal setof descriptors based on a frequency distribution (of thedescriptors selected in each run). The final set of seven
Figure 1. Process flow diagram for developing fuzzy ARTMAP and back-propagation QSPR/neural networks for predicting Henry’s Lawconstant.
HENRY’S LAW CONSTANT OF ORGANIC COMPOUNDS J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 200387
Table 1. Molecular Descriptors and Experimental Henry’s Law Constants at 25°C
HENRY’S LAW CONSTANT OF ORGANIC COMPOUNDS J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 200393
descriptors selected to build the QSPRs were all at or abovethe level of selection of 70%. The final set of inputparameters is listed in Table 1. We note that the dipolemoment-total hybridizationµH was the only parameterselected for all of the eleven NN runs. The dipole moment-total point-charge and heat of formation ranked at the 91stpercentile range, while the second-order valence molecularconnectivity index, ionization potential, and dipole moment-total sum ranked at the 82nd percentile range. Finally, thefirst-order average polarizability ranked at the 73rd percentilerange.
Fuzzy ARTMAP Neural Network Systems.The presentfuzzy ARTMAP neural network system was recently intro-
duced for developing QSPRs for boiling temperatures,18
critical properties,19 aqueous solubilities,20 and octanol-waterpartition coefficients.21 This fuzzy ARTMAP network is themodification introduced by Giralt et al.,27,36 to the originalmodel of Carpenter et al.22-26 Additional information aboutfuzzy ART and fuzzy ARTMAP systems can be foundelsewhere.37-41 Briefly, the basic learning mechanism of thefuzzy ARTMAP neural system consists of creating newcategories (equivalent to hidden units in back-propagation)when dissimilar molecular descriptors and different valuesof the physical property are encountered. The networkconsists of two fuzzy ART modules,artA andartB, that arelinked together via an inter-ART module (Figure 2). Each
a H ) heat of formation, kcal/mol.b µP ) total dipole moment (point-charge), Debye.c µH ) dipole moment (hybridization), Debye.d µS )dipole moment (sum), Debye.e IP ) ionization potential, eV.f PO) average polarizability, A.U.g 2øV ) second-order valence molecular connectivityindex.
94 J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003 YAFFE ET AL.
Table 2. Experimental and Predicted Henry’s Law Constant [Dimensionless] Using Fuzzy ARTMAP and Back-Propagation QSPRs
data setsa logH absolute logH error absolute % logH error
100 J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003 YAFFE ET AL.
ART system includes a field of nodes,Fo, that represents acurrent input vector; a field F1 that receives both bottom-upinput from Fo and top-down input from a field F2 thatrepresents the active code, or category. TheartA module
categorizes input patterns (molecular descriptors), whileartBmodule develops categories of target patterns (physicalproperty) during supervised learning (training). Duringsupervised learning, theartA module receives the molecular
Table 2. (Continued)
data setsa logH absolute logH error absolute % logH error
a Tr ) training set, Ts) test set, V) validation set.b FAM ) present fuzzy ARTMAP QSPR.c BK-Pr ) present back-propagation QSPR.
HENRY’S LAW CONSTANT OF ORGANIC COMPOUNDS J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003101
descriptors and theartB module receives the correct physicalproperty prediction of the input presented toFo
a. The artAmodule attempts a prediction through themap fieldof thecategory to which the current target belongs. The inter-ARTmodule (map field) is an associative learning network thatforms an internal controller designed to create a minimalnumber ofartA recognition categories, or “hidden units”,by following a match tracking rule.The fuzzy ARTMAPdynamics are determined by vigilanceFa, Fb, Fab ∈[0,1],learning ratesâa, âb ∈[0,1], and choiceR > 0, parameters.The vigilance parameters calibrate how well an input patternmust match the learned prototype or cluster of input featuresthat are relevant for such a category to be accepted. Thevigilance parameter controls the degree of generalization.The learning rate parameter determines how the map fieldweights change through the learning process. Finally, thechoice parameter controls the fuzzy subsethood of thecategory choice function and accounts for the noise in theactivation of the F1 layer. In general, the degree of similarityamong input features is determined by the vigilance andchoice parameters.
The fuzzy ARTMAP-based QSPR for logH was developedfollowing the methodology described in Figure 1. About 85%(421) of the compounds in the complete data set wereselected for training by the fuzzy ART classifier to ensurethat adequate information was provided to the system.Training of the fuzzy ARTMAP consisted of presenting themolecular descriptors and target properties of the trainingset to modulesartA andartB (see Figure 2), respectively, toestablish input and output categories and relate them throughthe map field (F1
ab). After training with a data set of 421compounds, the hypothesis component of theartB module(Fo
b and F1b) was disconnected and the output in its category
layer F2b was implemented.27,36 Therefore, through the map
field module F1ab, a prediction for the target physical propertywas obtained for any input of descriptors presented to moduleartA. The model was then evaluated with a test set containing74 compounds.
Back-Propagation Neural Network System.The logHdata and molecular descriptors, normalized from 0 to 1, weredivided into training, validation (or recall) and test data sets.The test data set of 74 compounds (about 15% of thecomplete data set) was selected to be identical to the fuzzyARTMAP test set to enable direct comparison of the QSPRsderived from two different networks. A random selection of
331 compounds of the remaining 421 compounds (about 67%of the total data set) made up the training set. To maintainan adequate size of the validation or recall set, the trainingset (331 compounds) and the remaining data not used fortesting (90 compounds) were combined into a single valida-tion set.
Model building with the back-propagation neural networkproceeded with the same seven input descriptors and logHdata set used for the fuzzy ARTMAP model. The neuralnetwork architecture was developed using a cascade methodof network construction, together with a Kalman filteringlearning rule.35 In the above approach, hidden nodes wereadded one or two at a time with new hidden units havingconnections from both the input buffer and previouslyestablished hidden nodes. Construction was stopped whenthe validation (recall) set showed no further performanceimprovement. The optimal back-propagation neural network/QSPR for logH had a 7-17-1 architecture in which thehyperbolic tangent transfer function was chosen to correlateweighted inputs and outputs of the hidden layer. The resultingoptimal neural network architecture was then validated andtested using the two data subsets described above.
III. RESULTS AND DISCUSSION
The optimal fuzzy ARTMAP/QSPR was obtained (i.e.training phase) for vigilance parametersFa ) 0, Fb ) 0.996,Fab ) 0.996, learning rate parametersâa ) 1, âb ) 1, andchoice parameterR ) 0.0001. The back-propagation QSPRwas obtained for a 7-17-1 architecture using a set of sevenmolecular descriptors (Table 1). It is noted that, althougherrors of Henry’s law estimation methods are often reportedas absolute logH errors, these can be misleading since actualpercent errors can be significant even for seemingly lowabsolute logH errors if, for example, the logH values for thecorresponding range of compounds is small. Conversely, areported high absolute logH error can also be misleading ifit pertains to a range of high logH values. Therefore, in thisstudy, to present a more balanced evaluation of modelperformance, both the absolute and absolute percent logHerrors are reported. The performance of the present twomodels are presented in Table 2 and Figures 3 and 4, withan error analysis summary provided in Tables 3 and 4.
Figure 2. Block diagram of fuzzy ARTMAP neural network.Figure 3. Henry’s Law constant (dimensionless) estimates usingthe present fuzzy ARTMAP QSPR.
102 J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003 YAFFE ET AL.
The performance of the fuzzy ARTMAP logH QSPR, forthe complete set of 495 compounds, was with extremely lowabsolute average and maximum errors of 0.03 (2.9%) and0.53 (131%) logH units, respectively, and a correspondingstandard deviation of 0.06 (11.1%) logH units (Table 2). Theperformance of the fuzzy ARTMAP/QSPR for the test setof 74 compounds, with a relaxed vigilance parameter settingof Fa ) 0.9, was with absolute average and maximumabsolute errors and standard deviation of 0.13 (7.3%), 0.53(37%), and 0.12 (8.2%) logH units, respectively. Althoughthe performance of the logH fuzzy ARTMAP/QSPR wasexcellent for the heterogeneous set of compounds, there wereseveral compounds, which were misclassified, thereby result-ing in elevated logH errors. It appears that chemical inputdescriptors can be similar for some compounds (e.g.,isomers), while there may be relatively large differences intheir logH values. For example, 2,2′3,3′-PCB (logH) -2.21)was assigned to the category represented by 2,2′3,3′4,6-PCB(logH ) -2.74) resulting in absolute error of 0.53 (24%) logHunits. Another example of misclassification of structuralisomers is the placing ofn-butyl propionate (logH) -1.69)into a recognition category generated for isobutyl propionate(logH ) -1.17). The condition of conflicting data, where twoinput patterns are very similar but have different outputs,was also observed for the pairs 2,5-dimethlyphenol (logH) -4.34) and 2,6-dimethlyphenol (logH) -3.86) as wellas trans 1,4-dimethylcyclohexane (log H) 1.55) and 1,2-dimethylcyclohexane (logH) 1.17). Since the accuracy of
the fuzzy ARTMAP based model depends on the quality ofthe training set, it is always desired to include a rich setcomprised of many different chemical categories to arriveat a reasonable number of categories. If a given chemical ina test does not have a matching category, developed in thetraining phase, the fuzzy ARTMAP will match the compoundwith the closest available recognition category according toa tolerance set by the vigilance parameter. For example,during model evaluation with the test set, octanal (logH)-1.68), an aldehyde with a molecular formula of C8H16O,was classified with 2-octanone (logH) -2.11), a ketonealso with a molecular formula of C8H16O. The aboveexamples suggest that further improvements to the fuzzyARTMAP logH QSPR would require a data set that containsa larger number of compounds per class and possibly arefined set of molecular descriptors to allow a greater abilityto differentiate among complex or apparently very similarstructures.
The logH QSPR, derived based on the optimal 7-17-1back-propagation network, performed with absolute averageand maximum errors, and standard deviation, for the trainingset, of 0.29 (21.8%), 1.4 (154%) and 0.27 (22%) logH units,respectively. LogH estimates for the validation set (recallphase) were obtained with absolute average and maximumerrors, and standard deviation of 0.28 (21.3%), 1.4 (154%),and 0.26 (22%) logH units, respectively. Performance of theback-propagation/QSPR for the test set (same as that of thefuzzy ARTMAP based QSPR) was better, relative to thetraining and validation sets, with absolute average andmaximum absolute errors, and standard deviation of 0.27(12.6%), 1.1 (81%), and 0.24 (12.4%) logH units, respec-tively. However, the performance of the fuzzy ARTMAPbased logH QSPR was superior as indicated in Table 3.Nonetheless, the overall performance of the back-propagationmodel is relatively good compared to other publishedmethods of estimating logH. It is interesting to note thatBrennan et al.6 suggested that predictions of Henry’s Lawconstant within a factor of 2.5 are a reasonable for manyenvironmental applications given the variability amongmeasured data, where standard deviations can range fromless than 0.05 to about 0.5 logH units.16
The performance of the present logH QSPR models canalso be assessed based on error analysis for specific chemicalgroups (Table 4). It is noted that the errors for the majorityof the different chemical groups are within the same orderof magnitude, for each respective model; however, errorsfor the fuzzy ARTMAP are generally about 2 orders ofmagnitude lower than for the back-propagation-based QSPR.Among the specific chemical groups, absolute average andpercent logH errors for the fuzzy ARTMAP/QSPR reachedup to 0.6 logH units and 10.17%, respectively. In contrast,the absolute average and percent logH errors for the back-propagation QSPR were higher ranging from 0.15 to 0.63logH units or the equivalent of 6.48 to 46.76%, respectively.However, the performance (in terms of absolute percent logHerror) of the back-propagation QSPR was better for alde-hydes, nitriles, phenols (both halogenated and nonhaloge-nated), and amides, relative to the other groups (<10%).Eleven out of the twenty five chemical groups (i.e., repre-senting 58% or 289 compounds of the total data set) exhibitedaverage absolute errors of less than 0.27 logH units. Thehighest absolute average and percent errors of 0.63 logH and
a Errors are expressed as|logH| estimation errors; H is dimensionless(H ) Ca/Cw).
HENRY’S LAW CONSTANT OF ORGANIC COMPOUNDS J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003103
46.76%, respectively, obtained for the back-propagationbased QSPR, were attributed to three heterocyclic oxygencompounds in the data set. In contrast, the highest absolutepercent logH error of 10.17% for the fuzzy ARTMAP modeloccurred for aromatic hydrocarbons while less than 2% logHerror were encountered for 17 of the 25 chemical groups.For both models, average absolute percent logH errors wererelatively lower for amides, halogenated phenols, nitriles,alcohols, aromatic amines, and carboxylic acids. However,for both models, relatively high average absolute percentlogH errors were encountered for aromatic and halogenatedaliphatic compounds since many of those compounds possessrelatively low logH values. Overall, the estimation errors forthe specific chemical groups are largely attributed to logHdata that were either too sparse or too concentrated in aparticular region. Clearly, improved correlations wouldrequire the use of a uniformly distributed data set for trainingvarious chemical groups as wells as for the overall data set.
The two QSPR models developed in this work werecompared to previously reported multiple linear regressionand neural network-based logH QSPRs. It is noted that the
literature reports4,5,13-16 a range of average absolute estima-tion error of about 0.24 to 0.56 logH units, for logHcorrelations spanning various ranges of the Henry’s lawconstant. In contrast, the present back-propagation and fuzzyARTMAP based models performed with average absoluteerrors of 0.13 and 0.27 logH units, respectively for a Henry’slaw constant range of-6.72e logH e 2.87. More specificcomparisons of the performance of the present to previouslypublished logH QSPRs,4,5,15-16 for compounds common withthe present data set, are provided in Table 5 and Figure 5.The predictions of the back-propagation and fuzzy ARTMAPmodels, for 277 compounds common with the data set ofNirmalakhandan and Speece,4,5 were within an averageabsolute and maximum errors and standard deviation of 0.26(19.5%), 1.3 (127%) and 0.25 (21.4%) and 0.03 (3.6%), 0.5(131%), and 0.07 (13.2%) logH units, respectively. Theabove performance was somewhat better than for theNirmalakhandan and Speece4,5 model predictions that yieldedhigher average absolute error, maximum error and standarddeviation of 0.38 (36.6%), 7.2 (1550%), and 0.67 (113.6%)logH units, respectively. Estimates for 308 compoundscommon with the data set of Katritzky et al.,15 resulted inaverage absolute and maximum errors and standard deviationof 0.29 (20.5%), 1.4 (154%), and 0.26 (22.6%) and 0.025(2.8%), 0.5 (131%) and 0.06 (11.4%) logH units, for thepresent back-propagation and fuzzy ARTMAP models,respectively. For the same 308 compounds, the multi-linearregression logLw QSPR of Katritzky et al.,15 resulted inhigher average absolute and maximum errors and standarddeviation of 0.50 (91.45), 3.96 (6700%), and 0.51 (521%)logH units (or logLw-1 units), respectively. Finally, thepresent models were compared, for a common set of 275compounds, to the 10-3-1 neural network/QSPR model ofEnglish and Carroll16 which was reported to perform with
Table 4. Error Analysis Based on Chemical Classesa
average absolute logH error average absolute percent logH error
chemical classes and functional groups no. of compds Fuzzy ARTMAP back-propagation Fuzzy ARTMAP back-propagation
HENRY’S LAW CONSTANT OF ORGANIC COMPOUNDS J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003109
average absolute and maximum errors and standard devia-tions were 0.20 (28%), 3.93 (600%), and 0.41(75.4%) logHunits, respectively. The present back-propagation modelperformed with nearly comparable or lower average absolute
and maximum errors and standard deviations of 0.25(19.7%), 1.27 (154%), and 0.25 (22.2%), respectively. Incontrast, the present fuzzy ARTMAP QSPR performed withlower average absolute and maximum errors and standard
a FAM ) present fuzzy ARTMAP/QSPR.b Bk-Pr) back propagation (this work).c N&S ) QSPR models of Nirmalakhandan and co-workers.4-6
d Ketal ) QSPR model of Katritzky et al.14 e E&C ) NN/QSPR model of English and Carroll.16
110 J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003 YAFFE ET AL.
deviation of 0.03 (3.9%), 0.54 (131%), and 0.07 (13.3%)logH units, respectively.
In closure, the performances of the fuzzy ARTMAP/QSPRand back-propagation QSPRs suggest that quantum chemicaldescriptors are reasonable for characterizing the structuralinformation of the present data set of 495 organic compounds(containing 25 different chemical groups) with respect tologH predictions. The fuzzy ARTMAP QSPR was superiorto the back-propagation-based logH QSPR. Both QSPRsdemonstrated equivalent or greater estimation accuracy, forthe range of compounds studied, relative to other regression-based QSPRs and group contribution methods.
CONCLUSIONS
The success of using fuzzy ARTMAP and back-propaga-tion networks for developing QSPRs for estimating Henry’sLaw constants was demonstrated using a set of moleculardescriptors calculated from PM3 semiempirical MO-theoryand a heterogeneous set of 495 organic compounds with arange of-6.72 e logH e 2.87. The descriptors obtainedfrom PM3 semiempirical MO-theory calculations representeddifferent forms of 3-dimensional information for character-izing the various atoms and functional groups for a set ofheterogeneous organic compounds. For the fuzzy ARTMAP-based QSPR, average absolute errors for logH predictionsfor the overall data set and test set were 0.03 and 0.13 logHunits, respectively. In contrast, the 7-17-1 back-propagationlogH QSPR performed with an average absolute errors forthe validation (recall) and test sets of 0.28 and 0.27 logHunits, respectively which was comparable to published NN/QSPR models. The fuzzy ARTMAP neural network-basedQSPR model was also of higher accuracy relative topreviously published multi-linear regression and neuralnetwork QSPR models for Henry’s Law constant andOstwald solubility coefficient. The study demonstrated thatit is possible to develop reasonably accurate QSPRs forheterogeneous organic compounds based on the fuzzy ARTclassifier and the fuzzy ARTMAP cognitive system using aset of descriptors calculated from quantum mechanics andgraph theory.
REFERENCES AND NOTES
(1) Reid, R. C.; Prausnitz, J. M.; Sherwood, T. K.The Properties of Gasesand Liquids, 3rd ed.; McGraw-Hill: New York, 1977.
(2) Prausnitz, J. M.; Shair, F. H. A Thermodynamic correlation of gassolubilities.AICHE J.1961, 7, 682-687.
(3) Yen, L.; McKetta, J. J. A Thermodynamic Correlation of NonpolarGas Solubilities in Polar, Nonassociated Liquids.AICHE J.1962, 8,501-507.
(4) Nirmalakhandan, N. N.; Speece, R. E. QSAR model for predictingHenry’s Constant.EnViron. Sci. Technol.1988, 22(11), 1349-1361.
(5) Nirmalakhandan, N.; Brennan, R. A.; Speece, R. E. Predictive Henry’sLaw Constant and the effects of Temperature on Henry’s LawConstant.Water Res.1997, 31(6), 1471-1481.
(6) Brennan, R. A.; Nirmalakhandan, N.; Speece, R. E. Comparison ofPredictive Methods for Henry’s Law Constants of Organic Chemicals.Water Res. 1998, 32(6), 1901-1911.
(7) Lohmann, J.; Joh, R.; Gmehling, J. From UNIFAC to modifiedUNIFAC (Dortmund).Ind. Eng. Chem. Res.2001, 40, 957-964.
(8) Ornektekin, S.; Paksoy, H.; Demirel, Y. The performance of UNIFACand related group. contribution models Part II. Prediction of Henry’sLaw Constant.Thermochim. Acta1996, 287, 251-259.
(9) Hwang, S.-M.; Lee, J.-M.; Lin, H. New group-interaction parametersof the UNIFAC model: Aromatic methoxyl binaries.Ind. Eng. Chem.Res.2001, 40, 1740-1747.
(10) Gmehling, J.; Lohmann, J.; Jakob, A.;, Li, J.; Joh, R. A modifiedUNIFAC (Dortmund) Model. 3. Revision and Extension.Ind. Eng.Chem. Res.1998, 37, 4876-4882.
(11) Hine, J.; Mookerjee, P. K. The intrinsic hydrophilic character of organiccompounds, correlations in terms of structural contribution.J. Org.Chem. 1975, 40, 292-298.
(12) Meylan, W. M.; Howard, P. H. Bond contribution method forestimating Henry’s Law Constants.EnViron. Toxicol. Chem.1991,10, 1283-1293.
(13) Abraham, M. H.; Andonian-Haftvan, J.; Whiting, g. S.; Leo, A.; Taft,S. Hydrogen bonding. Part 34. The factors that influence the solubilityof gases and vapors in water at 398 K, and a new method for itsdetermination.J. Chem. Soc., Perkin. Trans.1994, 2, 1777-1791.
(14) Katritzky, A.; Mu, L.: A QSPR study of the solubility of gases andvapors in water.J. Chem. Inf. Comput. Sci.1996, 36, 1162-1168.
(15) Katritzky, A. R.; Wang, Y.; Sild, S.; Tamm, T.; Karelson, M. QSPRStudies on Vapor Pressure, Aqueous Solubility, and the Prediction ofWater-Air Partition Coefficients.J. Chem. Inf. Comput. Sci. 1998,38, 720-725.
(16) English, N. J.; Carroll, D. G.; Prediction of Henry’s Laws Constantby a Quantitative Structure-Property Relationship and Neural Net-works.J. Chem. Inf. Comput. Sci. 2001, 41, 1150-1161.
(17) Russell, C. J.; Dixon, S. L.; Jurs, P. C.; Computer-Assisted Study ofthe Relationship between Molecular-Structure and Henry’s Law.Constant. Anal. Chem.1992, 64, 1350-1355.
(18) Espinosa, G.; Yaffe, D.; Cohen, Y.; Arenas, A.; Giralt, F. NeuralNetwork Based Quantitative Structural Property Relations (QSPRs)for Predicting Boiling Points of Aliphatic Hydrocarbons.J. Chem.Inf. Comput. Sci. 2000, 40, 859-877.
(19) Espinosa, G.; Yaffe, D.; Cohen, Y.; Arenas, A.; Giralt, F. A fuzzyARTMAP-Based Quantitative Structure-Property Relations (QSPRs)for Predicting Physical Properties of Organic Compounds.Ind. Eng.Chem. Res.2001, 40, 2757-2766.
(20) Yaffe, D Espinosa, G.; Cohen, Y.; Arenas, A.; Giralt, F. A fuzzyARTMAP based Quantitative Structure-Property Relationships (QSPRs)for predicting Aqueous Solubility of Organic CompoundsJ. Chem.Inf. Comput. Sci. 2001, 41, 1177-1207.
(21) Yaffe, D.; Cohen, Y.; Espinosa, G.; Giralt, F.; Arenas, A. FuzzyARTMAP and Back-Propagation Neural Networks based QuantitativeStructure-Property Relationships (QSPRs) for Octanol-Water Parti-tion Coefficient of Organic CompoundsJ. Chem Inf. Comput Sci. 2002,42(2), 162-183.
(22) Carpenter, A.; Grossberg, S. A Massively Parallel Architecture for aSelf-Organizing Neural Pattern Recognition Machine.ComputerVision, Graphics, Image Processing1987, 37, 54.
(23) Carpenter, A.; Grossberg, S. The ART of Adaptative Pattern Recogni-tion by a Self-organizing Neural Network.Computer1988, 77.
(24) Carpenter, G. A.; Grossberg, S.; Marcuzon, N.; Reynolds, J. H.; Rosen,D. B. fuzzy ARTMAP: A Neural Network Architecture for Incre-mental Supervised Learning of Analogue Multidimensional Maps.IEEE Trans. Neural Networks1992, 3, 698.
(25) Carpenter, G. A.; Grossberg, S.; Marcuzon, N.; Rosen, D. B. FuzzyART: Fast Stable Learning and Categorization of Analogue Patternsby an Adaptive Resonance System.Neural Networks1991, 4, 759.
(26) Carpenter, G.; Grossberg, S. A Self-Organizing Neural Network forSupervised Learning, Recognition, and Prediction.IEEE Communica-tions Magazine1992, 38.
(27) Giralt, F.; Arenas, A.; Ferre-Gine´, J.; Rallo R. The Simulation andInterpretation of Turbulence with a Cognitive Neural System.Phys.Fluids 2000, 12, 1826.
(28) Molecular Modeling Pro. Revision 3.14; ChemSM Inc. 1998.(29) CAChe Version 3.2, CAChe Chemistry Products, Oxford Molecular
Ltd.(30) Stewart, J. J. P.; MOPAC 6.0, Quantum Chemistry Program Exchange
No. 455, Bloomington, IN, 1989.(31) Stewart, J. J. P. Optimization of Parameter for Semiemperical Methods
I Method J. Comput. Chem. 1989, 10, 209-220.(32) Kier, L. B.; Hall, L. H.Molecular ConnectiVity in Chemistry and Drug
Research;Academic Press: New York, 1976.(33) Kier, L. B.; Hall, L. H.Molecular ConnectiVity in Structure-ActiVity
Analysis;John Wiley & Sons Inc.: New York, 1985.(34) Kier, L. B. A Shape Index from Molecular Graphs.Quantum Struct-
embedded in the velocity field of a turbulent wake, in SolvingEngineering Problems with Neural Networks, Proceedings of theInternational Conference on Engineering Applications of NeuralNetworks (EANN’96), Ed. Bulsari, A. B.; Kallio, S.; Tsaptsinos; D.Turku; 1996; Vol. 1, 17-20.
(37) Bartfai, B. Hierarchical Clustering with ART Neural Netwoks.Technical Report CS-TR-94/1. 1994.
(38) Bartfai, B. On the Match Tracking Anomaly of ARTMAP NeuralNetwork.Technical Report CS-TR-95/1. 1995.
(39) Bartfai, B. An Improved Learning Algorithm for the fuzzy ARTMAPNeural Network.Technical Report CS-TR-95/10,1995.
HENRY’S LAW CONSTANT OF ORGANIC COMPOUNDS J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003111
(40) Bartfai, B.; White, R. A fuzzy ART-based Modular Neuro-fuzzyArchitecture for Learning Hierarchical Clusterings.Technical ReportCS-TR-97/6,1997.
(41) Breindl, A.; Beck, B.; Clark, T. Prediction of then-Octanol/WaterPartition Coefficient, logP, using a combination of Semiempirical
MO-Calculations and Neural Network.J. Molecular Modeling1997,3, 142-155.
CI025561J
112 J. Chem. Inf. Comput. Sci., Vol. 43, No. 1, 2003 YAFFE ET AL.