Top Banner
A novel approach to predict aquatic toxicity from molecular structure Juan A. Castillo-Garit a,b,c, * , Yovani Marrero-Ponce b,c,d,e , Jeanette Escobar a , Francisco Torrens d , Richard Rotondo f a Applied Chemistry Research Center, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba b Unit of Computer-Aided Molecular ‘‘BiosilicoDiscovery and Bioinformatic Research (CAMD-BIR Unit), Department of Pharmacy, Faculty of Chemistry-Pharmacy, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba c Department of Drug Design, Chemical Bioactive Center, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba d Institut Universitari de Ciència Molecular, Universitat de València, Edifici d’Instituts de Paterna, P.O. Box 22085, 46071 Valencia, Spain e Unidad de Investigación de Diseño de Fármacos y Conectividad Molecular, Departamento de Quı ´mica Fı ´ sica, Facultad de Farmacia, Universitat de València, València, Spain f Mediscovery Inc., Suite 1050, 601 Carlson Parkway, Minnetonka, MN 55305, USA article info Article history: Received 1 February 2008 Received in revised form 29 April 2008 Accepted 7 May 2008 Available online 1 July 2008 Keywords: Atom-based non-stochastic and stochastic linear index Multiple linear regression QSAR Tetrahymena pyriformis Program TOMOCOMD–CARDD abstract The main aim of the study was to develop quantitative structure–activity relationship (QSAR) models for the prediction of aquatic toxicity using atom-based non-stochastic and stochastic linear indices. The used dataset consist of 392 benzene derivatives, separated into training and test sets, for which toxicity data to the ciliate Tetrahymena pyriformis were available. Using multiple linear regression, two statistically sig- nificant QSAR models were obtained with non-stochastic (R 2 = 0.791 and s = 0.344) and stochastic (R 2 = 0.799 and s = 0.343) linear indices. A leave-one-out (LOO) cross-validation procedure was carried out achieving values of q 2 = 0.781 (s cv = 0.348) and q 2 = 0.786 (s cv = 0.350), respectively. In addition, a val- idation through an external test set was performed, which yields significant values of R 2 pred of 0.762 and 0.797. A brief study of the influence of the statistical outliers in QSAR’s model development was also car- ried out. Finally, our method was compared with other approaches implemented in the Dragon software achieving better results. The non-stochastic and stochastic linear indices appear to provide an interesting alternative to costly and time-consuming experiments for determining toxicity. Ó 2008 Elsevier Ltd. All rights reserved. 1. Introduction Under the European Union Registration, Evaluation and Autho- rization of Chemicals program, all chemicals produced or imported >1 ton per annum (tpa) in the European Union will need to be as- sessed for human and environmental hazards (Aptula and Roberts, 2006). As pointed out in the European Union White Paper concern- ing a future of chemical policy (Anon, 2001) the development of tools able to assess potential hazardous effects of chemicals on liv- ing organisms needs to receive attention. Therefore, information about the toxicity of industrial organic chemicals to aquatic species is of interest. While experimental testing provides the most reli- able data about the effects of chemicals, it is not suitable to screen a large number of potential toxicants (Netzeva and Schultz, 2005); because the generation of toxicological data is often a lengthy and costly process, and thus predictive models in the form of quantita- tive structure–activity relationships (QSARs) are a necessary tool for filling data gaps in environmental risk assessment and regula- tory concerns (DeWeese and Schultz, 2001). QSARs are employed as scientifically credible tools for predict- ing the acute toxicity of chemicals when few empirical data are available. The Office of Toxic Substances of the US Environmental Protection Agency has developed QSARs based on as little as one datum and assumptions about the nature of the relationship be- tween a chemical class and its toxicity (Auer et al., 1990). Consis- tent with the development and application of QSARs for the design of more efficacious pharmaceuticals and pesticides, has been the increasing acceptance of structure–activity relationships to predict adverse effects of xenobiotics in risk assessment (Brad- bury, 1995). In recent years research in this field, especially as it re- lates to aquatic toxicity has become more sophisticated (Bradbury et al., 2003; Comber et al., 2003). QSARs offer the advantages of higher speed and lower cost, especially when compared to exper- imental testing (Netzeva and Schultz, 2005). Structure–toxicity models exist at the intersection of biology, chemistry, and statistics. The connection of these three subjects has permitted the development of structure–activity relationships as an accepted sub-discipline in toxicology (McKinney et al., 2000). 0045-6535/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.chemosphere.2008.05.024 * Corresponding author. Address: Applied Chemistry Research Center. Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cuba. Tel.: +53 42 281192/ 281473; fax: +53 42 281130/281455. E-mail addresses: [email protected], [email protected], [email protected] (J.A. Castillo-Garit). Chemosphere 73 (2008) 415–427 Contents lists available at ScienceDirect Chemosphere journal homepage: www.elsevier.com/locate/chemosphere
13

A novel approach to predict aquatic toxicity from molecular structure

Feb 07, 2023

Download

Documents

Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
Page 1: A novel approach to predict aquatic toxicity from molecular structure

Chemosphere 73 (2008) 415–427

Contents lists available at ScienceDirect

Chemosphere

journal homepage: www.elsevier .com/locate /chemosphere

A novel approach to predict aquatic toxicity from molecular structure

Juan A. Castillo-Garit a,b,c,*, Yovani Marrero-Ponce b,c,d,e, Jeanette Escobar a, Francisco Torrens d,Richard Rotondo f

a Applied Chemistry Research Center, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cubab Unit of Computer-Aided Molecular ‘‘Biosilico” Discovery and Bioinformatic Research (CAMD-BIR Unit), Department of Pharmacy, Faculty of Chemistry-Pharmacy,Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cubac Department of Drug Design, Chemical Bioactive Center, Central University of Las Villas, Santa Clara, 54830, Villa Clara, Cubad Institut Universitari de Ciència Molecular, Universitat de València, Edifici d’Instituts de Paterna, P.O. Box 22085, 46071 Valencia, Spaine Unidad de Investigación de Diseño de Fármacos y Conectividad Molecular, Departamento de Quımica Fısica, Facultad de Farmacia, Universitat de València, València, Spainf Mediscovery Inc., Suite 1050, 601 Carlson Parkway, Minnetonka, MN 55305, USA

a r t i c l e i n f o a b s t r a c t

Article history:Received 1 February 2008Received in revised form 29 April 2008Accepted 7 May 2008Available online 1 July 2008

Keywords:Atom-based non-stochastic and stochasticlinear indexMultiple linear regressionQSARTetrahymena pyriformisProgram TOMOCOMD–CARDD

0045-6535/$ - see front matter � 2008 Elsevier Ltd. Adoi:10.1016/j.chemosphere.2008.05.024

* Corresponding author. Address: Applied ChemisUniversity of Las Villas, Santa Clara, 54830, Villa Clar281473; fax: +53 42 281130/281455.

E-mail addresses: [email protected], juancg.22@gm(J.A. Castillo-Garit).

The main aim of the study was to develop quantitative structure–activity relationship (QSAR) models forthe prediction of aquatic toxicity using atom-based non-stochastic and stochastic linear indices. The useddataset consist of 392 benzene derivatives, separated into training and test sets, for which toxicity data tothe ciliate Tetrahymena pyriformis were available. Using multiple linear regression, two statistically sig-nificant QSAR models were obtained with non-stochastic (R2 = 0.791 and s = 0.344) and stochastic(R2 = 0.799 and s = 0.343) linear indices. A leave-one-out (LOO) cross-validation procedure was carriedout achieving values of q2 = 0.781 (scv = 0.348) and q2 = 0.786 (scv = 0.350), respectively. In addition, a val-idation through an external test set was performed, which yields significant values of R2

pred of 0.762 and0.797. A brief study of the influence of the statistical outliers in QSAR’s model development was also car-ried out. Finally, our method was compared with other approaches implemented in the Dragon softwareachieving better results. The non-stochastic and stochastic linear indices appear to provide an interestingalternative to costly and time-consuming experiments for determining toxicity.

� 2008 Elsevier Ltd. All rights reserved.

1. Introduction

Under the European Union Registration, Evaluation and Autho-rization of Chemicals program, all chemicals produced or imported>1 ton per annum (tpa) in the European Union will need to be as-sessed for human and environmental hazards (Aptula and Roberts,2006). As pointed out in the European Union White Paper concern-ing a future of chemical policy (Anon, 2001) the development oftools able to assess potential hazardous effects of chemicals on liv-ing organisms needs to receive attention. Therefore, informationabout the toxicity of industrial organic chemicals to aquatic speciesis of interest. While experimental testing provides the most reli-able data about the effects of chemicals, it is not suitable to screena large number of potential toxicants (Netzeva and Schultz, 2005);because the generation of toxicological data is often a lengthy andcostly process, and thus predictive models in the form of quantita-

ll rights reserved.

try Research Center. Centrala, Cuba. Tel.: +53 42 281192/

ail.com, [email protected]

tive structure–activity relationships (QSARs) are a necessary toolfor filling data gaps in environmental risk assessment and regula-tory concerns (DeWeese and Schultz, 2001).

QSARs are employed as scientifically credible tools for predict-ing the acute toxicity of chemicals when few empirical data areavailable. The Office of Toxic Substances of the US EnvironmentalProtection Agency has developed QSARs based on as little as onedatum and assumptions about the nature of the relationship be-tween a chemical class and its toxicity (Auer et al., 1990). Consis-tent with the development and application of QSARs for thedesign of more efficacious pharmaceuticals and pesticides, hasbeen the increasing acceptance of structure–activity relationshipsto predict adverse effects of xenobiotics in risk assessment (Brad-bury, 1995). In recent years research in this field, especially as it re-lates to aquatic toxicity has become more sophisticated (Bradburyet al., 2003; Comber et al., 2003). QSARs offer the advantages ofhigher speed and lower cost, especially when compared to exper-imental testing (Netzeva and Schultz, 2005).

Structure–toxicity models exist at the intersection of biology,chemistry, and statistics. The connection of these three subjectshas permitted the development of structure–activity relationshipsas an accepted sub-discipline in toxicology (McKinney et al., 2000).

Page 2: A novel approach to predict aquatic toxicity from molecular structure

Table 1Values of the atomic weights used for linear indices calculation (Pauling, 1939; Kierand Hall, 1986; Todeschini and Gramatica, 1998; Consonni et al., 2002)

ID Atomic VdWa Mulliken Polarizability Pauling

416 J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427

Therefore, the development of an ecotoxicity-based QSAR requiresthese three components. A dataset is required that provides a mea-sure of the toxicity for a group of chemicals, most often organic.This group of chemicals is typically defined by some selection cri-teria. In addition, required for each compound of this group ofchemicals are molecular structure and/or property data (i.e. thedescriptors, variables, or predictors). These two data arrays mustthen be related, usually via a statistical method (Schultz et al.,2003).

Therefore, the inhibition of growth database of ciliated proto-zoan Tetrahymena pyriformis (Schultz, 1997) is considered to be ahigh quality dataset (Bradbury et al., 2003). It has been developedin a single laboratory over more than two decades. While numer-ous workers using slight variations in the static protocol and nom-inal concentrations have generated the data, the dataset stillremains an excellent primary source of information; it is also un-ique in terms of its size, molecular diversity, and quality. Moreover,these data have been compiled for the main purpose of QSARdevelopment and validation. Many works are been performedusing T. pyriformis to develop linear models in recent years (Chenet al., 2004; Cronin et al., 2004; Gonzalez et al., 2004; Schultzet al., 2004; Aptula et al., 2005; Cheng and Yuan, 2005; Netzevaand Schultz, 2005; Spycher et al., 2005; Costescu and Diudea,2006; Zvinavashe et al., 2006); additionally, some non-linearmethods were also applied (Melagraki et al., 2005; Ivanciuc,2004, 2007) to predict aquatic toxicity in T. pyriformis.

On the other hand, a novel scheme to the rational in silicomolecular design (or selection/identification of chemicals) and toQSAR/QSPR studies has been introduced in recent times by ourresearch team: the so-called TOpological MOlecular COMputerDesign–Computer-Aided Rational Drug Design (TOMOCOMD–CARDD) (Marrero-Ponce and Romero, 2002). This approach, whichis based on principles of novel methods in chemical graph andalgebraic theories, has been successfully used for the descriptionof different physical, chemo-physical, and chemical properties oforganic compounds (Marrero-Ponce, 2003, 2004a,b; Marrero-Ponce et al., 2004c). The prediction of many biological activitieswere also effectively modeled with these TOMOCOMD–CARDDdescriptors (Marrero-Ponce et al., 2004a,b; Marrero-Ponce et al.,2005a,b,c,e; Castillo-Garit et al., 2007a), including studies relatedto proteomics (Marrero-Ponce et al., 2004e, 2005d), and nucleicacid–drug interactions (Marrero-Ponce et al., 2004f, 2005d). Inaddition, these molecular descriptors (MDs) have been extendedto consider three-dimensional (3D) features of small/medium-sized molecules based on the trigonometric-3D-chirality-correc-tion factor approach (Marrero-Ponce et al., 2004d; Marrero-Ponceand Castillo-Garit, 2005; Castillo-Garit et al., 2006, 2007b).

Bearing in mind that mentioned above the main aim of thepresent work was to test the applicability of the TOMOCOMD–CARDD approach in ecotoxicological research. Therefore, we shalldevelop QSAR models for the prediction of aquatic toxicity for alarge group of substituted benzenes tested on the impairmentassay of the population growth of T. pyriformis.

mass(g/mol)

volume (Å3) electronegativity (Å3) electronegativity

H 1.01 6.709 2.592 0.667 2.20B 10.81 17.875 2.275 3.030 2.04C 12.01 22.449 2.746 1.760 2.55N 14.01 15.599 3.194 1.100 3.04O 16.00 11.494 3.654 0.802 3.44F 19.00 9.203 4.000 0.557 3.98P 30.97 26.522 2.515 3.630 2.19S 32.07 24.429 2.957 2.900 2.58Cl 35.45 23.228 3.475 2.180 3.16Br 79.90 31.059 3.219 3.050 2.96I 126.90 38.792 2.778 5.350 2.66

a VdW: van der Waals.

2. Materials and methods

2.1. TOMOCOMD–CARDD approach

In earlier publications, we have outlined outstanding featuresconcerned with the theory of the atom-based non-stochastic andstochastic linear indices. This method codifies molecular structuresby means of mathematical linear transformations (Marrero-Ponce,2004a,c; Marrero-Ponce and Castillo-Garit, 2005; Marrero-Ponceet al., 2005f). These molecular fingerprints were generated bymeans of the interactive program for molecular design and bioin-

formatic research TOMOCOMD (Marrero-Ponce and Romero,2002). It is composed of four subprograms; each one of them al-lows both drawing the structures (drawing mode) and calculatingmolecular 2D/3D descriptors (calculation mode). The modules arenamed CARDD (Computed-Aided ‘Rational’ Drug Design), CAMPS(Computed-Aided Modeling in Protein Science), CANAR (Com-puted-Aided Nucleic Acid Research) and CABPD (Computed-AidedBio-Polymers Docking).The CARDD module was selected for draw-ing all structures as well as for the computation of non-stochasticand stochastic atom-based linear indices. The main steps for theapplication of this method in quantitative structure–activity/toxic-ity relationships (QSAR/QSTR) and for drug design can be brieflysummarized as follows.

1. Drawing of the molecular pseudograph for each molecule in thedata set, using the drawing mode.

2. Use of appropriate weights in order to differentiate the molec-ular atoms. The weights used in this work are those previouslyproposed for the calculation of the DRAGON descriptors (Paul-ing, 1939; Todeschini and Gramatica, 1998; Consonni et al.,2002), i.e., atomic mass (M), atomic polarizability (P), Mullinkenatomic electronegativity (K), van der Waals atomic volume (V),plus the atomic electronegativity in Pauling scale (G) (Kier andHall, 1986). The values of these atomic labels are shown inTable 1 (Pauling, 1939; Kier and Hall, 1986; Todeschini andGramatica, 1998; Consonni et al., 2002).

3. Computation of the total and local (atom and atom-type) atom-based linear indices of the molecular pseudograph’s atom adja-cency matrix can be carried out in the software calculationmode, where one can select the atomic properties and thedescriptor family before calculating the molecular indices. Thissoftware generates a table in which the rows correspond to thecompounds, and the columns correspond to the atom-based(both total and local) linear maps or other MD family imple-mented in this program.

4. Development of a QSAR/QSTR equation by using several multi-variate analytical techniques, for instance, multiple linearregression. Therefore, one can find a quantitative relationshipbetween an activity A and the atom-based linear fingerprintshaving, for example, the following appearance:

A ¼ a0f 0ðxÞ þ a1f 1ðxÞ þ a2f 2ðxÞ þ � � � þ akf kðxÞ þ c ð1Þ

where A is the measured activity, f k(x) is the kth total (atom andatom-type) linear indices, and the ak’s and c are the coefficientsobtained by the linear regression analysis.

Page 3: A novel approach to predict aquatic toxicity from molecular structure

J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427 417

5. Test of the robustness and predictive power of the QSPR/QSAR equation by using internal [leave-one-out (LOO)] andexternal (using an external prediction set) validationtechniques.

The descriptors computed in this work were the following:

(1) fk(x) and f Hk (x) are the kth atom-based non-stochastic total

linear indices considering and not considering H-atoms,respectively, in the molecule.

(2) fkL(xE) and f HkL (xE) are the kth atom-based non-stochastic

local (atom-type = heteroatoms: S, N, O) linear indices con-sidering and not considering H-atoms, respectively, in themolecule.

(3) f HkL(xE-H) are the kth atom-based non-stochastic local (atom-

type = H-atoms bonding to heteroatoms: S, N, O) linear indi-ces considering H-atoms in the molecular pseudograph (G).

Therefore, the kth atom-based stochastic total [sfk(x) andsf H

k (x)], as well as local [sfk(xE), sf Hk (xE) and sf H

k (xE-H)] linear indiceswere also computed.

2.2. Chemical database selection

Central to the issues of quality, transparency, and domain iden-tification as they relate to toxicological QSAR is biological data.High-quality toxicity data, in a structurally diverse set of moleculesare required to formulate and validate high-quality QSARs. Qualitytoxicity data typically come from standardized assays measured ina consistent manner, with a clear and unambiguous endpoint, andlower experimental error (Schultz and Netzeva, 2004). Toxicityassessments which are made in a single laboratory by a single pro-tocol tend to be the most precise. Taking into consideration thesepoints, we select the database of inhibition of growth of the ciliatedprotozoan T. pyriformis. This database has been developed in a sin-gle laboratory over more than two decades and it has been recog-nized as a high-quality dataset (Bradbury et al., 2003).

The general dataset used in this study has been recently pub-lished by other researches (Schultz and Netzeva, 2004). It consistsof almost 400 substituted benzenes representing several mecha-nisms of toxic action. Some compounds were reported by Schultzand Netzeva as not toxic at saturation; hence these compoundswere not used in the present work. A horizontal validation wasperformed using a training set, composed by 313 benzene deriva-tives, for models development and a validation set (79 compounds)to assess the predictive capability of the QSAR models. In order tosplit the database into training and prediction series, a k-meanscluster analyses (k-MCA) was carried out for entire dataset to de-sign, in a rational representative way, the training (learning) andprediction (test) series (Johnson and Wichern, 1988; Mc Farlandand Gans, 1995).

2.3. Chemometric methods

2.3.1. Cluster analysisThe cluster analysis (CA) is the name of a group of methods used

to recognize similarities among cases (objects) or among variablesand to single out some categories as a set of similar cases (or vari-ables) (Xu and Hagler, 2002). This CA comprehends a number ofdifferent ‘classification algorithms’ and it allows organizing thedata into subsystems. These algorithms are grouped into two cat-egories: hierarchical clustering and partitional (non-hierarchical)clustering. Hierarchical clustering rearranges objects in a tree-structure (joining clustering) in an agglomerative (bottom-up) pro-cedure. On the other hand, partitional clustering assumes that the

objects have non-hierarchical characters (Johnson and Wichern,1988; Mc Farland and Gans, 1995; STATISTICA version 6.0, 2001;Xu and Hagler, 2002).

The most used cluster algorithms are the k-means cluster ana-lysis (k-MCA) and Jarvis–Patrick algorithm (also known as k-nearestneighbor cluster analysis; k-NNCA); in our case, in order to designthe training and test series to guarantee structural and toxicityvariabilities in both series of the present database, we carried outboth kinds of cluster analyses (k-MCA and k-NNCA) for the entiredataset of compounds (Johnson and Wichern, 1988; Mc Farlandand Gans, 1995; STATISTICA version 6.0, 2001; Xu and Hagler,2002). The number of members in each cluster and the standarddeviation of the variables in the cluster (kept as low as possible)were taking into account, to have an acceptable statistical qualityof data partition into clusters. The values of the standard deviation(SD) between and within clusters, those of the respective Fisher-ratio and their p-level of significance were also examined (Johnsonand Wichern, 1988; Mc Farland and Gans, 1995; STATISTICA ver-sion 6.0, 2001; Xu and Hagler, 2002). Finally, before carrying outthe cluster processes, all the variables were standardized. In stan-dardization, all values of selected variables (molecular descriptors)were replaced by standardized values, which are computed as fol-lows: Std. score = (raw score �mean)/Std. deviation.

2.3.2. Multiple linear regressionIn prediction of aquatic toxicity against T. pyriformis the multi-

ple linear regression (MLR) analysis was used as statistical method.This experiment was performed with software package STATISTICAversion 6.0 (2001). The considered tolerance parameter (propor-tion of variance that is unique to the respective variable) was thedefault value for minimum acceptable tolerance, which is 0.01.Forward stepwise procedure was fixed as the strategy for variableselection. The principle of maximal parsimony (Occam’s razor) wastaken into account as a strategy for model selection. Therefore, weselected the model with the highest statistical signification, buthaving as few parameters (ak) as possible. Log (IGC50)�1 (decimallogarithm of the inverse 50% growth inhibitory concentration) val-ues reported as m M were used as the dependent variable.

The quality of the models was determined by examining theregression’s statistical parameters and those of the cross-validationprocedures (Belsey et al., 1980; Wold and Erikson, 1995). There-fore, the following parameters were verified: the correlation coef-ficient (R), determination coefficient or squared correlationcoefficient (R2), Fisher-ratio’s p-level [p(F)], standard deviation ofthe regression (s) and the LOO press statistics (q2,scv). The predic-tive powers of the obtained models were assessed by using anexternal prediction (test) set.

3. Results and discussion

3.1. Similarity analysis and design of training and test sets

The quality of any QSAR model depends on the quality of the se-lected dataset. One of the most critical aspects of constructing thetraining set is to warrant enough molecular diversity for it. In orderto demonstrate the structural diversity of these datasets, we per-formed a hierarchical CA of the entire dataset (Johnson and Wich-ern, 1988; Mc Farland and Gans, 1995). The dendrogram is given inFig. 1, using the Euclidean distance (X-axis) and the complete link-age (Y-axis), illustrate the results of the k-NNCA developed for thedataset. As can be seen in the binary tree, there are a number of dif-ferent subsets, which prove the molecular variability of the se-lected chemicals in these database.

Because of the difficulty in evaluating the output dendrogram,other kind of CA is usually performed. In this sense and, in order

Page 4: A novel approach to predict aquatic toxicity from molecular structure

Fig. 1. A dendrogram illustrating the results for the hierarchical k-NNCA developedfor the dataset.

418 J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427

to split the whole group into two datasets (training and predictingones), we perform a k-MCA. The main idea of this procedure con-sists in making a partition of chemicals in several statistically rep-resentative classes of compounds. This procedure ensures that anychemical class (as determined by the clusters derived from k-MCA)will be represented in both compounds’ series. This ‘‘rational” de-sign of the training and predicting series allowed us to design bothsets that are representative of the whole ‘‘experimental universe”.This procedure split the dataset of benzene derivatives into 9clusters.

Afterward, the selection of the training and prediction sets wasperformed by taking, in a random way, compounds belonging toeach cluster. From these 392 compounds, 313 were chosen at ran-dom to form the training set. The remaining subset, composed of79 compounds, was prepared as test set for the external set valida-tion of the models. These compounds were never used in the devel-opment of the classification models. Fig. 2 illustrates graphicallythe above-described procedure, where a cluster analyses was per-formed to select a representative sample for the training and testsets.

3.2. Predicting aquatic toxicity of benzene derivatives

As we previously pointed out, the dataset was divided intotraining and test sets; toxicity data to T. pyirformis for the 392 ben-zene derivatives are presented in Table 2, where training and val-idation compounds are also clearly indicated. The MLR analysis

Fig. 2. General algorithm used for designing training and test sets throughoutk-MCA.

was used to develop QSAR models for the prediction of aquatic tox-icity against T. pyriformis.

The models obtained by using atom-based non-stochastic linearindices are the following:

Logð1=IGC50Þ ¼ �1:045ð�0:110Þ þ 0:148ð�0:015ÞPf H0LðxEÞ

þ 0:105ð�0:007ÞK f0ðxÞ

� 0:362ð�0:028ÞPf H1LðxEÞ þ 3:29

� 10�2ð�0:41� 10�2ÞPf3LðxEÞ � 3:09

� 10�3ð�0:345� 10�3ÞPf3ðxÞ � 4:62

� 10�7ð�1:77� 10�7ÞMf9LðxEÞ ð2Þ

N ¼ 313 R2 ¼ 0:721 s ¼ 0:403 F ¼ 131:79 p < 0:0001

q2 ¼ 0:687 scv ¼ 0:421

Logð1=IGC50Þ ¼ �1:514ð�0:111Þ þ 0:167ð�0:013ÞPf H0LðxEÞ

þ 0:105ð�0;006ÞK f0ðxÞ

� 0:360ð�0:024ÞPf H1LðxEÞ þ 3:26

� 10�2ð�0:35� 10�2ÞPf3LðxEÞ � 2:12

� 10�3ð�0:32� 10�3ÞPf3ðxÞ � 5:97

� 10�7ð�1:53� 10�7ÞMf9LðxEÞ ð3Þ

N ¼ 307 R2 ¼ 0:791 s ¼ 0:344 F ¼ 189:43 p < 0:0001

q2 ¼ 0:781 scv ¼ 0:348 R2pred ¼ 0:762

where N is the size of the dataset, R is the correlation coefficient, R2

is the determination coefficient, s is the standard deviation of theregression, F is the Fischer ratio, q2 (scv) is the square correlationcoefficient (standard deviation) of the cross-validation performedby the LOO procedure and R2

pred is the square correlation coefficientfor the external prediction set.

In the development of the first non-stochastic model (Eq. (2))compounds 020, 074, 182, 215, 335 and 354 (4-ethylbiphenyl,4-hexylresorcinol, phenyl isothiocyanate, 4-chloro-3,5-dinitro-benzaldehyde, benzyl-4-hydroxyphenyl ketone and 4-bromophenyl-3-pyridyl ketone, respectively) were detected as statistical outliers.Once rejected the statistical outliers, a new non-stochastic model(Eq. (3)) was obtained with better statistical parameters. As canbe seen the latter model explains almost 80% of the experimentaltoxicity values, the former model without outlier rejection explainsonly about 72%, and a small value of standard deviation of 0.344;the other statistical parameters were also improved. Predictabilityand stability of the obtained models using non-stochastic linearindices (Eqs. (2) and (3)) for data variation was carried out hereby means of LOO cross-validation. The second non-stochasticmodel (Eq. (3)) showed a good value of square correlation coeffi-cient q2 = 0.781 (scv = 0.348); this value of q2(q2 > 0.5) can be con-sidered as a proof of the high-predictive ability of the model(Belsey et al., 1980; Wold and Erikson, 1995).

All toxicity-related QSARs require validation to ensure they arecapable of making accurate predictions of toxicity for compoundsnot included in the training set. The best means of validation isby means of an external dataset. This is the most demanding meth-od because it requires additional testing and attention to the selec-tion of compounds for validation (Schultz and Netzeva, 2004).Efforts should be made to have chemical diversity within the train-ing set and the chemicals in the validation set similar to those inthe training set (Golbraikh and Tropsha, 2002). The training chem-icals should represent the depth and breadth of all existing chem-icals within the domain. The validating chemicals should also

Page 5: A novel approach to predict aquatic toxicity from molecular structure

Table 2Experimental and predicted values [Log (1/IGC50)] for the training and test set

Compounds CAS Log (1/IGC50) Training set Test set

Non-stochastic Stochastic Non-stochastic Stochastic

001 Benzene 71-43-2 �0.12 �0.388 �0.376002 p-Xylene 106-42-3 0.25 �0.013 0.015003 1-Phenyl-2-butanol 120055-09-6 �0.16 0.437 0.179004 Toluene 108-88-3 0.25 �0.200 �0.182005 n-Butylbenzene 104-51-8 1.25 0.561 0.695006 n-Amylbenzene 538-68-1 1.79 0.820 0.979007 Benzylamine 100-46-9 -0.24 �0.572 �0.499008 Isopropylbenzene 98-82-8 0.69 0.272 0.308009 6-Phenyl-1-hexanol 2430-16-2 0.87 0.870 0.959010 5-Phenyl-1-pentanol 10521-91-2 0.42 0.611 0.742011 a,a-Dimethylbenzenepropanol 103-05-9 �0.07 0.679 0.444012 4-Phenyl-1-butanol 3360-41-6 0.12 0.352 0.465013 3-Phenyl-1-propanol 122-97-4 �0.21 0.091 0.209014 Benzyl alcohol 100-51-6 �0.83 �0.330 �0.408015 sec-Phenethyl alcohol 98-85-1 �0.66 �0.048 �0.146016 4-Ethylbenzyl alcohol 768-59-2 0.07 0.100 0.093017 3-Phenyl-1-butanol 2722-36-3 0.01 0.311 0.370018 (R)-1-phenyl-1-butanol 22144-60-1 �0.01 0.520 0.524019 4-Biphenylmethanol 3597-91-9 0.92 0.531 0.514020 4-Ethylbiphenyla 5707-44-8 1.97 0.908 1.040021 Biphenyl 92-52-4 1.05 0.477 0.540022 (±)-2-Phenyl-2-butanol 1565-75-9 0.06 0.519 0.484023 (±)-1,2-Diphenyl-2-propanol 5342-87-0 0.8 1.227 1.149024 1,1-Diphenyl-2-propanol 29338-49-6 0.75 1.217 0.951025 3,4-Dimethylaniline 95-64-7 �0.16 �0.171 0.000026 3-Aminobenzyl alcohol 1877-77-6 �1.13 �0.480 �0.426027 4-Butoxyaniline 4344-55-2 0.61 0.458 0.528028 4-Pentyloxyaniline 39905-50-5 0.97 0.716 0.843029 4-Hexyloxyaniline 39905-57-2 1.38 0.975 1.131030 4-Methylaniline 106-49-0 �0.05 �0.344 �0.187031 4-Isopropylaniline 99-88-7 0.22 0.126 0.307032 3-Ethylaniline 587-02-0 �0.03 �0.104 0.103033 4-Ethylaniline 589-16-2 0.03 �0.101 0.116034 3-Methylaniline 108-44-1 0.28 �0.346 �0.207035 4-Butylaniline 104-13-2 1.07 0.416 0.695036 (2-Bromoethyl)benzene 103-63-9 0.42 �0.266 0.406037 2-Methylaniline 95-53-4 �0.16 �0.353 �0.037038 2,6-Diisopropylaniline 24544-04-5 0.76 0.736 0.991039 Aniline 62-53-3 �0.23 �0.526 �0.341040 2-Ethylaniline 578-54-1 �0.22 �0.061 0.130041 2,6-Diethylaniline 579-66-8 0.31 0.404 0.598042 Thioanisole 100-68-5 0.18 �0.001 �0.152043 4-Methoxyphenol 150-76-5 �0.14 �0.193 �0.244044 3,4,5-Trimethylphenol 527-54-8 0.93 0.236 0.330045 Benzyl chloride 100-44-7 0.06 0.105 �0.170046 4-Methylanisole 104-93-8 0.25 �0.094 �0.179047 2,3,5-Trimethylphenol 697-82-5 0.36 0.286 0.282048 2,4,6-Trimethylphenol 527-60-6 0.42 0.341 0.272049 4-tert-Butylphenol 98-54-4 0.91 0.574 0.593050 4-tert-Pentylphenol 80-46-6 1.23 0.817 0.960051 2,3,6-Trimethylphenol 2416-94-6 0.28 0.330 0.242052 Phenetole 103-73-1 �0.14 0.040 �0.143053 Anisole 100-66-3 �0.1 �0.275 �0.388054 2,4-Dimethylphenol 105-67-9 0.14 0.116 0.100055 2-Phenyl-3-butyn-2-ol 127-66-2 �0.18 0.271 0.318056 p-Cresol 106-44-5 �0.16 �0.109 �0.009057 4-Ethylphenol 123-07-9 0.21 0.134 0.294058 4-Propylphenol 645-56-7 0.64 0.392 0.577059 3-Ethylphenol 620-17-7 0.29 0.131 0.270060 Nonylphenol 104-40-5 2.47 1.984 2.239061 m-Cresol 108-39-4 �0.08 �0.111 �0.044062 o-Cresol 95-48-7 �0.29 �0.066 �0.061063 2-Ethylphenol 90-00-6 0.16 0.175 0.225064 Phenol 108-95-2 �0.35 �0.291 �0.224065 2-Allylphenol 1745-81-9 0.33 0.340 0.366066 Iodobenzene 591-50-4 0.36 0.616 0.469067 4-Chloroaniline 106-47-8 0.05 0.010 0.200068 2-Tolunitrile 529-19-1 �0.24 �0.024 �0.051069 4-Hydroxyphenethyl alcohol 501-94-0 �0.83 �0.084 0.040070 2-Chloro-4-methylaniline 615-65-6 0.18 0.262 0.392071 2-Chloroaniline 95-51-2 �0.17 0.088 0.195072 5-Pentylresorcinol 500-66-3 1.31 0.977 1.317073 3-Methoxyphenol 150-19-6 �0.33 �0.198 �0.228074 4-Hexylresorcinola 136-77-6 1.8 0.718 1.030

(continued on next page)

J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427 419

Page 6: A novel approach to predict aquatic toxicity from molecular structure

Table 2 (continued)

Compounds CAS Log (1/IGC50) Training set Test set

Non-stochastic Stochastic Non-stochastic Stochastic

075 4-Chloro-3,5-dimethylphenol 88-04-0 1.2 0.670 0.614076 4-Bromotoluene 106-38-7 0.47 0.457 0.582077 1-Bromo-4-ethylbenzene 1585-07-5 0.67 0.700 0.905078 4-Chloroanisole 623-12-1 0.6 0.258 0.173079 4-Chloro-3-methylphenol 59-50-7 0.8 0.458 0.458080 1,3-Dihydroxybenzene 108-46-3 �0.65 �0.211 �0.067081 Bromobenzene 108-86-1 0.08 0.275 0.385082 4-Chlorophenol 106-48-9 0.54 0.245 0.333083 4-Iodophenol 540-38-5 0.85 0.673 0.589084 2-(4-Chlorophenyl)ethylamine 156-41-2 0.14 0.131 0.364085 4-Chlorobenzylamine 104-86-9 0.16 �0.028 0.077086 2,4-Dichloroaniline 554-00-7 0.56 0.594 0.747087 Chlorobenzene 108-90-7 �0.13 0.166 0.200088 3-Chloroaniline 108-42-9 0.22 0.005 0.171089 1,2-Dimethyl-4-nitrobenzene 99-51-4 0.59 0.681 0.593090 4-(Pentyloxy)benzaldehyde 5736-91-4 1.18 1.000 0.986091 4-Nitrotoluene 99-99-0 0.65 0.521 0.393092 4-Isopropylbenzaldehyde 122-03-2 0.67 0.409 0.452093 1,2-Dimethyl-3-nitrobenzene 83-41-0 0.56 0.697 0.493094 3-Chlorophenol 108-43-0 0.87 0.240 0.343095 3-Nitrotoluene 99-08-1 0.42 0.514 0.379096 2-Nitrotoluene 88-72-2 0.26 0.539 0.291097 1,4-Dibromobenzene 106-37-6 0.68 0.894 1.083098 Benzaldehyde 100-52-7 �0.2 �0.246 �0.245099 3-Ethoxy-4-hydroxybenzaldehyde 121-32-4 0.02 0.252 0.112100 3-Methoxy-4-hydroxybenzaldehyde 121-33-5 �0.03 �0.062 �0.171101 4-Hydroxypropiophenone 70-70-2 0.12 0.478 0.366102 2,4-Dichlorophenol 120-83-2 1.04 0.819 0.863103 Valerophenone 1009-14-9 0.56 0.912 0.822104 Propiophenone 93-55-0 �0.07 0.397 0.233105 Butyrophenone 495-40-9 0.21 0.654 0.467106 2-Hydroxybenzaldehyde 90-02-8 0.42 �0.139 �0.089107 Heptanophenone 1671-75-6 1.56 1.429 1.405108 Acetophenone 98-86-2 �0.46 0.042 �0.265109 Nitrobenzene 98-95-3 0.14 0.346 0.181110 Octanophenone 1674-37-9 1.89 1.687 1.688111 2,5-Dichloroaniline 95-82-9 0.58 0.595 0.681112 3,4-Dichlorotoluene 95-75-0 1.07 0.977 0.902113 3-Nitroaniline 99-09-2 0.03 0.167 0.164114 3,5-Dichloroaniline 626-43-7 0.71 0.507 0.684115 4-Bromo-6-chloro-o-cresol 7530-27-0 1.28 1.109 1.147116 1,2-Dichlorobenzene 95-50-1 0.53 0.803 0.705117 3-Nitroanisole 555-03-3 0.72 0.410 0.137118 Benzophenone 119-61-9 0.87 0.950 0.838119 3-Chloro-5-methoxyphenol 65262-96-6 0.76 0.615 0.422120 4-Nitrobenzyl chloride 100-14-1 1.18 0.816 0.364121 2,4-Dibromophenol 615-58-7 1.4 1.006 1.196122 2-Amino-5-chlorobenzonitrile 5922-60-1 0.44 0.136 0.297123 2-Hydroxy-4-methoxyacetophenone 552-41-0 0.55 0.215 �0.138124 3,5-Dichlorophenol 591-35-5 1.56 0.741 0.829125 4-Chlorobenzaldehyde 104-88-1 0.4 0.290 0.295126 4-Chlorobenzophenone 134-85-0 1.5 1.481 1.356127 1,3,5-Trichlorobenzene 108-70-3 0.87 1.186 1.223128 2,4,5-Trichloroaniline 636-30-6 1.3 1.181 1.228129 4-Bromobenzophenone 90-90-4 1.26 1.582 1.470130 1,2,4-Trichlorobenzene 120-82-1 1.08 1.303 1.233131 2,4,6-Trichlorophenol 88-06-2 1.41 1.357 1.425132 4-Ethoxy-2-nitroaniline 616-86-4 0.76 0.551 0.477133 5-Bromovanillin 2973-76-4 0.62 0.569 0.603134 4-Nitrophenetole 100-29-8 0.83 1.096 0.788135 4-Chloro-2-nitrotoluene 89-59-8 0.82 1.026 0.854136 1-Bromo-3-nitrobenzene 585-79-5 1.03 0.909 0.874137 4-Bromo-2,6-dichlorophenol 3217-15-0 1.78 1.427 1.535138 2-Chloro-6-nitrotoluene 83-42-1 0.68 1.060 0.789139 2,3,5,6-Tetrachloroaniline 3481-20-7 1.76 1.801 1.758140 3-Nitrobenzonitrile 619-24-9 0.45 0.480 0.313141 2,4,5-Trichlorophenol 95-95-4 2.1 1.405 1.381142 1,2,4,5-Tetrachlorobenzene 95-94-3 2 1.884 1.771143 4-Methyl-2-nitroaniline 89-62-3 0.37 0.350 0.430144 1-Chloro-3-nitrobenzene 121-73-3 0.73 0.840 0.702145 2-Nitroaniline 88-74-4 0.08 0.188 0.226146 2,3,4,5-Tetrachloroaniline 634-83-3 1.96 1.829 1.778147 2,4,6-Tribromophenol 118-79-6 1.91 1.602 1.850148 2-Bromo-5-nitrotoluene 7149-70-4 1.16 1.133 1.099

420 J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427

Page 7: A novel approach to predict aquatic toxicity from molecular structure

Table 2 (continued)

Compounds CAS Log (1/IGC50) Training set Test set

Non-stochastic Stochastic Non-stochastic Stochastic

149 1-Fluoro-3-iodo-5-nitrobenzene 3819-88-3 1.09 1.439 1.237150 2-Nitrophenol 88-75-5 0.67 0.411 0.391151 2-Chloro-4-nitroaniline 121-87-9 0.75 0.733 0.709152 5-Hydroxy-2-nitrobenzaldehyde 42454-06-8 0.33 0.468 0.424153 3,4,5,6-Tetrabromo-o-cresol 576-55-6 2.57 2.614 2.707154 2,3,4,6-Tetrachlorophenol 58-90-2 2.18 2.014 1.938155 1-Fluoro-4-nitrobenzene 350-46-9 0.1 0.645 0.467156 Pentafluoroanilineb 771-60-8 0.26 0.964 1.456157 1-Bromo-2-nitrobenzene 577-19-5 0.75 0.942 0.959158 3,5-Dibromo-salicylaldehyde 90-59-5 1.65 1.072 1.333159 3,5-Dichloro-nitrobenzene 618-62-2 1.13 1.304 1.252160 4-Chloro-3-nitrophenol 610-78-6 1.27 0.914 0.895161 2,3,4,5-Tetrachlorophenol 4901-51-3 2.72 2.053 1.938162 Thiobenzamide 2227-79-4 0.09 �0.195 0.175163 1-Chloro-4-nitrobenzene 100-00-5 0.43 0.858 0.726164 a,a,a,4-Tetrafluoro-m-toluidine 2357-47-3 0.77 0.547 0.991165 1-Chloro-2-nitrobenzene 88-73-3 0.68 0.878 0.788166 4-Chloro-6-nitro-m-cresol 7147-89-9 1.63 1.084 1.073167 Pentachlorophenol 87-86-5 2.07 2.632 2.491168 1,3-Dinitrobenzene 99-65-0 0.76 0.987 0.733169 2,4-Dinitrotoluene 121-14-2 0.87 1.166 0.856170 4,5-Dichloro-2-nitroaniline 6641-64-1 1.66 1.230 1.341171 Pentafluorophenol 771-61-9 1.63 1.177 1.649172 Pentabromophenol 608-71-9 2.66 3.008 3.206173 3-Chloro-4-fluoronitrobenzene 350-30-1 0.8 1.178 1.072174 1,4-Dinitrobenzene 100-25-4 1.3 1.023 0.757175 3,4-Dichloronitrobenzene 99-54-7 1.16 1.433 1.294176 2,5-Dichloronitrobenzene 89-61-2 1.13 1.347 1.300177 2,4-Dichloro-6-nitroaniline 2683-43-4 1.26 1.178 1.286178 3,4-Dinitrobenzyl alcohol 79544-31-3 1.09 1.090 1.077179 2,4-Dichloronitrobenzene 611-06-3 0.99 1.359 1.330180 2,3-Dichloronitrobenzene 3209-22-1 1.07 1.446 1.344181 1,2-Dinitrobenzene 528-29-0 1.25 0.921 0.800182 Phenyl isothiocyanatea,b 103-72-0 1.41 0.125 0.304183 3-Trifluoromethyl-4-nitrophenolb 88-30-2 1.65 0.677 0.303184 2,6-Iodo-4-nitrophenol 305-85-1 1.81 0.972 0.998185 2,4-Chloro-6-nitrophenol 609-89-2 1.75 1.390 1.468186 1,3,5-Trichloro-2-nitrobenzene 18708-70-8 1.43 1.824 1.930187 1,2,4-Trichloro-5-nitrobenzene 89-69-0 1.53 1.908 1.847188 1,2,3-Trichloro-4-nitrobenzene 17700-09-3 1.51 2.003 1.853189 2-Chloro-5-nitrobenzaldehyde 6361-21-3 0.53 0.947 0.936190 Pentafluorobenzaldehyde 653-37-2 0.82 1.217 1.664191 2,4-Dinitro-1-iodobenzene 709-49-9 2.12 1.809 1.572192 2,3,5,6-Tetrachloronitrobenzene 117-18-0 1.82 2.424 2.447193 2,5-Dinitrophenol 329-71-5 1.04 1.044 0.988194 2,4-Dinitroaniline 97-02-9 0.72 0.800 0.830195 2,3,4,5-Tetrachloronitrobenzene 879-39-0 1.78 2.513 2.411196 1,2,3-Trifluoro-4-nitrobenzene 771-69-7 1.89 1.180 1.293197 1,2-Dichloro-4,5-dinitrobenzene 6306-39-4 2.21 1.900 1.887198 2,6-Dinitroaniline 606-22-4 0.84 0.800 0.836199 4,6-Dinitro-2-methylphenol 534-52-1 1.73 1.202 1.113200 4-tert-Butyl-2,6-dinitrophenol 4097-49-8 1.8 1.803 1.885201 1-Bromo-2,4-dinitrobenzene 584-48-5 2.31 1.507 1.550202 2,4-Dinitrophenol 51-28-5 1.06 1.022 0.998203 1,5-Dichloro-2,3-dinitrobenzene 28689-08-9 2.42 1.801 1.928204 6-Chloro-2,4-dinitroaniline 3531-19-9 1.12 1.283 1.347205 2-Bromo-4,6-dinitroaniline 1817-73-8 1.24 1.312 1.522206 2,3,4,6-Tetrafluoronitrobenzene 314-41-0 1.87 1.402 1.744207 2,6-Dinitrophenol 573-56-8 0.83 1.011 1.040208 1-Chloro-2,4-dinitrobenzene 97-00-7 2.16 1.475 1.387209 2,4-Dinitro-1-fluorobenzene 70-34-8 1.71 1.239 1.179210 Pentafluoronitrobenzene 880-78-4 2.43 1.630 2.213211 1,4-Dinitrotetrachlorobenzene 20098-38-8 2.82 2.898 3.155212 1,5-Difluoro-2,4-dinitrobenzene 327-92-4 2.08 1.470 1.628213 1,3-Dinitro-2,4,5-trichlorobenzene 2678-21-9 2.6 2.413 2.549214 1,3,5-Trichloro-2,4-dinitrobenzene hemihydrate 6284-83-9 2.19 2.353 2.578215 4-Chloro-3,5-dinitrobenzaldehydea 1930-72-9 2.66 1.512 1.581216 1-Phenyl-2-propanol 14898-87-4 �0.62 0.127 0.055217 4-Methylbenzyl alcohol 589-18-4 �0.49 �0.144 �0.209218 (±)1-Phenyl-2-pentanol 705-73-7 0.16 0.696 0.754219 4-Isopropylbenzyl alcohol 536-60-7 0.18 0.555 0.497220 2-(p-Tolyl)ethylamine 3261-62-9 �0.04 �0.227 �0.027221 4-Methyl benzylamine 104-84-7 �0.01 �0.386 �0.302222 3-Methylbenzyl alcohol 587-03-1 �0.24 �0.145 �0.215

(continued on next page)

J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427 421

Page 8: A novel approach to predict aquatic toxicity from molecular structure

Table 2 (continued)

Compounds CAS Log (1/IGC50) Training set Test set

Non-stochastic Stochastic Non-stochastic Stochastic

223 3-Phenyl-2-propen-1-ol 104-54-1 �0.08 �0.011 0.000224 4-tert-Buthylbenzyl alcohol 877-65-6 0.48 0.542 0.391225 4-Methylphenetyl alcohol 699-02-5 �0.26 0.014 0.081226 1-Phenylethylamine 618-36-0 �0.18 �0.290 �0.195227 2-Methylbenzyl alcohol 89-95-2 �0.43 �0.154 �0.237228 2-Methyl-1-phenyl-2-propanol 100-86-7 �0.41 0.412 0.142229 N-Methylphenethylamine 589-08-2 �0.41 �0.512 �0.454230 b-Methylphenethylamine 582-22-9 �0.28 �0.137 0.021231 (±)-1-Phenyl-1-butanol 22135-49-5 �0.09 0.520 0.524232 (±)-1-Phenyl-1-propanol 93-54-9 �0.43 0.262 0.267233 Phenetyl alcohol 60-12-8 �0.59 �0.174 �0.116234 2-Phenyl-1-propanol 1123-85-9 �0.4 0.105 0.125235 2-Phenyl-2-propanol 617-94-7 �0.57 0.218 0.025236 2-Phenyl-1-butanol 89104-46-1 �0.11 0.355 0.390237 Benzhydrol 91-01-0 0.5 0.792 0.738238 Benzaldoxime 622-32-2 �0.11 �0.239 �0.347239 3,5-Dimethylaniline 108-69-0 �0.36 �0.166 �0.032240 4-tert-Buthylaniline 769-92-6 0.36 0.338 0.415241 2,4-Dimethylaniline 95-68-1 �0.29 �0.120 �0.009242 4-Phenylbutyronitrile 2046-18-6 0.15 0.549 0.328243 2,4,6-Trimethylaniline 88-05-1 �0.05 0.104 0.160244 3-Phenylpropionitrile 645-59-0 �0.16 0.276 0.110245 4-sec-Butylaniline 30273-11-1 0.61 0.414 0.594246 2,3-Dimethylaniline 87-59-2 �0.43 �0.130 �0.016247 Benzyl cyanide 140-29-4 �0.36 �0.027 �0.120248 2,5-Dimethylaniline 95-78-3 �0.33 �0.121 �0.019249 a-Methylbenzyl cyanide 1823-91-2 0.01 0.145 0.086250 2-Isopropylaniline 643-28-7 0.12 0.164 0.258251 2,6-Dimethylaniline 87-62-7 �0.43 �0.078 �0.027252 N-ethylaniline 103-69-5 0.07 �0.197 �0.146253 2-Propylaniline 1821-39-2 0.08 0.197 0.377254 N-Methylaniline 100-61-8 0.06 �0.512 �0.454255 2-Amino-4-tert-buthylaniline 1199-46-8 0.37 0.448 0.574256 2-Methoxyaniline 90-04-0 �0.69 �0.392 �0.451257 3-Phenylpyridine 1008-88-4 0.47 0.318 0.223258 2-Aminobenzyl alcohol 5344-90-1 �1.07 �0.440 �0.404259 2-Benzylpyridine 101-82-6 0.38 0.781 0.525260 3,5-Di-tert-buthylphenol 1138-52-9 1.64 1.430 1.356261 Phenyl propargyl sulfide 5651-88-7 0.54 0.407 0.228262 4-Ethoxyphenol 622-62-8 0.01 0.122 �0.059263 4-Buthoxyphenol 122-94-1 0.7 0.698 0.646264 4-Benzylpyridine 2116-65-6 0.63 0.595 0.455265 2-Phenylpyridine 1008-89-5 0.27 0.530 0.304266 3,4-Dimethylphenol 95-65-8 0.12 0.064 0.127267 3-tert-Buthylphenol 585-34-2 0.74 0.570 0.545268 3,5-Dimethylphenol 108-68-9 0.11 0.070 0.085269 6-tert-Buthyl-2,4-dimethylphenol 1879-09-0 1.16 1.015 0.797270 4-Isopropylphenol 99-89-8 0.47 0.361 0.442271 3-Isopropylphenol 618-45-1 0.61 0.358 0.431272 2,3-Dimethylphenol 526-75-0 0.12 0.106 0.084273 2,5-Dimethylphenol 95-87-4 0.14 0.114 0.081274 4-Hydroxy-3-methoxybenzyl alcohol 498-00-0 �0.7 �0.132 �0.349275 2-Isopropylphenol 88-69-7 0.61 0.400 0.355276 3-Amino-2-cresol 53222-92-7 �0.55 �0.186 �0.107277 4-Chloro-2-methylaniline 95-69-2 0.35 0.226 0.326278 2-Methoxy-4-propenylphenol 97-54-1 0.75 0.355 0.273279 2,4,6-tris(Dimethylaminomethyl)phenol 90-72-2 �0.52 �0.299 0.134280 2-Fluoroaniline 348-54-9 �0.37 �0.179 �0.108281 4-Aminobenzyl cyanide 3544-25-0 -0.76 �0.175 �0.125282 3-Iodoaniline 626-01-7 0.65 0.429 0.406283 3-Cinnamonitrile 4360-47-8 0.16 0.164 0.097284 3-Fluorobenzyl alcohol 456-47-3 �0.39 �0.013 �0.162285 3-Cyanoaniline 2237-30-1 -0.47 �0.351 �0.245286 4-Fluorophenol 371-41-5 0.02 0.025 �0.006287 2-Iodoaniline 615-43-0 0.35 0.594 0.427288 3-Fluoroaniline 372-19-0 �0.1 �0.214 �0.112289 4-Chloro-2-methylphenol 1570-64-5 0.7 0.462 0.426290 4-Chloro-3-ethylphenol 14143-32-9 1.08 0.697 0.749291 2-Chloro-4,5-dimethylphenol 1124-04-5 0.69 0.653 0.687292 3,5-Dimethoxyphenol 500-99-2 �0.09 �0.129 �0.263293 4-Hydroxybenzyl cyanide 14191-95-8 �0.38 0.060 0.006294 4-Bromo-2,6-dimethylphenol 2374-05-2 1.16 0.774 0.761295 2-Bromobenzyl alcohol 18982-54-2 0.1 0.345 0.322296 2-Chloro-5-methylphenol 615-74-7 0.54 0.487 0.485297 2-Fluorophenol 367-12-4 0.19 0.046 0.045

422 J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427

Page 9: A novel approach to predict aquatic toxicity from molecular structure

Table 2 (continued)

Compounds CAS Log (1/IGC50) Training set Test set

Non-stochastic Stochastic Non-stochastic Stochastic

298 4-(Dimethylamino)benzaldehyde 100-10-7 0.23 �0.276 �0.298299 4-Bromophenol 106-41-2 0.68 0.344 0.455300 3-Chloro-2-methylaniline 95-79-4 0.5 0.224 0.342301 3-Chloro-4-methylaniline 95-74-9 0.39 0.220 0.330302 3-Chloro-2-methylaniline 87-60-5 0.38 0.260 0.312303 4-Chlorophenethyl alcohol 1875-88-3 0.32 0.372 0.450304 4-Chlorobenzyl alcohol 873-76-7 0.25 0.213 0.146305 2-Bromo-4-methylphenol 6627-55-0 0.6 0.603 0.684306 1,3,5-Trimethyl-2-nitrobenzene 603-71-4 0.86 0.905 0.559307 3-Chlorobenzyl alcohol 873-63-2 0.15 0.209 0.132308 2-Bromophenol 95-56-7 0.33 0.429 0.478309 4-Hydroxy-3-methoxybenzonitrile 4421-08-3 �0.03 �0.020 �0.159310 3-Nitrobenzyl alcohol 619-25-0 �0.22 0.373 0.142311 4-Bromophenyl acetonitrile 16532-79-9 0.6 0.619 0.534312 4-Methoxybenzonitrile 874-90-8 0.1 �0.088 �0.282313 2-Hydroxy-4,5-dimethylacetophenone 36436-65-4 0.71 0.485 0.267314 2-Anisaldehyde 135-02-4 0.15 �0.130 �0.257315 4-Chlororesorcinol 95-88-5 0.13 0.374 0.433316 Methyl-4-methylaminobenzoate 18358-63-9 0.31 �0.150 0.061317 4-Phenoxybenzaldehyde 67-36-7 1.26 1.030 0.837318 3-Hydroxy-4-methoxybenzaldehyde 621-59-0 �0.14 �0.062 �0.171319 4-Biphenylcarboxaldehyde 3218-36-8 1.12 0.611 0.678320 2,4,5-Trimethoxybenzaldehyde 4460-86-0 �0.1 0.018 �0.342321 4-Benzoylaniline 1137-41-3 0.68 0.794 0.839322 3-Anisaldehyde 5991-31-1 0.23 �0.157 �0.289323 n-Propyl cinnamate 7778-83-8 1.23 0.908 1.160324 (Trans)ethyl cinnamate 103-36-6 0.99 0.592 0.683325 Hexanophenone 942-92-7 1.19 1.170 1.112326 n-Butyl cinnamate 538-65-8 1.53 1.166 1.427327 4-Chlorobenzyl cyanide 140-53-4 0.66 0.514 0.403328 (Trans)methyl cinnamate 103-26-4 0.58 0.277 0.558329 Ethyl-4-methoxybenzoate 94-30-4 0.77 0.401 0.235330 Phenylacetic acid hydrazide 937-39-3 �0.48 �0.524 �0.438331 3-Hydroxybenzaldehyde 100-83-4 0.08 �0.170 �0.126332 2,6-Dichlorophenol 87-65-0 0.73 0.882 0.851333 Benzyl methacrylate 2495-37-6 0.65 0.925 0.878334 Isoamyl-4-hydroxybenzoate 6521-30-8 1.48 1.205 1.290335 Benzyl-4-hydroxyphenyl ketonea,b 2491-32-9 1.07 3.261 3.304336 Benzyl benzoate 120-51-4 1.45 1.166 1.208337 4-Benzoylphenol 1137-42-4 1.02 1.029 0.968338 2-Methyl-5-nitrophenol 5428-54-6 0.66 0.612 0.470339 3-Acetoamidophenol 621-42-1 �0.16 �0.051 �0.243340 4-Cyanobenzamide 3034-34-2 �0.38 �0.189 �0.177341 2-Nitrobiphenyl 86-00-0 1.3 1.125 1.100342 5-Chloro-2-hydroxybenzamide 7120-43-6 0.59 0.212 0.378343 3-Nitrophenol 554-84-7 0.51 0.401 0.292344 Phenyl-1,3-dialdehyde 626-19-7 0.18 �0.123 �0.090345 Ethyl-4-bromobenzoate 5798-75-4 1.33 0.935 1.000346 2,4-Dihydroxyacetophenone 89-84-9 0.25 0.204 0.017347 3-Chlorobenzophenone 1016-78-0 1.55 1.469 1.364348 Phenyl-4-hydroxybenzoate 17696-62-7 1.37 1.222 1.199349 Phenyl benzoate 93-99-2 1.35 1.145 1.070350 2-Hydroxy-4-methoxybenzophenone 131-57-7 1.42 1.107 1.007351 Benzylidene malononitrile 2700-22-3 0.64 0.319 0.471352 4-Nitrophenyl phenyl ether 620-88-2 1.58 1.603 1.238353 Resorcinol monobenzoate 136-36-7 1.11 1.217 1.212354 4-Bromophenyl-3-pyridyl ketonea,b 14548-45-9 0.82 3.887 3.529355 3-Nitroacetophenone 121-89-1 0.32 0.720 0.302356 3-Nitrobenzaldehyde 99-61-6 0.11 0.440 0.316357 Ethyl phenylcyanoacetate 4553-07-5 �0.02 0.659 0.786358 2-Nitroanisole 91-23-6 �0.07 0.412 0.173359 3-Methyl-2-nitrophenol 4920-77-8 0.61 0.594 0.442360 2,5-Diphenyl-1,4-benzoquinone 844-51-9 1.48 1.471 1.661361 2-Nitrobenzamide 610-15-1 �0.72 0.252 0.222362 Methyl-2,5-dichlorobenzoate 2905-69-3 0.81 1.012 1.360363 2-Nitrobenzaldehyde 552-89-6 0.17 0.422 0.288364 4-Methyl-2-nitrophenol 119-33-5 0.57 0.573 0.564365 2,20 ,4,40-Tetrahydroxybenzophenone 131-55-5 0.96 1.239 1.484366 4-Nitrobenzaldehyde 555-16-8 0.2 0.458 0.306367 5-Methyl-2-nitrophenol 700-38-9 0.59 0.578 0.545368 3,5-Dichlorosalicylaldehyde 90-60-8 1.55 0.912 1.009369 2-(Benzylthio)-3-nitropyridine 69212-31-3 1.72 1.855 1.406370 Ethyl-4-nitrobenzoate 99-77-4 0.71 1.004 0.822371 2,4-Dichlorobenzaldehyde 874-42-0 1.04 0.811 0.868

(continued on next page)

J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427 423

Page 10: A novel approach to predict aquatic toxicity from molecular structure

Table 2 (continued)

Compounds CAS Log (1/IGC50) Training set Test set

Non-stochastic Stochastic Non-stochastic Stochastic

372 20 ,30 ,40-Trichloroacetophenone 13608-87-2 1.34 1.767 1.443373 2,20-Dihydroxybenzophenone 835-11-0 1.16 1.117 1.179374 Methyl-4-nitrobenzoate 619-50-1 0.39 0.689 0.674375 2-Chloromethyl-4-nitrophenol 2973-19-5 0.75 0.896 0.474376 a,a,a-Trifluoro-p-cresol 402-45-9 0.62 0.473 0.744377 Dimethylnitroterephthalate 5292-45-5 0.43 0.948 1.091378 Thioacetanilide 637-53-6 �0.01 0.085 0.210379 2-Nitroresorcinol 601-89-8 0.66 0.455 0.563380 3,5-Dibromo-4-hydroxybenzonitrile 1689-84-5 1.16 1.112 1.156381 Pentafluorobenzyl alcoholc 440-60-8 �0.2 �0.61 1.371382 Methyl-4-chloro-2-nitrobenzoate 42087-80-9 0.82 1.084 1.311383 1-Fluoro-2-nitrobenzene 1493-27-2 0.23 0.631 0.533384 a,a,a-Tetrafluoro-o-toluidine 393-39-5 �0.02 0.547 0.951385 3-Hydroxy-4-nitrobenzaldehyde 704-13-2 0.27 0.501 0.509386 2,5-Dibromonitrobenzene 3460-18-2 1.37 1.455 1.640387 Benzoyl cyanide 613-90-1 0.31 �0.049 0.156388 4,5-Difluoro-2-nitroaniline 78056-39-0 0.75 0.734 0.863389 2,5-Difluoronitrobenzene 364-74-9 0.33 0.900 0.816390 2,4-Dibromo-6-nitroaniline 827-23-6 1.62 1.277 1.604391 4-Hydroxy-3-nitrobenzaldehyde 3011-34-5 0.61 0.490 0.531392 Benzoyl isothiocyanate 532-55-8 0.1 0.499 0.745

a Statistical outliers for Eq. (2).b Statistical outliers for Eq. (4).c Statistical outliers for Eq. (5).

424 J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427

represent the distribution of existing chemicals within the trainingdomain. In this exercise CA was used to assess both diversity fortraining and representation for validation.

The main function of the horizontal validation is to prove therobustness of the model. In previous reports it has been recognizedthat the external validation is the only way to establish the realpredictability of the model (Golbraikh and Tropsha, 2002). Anexternal set of 79 benzene derivatives was used as test set to judgethe predictability of the best model (Eq. (3)). Therefore, the deter-mination coefficient for the test set ðR2

predÞ with the model was of0.762; the good prediction for the tested compounds confirmsthe significance of the selected molecular descriptors and the mod-el based on them. The predicted values for the compounds of thetest set, using the non-stochastic linear indices (Eq. (3)) are shownin Table 2.

The models obtained by using atom-based stochastic linearindices are the following:

Logð1=IGC50Þ ¼ �1:081ð�0:100Þ þ 9:13� 10�3ð�1:46

� 10�3ÞMsf15ðxÞ þ 1:500ð�0:138ÞKsf H0 ðxÞ

� 0:700ð�0:065ÞKsf H5 ðxÞ

� 0:881ð�0:096ÞGsf H0 ðxÞ � 6:51� 10�2ð�0:67

� 10�2ÞVsf4LðxEÞ þ 7:05� 10�2ð�0:73

� 10�2ÞVsf H2LðxEÞ ð4Þ

N ¼ 313 R2 ¼ 0:733 s ¼ 0:394 F ¼ 139:94 p < 0:0001

q2 ¼ 0:704 scv ¼ 0:411

Logð1=IGC50Þ ¼ �1:471ð�0:098Þ þ 1:13� 10�2ð�0:13

� 10�2ÞMsf15ðxÞ þ 1:242ð�0:125ÞKsf H0 ðxÞ

� 0:634ð�0:057ÞKsf H5 ðxÞ

� 0:663ð�0:088ÞGsf H0 ðxÞ � 7:59� 10�2ð�0:60

� 10�2ÞVsf4LðxEÞ þ 7:95� 10�2ð�0:65

� 10�2ÞVsf H2LðxEÞ ð5Þ

N ¼ 308 R2 ¼ 0:799 s ¼ 0:343 F ¼ 198:88

p < 0:0001 q2 ¼ 0:786 scv ¼ 0:350 R2pred ¼ 0:797 ð6Þ

In the development of the first stochastic model (Eq. (4))compounds 156, 182, 183, 335 and 354 (pentafluoroaniline, phenylisothiocyanate, 3-trifluoromethyl-4-nitrophenol, benzyl-4-hydro-xyphenyl ketone and 4-bromophenyl-3-pyridyl ketone, correspond-ingly) were detected as statistical outliers. Once rejected thestatistical outliers, a new stochastic model (Eq. (5)) was obtainedwith better statistical parameters. Notice that the latter model (aswell as non-stochastic model, Eq. (3)) also explains almost 80% ofthe variance for the experimental toxicity values, but the formermodel obtained with the entire dataset explains only about 73%.An improvement of all statistical parameters was achieved by thelatter stochastic model (Eq. (5)). In order to assess the predictabilityand stability of the obtained models with stochastic linear indices aLOO cross-validation procedure was carried out. In this sense thesecond stochastic model achieved good values of press statistics,q2 = 0.786 and scv = 0.350.

As we pointed out, the predictive power of a QSAR model has tobe estimated using an external test set. At this point, the real pre-dictive power of stochastic linear indices’ model (Eq. (5)) was val-idated by the same external test set of 79 compounds. In this sense,compound pentafluorobenzyl alcohol was detected as statisticaloutlier and removed from the analysis, which yields a final valueof R2

pred of 0.797. The obtained values for the prediction of the train-ing and test sets, using stochastic linear indices (Eq. (5)) are alsoshown in Table 2.

3.3. Comparison with other approaches

The use of atom-based non-stochastic and stochastic linearindices, for the prediction of aquatic toxicity of benzene derivativesagainst T. pyriformis, was compared with other method imple-mented in the Dragon software (Todeschini et al., Dragon Softwareversion 2.1, 2002). Five kinds of indices were specifically used:Topological, BCUT, Gálvez’s topological charge indices, 2D Autocor-relations and Molecular Walk Counts; all these families are madeup for bidimentional molecular descriptors like linear indices.

Page 11: A novel approach to predict aquatic toxicity from molecular structure

Table 3Statistical parameters of the QSAR models obtained using different molecular descriptors to predict aquatic toxicity

Index N R2 s F q2 scv Eq. no.

Non-stochastic linear indices Pf H0L (xE), Kf0(x), Pf H

1LðxEÞ, Pf3L(xE), Pf3(x), Mf H9L (xE) 313 0.721 0.403 131.79 0.687 0.421 2

Stochastic linear indices Msf15(x), Ksf H0 (x), Ksf H

5 (x), Gsf H0 (x), Vsf H

4L(xE), Vsf H2L(xE) 313 0.733 0.394 139.94 0.704 0.411 4

2D autocorrelations ATS3v, ATS8v, ATS3e, ATS8e, MATS1e, GATS1m. 313 0.609 0.476 79.54 0.585 0.486 6BCUT BEHm7, BELm4, BELm6, BELv4, BELe6, BEHp6. 313 0.690 0.424 113.56 0.675 0.431 7Gálvez topological charge indices GGI2, GGI6, GGI8, JGI2, JGI5, JGI8. 313 0.516 0.530 54.30 0.478 0.545 8Topological descriptors ISIZ, X2sol, S2K,PW2, TIC1, pilD. 313 0.716 0.406 128.70 0.682 0.423 9Molecular walk count MWC08, MWC09, TWC, SRW10. 313 0.346 0.614 40.80 0.303 0.630 10

J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427 425

The development of these five models involved the use of the samedataset that it was used in developing the model of linear indicesand the same number of variables in the equation, in most of theapproaches. The five obtained models are shown in Table 3 withtheir statistical parameters.

The comparison was based on the quality of the statisticalparameters of the regression. In this sense, the present approachshowed the greater values of squared correlation coefficient of0.733 (s = 0.394), and 0.721 (s = 0.403), with stochastic and non-stochastic linear indices, correspondingly. The achieved values ofR2 (s) for the other models, Eqs. (6)–(9), are between 0.516 (0.53)and 0.716 (0.406); the model obtained with molecular walk countdescriptors (Eq. (10)) was not considered into the comparison dueto the poor behavior shown. All these results are summarized inTable 3, where a detailed comparison can be more easilyperformed.

The models obtained with Dragon’s molecular descriptors werevalidated using a LOO cross-validation procedure. In this sense, theatom-based linear models achieve the best values of press statis-tics, q2 and scv. As it can be seen, our models have statistical param-eter better than models obtained with Dragon’s moleculardescriptors. The model obtained with stochastic linear indicesshown the highest values of q2 = 0.704 and the lowest value ofscv = 0.411; the model obtained with non-stochastic linear indiceshad a similar behavior q2 = 0.687 and scv = 0.42. The values of thesestatistical parameters for the others models are for q2 between0.682 and 0.478 and for scv between 0.423 and 0.545.

Now we proceed to give a little discussion about the presence ofoutliers in the developed QSAR models. Outliers are useful in QSARdevelopment as they assist in establishing the chemical domain ofthe model. Outliers from a QSAR are compounds that do not fit themodel, or that are poorly predicted by it (Egan and Morgan, 1998).There are several potential reasons for a chemical to be an outlierfrom a QSAR. Customarily, such compounds have been recognizedas acting by a different mechanism of action from the other chem-icals, which are well modeled by the QSAR. Examples of outliersfrom toxicological QSARs abound for all endpoints, and have actu-ally been extremely useful in their development. In the 1980s andmore recently, the analysis of outliers proved to be the spur for thefurther analysis and identification of mechanisms of action (Croninand Schultz, 2003).

Several methods can be used to highlight outliers including, atthe most basic level, and the identification of those compoundswith significantly high standard residuals from regression-basedtechniques. In this work, outliers’ detection was performed usingthe following standard statistical test: residual, standardized resid-ual, Mahalanobis distance, deleted residual and Cooks’ distance(Belsey et al., 1980; STATISTICA version 6.0, 2001). After their iden-tification, outliers were removed from the dataset and the QSARrecalculated (as described above).

A deep analysis of the outlier compounds showed that threecompounds 182, 335 and 354 (phenyl isothiocyanate, benzyl-4-hydroxyphenyl ketone and 4-bromophenyl-3-pyridyl ketone,respectively) were detected as outliers for both models (Eqs. (2)

and (4)), two of these compounds 335 and 354 belong to the clus-ter number nine. That is a logical result, because this cluster iscomposed of only three compounds; so the structures of thesethree compounds are markedly different from the rest of the struc-tures in the whole dataset. Taking this into account, we can expectan outlier behavior for these compounds, as was demonstrated inthe development of the models. For that reason these compoundswere included only in the training set. On the other hand, com-pound 183 (3-trifluoromethyl-4-nitrophenol) belongs to a chemi-cal family (4-nitrophenols), which has been recognized thatundergoes abiotic transformation (Seward et al., 2001). It is inter-esting to note that compounds 074 and 215 (4-hexylresorcinol and4-chloro-3,5-dinitrobenzaldehyde) were also detected as outliersin previous reports by other authors (Schultz, 1999; Schultz andNetzeva, 2004). Other outliers without any apparent structural pat-tern were detected. Compound 381 (pentafluorobenzyl alcohol),which has a residual value of �1.57, was detected as outlier forthe test set with stochastic model, this compounds was previouslyreported as outlier in some of the QSAR models developed bySchultz and Netzeva (2004).

4. Conclusions

The growing necessity of developing more reliable QSAR/QSTRmodels to assess drug discovery and chemical environmental riskhas been recognized in the literature (Kulkarni and Hopfinger,1999; Devillers, 2000; Gonzalez et al., 2004). Therefore, theatom-based linear indices have been largely probed here to devel-op fairly good predictive regression-based models in order to pre-dict aquatic toxicity of benzenes against T. pyriformis. The datasetwas carefully split into training and validation sets, guaranteeingenough molecular diversity in each subset, by using non-hierarchi-cal cluster analysis. The obtained models were statistically signifi-cant and robust in terms of the R2, s, q2 and scv values. The bestmodel was developed with atom-based stochastic linear indices;it shown good values of R2 = 0.799 and q2 = 0.786. For the valida-tion set, the high-predictive R2

pred values, of our models (0.762and 0.797 for non-stochastic and stochastic model, respectively)indicate the capability of predicting the aquatic toxicity of benzenederivatives in the impairment of the population growth of T. pyri-formis. The performance of the stochastic model is better than thatthe non-stochastic model, but both models can be efficiently usedto predict the aquatic toxicity of benzenes derivatives. Our methodfavorably compares with other approaches implemented in theDragon software. Definitely, those models obtained in the currentwork are not ideal, because the dataset used here, although of goodquality and reliable, is limited. However, the method proposedhere (atom-based linear indices) could be a substitute for costlyand time-consuming experiments to determine toxicity.

Acknowledgements

We sincerely thank Dr. T.W. Schultz for providing some manu-script reprints from his works, which significantly contribute to the

Page 12: A novel approach to predict aquatic toxicity from molecular structure

426 J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427

development of this paper. Castillo-Garit thanks the program ‘Est-ades Temporals per a Investigadors Convidats’ for a fellowship towork at Valencia University in 2008.

References

Anon, 2001. White paper on a strategy for a future chemicals policy (COM (2001) 88final). Available from <http://europa.eu.int/comm/enviroment/chemicals/whitepaper.htm>.

Aptula, A.O., Roberts, D.W., 2006. Mechanistic applicability domains for nonanimal-based prediction of toxicological end points: general principles and applicationto reactive toxicity. Chem. Res. Toxicol. 19, 1097–1105.

Aptula, A.O., Roberts, D.W., Cronin, M.T., Schultz, T.W., 2005. Chemistry–toxicityrelationships for the effects of di- and trihydroxybenzenes to Tetrahymenapyriformis. Chem. Res. Toxicol. 18, 844–854.

Auer, C.M., Nabholz, J.V., Baetcke, K.P., 1990. Mode of action and the assessment ofchemical hazards in the presence of limited data: use of structure–activityrelationships (SAR) under TSCA, Section 5. Environ. Health Perspect. 87, 183–197.

Belsey, D.A., Kuh, E., Welsch, R.E., 1980. Regression Diagnostics. Wiley, New York.Bradbury, S.P., 1995. Quantitative structure–activity relationships and ecological

risk assessment: an overview of predictive aquatic toxicology research. Toxicol.Lett. 79, 229–237.

Bradbury, S.P., Russom, C.L., Ankley, G.T., Schultz, T.W., Walker, J.D., 2003. Overviewof data and conceptual approaches for derivation of quantitative structure–activity relationships for ecotoxicological effects of organic chemicals. Environ.Toxicol. Chem. 22, 1789–1798.

Castillo-Garit, J.A., Marrero-Ponce, Y., Torrens, F., 2006. Atom-based 3D-chiralquadratic indices. Part 2: prediction of the corticosteroid-bindingglobulinbinding affinity of the 31 benchmark steroids dataset. Bioorg. Med.Chem. 14, 2398–2408.

Castillo-Garit, J.A., Marrero-Ponce, Y., Torrens, F., Garcıa-Domenech, R., 2007a.Estimation of ADME properties in drug discovery: predicting Caco-2 cellpermeability using atom-based stochastic and non-stochastic linear indices. J.Pharm. Sci.. doi:10.1002/jps.21122.

Castillo-Garit, J.A., Marrero-Ponce, Y., Torrens, F., Rotondo, R., 2007b. Atom-basedstochastic and non-stochastic 3D-chiral bilinear indices and their applicationsto central chirality codification. J. Mol. Graphics Model. 26, 32–47.

Chen, D., Yin, C., Wang, X., Wang, L., 2004. Holographic QSAR of selected esters.Chemosphere 57, 1739–1745.

Cheng, Y.Y., Yuan, H., 2005. Quantitative study of electrostatic and steric effects onphysicochemical property and biological activity. J. Mol. Graph Model.

Comber, M.H.I., Walker, J.D., Watts, C., Hermens, J., 2003. Quantitative structure–activity relationships for predicting potential ecological hazard of organicchemicals for use in regulatory risk assessments. Environ. Toxicol. Chem. 22,1822–1828.

Consonni, V., Todeschini, R., Pavan, M., 2002. Structure/response correlations andsimilarity/diversity analysis by GETAWAY descriptors. 1. Theory of the novel 3Dmolecular descriptors. J. Chem. Inf. Comput. Sci. 42, 682–692.

Costescu, A., Diudea, M.V., 2006. QSTR study on aquatic toxicity against poeciliareticulata and Tetrahymena pyriformis using topological indices. InternetElectron. J. Mol. Des. 5, 116–134.

Cronin, M.T.D., Schultz, T.W., 2003. Pitfalls in QSAR. J. Mol. Struct. (Theochem.) 622,39–51.

Cronin, M.T., Netzeva, T.I., Dearden, J.C., Edwards, R., Worgan, A.D., 2004.Assessment and modeling of the toxicity of organic chemicals to Chlorellavulgaris: development of a novel database. Chem. Res. Toxicol. 17, 545–554.

Devillers, J., 2000. New trends in (Q)SAR modeling with topological indices. Curr.Opin. Drug Discov. Dev. 3, 275–279.

DeWeese, A.D., Schultz, T.W., 2001. Structure–activity relationships for aquatictoxicity to Tetrahymena: halogen-substituted aliphatic esters. Environ. Toxicol.16, 54–60.

Egan, W.J., Morgan, S.L., 1998. Outlier detection in multivariate analytical chemicaldata. Anal. Chem. 70, 2372–2379.

Golbraikh, A., Tropsha, A., 2002. Beware of q2! J. Mol. Graphics Model. 20, 269–276.Gonzalez, M.P., Diaz, H.G., Cabrera, M.A., Ruiz, R.M., 2004. A novel approach to

predict a toxicological property of aromatic compounds in the Tetrahymenapyriformis. Bioorg. Med. Chem. 12, 735–744.

Ivanciuc, O., 2004. Support vector machines prediction of the mechanism of toxicaction from hydrophobicity and experimental toxicity against Pimephalespromelas and Tetrahymena pyriformis. Internet Electron. J. Mol. Des. 3, 802–821.

Ivanciuc, O., 2007. Applications of support vector machines in chemistry. In:Lipkowitz, K.B., Cundari, T.R. (Eds.), Rev. Comput. Chem.. Whiley-VCH,Weinheim.

Johnson, R.A., Wichern, D.W., 1988. Applied Multivariate Statistical Analysis.Prentice-Hall, Englewood Cliffs, NJ.

Kier, L.B., Hall, L.H., 1986. Molecular Connectivity in Structure–Activity Analysis.Research Studies Press, Letchworth, U.K.

Kulkarni, A.S., Hopfinger, A.J., 1999. Membrane-interaction QSAR analysis:application to the estimation of eye irritation by organic compounds. Pharm.Res. 16, 1245–1253.

Marrero-Ponce, Y., 2003. Total and local quadratic indices of the molecularpseudograph’s atom adjacency matrix: applications to the prediction ofphysical properties of organic compounds. Molecules 8, 687–726.

Marrero-Ponce, Y., 2004a. Linear indices of the ‘‘molecular pseudograph’s atomadjacency matrix: definition, significance-interpretation, and application toQSAR analysis of flavone derivatives as HIV-1 integrase inhibitors. J. Chem. Inf.Comput. Sci. 44, 2010–2026.

Marrero-Ponce, Y., 2004b. Total and local (atom and atom type) molecular quadraticindices: significance interpretation, comparison to other molecular descriptors,and QSPR/QSAR applications. Bioorg. Med. Chem. 12, 6351–6369.

Marrero-Ponce, Y., Castillo-Garit, J.A., 2005. 3D-chiral atom, atom-type, and totalnon-stochastic and stochastic molecular linear indices and their applicationsto central chirality codification. J. Computer-Aided Mol. Des. 19, 369–383.

Marrero-Ponce, Y., Romero, V. 2002. TOMOCOMD software. TOMOCOMD(TOpological MOlecular COMputer Design) for Windows, version 1.0 is apreliminary experimental version; in future a professional version will beobtained upon request to Marrero: [email protected];[email protected] Central University of Las Villas.

Marrero-Ponce, Y., Cabrera Perez, M.A., Romero Zaldivar, V., Gonzalez Diaz, H.,Torrens, F., 2004a. A new topological descriptors based model for predictingintestinal epithelial transport of drugs in Caco-2 cell culture. J. Pharm. Pharm.Sci. 7, 186–199.

Marrero-Ponce, Y., Castillo-Garit, J.A., Olazabal, E., Serrano, H.S., Morales, A.,Castañedo, N., Ibarra-Velarde, F., Huesca-Guillen, A., Jorge, E., del Valle, A.,Torrens, F., Castro, E.A., 2004b. TOMOCOMD–CARDD, a novel approach forcomputer-aided’rational’ drug design: I. Theoretical and experimentalassessment of a promising method for computational screening and in silicodesign of new anthelmintic compounds. J. Computer-Aided Mol. Des. 18, 615–634.

Marrero-Ponce, Y., Castillo-Garit, J.A., Torrens, F., Romero-Zaldivar, V., Castro, E.,2004c. Atom, atom-type, and total linear indices of the ¨MolecularPseudograph’s Atom Adjacency Matrix: application to QSPR/QSAR studies oforganic compounds. Molecules 9, 1100–1123.

Marrero-Ponce, Y., Dıaz, H.G., Romero, V., Torrens, F., Castro, E.A., 2004d. 3D-Chiralquadratic indices of the¨molecular pseudograph’s atom adjacency matrix andtheir application to central chirality codification: classification of ACE inhibitorsand prediction of r-receptor antagonist activities. Bioorg. Med. Chem. 12, 5331–5342.

Marrero-Ponce, Y., Medina, R., Castro, E.A., de Armas, R., González, H., Romero, V.,Torrens, F., 2004e. Protein Quadratic Indices of the ¨MacromolecularPseudograph’s a-Carbon Atom Adjacency Matrix. 1. Prediction of ArcRepressor Alanine-mutant’s Stability. Molecules 9, 1124–1147.

Marrero-Ponce, Y., Nodarse, D., González, H.D., Ramos de Armas, R., Romero-Zaldivar, V., Torrens, F., Castro, E., 2004f. Nucleic acid quadratic indices of theMacromolecular Graph’s Nucleotides Adjacency Matrix. Modeling of footprintsafter the interaction of paromomycin with the HIV-1 W-RNA Packaging Region.Int. J. Mol. Sci. 5, 276–293.

Marrero-Ponce, Y., Cabrera, M.A., Romero-Zaldivar, V., Bermejo, M., Siverio, D.,Torrens, F., 2005a. Prediction of intestinal epithelial transport of drug in (Caco-2) cell culture from molecular structure using in silico approaches during earlydrug discovery. Internet Electron. J. Mol. Des. 4, 124–150.

Marrero-Ponce, Y., Castillo-Garit, J.A., Olazabal, E., Serrano, H.S., Morales, A.,Castanedo, N., Ibarra-Velarde, F., Huesca-Guillen, A., Sanchez, A.M., Torrens, F.,Castro, E.A., 2005b. Atom, atom-type and total molecular linear indices as apromising approach for bioorganic and medicinal chemistry: theoretical andexperimental assessment of a novel method for virtual screening and rationaldesign of new lead anthelmintic. Bioorg. Med. Chem. 13, 1005–1020.

Marrero-Ponce, Y., Iyarreta-Veitia, M., Montero-Torres, A., Romero-Zaldivar, C.,Brandt, C.A., Avila, P.E., Kirchgatter, K., Machado, Y., 2005c. Ligand-based virtualscreening and in silico design of new antimalarial compounds usingnonstochastic and stochastic total and atom-type quadratic maps. J. Chem.Inf. Model. 45, 1082–1100.

Marrero-Ponce, Y., Medina-Marrero, R., Castillo-Garit, J.A., Romero-Zaldivar, V.,Torrens, F., Castro, E.A., 2005d. Protein linear indices of the’macromolecularpseudograph alpha-carbon atom adjacency matrix’ in bioinformatics. Part 1:prediction of protein stability effects of a complete set of alanine substitutionsin Arc repressor. Bioorg. Med. Chem. 13, 3003–3015.

Marrero-Ponce, Y., Medina-Marrero, R., Torrens, F., Martinez, Y., Romero-Zaldivar,V., Castro, E.A., 2005e. Atom, atom-type, and total non-stochastic and stochasticquadratic fingerprints: a promising approach for modeling of antibacterialactivity. Bioorg. Med. Chem. 13, 2881–2899.

Marrero-Ponce, Y., Montero-Torres, A., Zaldivar, C.R., Veitia, M.I., Perez, M.M.,Sanchez, R.N., 2005f. Non-stochastic and stochastic linear indices ofthe’molecular pseudograph’s atom adjacency matrix’: application to’in silico’studies for the rational discovery of new antimalarial compounds. Bioorg. Med.Chem. 13, 1293–1304.

Mc Farland, J.W., Gans, D.J., 1995. Cluster significance analysis. In: Waterbeemd, H.(Ed.), Chemometric Methods in Molecular Design. VCH Publishers, Winheim,pp. 295–307.

McKinney, J.D., Richard, A., Waller, C., Newman, M.C., Gerberick, F., 2000. Thepractice of structure activity relationships (SAR) in toxicology. Toxicol. Sci. 56,8–17.

Melagraki, G., Afantitis, A., Makridima, K., Sarimveis, H., Igglessi-Markopoulou, O.,2005. Prediction of toxicity using a novel RBF neural network trainingmethodology. J. Mol. Model. 12, 297–305.

Netzeva, T.I., Schultz, T.W., 2005. QSARs for the aquatic toxicity of aromaticaldehydes from Tetrahymena data. Chemosphere 61, 1632–1643.

Page 13: A novel approach to predict aquatic toxicity from molecular structure

J.A. Castillo-Garit et al. / Chemosphere 73 (2008) 415–427 427

Pauling, L., 1939. The Nature of Chemical Bond. Cornell University Press, Ithaca, NY.Schultz, T.W., 1997. TETRATOX: Tetrahymena pyriformis population grow

impairment endpoint-A surrogate for fish lethality. Toxicol. Methods 7, 289–309.

Schultz, T.W., 1999. Structure-toxicity relationships for benzenes evaluated withTetrahymena pyriformis. Chem. Res. Toxicol. 12, 1262–1267.

Schultz, T.W., Netzeva, T.I., 2004. Development and evaluation of QSARs for ecotoxicendpoints: the benzene response-surface model for Tetrahymena toxicity. In:Cronin, M.T., Livingstone, D. (Eds.), Modelling Enviromental Fate and Toxicity.CRC Press, Boca Raton, FL, pp. 265–284.

Schultz, T.W., Cronin, M.T.D., Walker, J.D., Aptula, A.O., 2003. Quantitativestructure–activity relationships (QSARs) in toxicology: a historicalperspective. J. Mol. Struct. (Theochem.) 622, 1–22.

Schultz, T.W., Seward-Nagel, J., Foster, K.A., Tucker, V.A., 2004. Population growthimpairment of aliphatic alcohols to Tetrahymena. Environ. Toxicol. 19, 1–10.

Seward, J.R., Cronin, M.T., Schultz, T.W., 2001. Structure-toxicity analyses ofTetrahymena pyriformis exposed to pyridines – an examination into extensionof surface-response domains. SAR QSAR Environ. Res. 11, 489–512.

Spycher, S., Pellegrini, E., Gasteiger, J., 2005. Use of structure descriptors todiscriminate between modes of toxic action of phenols. J. Chem. Inf. Model. 45,200–208.

STATISTICA version 6.0 2001, StatSoft Tulsa.Todeschini, R., Gramatica, P., 1998. New 3D molecular descriptors: the WHIM

theory and QSAR applications. Persp. Drug Disc. Des. 9–11, 355–380.Todeschini, R., Consonni, V., Pavan, M. Dragon Software version 2.1, 2002.Wold, S., Erikson, L., 1995. Chemometric methods in molecular design. In: van de

Waterbeemd, H. (Ed.), Chemometric Methods in Molecular Design. VCHPublishers, Weinheim, pp. 309–318.

Xu, J., Hagler, A., 2002. Chemoinformatics and drug discovery. Molecules 7, 566–700.

Zvinavashe, E., Murk, A.J., Vervoort, J., Soffers, A.E., Freidig, A., Rietjens, I.M., 2006.Quantum chemistry based quantitative structure–activity relationships formodeling the (sub)acute toxicity of substituted mononitrobenzenes in aquaticsystems. Environ. Toxicol. Chem. 25, 2313–2321.