This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
S1
Electronic Supplementary Information for:
Molecular Dynamics Based Descriptors for
Predicting Supramolecular Gelation
Ruben Van Lommel,[1,2] Jianyu Zhao,[2] Wim M. De Borggraeve,[1] Frank De Proft[2] and
Mercedes Alonso*[2]
1. Molecular Design and Synthesis, Department of Chemistry, KU Leuven, Celestijnenlaan 200F Leuven Chem&Tech, box 2404, 3001 Leuven, Belgium
In the next section, the method to calculate the MD based descriptors derived in this work: rSASA, HB%,
rH and F is explained in a detailed, tutorial like fashion.
1) Preparation
The calculation of the descriptors rSASA, rH and F requires the Solvent Accessible Surface Area (SASA),
maximum end-to-end distance (Rmax) and the volume (V) of a fully extended gelator molecule. To obtain
a fully extended molecule, the molecule is built in an adequate building software (for example Gaussview1)
and all dihedral angles of the backbone are set at 180°. No further optimization of the structure is required.
Below a graphical representation together with the respective coordinates of the fully extended gelator
molecules that are considered in this study can be found.
To obtain Rmax, the distance is measured between the atoms that are furthest away from each other in the
fully extended conformation.
The SASA of the extended conformation can be obtained through the gmx sasa implementation in the
GROMACS software.2 First, the extended conformer is saved as a pdb file (name.pdb). Then from this pdb
file an index file (index.ndx) can be generated through the following command line in GROMACS:
S5
gmx make_ndx -f name.pdb -o index.ndx
Following this, the SASA of the molecule can be computed by means of the command described below:
gmx sasa –f name.pdb –s name.pdb –n index.ndx –o sasa.xvg
Before the computation, the software will ask for which index group the SASA needs to be computed.
Make sure to select the entire system in this case.
Similarly, the volume of the extended gelator molecule can be obtained through the next command:
gmx sasa –f name.pdb –s name.pdb –n index.ndx –tv volume.xvg
The SASA, Rmax and volume of the extended molecules considered in this work are provided below.
2) Calculating rSASArSASA is calculated through the following equation:
𝑟𝑆𝐴𝑆𝐴 =𝑆𝐴𝑆𝐴
𝑆𝐴𝑆𝐴𝑚𝑎𝑥
SASAmax is obtained by multiplying the SASA of a fully extended molecule (see above) with the total number
of gelator molecules present in the simulation (in our case 5). To compute the a trajectory file of the 𝑆𝐴𝑆𝐴
simulation is necessary in .xtc (md.xtc) format, together with a gromacs file (md.gro) in which the solvent
and gelator molecules are labeled distinctively as for example SOL and GEL. Next, an index file can be
created through the following command in GROMACS:
gmx make_ndx –f md.gro –o indexmd.ndx
Then the evolution of the SASA during the simulation can be calculated through:
gmx sasa –f md.xtc –s md.gro –n indexmd.ndx –surface GEL –o sasamd.xvg
Following this, the average of the SASA can be taken from the sasamd.xvg file, straightforward through
any mathematical operating software that is capable of calculating averages from a list of data points.
S6
3) Calculating HB% To obtain the HB%, first all hydrogen bond donor and hydrogen bond acceptor atoms need to be identified
in the gelator molecule. This is done on the basis of prior chemical knowledge. Next, the indices of these
atoms, which can be retrieved from the md.gro file used to run the simulation, are collected in an index
file (indexHB.ndx) as follows:
[HBdonor1]“index number”
[HBdonor2]“index number”
[HBacceptor1]“index number”
[HBacceptor2]“index number”
…
Through the following command line in GROMACS, a distance histogram is provided for the distances
between the hydrogen bond acceptor and donor atoms.
gmx distance –f md.xtc –s md.gro –n indexHB.ndx –oh HB.xvg –select ‘com of group “HBdonor1” plus com of group “HBacceptor1”’ ‘com of group “HBdonor1” plus com of group “HBacceptor2”’ ‘com of group “HBdonor2” plus com of group “HBacceptor1”’ ‘com of group “HBdonor2” plus com of group
“HBacceptor2”’ –len 0.15
At the end of the output file generated by this command (HB.xvg), the probability of finding the
corresponding atoms further or equal from each other than 3.0 Å is given. From this value, the probability
of finding the atoms closer than 3.0 Å from each other can be calculated. Summation of these values for
all hydrogen bond donor/hydrogen bond acceptor combinations and multiplying it with 100% will render
the HB%.
4) Calculating rH
rH is calculated by the following equation:
𝑟𝐻 =��
𝑅𝑚𝑎𝑥
The computation of Rmax from the extended conformation is explained above. To obtain , the indices of ��
the atoms that are furthest away from each other in the gelator molecule need to be gathered in an index
file (indexrH.ndx) as follows:
S7
[gelator1] “index1” “index2”
[gelator2]“index1” “index2”
…
Following this, the evolution of the distance between these atoms during the simulation can be calculated
To ensure that the machine learning models are not over fitted, the data is partitioned into a training,
validation and test set as indicated in Table S3. Note, that for the decision tree model 44 data points are
used for model training, while 15 data points are left out to validate the model. This partitioning is
determined by a stratified random approach over the experimental response. On the other hand, for the
construction of the artificial neural network a JMP version of the random k-fold (with k = 5) cross-validation
approach was employed to maximally utilize the data. Additionally, for the artificial neural network, the
data is balanced by oversampling the data points with a gel or soluble response with respect to data points
that were classified as a precipitate (Table S3). In both cases, all data points originating from compound
11 are used as the test set.
Table S3: Data partitioning for machine learning methods and experimental outcomes (G = gel, P = precipitate and S = soluble).
Molecule Solvent Training/Validation
/Test
k-fold Experimental
outcome
Frequency
1 Toluene Training 3 G 3
1 Benzene Training 2 G 2
1 Acetone Training 1 P 1
1 Methanol Training 2 P 1
1 Dimethylsulfoxide Training 1 S 2
1 Hexane Training 4 P 1
2 Toluene Training 4 G 3
2 Dibutylether Training 5 G 2
2 Ethanol Validation 1 P 1
2 Dimethylsulfoxide Training 2 G 3
2 1-propanol Training 4 P 1
3 1-propanol Training 5 S 2
S19
3 Dimethylsulfoxide Training 3 S 2
3 Nitromethane Validation 5 G 2
3 Nitrobenzene Training 1 G 3
3 1,2-dichlorobenzene Validation 2 G 2
3 1,3-dichlorobenzene Training 4 G 3
4 1-propanol Training 2 S 2
4 Dimethylsulfoxide Training 2 S 2
4 Dichloromethane Validation 5 P 1
4 Hexane Training 3 P 1
4 Nitrobenzene Training 3 G 2
4 1,2-dichlorobenzene Training 1 G 3
5 Ethanol Validation 5 S 2
5 1-octanol Training 3 S 2
5 Dimethylsulfoxide Training 4 S 2
5 Water Training 3 P 1
5 Hexane Training 1 P 1
5 Toluene Training 1 S 2
6 Water Validation 1 G 3
6 Dimethylsulfoxide Validation 5 S 2
6 1-octanol Validation 4 S 1
6 Acetonitrile Training 2 P 1
6 Methyl tert-butyl ether Validation 4 P 1
6 Heptane Validation 2 P 1
7 Water Training 3 P 1
7 Dimethylsulfoxide Training 4 S 2
7 1-octanol Training 4 P 1
7 Acetonitrile Validation 5 P 1
7 Methyl tert-butyl ether Validation 5 P 1
7 Heptane Training 1 P 1
8 Water Training 3 P 1
8 Dimethylsulfoxide Training 1 S 2
8 1-octanol Training 2 P 1
8 Acetonitrile Training 4 P 1
8 Methyl tert-butyl ether Training 3 P 1
8 Heptane Training 2 P 1
9 Water Training 2 P 1
9 Dimethylsulfoxide Training 4 S 2
9 1-octanol Training 1 P 1
9 Acetonitrile Validation 3 P 1
9 Methyl tert-butyl ether Training 5 P 1
9 Heptane Training 5 P 1
10 Water Training 4 P 1
S20
10 Dimethylsulfoxide Validation 2 S 2
10 1-octanol Training 3 S 2
10 Acetonitrile Training 1 P 1
10 Methyl tert-butyl ether Validation 5 P 1
10 Heptane Training 5 P 1
11 Water Test Test G N.A.
11 Dimethylsulfoxide Test Test S N.A.
11 1-octanol Test Test S N.A.
11 Acetonitrile Test Test P N.A.
11 Methyl tert-butyl ether Test Test P N.A.
11 Heptane Test Test P N.A.
Decision tree model
Optimization
The decision tree model is constructed by consecutive splits which recursively partition the data according
to a relationship between the descriptors and the experimental results. The optimum model was retrieved
by following the evolution of the entropy R² value of the validation data over the total number of splits
(Figure S2). The total number of splits resulting in the maximum entropy R² value of the validation data set
is deemed the optimal model, which in this study was equal to 5. Initially, the minimum size split was set
to a count of 5 and subsequently decreased by 1 till no adequate splits were possible anymore i.e. they
resulted in a lower entropy R² value for the training data.
Figure S1: Split history. The red curve describes the evolution of the entropy R² of the validation data while the blue curve describes the evolution of the entropy R² of the training data.
S21
Performance
Figure S2: Receiver operating characteristics (ROC) for the training data (left) and validation data (right) of the optimized decision tree model. Predictions on precipitates are visualized by a red line, gels by a blue line and soluble by a green line.
Table S4: Confusion matrix for the optimized decision tree model.
Training ValidationPredicted count Predicted count
Actual G P S Actual G P SG 6 1 2 G 2 0 1P 2 21 0 P 1 7 0S 0 5 7 S 0 2 2
Artificial neural network
Optimization
The artificial neural network is defined by a set of hyperparameters such as the number of neurons,
activation function, number of hidden layers and the penalty method to optimize the weights. In this study
we restricted the architecture of the neural network to 1 hidden layer and selected the weight decay
method where the penalty function p(βi) is given as:
∑ 𝛽𝑖²
1 + 𝛽𝑖²A manual grid search over the other hyperparameters was performed to find the optimal network. The
total number of neurons is varied between 0 and 5, while three different transformation functions were
S22
considered being a linear identity function, a hyperbolic tangent function and a Gaussian function. The
model with the lowest misclassified data of the validation set , i.e. the lowest misclassification rate, is
considered to have the optimal settings. From Table S5 we can discern that two architectures have perfect
classification of the validation data, thus to make a differentiation between the two models we looked at
the misclassification rate on the training data. From this we opted for the model with 5 hyperbolic tangent
neurons as this model has a slightly better misclassification rate on the training data compared to the
model with 4 Gaussian neurons. Hence, the former model should be trained slightly better.
Table S5: Hyperparameter optimization of the artificial neural network by a manual grid search. The optimal model is marked green.
Activation # neurons Misclassification rate
training data
Misclassification rate
validation data
Linear 1 0.4324 0.1579
Linear 2 0.3514 0.1053
Linear 3 0.3684 0.2353
Linear 4 0.3553 0.1765
Linear 5 0.3649 0.1131
tanH 1 0.3194 0.6190
tanH 2 0.1486 0.4211
tanH 3 0.1622 0.0526
tanH 4 0.1711 0.1176
tanH 5 0.0263 0.0000
Gaussian 1 0.4189 0.3684
Gaussian 2 0.1892 0.1053
Gaussian 3 0.2237 0.1765
Gaussian 4 0.0270 0.0000
Gaussian 5 0.1447 0.1176
S23
Performance
Figure S3: Receiver operating characteristics (ROC) for the training data (left) and validation data (right) of the optimized artificial neural network. Predictions on precipitates are visualized by a red line, gels by a blue line and soluble by a green line.
The three lines on the ROC-curve of the validation data fully overlap.
Table S6: Confusion matrix for the optimized artificial neural network. Note that the frequency of each data point is taken into consideration for the counts (see Table S3).
Training ValidationPredicted count Predicted count
Actual G P S Actual G P SG 27 0 0 G 4 0 0P 0 24 0 P 0 7 0S 0 2 23 S 0 0 6
Measures of fit
A myriad of evaluation metrics exist to quantitatively assess the quality of a predictive model and
summarize discrepancies between observed and predicted outcomes. It is important to note, that
different measures of fit can suggest different qualities for the same model depending on the data (dataset
size, balance,…) that was used to build the model. As a result, it is crucial to recognize the features of the
used data and adopt a correct measure of fit. A valid strategy to determine the quality of the predictive
model, is to calculate multiple measures of fit and verify if they show a similar quality for the model. Below
a definition is provided for all measures of fit that were used in this work.
The balanced accuracy (BA) is calculated by the average of the proportion correct predictions of each class
in a classification model.3 The main difference between the balanced accuracy and the overall accuracy is
that with the BA, imbalance of the data is taken into consideration in the metric. As an example the BA is
S24
calculated below for the confusion matrix provided in Table S6 (training set).
𝐵𝐴 =(2727
+2424
+2325)
3= 0.97
Entropy R² is defined by one minus the ratio of the negative log-likelihoods of the fitted model over a
hypothetical constant probability (random) model. Perfect predictive models have an entropy R² value of
1, while random models have a value of 0.4
Misclassification Rate (MR) is the probability for which the prediction does not correspond with the
observed outcome.4 As an example the MR is calculated below with the confusion matrix provided in Table
S6 (training set).
𝑀𝑅 =2
76= 0.03
Cohen’s Kappa (K) classically measures inter-observer agreement.5 However, in the case of a classification
model it can also be used as a measure of fit when the observers (also called raters) are chosen as an
“actual” observer and a “predictive” observer. To calculate K, the relative observed agreement between
the two raters (P0) and the hypothetical probability of chance agreement (Pe) is necessary. Again, a
calculation is provided based on the data in Table S2 (training set).
𝑃0 = 27 + 24 + 23
76= 0.97
𝑃𝑒 = 2776
∙2776
+2476
∙2676
+2576
∙2376
= 0.334
Next, K can be calculated as follows, with a substantial agreement between two observers indicated by a
value of K larger than 0.61. A value of 0 indicates an agreement that is equivalent to chance.
𝐾 = 1 ‒1 ‒ 𝑃0
1 ‒ 𝑃𝑒= 0.96
The area under the receiver operating characteristics curve (AUROC) is often used as a measure to assess
the quality of a machine learning predictor. AUROC values of 1 indicate perfect classification, values of 0.5
indicate classification similar to random models, while values below 0.5 indicate models which perform
worse than random. To calculate the AUROC, receiver operating characteristic curves need to be plotted
for every class in the prediction model, which is accomplished by depicting the sensitivity (true positive
rate) on the y-axis and the 1-specificy (false positive rate) on the x-axis (Figure S3 and Figure S4).4
S25
References and notes
1. Gaussview, Version 6.1, R. Dennington, T. A. Keith, J. M. Millam, Semichem Inc., Shawnee Mission, KS, 2016
2. M. J. Abraham, T. Murtola, R. Schulz, S. Páll, J. C. Smith, B. Hess and E. Lindahl, SoftwareX, 2015, 1-2, 19-25.
3. K. H. Brodersen, C. S. Ong, K. E. Stephan and J. M. Buhmann, in 2010 20th International Conference on Pattern Recognition, 2010, 3121-3124
4. JMP®, Version Pro 14. SAS Institute Inc., Cary, NC, 1989-2019.5. M. Banerjee, M. Capozzoli, L. McSweeney and D. Sinha, Canadian Journal of Statistics, 1999, 27,