Journal of Computer Aided Chemistry , Vol.20, 92-103 (2019) ISSN 1345-8647 92 Copyright 2019 Chemical Society of Japan Chemoinformatics Approach for Estimating Recovery Rates of Pesticides in Fruits and Vegetables Takeshi Serino a, b , Yoshizumi Takigawa b , Sadao Nakamura b , Ming Huang a , Naoaki Ono a, c , Altaf-Ul-Amin a , Shigehiko Kanaya a, c * a Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma Nara 630-0192, Japan b Japan Field Support Center, Chemical Analysis Marketing Group, Agilent Technologies Japan, Ltd. Hachioji Site, 9-1 Takakura-machi, Hachioji-shi, Tokyo, 192-8510, Japan c Data Science Ceter, Graduate School of Science and Technology, Nara Institute of Science and Technology, Ikoma Nara 630-0192, Japan (Received July 17, 2019; Accepted October 30, 2019) Pesticides are considered a vital component of modern farming, playing major roles in maintaining high agricultural productivity. Pesticide recovery rates in vegetables and fruits determined using GC/MS depends on various factors including the matrix effect and chemical interactions between pesticides and mixing compounds in crops. In this study, the recovery rate of a pesticide is defined by a ratio of peak area of 50 ppb spiked in a crop sample to that in the solvent standard calibration curve. The estimation of recovery rates of pesticides in crops leads to evaluation of precise contents of them in the crops. In the present study, we performed regression models of the recovery rates based on molecular descriptors using R-packages rcdk and caret. Each of the chemical structures of 248 pesticides was converted to 174 molecular descriptors, then, for 7 crops, we created 69 ordinary and 20 ensemble learning regression models for estimating the recovery rates from the molecular descriptors using R-package caret. In the present study, two machine learning regression methods called mSBC and xgbLinear performed the best in view of prediction rates and execution times. In those two regression models predictions of recovery rates of pesticides are carried out in local distribution of chemical properties out of the 174 molecular descriptors. This concludes that closely related pesticides in the chemical space have also very similar recovery rates. Key Words: recovery rate, regression analysis, quantitative-structure property relation ships, machine learning 1. Introduction Chemoinformatics is a discipline focusing on extracting, processing, extrapolating meaningful data from chemical structures which includes development of quantitative * [email protected]structure property relationships [1]. Pesticides are considered a vital component of modern farming, playing major roles in maintaining high agricultural productivity. In high-input intensive agricultural production systems, the widespread use of pesticides to manage pests has emerged as a dominant feature [2]. In the case of GC-MS, special attention should be paid to characterize the matrix
12
Embed
Chemoinformatics Approach for Estimating Recovery Rates of ...
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Chemoinformatics Approach for Estimating Recovery Rates
of Pesticides in Fruits and Vegetables
Takeshi Serinoa, b , Yoshizumi Takigawa b, Sadao Nakamura b, Ming Huanga,
Naoaki Onoa, c, Altaf-Ul-Amina, Shigehiko Kanaya a, c*
a Division of Information Science, Graduate School of Science and Technology, Nara Institute of Science and
Technology, Ikoma Nara 630-0192, Japan b Japan Field Support Center, Chemical Analysis Marketing Group, Agilent Technologies Japan, Ltd. Hachioji Site,
9-1 Takakura-machi, Hachioji-shi, Tokyo, 192-8510, Japan c Data Science Ceter, Graduate School of Science and Technology, Nara Institute of Science and Technology,
Ikoma Nara 630-0192, Japan
(Received July 17, 2019; Accepted October 30, 2019)
Pesticides are considered a vital component of modern farming, playing major roles in maintaining high
agricultural productivity. Pesticide recovery rates in vegetables and fruits determined using GC/MS
depends on various factors including the matrix effect and chemical interactions between pesticides and
mixing compounds in crops. In this study, the recovery rate of a pesticide is defined by a ratio of peak area
of 50 ppb spiked in a crop sample to that in the solvent standard calibration curve. The estimation of
recovery rates of pesticides in crops leads to evaluation of precise contents of them in the crops. In the
present study, we performed regression models of the recovery rates based on molecular descriptors using
R-packages rcdk and caret. Each of the chemical structures of 248 pesticides was converted to 174
molecular descriptors, then, for 7 crops, we created 69 ordinary and 20 ensemble learning regression
models for estimating the recovery rates from the molecular descriptors using R-package caret. In the
present study, two machine learning regression methods called mSBC and xgbLinear performed the best
in view of prediction rates and execution times. In those two regression models predictions of recovery
rates of pesticides are carried out in local distribution of chemical properties out of the 174 molecular
descriptors. This concludes that closely related pesticides in the chemical space have also very similar
bpol (Sum of the absolute value of the difference between atomic polarizabilities of all bonded atoms in the molecule (including implicit hydrogens))
Carbon Types Descriptor (9)
C1SP1 (Triply bound carbon bound to one other carbon), C2SP1 (Triply bound carbon bound to two other carbons), C1SP2 (Doubly hound carbon bound to one other carbon), C2SP2 (Doubly bound carbon bound to two other carbons), C3SP2 (Doubly bound carbon bound to three other carbons), C1SP3 (Singly bound carbon bound to one other carbon), C2SP3 (Singly bound carbon bound to two other carbons), C3SP3 (Singly bound carbon bound to three other carbons), C4SP3 (Singly bound carbon bound to four other carbons)
Fragment Complexity Descriptor (1) fragC (Complexity of a system) H Bond Acceptor Count Descriptor (1) nHBAcc (Number of hydrogen bond acceptors) H Bond Donor Count Descriptor (1) nHBDon (Number of hydrogen bond donors) KappaShape Indices Descriptor (3) Kier1-3 (First, Second, Third kappa (κ) shape indexes) Largest Chain Descriptor (1) nAtomLC (Number of atoms in the largest chain) Longest Aliphatic Chain Descriptor (1) nAtomLAC (Number of atoms in the longest aliphatic chain) Mannhold LogP Descriptor (1) MLogP (Mannhold LogP) MDEDescriptor (19)
MDEC.11 (Molecular distance edge between all primary carbons), MDEC.12 (between all primary and secondary carbons), MDEC.13 (between all primary and tertiary carbons), MDEC.14 (between all primary and quaternary carbons), MDEC.22 (between all secondary carbons), MDEC.23 (between all secondary and tertiary carbons), MDEC.24 (between all secondary and quaternary carbons), MDEC.33 (between all tertiary carbons), MDEC.34 (between all tertiary and quaternary carbons), MDEC.44 (between all quaternary carbons), MDEO.11 (between all primary oxygens), MDEO.12 (between all primary and secondary oxygens), MDEO.22 (between all secondary oxygens), MDEN.11 (between all primary nitrogens), MDEN.12 (between all primary and secondary nitrogens), MDEN.13 (between all primary and tertiary niroqens), MDEN.22 (between all secondary nitroqens), MDEN.23 (between all secondary and tertiary nitrogens), MDEN.33 (between all tertiary nitrogens)
PetitjeanNumberDescriptor (1) PetitjeanNumber (Petitjean number) RotatableBondsCountDescriptor (1) nRotB (Number of rotatable bonds, excluding terminal bonds) RuleOfFiveDescriptor (1) LipinskiFailures (Number failures of the Lipinski's Rule Of 5) TPSADescriptor (19) TopoPSA (Topological polar surface area) VAdjMaDescriptor (1) VAdjMat (Vertex adjacency information (magnitude)) WeightDescriptor (1) MW (Molecular weight) WeightedPathDescriptor (5)
WTPT.1 (Molecular ID), WTPT.2 (Molecular ID / number of atoms), WTPT.3 (Sum of path lengths starting from heteroatoms), WTPT.4 (Sum of path lengths starting from oxygens), WTPT.5 (Sum of path lengths starting from nitrogens)
WienerNumbersDescriptor (2) WPATH (Weiner path number), WPOL (Weiner polarity number) XLogPDescriptor (1) XLogP (XLogP) ZagrebIndexDescriptor (1) Zagreb (Sum of the squares of atom degree over all heavy atoms i) Petitjean Shape Index Descriptor (1) topoShape (Petitjean topological shape index) Others (17)
nAcid (Acidic group count descriptor), nBase (Basic group count descriptor), nSmallRings (the number of small rings from size 3 to 9), nAromRings (the number of aromatic rings), nRingBlocks (total number of distinct ring blocks), nAromBlocks (total number of "aromatically connected components"), nRings3, 5, 6, 7 (individual breakdown of small rings), tpsaEfficiency (Polar surface area expressed as a ratio to molecular size), VABC (Atomic and Bond Contributions of van der Waals volume), HybRatio (the ratio of heavy atoms in the framework to the total number of heavy atoms in the molecule.), tpsaEfficiency.1 (Polar surface area expressed as a ratio to molecular size), TopoPSA.1 (Topological polar surface area), topoShape.1(A measure of the anisotropy in a molecule)
Journal of Computer Aided Chemistry, Vol.20 (2019) 95
of JPL method. 50 ppb of 305 pesticides were spiked to
the sample and then analyzed by GC-MS in SIM/Scan
mode. The recovery rate of pesticides was calculated in a
ratio of peak area of 50 ppb spiked in the sample to that in
the solvent standard calibration curve.
2.2 Chemical Descriptors
We examined 248 unique pesticides by removing the
pesticides that have isomer(s), because such isomers have
different recovery ratio respectively with the same
canonical SMILES. Canonical SMILES strings were
added to the data set from the PubChem
(https://pubchem.ncbi.nlm.nih.gov). In the evaluation of
recovery rates based on chemical structures by R-package
rcdk, chemical structures were converted to several
molecular properties using connectivity information on
chemical structures and then to 178 molecular descriptors
for each of the 248 pesticides (Table 1).
2.3 Regression algorithms
Regression algorithms can be classified to (1) ordinary
learning approaches which construct one learner from
training data and (2) ensemble methods which construct a
set of learners and combine them. Caret package for
machine learning in R is introduced as publicly accessible
learning resources and tools related to machine learning
(Butler et al., 2018). In the present study, we examined
regression models of 69 ordinary and 20 ensemble
learning methods in caret (Table 2). Most of the ordinary
learning methods correspond to regression models with
kernel and simple linear models. In regression models
with kernels, gaussprLinear, rvmLinear, svmLinear,
svmLinear2 and svmLinear3 implement linear kernel;