Top Banner
Using vis-NIRS and Machine Learning methods to diagnose sugarcane soil chemical properties Diego A. Delgadillo-Duran a , Cesar A. Vargas-Garc´ ıa a , Viviana M. Var´ on-Ram´ ırez a , Francisco Calder´ on b , Andrea C. Montenegro a , Paula H. Reyes-Herrera a a Corporaci´on Colombiana de Investigaci´on Agropecuaria, CI Tibaitat´a, Bogot´ a, Colombia b School of Engineering, Pontificia Universidad Javeriana, Bogot´ a, Colombia Abstract Knowing chemical soil properties might be determinant in crop manage- ment and total yield production. Traditional property estimation approaches are time-consuming and require complex lab setups, refraining farmers from taking steps towards optimal practices in their crops promptly. Property es- timation from spectral signals(vis-NIRS), emerged as a low-cost, non-invasive, and non-destructive alternative. Current approaches use mathematical and sta- tistical techniques, avoiding machine learning framework. Here we propose both regression and classification with machine learning techniques to assess perfor- mance in the prediction and infer categories of common soil properties (pH, soil organic matter, Ca, Na, K and Mg), evaluated by the most common metrics. In sugarcane soils, we use regression to estimate properties and classification to as- sess soil’s property status and report the direct relation between spectra bands and direct measure of certain properties. In both cases, we achieved similar performance on similar setups reported in the literature. Keywords: Vis-NIR, Soil properties, Machine learning 1. Introduction As the population grows, the demand for food continues to increase. But unsustainable practices reduce the arable soil. Soils are dynamic systems that change in response to different natural and anthropogenic activities. Soil health must be a priority, particularly in agricultural practices, to increase productivity without affecting the soil. It is essential to monitor soil quality through physico- chemical analyses to provide a specific assessment looking towards sustainability [1]. Email addresses: [email protected] (Diego A. Delgadillo-Duran), [email protected] (Paula H. Reyes-Herrera) Preprint submitted to Catena January 22, 2021 arXiv:2012.12995v2 [cs.LG] 20 Jan 2021
14

Using vis-NIRS and Machine Learning methods to diagnose ...Using vis-NIRS and Machine Learning methods to diagnose sugarcane soil chemical properties Diego A. Delgadillo-Duran a, Cesar

Feb 02, 2021

Download

Documents

dariahiddleston
Welcome message from author
This document is posted to help you gain knowledge. Please leave a comment to let me know what you think about it! Share it to your friends and learn new things together.
Transcript
  • Using vis-NIRS and Machine Learning methods todiagnose sugarcane soil chemical properties

    Diego A. Delgadillo-Durana, Cesar A. Vargas-Garćıaa, Viviana M.Varón-Ramı́reza, Francisco Calderónb, Andrea C. Montenegroa, Paula H.

    Reyes-Herreraa

    aCorporación Colombiana de Investigación Agropecuaria, CI Tibaitatá, Bogotá, ColombiabSchool of Engineering, Pontificia Universidad Javeriana, Bogotá, Colombia

    Abstract

    Knowing chemical soil properties might be determinant in crop manage-ment and total yield production. Traditional property estimation approachesare time-consuming and require complex lab setups, refraining farmers fromtaking steps towards optimal practices in their crops promptly. Property es-timation from spectral signals(vis-NIRS), emerged as a low-cost, non-invasive,and non-destructive alternative. Current approaches use mathematical and sta-tistical techniques, avoiding machine learning framework. Here we propose bothregression and classification with machine learning techniques to assess perfor-mance in the prediction and infer categories of common soil properties (pH, soilorganic matter, Ca, Na, K and Mg), evaluated by the most common metrics. Insugarcane soils, we use regression to estimate properties and classification to as-sess soil’s property status and report the direct relation between spectra bandsand direct measure of certain properties. In both cases, we achieved similarperformance on similar setups reported in the literature.

    Keywords: Vis-NIR, Soil properties, Machine learning

    1. Introduction

    As the population grows, the demand for food continues to increase. Butunsustainable practices reduce the arable soil. Soils are dynamic systems thatchange in response to different natural and anthropogenic activities. Soil healthmust be a priority, particularly in agricultural practices, to increase productivitywithout affecting the soil. It is essential to monitor soil quality through physico-chemical analyses to provide a specific assessment looking towards sustainability[1].

    Email addresses: [email protected] (Diego A. Delgadillo-Duran),[email protected] (Paula H. Reyes-Herrera)

    Preprint submitted to Catena January 22, 2021

    arX

    iv:2

    012.

    1299

    5v2

    [cs

    .LG

    ] 2

    0 Ja

    n 20

    21

  • Analysis of soils in the laboratory is widely used to know the soil properties;it uses traditional chemical analyzes that are expensive, time-consuming, andgenerate environmental contamination due to the number of chemical reagentsused [2]. There is currently a growing demand to obtain immediate results.The search for alternatives for conventional laboratory analysis has allowed theNIRS technique to be a potential candidate. The use of visible and near-infraredreflectance spectroscopy (vis-NIRS) of the electromagnetic spectrum emerges asa precision agriculture technique to monitor soil physicochemical characteristicsin the field and the laboratory. This non-destructive analysis method is usedbecause of its cost-effectiveness, rapid results, and simultaneously infer multiplecomponents from a single spectrum. Also, it does not require chemical agentsin the analysis procedure; thus, it is not harmful to the environment.

    vis-NIRS is a method based on the absorption of light by different materialsin the near-infrared visible region (400 - 2500 nm) of the electromagnetic spec-trum [3][4]. Materials absorb specific frequencies when irradiated with visible-NIR light. Absorption occurs when the incoming light frequency correspondsto the molecular vibration frequency of a constituent in the sample. A detectormonitors the portion of the light reflected and decomposes into the componentsat different frequencies of the spectrum with the corresponding magnitudes.

    Nevertheless, processing raw NIRS data requires (1) using advanced mathe-matical and statistical analysis to provide information on what and how muchsubstance is present in the sample and (2) performing good calibrations thatguarantee the values obtained and associated accuracy. Soil vis-NIR spectra arelargely nonspecific because of the overlapping absorption of soil constituents.Complex absorption patterns generated from soil constituents and quartz needto be mathematically extracted from the spectra[5]. The most frequently usedmethod to estimate chemical properties from vis-NIRs is the Partial LeastSquares Regression (PLSR), but it hides the nonlinear relationships betweenthe spectrum and the soil constituents[6].

    The usage of machine learning (ML) in soil science has increased in the lastdecade [7], also impacting the use of infrared spectral data to infer soil prop-erties [8][9]. Recent studies [10][11][7][6] adopt ML approaches (SVM, neuralnetworks, random forest, and cubist) to estimate organic carbon and matter,cation exchange capacity, pH, clay content, and nitrogen from fresh and pro-cessed samples vis-NIR. However, soil properties depend on soil-forming factorsand processes in a specific region, and ML approaches performance depends onthe training set. Therefore, models trained with data from different locationsare not easily extended.

    Colombia is a country with a diversity of soils. In Colombia, IGAC instituteat 1:100.000 scale have identified 11 of the 12 soils’ orders according to USDAclassification [12]. Previous studies in Colombia used NIRS, in an oxisol, topredict total carbon and total nitrogen and to incorporate these predictions formapping using geostatistical techniques in a region of about 5100 hectares. [13].Later, they found NIRS useful to predict also clay content in the same studyarea [14]. However, we are not aware of any study that is exploring ML andvis-NIRS in the country.

    2

  • In this study, we use vis-NIRS and ML approaches from sugarcane for panelaColombian soil samples with two-fold purposes. First, to evaluate the capac-ity of ML approaches to estimate six chemical properties: pH, organic matter(OM), calcium (Ca), magnesium (Mg), sodium (Na), and potassium (K) con-tent. We compare the selected ML model for each property with two scenar-ios that simulate traditional chemometric techniques (1) using the band withthe highest regression coefficient(s) and (2) Partial Least Squares Regression(PLSR) [15][16]. Second, to estimate soil properties’ value as a first step totune a recommendation. Moreover, we use ML classification to infer categoriesfor soil properties to see whether this is a viable alternative.

    2. Materials and methods

    2.1. Data

    2.1.1. Study area and sample collection procedure

    We used a data set derived from a previous study in the Hoya del ŕıo Suárezregion in Colombia (Coordinates: 73°22’ - 73°39’ West longitude and 5°53’ -6°10’ North latitude). This region covers an area of about 470 km2. Entisols,inceptisols, and vertisols characterize this region, according to the soil survey[12] . The area has two principals crops: sugar cane for panela agro-industryand grasslands.

    The sampling stage occurred during 2015 and 2016; samples are from thesurface to a depth of 20 cm. Each sample (Figure 1) point represents four sub-samples collected and mixed; the samples area corresponds to a reticulate gridof 700 meters.

    Figure 1: Sampling area for 653 points.

    3

  • 2.1.2. Chemical measurements

    We dried and analyzed the samples for classic laboratory analysis. We useda pH meter with 1:2.5 soil-water suspension (NTC 5264, 2008), organic matter(OM) with Walkley and Black’s wet digestion method. We used the ammoniumacetate extraction method to measure exchangeable cations (Ca+2, K+, Mg+2,and Na+) by NTC 5349 - 2008 [17].

    2.2. Methods

    We transform the spectrum to obtain informative features and use five MLregression models from scikit-learn [18] in Python and classification in the Statis-tics and Machine learning toolbox in MATLAB. The coefficient of determinationR2 and regression coefficient ρ helped us to select the model. For some proper-ties, none of the regression ML is promising (ρ >0.6); we did not proceed in thiscase because we consider that ML regression is not suitable for the property.

    In all cases, we trained an ML classification to infer categories for soil prop-erties to see whether this is a viable alternative. We selected the classificationmodel using accuracy and then used a grid-search in the penalties for the confu-sion matrix to look for a model that handles the classes’ imbalance. Finally, weperform feature selection to identify the wavelengths that have a higher effecton each property model.

    Features and Preprocessing The vis-NIR spectra for each sample coverthe range between 400 and 2491 nm with steps of 8.5 nm (vector of 247 el-ements). We took each data point as a feature, and applied transformationsto the spectra, such as the first derivative (D1), second derivative (D2), andthe Fast Fourier Transform (FFT). We applied standard normalization by fea-ture in the whole samples dataset to ensure unit variance and zero mean [19],and concatenated each feature set, resulting in 247x4 = 988 features for everysample.

    2.2.1. ML regression models

    The dataset contains 653 samples and 988 features for six soil properties.First, we randomly split this dataset 70% for training (to evaluate and adjustparameters) and 30% only for validation purposes.

    We evaluated four regression models from [18] such as: (1) linear (LR), (2)support vector regression (SVR) using lineal kernel,(3) LASSO by using cross-validation, and (4) Multilayer perceptron neural network.

    Cross-validation: We selected the best model by using a 5-fold cross-validation in the 70 % defined for training and the 988 features. We performedthe selection by using the distribution for the correlation (ρ) and determinationcoefficients (R2), and the mean squared error (MSE).

    Comparison ML regression to chemometric approaches: We usedthe 30 % test set to compare the selected model against (1) the regression withthe band with the highest correlation coefficient and the target label, and (2)Partial Least Squares with six principal components from [18].

    4

  • 2.2.2. ML classifiers

    Classes: We defined the target classes for the properties according to soilfertility requirements for sugarcane crops for panela: K (Low: 0.4), Na (Acceptable: 1), pH(acidity correction: 7.3),Mg (low: < 1.5, Medium: 3-5, High: > 5), Ca (Low: 5), OM (Low: 5). However, this definition conducedto imbalanced classes.

    Classification Models: Thanks to data pre-processing and the practicalityof modern ML tools, as stated earlier, we were able to make an initial selectionand evaluation of 24 ML models. These 24 classifiers models are divided into sixgroups: (1) three based on binary trees, (2) linear and quadratic discriminant,(3) Naive Bayes and Kernel Naive Bayes, (4) Support Vector Machines (SVM)with six different kernel configurations, (5) KNN with six different distancemetrics and (6) five with ensemble-based architectures.

    Cross-validation: Due to the imbalance of the classes, we opted to perform5-fold cross-validation and selected the best performing ML model from the 24available in the toolbox. The cross-validation gives us an estimate of the finalmodel’s predictive accuracy trained with all the data. It requires multiple fitsbut makes efficient use of all the data, so it is recommended for small data sets.This method gives us an estimate of the precision of the final model trainedwith the entire dataset[20].

    Missclassification cost grid search Also, to choose the configurationclassifier in the face of class imbalance, we use a penalty for all misclassificationsduring training. This cost was applied to all Type I and Type II errors in theConfusion matrix; by default, ML models associate a cost of one to all errorsand 0 to the Confusion matrix’s diagonal. We use a grid search to select thebest performing combination misclassifications cost for all mistakes.

    For pH, OM, Ca, Mg, K, we use a grid-search of 6 parameters correspondingto all type I and II errors on a three-class confusion matrix; each parametervaried between gs = {1, 2, ..., 7} for a total of 117649 different cross-validationexperiments. For Na, we optimize the two misclassification cost, we increasedthe grid search to gs = {1, 2, ..., 150} for a total of 22500 cross-validations.Finally, to evaluate the performance, we use the Mathews correlation coefficientas our preferred metric due to our dataset’s imbalance [21].

    2.2.3. Feature ranking

    Finally, we propose a feature ranking approach to unveil the effects of eachband from the spectrum and the properties. First, we obtained and normalizedthe correlation coefficient and LASSO ranking for each band spectrum (andtransformations such as first and second derivatives) and the training set’s targetlabel. We added the coefficients for each band (for spectrum, first and secondderivatives) to obtain a unique value, similar to the traditional chemometricapproach.

    5

  • 4.5 6.0 7.5True

    4.5

    6.0

    7.5Pr

    edic

    ted

    0.0 0.8 1.6Log10(True)

    0.0

    0.8

    1.6

    Log 1

    0(Pr

    edic

    ted)

    0.0 0.8 1.6 2.4True

    0.0

    0.8

    1.6

    2.4

    Pred

    icte

    d

    Reference All features Best feature PLSR, 6 components

    High Medium LowPredicted

    High

    Med

    ium

    Low

    True

    84.1%348/414

    15.0%62

    1.0%4

    9.4%13

    75.4%104/138

    15.2%21

    4.0%4

    30.7%31

    65.3%66/101

    High Medium LowPredicted

    High

    Med

    ium

    Low

    True

    72.0%311/432

    26.4%114

    1.6%7

    14.7%28

    81.6%155/190

    3.7%7

    12.9%4

    58.1%18

    29.0%9/31

    High Medium LowPredicted

    High

    Med

    ium

    Low

    True

    90.7%254/280

    7.1%20

    2.1%6

    11.5%16

    47.5%66/139

    41.0%57

    8.1%19

    21.8%51

    70.1%164/234

    High Medium LowPredicted

    High

    Med

    ium

    Low

    True

    47.4%9/19

    31.6%6

    21.1%4

    5.9%6

    69.3%70/101

    24.8%25

    3.4%18

    28.5%152

    68.1%363/533

    High Medium LowPredicted

    High

    Med

    ium

    Low

    True

    56.0%65/116

    25.9%30

    18.1%21

    24.3%86

    46.3%164/354

    29.4%104

    15.3%28

    26.8%49

    57.9%106/183

    Low MediumPredicted

    Low

    Med

    ium

    True

    99.7%643/645

    0.3%2

    62.5%5

    37.5%3/8

    A. pH

    Pred

    icte

    d

    0 5 10 15 20True

    0

    5

    10

    15

    20B. OM

    C. Ca D. Mg

    Regression

    ClassificationE. pH

    H. K

    F. OM G. Ca

    I. Mg J. Na

    Figure 2: Regression results A. ph, B. Organic matter, C. Ca, D. Mg. For each property,we present the result with the best ML model (red). And the results simulate chemometrictechniques such as the regression result with the band (blue) with the highest correlation andthe PLSR (green). Classification results: Confusion matrix for each property E. pH, F. OM,G. Ca, H. K, I. Mg, J. Na.

    6

  • 3. Results

    Figure 2 show the prediction results using both regression and classifiers. Weget the best pH estimates using a SVR regressor in the test set with a correlationbetween true and predicted ρ = 0.90 (R2 = 0.80). When using the feature thatbest correlates with pH (see Figure 3), we get ρ = 0.70 (R2 = 0.48) usinglinear regression. LASSO performed slightly better than PLSR (ρ = 0.87 andR2 = 0.75). LASSO and PLSR regressor improved significantly the accuracy ofthe pH estimates, shown by the non-overlapping 95% confidence interval of allthree LASSO, best feature, and PLSR.

    Table 1: Models comparison for property: correlation ρtest and determination R2test coeffi-cients, and MSE in the test set. The models presented are the best result of the ML regressionand the two approaches that simulate chemometric techniques (1) linear regression with thehighest correlated band and (2) PLSR.

    Property Model Model detail ρtest R2test MSEtest

    pHSVR All bands 0.898 0.802 0.270LR Band D1 at 621 0.694 0.479 0.709

    PLSR 6 components 0.865 0.745 0.347

    OMLASSO Selected by model 0.620 0.372 6.364

    LR Band D1 at 1913 0.471 0.220 7.907PLSR 6 components 0.611 0.359 6.495

    CaLASSO Selected by model 0.746 0.541 70.371

    LR Band D1 at 612.5 0.545 0.280 110.480PLSR 6 components 0.691 0.424 88.284

    MgLASSO Selected by model 0.649 0.415 0.246

    LR Band D1 at 621 0.454 0.193 0.340PLSR 6 components 0.634 0.399 0.253

    KSVR All bands 0.478 0.167 0.022SVR Band D2 at 493.5 0.173 0.022 0.026PLSR 6 components 0.360 0.019 0.026

    NaLASSO Selected by model 0.253 0.060 0.053

    LR Band D2 at 1751.5 0.145 0.013 0.055PLSR 6 components 0.276 0.056 0.053

    We got OM estimates correlated ρ = 0.62 with ground true values (R2 =0.37) using the LASSO regressor. Comparing to a best feature based regressor(ρ = 0.47, R2 = 0.22), LASSO and PLSR improved significantly the estimates(non-overlapping 95% confidence intervals). Ca estimates using LASSO cor-related ρ = 0.75 with true values (R2 = 0.54), showing a significant increasein accuracy if compared with the best feature based regressor. LASSO alsoslightly improved PLSR estimates, although not significantly. Mg estimates us-ing LASSO and PLSR were similar (ρ = 0.65, R2 = 0.42) and are significantlydifferent from regressors based on the best correlated feature. We tested sev-eral regression models on the remaining soil properties (K and Na), obtaining a

    7

  • correlation ρ below 0.5. Table 1 summarises regressor results and presents thebest ML regressor and the results simulating chemometric approaches.

    The performance of the best ML classifiers based on pre-defined labels fromexperts is shown in 2. For all properties, we obtain an accuracy of 74%. For pH,accuracy in each label is over 65%, where every label was represented at leastin 15%. However, for OM, K and Na, some labels were under-represented in5% or less in the training dataset, accuracy decreased to 30 - 40%. It is worthnoting, K and Na, most of the time labels are predicted correctly for K exceptfor Medium levels. For Na, the low label is correctly labeled 99% of the time,however medium values (under-repressented) are misslabeled more than 60%.

    At last, Figure 3.A shows the feature ranking for each property and regionwith distinct absorption (towards red). The visible 450-670nm range containshighly ranked features for all properties. pH and Ca have similar feature rank-ing heatmaps with the highest bands ranked around 600nm. Mg has a highlycorrelated range of 600-670nm, while K has a highly ranked area near 500nm.Instead, Na presents highly ranked features between 2100-2400nm.

    400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500

    Ca

    K

    Mg

    Na

    OM

    pH

    Prop

    ertie

    s

    2

    3

    4

    score

    Wavelength [nm]

    400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800 1900 2000 2100 2200 2300 2400 2500

    Gro

    up

    Aliphatics

    Alkyl asymmetric– symmetric doublet

    Amides

    Amine

    Aromatics

    Carbohydrates

    Carbonate

    Carboxylic acids

    Goethite

    Haematite

    Hydroxyl

    Illite

    Kaolin doublet

    Methyls

    Phenolics

    Polysaccharides

    Smectite

    Water

    A.

    B.

    Figure 3: A. bands with the highest correlation for each property B. NIRS spectra and bandswith relative peak positions for soil costituents absorption

    4. Discussion

    Classifiers, missclassication cost and oversamplingWe proposed an alternative tool for property estimation from soil samples byrecasting the regression into a classification problem. We labeled conventionaltest results depending on the property to be estimated. We then implementedstandard mappings of real property values into qualitative classes from litera-ture. For our dataset, such mappings ended in unbalanced classes, with few

    8

  • samples (classes with less than 5% of the samples). Surveyed MLs classified thetest samples mostly in the label with the largest training set. We marginally im-proved the classification of the under-represented labels by introducing weightedmetrics in the cross-validation stage. We also tested oversampling approachesto sinthetically balance our training dataset, however weighted metrics outper-formed such approach. These classifiers can be used as qualitative assestmenttool that migth help in optimal sampling design for further expensive conven-tional lab test or the design of initial interventions plans.

    vis-NIRS to diagnose soil conditionIn sugarcane soils, Viscarra et al., [4] used vis-NIRS to predict soil properties

    and moved towards a soil fertility index. They used 184 soil samples, PLSR toestimate soil properties, in addition to 17 terrain attributes to derive the index.Awiti et al. [22] used vis-NIR into an odds logistic model to classify soil intogood, average and poor condition. The usage of vis-NIRS and ML is a rapidstrategy that offers the possibility to diagnose soil conditions. This study is thefirst step to evaluate the performance of vis-NIRS, ML regressors, and classifiers,but we look forward to getting into soil diagnosis.

    Bands with highest correlations and chemical hypothesesRegions for features highly ranked are centered near to 500, 600, 1400, 1700,

    1900, 2200, and 2400 nm(Figure 3). For the pH(H2O), absorptions near 500and 600 nm are primarily associated with some minerals containing hematiteand goethite [23, 5]; while those near 600 nm result from chromophores and thedarkness of organic. In the vis-NIRS, the overtones and combination bands dueto organic matter result from the stretching and bending of CO, CH, and NHgroups [24]. The band around 1400 nm is linked to the vibration of OH andresidual water in organic matter[25]. On the other hand, the wavelengths at1700 and 1930 nm are assigned to groups (C-H) and (C=O) that correspondto aromatic asymmetric alkyl-symmetric doublet and carboxylic acids, respec-tively [26]. These bands have been identified as important bands for organicmatter calibration [5]. The band near 2200 nm can be attributed to metal–OHbend plus O–H stretch combinations of several clay minerals, among them illitictypes[27], organic compounds, and carbonate. The wavelength at 2350 nm isrelated to Mg-OH [28]. Finally, in Figure 3 the region between 500 and 600nmhas a high correlation with the chemical parameters analyzed, which could berelated to both the dissolution mechanisms of iron oxides within the soils andparticularly within the rhizosphere (protonation, reduction, complexation) [29];as well as the reactions that organic matter (humic acids and fulvic acids) withcations (Ca+2, K+, Mg+2, and Na+) ([30, 31, 32, 33, 34]). Although there is nodirect association between properties and the NIRS, highly classified character-istics could be associated with property components.

    5. Conclusions

    ML regressors using a combination of spectra, its first and second derivative,and FFT features as input were the best model for pH, OM, Ca, and Mg soilcontent. Despite the estimation performance being close to reported in the

    9

  • literature, it is critical to increase the number of samples, adding soil sampleswith extreme values to enhance prediction power. ML classifiers are a feasiblestrategy when ML regressors poorly perform. Also, ML classifiers can be usedas a qualitative assessment tool for optimal sampling design.

    The feature ranking approach enables the researcher to get insight into thebands that highly correlate with each property. It is essential to understandwhat is behind ML approaches; thus, feature ranking is the first step in gettingback to the data.

    6. Data availability upon acceptance

    The filtered datasets and scripts are archived at github (available upon ac-ceptance).

    7. Acknowledegments

    Special thanks to Oscar Daniel Torres Rodŕıguez and Andrés Felipe MariñoGuerra for a preliminary study. We are also grateful for the project 243. Re-comendaciones técnicas preliminares de manejo de suelos en ladera para el sis-tema de producción de caña panelera en la HRS from AGROSAVIA that ob-tained the data used in this study.

    References

    [1] E. K. Bünemann, G. Bongiorno, Z. Bai, R. E. Creamer, G. De Deyn,R. de Goede, L. Fleskens, V. Geissen, T. W. Kuyper, P. Mäder, et al.,Soil quality–a critical review, Soil Biology and Biochemistry 120 (2018)105–125.

    [2] M. R. Nanni, J. A. M. Demattê, Spectral Reflectance Methodology inComparison to Traditional Soil Analysis, Soil Science Society of Amer-ica Journal 70 (2006) 393–407. URL: http://doi.wiley.com/10.2136/sssaj2003.0285. doi:10.2136/sssaj2003.0285.

    [3] J. C. Cañasveras, V. Barrón, M. C. del Campillo, R. A. ViscarraRossel, Espectroscoṕıa de reflectancia: Una herramienta para predecirlas propiedades del suelo relacionadas con la clorosis férrica, SpanishJournal of Agricultural Research 10 (2012) 1133–1142. doi:10.5424/sjar/2012104-681-11.

    [4] R. Viscarra Rossel, R. Rizzo, J. Demattê, T. Behrens, Spatial modeling of asoil fertility index using visible–near-infrared spectra and terrain attributes,Soil Science Society of America Journal 74 (2010) 1293–1300.

    10

    http://doi.wiley.com/10.2136/sssaj2003.0285http://doi.wiley.com/10.2136/sssaj2003.0285http://dx.doi.org/10.2136/sssaj2003.0285http://dx.doi.org/10.5424/sjar/2012104-681-11http://dx.doi.org/10.5424/sjar/2012104-681-11

  • [5] B. Stenberg, R. A. Viscarra Rossel, A. M. Mouazen, J. Wetterlind,Chapter five - visible and near infrared spectroscopy in soil science,in: D. L. Sparks (Ed.), Advances in Agronomy, volume 107, Aca-demic Press, 2010, pp. 163 – 215. URL: http://www.sciencedirect.com/science/article/pii/S0065211310070057. doi:https://doi.org/10.1016/S0065-2113(10)07005-7.

    [6] M. Yang, D. Xu, S. Chen, H. Li, Z. Shi, Evaluation of machine learningapproaches to predict soil organic matter and pH using vis-NIR spectra,Sensors (Switzerland) 19 (2019). doi:10.3390/s19020263.

    [7] J. Padarian, B. Minasny, A. B. McBratney, Machine learning and soilsciences: A review aided by machine learning tools, Soil 6 (2020) 35–52.

    [8] J. Ding, A. Yang, J. Wang, V. Sagan, D. Yu, Machine-learning-based quan-titative estimation of soil organic carbon content by vis/nir spectroscopy,PeerJ 6 (2018) e5714.

    [9] M. Yang, D. Xu, S. Chen, H. Li, Z. Shi, Evaluation of machine learningapproaches to predict soil organic matter and ph using vis-nir spectra,Sensors 19 (2019) 263.

    [10] A. Morellos, X.-E. Pantazi, D. Moshou, T. Alexandridis, R. Whetton,G. Tziotzios, J. Wiebensohn, R. Bill, A. M. Mouazen, Machine learn-ing based prediction of soil total nitrogen, organic carbon and moisturecontent by using vis-nir spectroscopy, Biosystems Engineering 152 (2016)104–116.

    [11] S. Nawar, A. Mouazen, On-line vis-nir spectroscopy prediction of soil or-ganic carbon using machine learning, Soil and Tillage Research 190 (2019)120–127.

    [12] IGAC, SUELOS Y TIERRAS DE COLOMBIA, 3 ed., Instituto GeográficoAgust́ın Codazzi, 2015.

    [13] J. H. Camacho-Tamayo, Y. Rubiano S, M. d. P. Hurtado S, Near-infrared(nir) diffuse reflectance spectroscopy for the prediction of carbon and ni-trogen in an oxisol, Agronomia colombiana 32 (2014) 86–94.

    [14] J. H. Camacho-Tamayo, N. M. Forero-Cabrera, L. Ramı́rez-López, Y. Ru-biano, Near-infrared spectroscopic assessment of soil texture in an oxisolof the eastern plains of colombia, Colombia Forestal 20 (2017) 5–18.

    [15] D. Cozzolino, A. Morón, The potential of near-infrared reflectance spec-troscopy to analyse soil chemical and physical characteristics, Journal ofAgricultural Science 140 (2003) 65–71. doi:10.1017/S0021859602002836.

    [16] R. Zornoza, C. Guerrero, J. Mataix-Solera, K. Scow, V. Arcenegui,J. Mataix-Beneyto, Near infrared spectroscopy for determination of variousphysical, chemical and biochemical properties in mediterranean soils, SoilBiology and Biochemistry 40 (2008) 1923–1930.

    11

    http://www.sciencedirect.com/science/article/pii/S0065211310070057http://www.sciencedirect.com/science/article/pii/S0065211310070057http://dx.doi.org/https://doi.org/10.1016/S0065-2113(10)07005-7http://dx.doi.org/https://doi.org/10.1016/S0065-2113(10)07005-7http://dx.doi.org/10.3390/s19020263http://dx.doi.org/10.1017/S0021859602002836

  • [17] S. B. Aguiar Herrera, Bases técnicas para el establecimiento y manejo delcultivo de caña en el departamento de Casanare, 1 ed., Corpoica, 2001.

    [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Pas-sos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn:Machine learning in Python, Journal of Machine Learning Research 12(2011) 2825–2830.

    [19] P. Juszczak, D. M. J. Tax, R. P. W. Duin, Feature scaling in support vectordata description, 2002.

    [20] S. J. Russell, P. Norvig, E. Al, Artificial intelligence : a modern approach,Pearson, Cop, 2010.

    [21] D. Chicco, G. Jurman, The advantages of the Matthews correlation co-efficient (MCC) over F1 score and accuracy in binary classification eval-uation, BMC Genomics 21 (2020) 6. URL: https://doi.org/10.1186/s12864-019-6413-7. doi:10.1186/s12864-019-6413-7.

    [22] A. O. Awiti, M. G. Walsh, K. D. Shepherd, J. Kinyamario, Soil conditionclassification using infrared spectroscopy: A proposition for assessment ofsoil condition along a tropical forest-cropland chronosequence, Geoderma143 (2008) 73–84.

    [23] R. V. Morris, H. V. Lauer, C. A. Lawson, E. K. Gibson, G. A. Nace,C. Stewart, Spectral and other physicochemical properties of submicronpowders of hematite (alpha -Fe2O3), maghemite (gamma - Fe2O3), mag-netite (Fe3O4), goethite (alpha - FeOOH) and lepidocrocite (gamma -FeOOH)., Journal of Geophysical Research 90 (1985) 3126–3144. doi:10.1029/JB090iB04p03126.

    [24] E. Ben-Dor, s. J. R. Iron, G. F. Epema, PSoil reflectance, in: RemoteSensing for the Earth Sciences, volume 3 of Manual of Remote Sensing,Wiley, New York, 1999, pp. 111––188.

    [25] R. Reda, T. Saffaj, B. Ilham, O. Saidi, K. Issam, L. Brahim, E. M. ElHadrami, A comparative study between a new method and other machinelearning algorithms for soil organic carbon and total nitrogen predictionusing near infrared spectroscopy, Chemometrics and Intelligent LaboratorySystems 195 (2019). doi:10.1016/j.chemolab.2019.103873.

    [26] R. V. Rossel, T. Behrens, Using data mining to model and interpret soildiffuse reflectance spectra, Geoderma 158 (2010) 46–54.

    [27] R. N. Clark, T. V. King, M. Klejwa, G. A. Swayze, N. Vergo, High spectralresolution reflectance spectroscopy of minerals, Journal of Geophysical Re-search 95 (1990) 12653–12680. URL: https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/JB095iB08p12653https://agupubs.

    12

    https://doi.org/10.1186/s12864-019-6413-7https://doi.org/10.1186/s12864-019-6413-7http://dx.doi.org/10.1186/s12864-019-6413-7http://dx.doi.org/10.1029/JB090iB04p03126http://dx.doi.org/10.1029/JB090iB04p03126http://dx.doi.org/10.1016/j.chemolab.2019.103873https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/10.1029/JB095iB08p12653https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/10.1029/JB095iB08p12653https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/10.1029/JB095iB08p12653

  • onlinelibrary.wiley.com/doi/abs/10.1029/JB095iB08p12653https:

    //agupubs.onlinelibrary.wiley.com/doi/10.1029/JB095iB08p12653.doi:10.1029/jb095ib08p12653.

    [28] Q. Fang, H. Hong, L. Zhao, S. Kukolich, K. Yin, C. Wang, Visible andnear-infrared reflectance spectroscopy for investigating soil mineralogy: Areview, JOURNAL OF SPECTROSCOPY (2018). URL: http://hdl.handle.net/10150/628358. doi:10.1155/2018/3168974.

    [29] U. Schwertmann, Solubility and dissolution of iron oxides, Plant and Soil130 (1991) 1–25.

    [30] M. Ali, W. Mindari, EFFECT OF HUMIC ACID ON SOIL CHEMICALAND PHYSICAL CHARACTERISTICS OF EMBANKMENT, MATECWeb of Conferences (2015). doi:10.1051/conf/2016.

    [31] H. R. Sindelar, M. T. Brown, T. H. Boyer, Effects of natural organic matteron calcium and phosphorus co-precipitation, Chemosphere 138 (2015) 218–224. doi:10.1016/j.chemosphere.2015.05.008.

    [32] F. L. Wang, P. M. Huang, Effects of organic matter on the rate of potassiumadsorption by soils, Canadian Journal of Soil science (2001). URL: www.nrcresearchpress.com.

    [33] M. Yan, Y. Lu, Y. Gao, M. F. Benedetti, G. V. Korshin, In-Situ In-vestigation of Interactions between Magnesium Ion and Natural OrganicMatter, Environmental Science and Technology 49 (2015) 8323–8329.doi:10.1021/acs.est.5b00003.

    [34] S. Droge, K. U. Goss, Effect of sodium and calcium cations on the ion-exchange affinity of organic cations for soil organic matter, EnvironmentalScience and Technology 46 (2012) 5894–5901. doi:10.1021/es204449r.

    [35] R. N. Clark, T. V. King, M. Klejwa, G. A. Swayze, N. Vergo, High spectralresolution reflectance spectroscopy of minerals, Journal of GeophysicalResearch 95 (1990). doi:10.1029/jb095ib08p12653.

    [36] E. Suess, Interaction of organic compounds with calcium carbonat-II.Organo-carbonate association in Recent sediments, Geochimica et Cos-mochimica Acta 37 (1973) 2435–2447.

    [37] C. Pasquini, Near infrared spectroscopy: Fundamentals, prac-tical aspects and analytical applications, 2003. doi:10.1590/S0103-50532003000200006.

    [38] A. Niemöller, D. Behmer, Use of Near Infrared Spectroscopy in the FoodIndustry, Nondestructive Testing of Food Quality (2008) 67–118. doi:10.1002/9780470388310.ch4.

    13

    https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/10.1029/JB095iB08p12653https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/10.1029/JB095iB08p12653https://agupubs.onlinelibrary.wiley.com/doi/full/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/JB095iB08p12653 https://agupubs.onlinelibrary.wiley.com/doi/10.1029/JB095iB08p12653http://dx.doi.org/10.1029/jb095ib08p12653http://hdl.handle.net/10150/628358http://hdl.handle.net/10150/628358http://dx.doi.org/10.1155/2018/3168974http://dx.doi.org/10.1051/conf/2016http://dx.doi.org/10.1016/j.chemosphere.2015.05.008www.nrcresearchpress.comwww.nrcresearchpress.comhttp://dx.doi.org/10.1021/acs.est.5b00003http://dx.doi.org/10.1021/es204449rhttp://dx.doi.org/10.1029/jb095ib08p12653http://dx.doi.org/10.1590/S0103-50532003000200006http://dx.doi.org/10.1590/S0103-50532003000200006http://dx.doi.org/10.1002/9780470388310.ch4http://dx.doi.org/10.1002/9780470388310.ch4

  • [39] B. S. Bansod, N. Kamboj, Measurement of soil attributes using NIR spec-troscopy : A review, International Journal of Advance Research in Scienceand Engineering (2016) 601–606.

    [40] T. Udelhoven, C. Emmerling, T. Jarmer, Quantitative analysis of soilchemical properties with diffuse reflectance spectrometry and partial least-square regression: A feasibility study, Plant and Soil 251 (2003) 319–329.doi:10.1023/A:1023008322682.

    [41] H. U. Rehman, M. Knadel, L. Wollesen de Jonge, E. Arthur, Predicting soilcation exchange capacity for variable soil types with visible near infraredspectra, in: EGU General Assembly Conference Abstracts, EGU GeneralAssembly Conference Abstracts, 2018, p. 3595.

    [42] A. P. Leone, G. Leone, N. Leone, C. Galeone, E. Grilli, N. Orefice, V. An-cona, Capability of Di ff use Reflectance Spectroscopy to Predict Soil WaterRetention and Related Soil, Water (Switzerland) 11 (2019) 1–16.

    [43] Y. Ulusoy, Y. Tekin, Z. Tümsavaş, A. M. Mouazen, Prediction of soil cationexchange capacity using visible and near infrared spectroscopy, BiosystemsEngineering 152 (2016) 79–93. doi:10.1016/j.biosystemseng.2016.03.005.

    [44] J. Padarian, B. Minasny, A. McBratney, Transfer learning to localise acontinental soil vis-nir calibration model, Geoderma 340 (2019) 279–288.

    [45] R. V. Rossel, T. Behrens, E. Ben-Dor, D. Brown, J. Demattê, K. D. Shep-herd, Z. Shi, B. Stenberg, A. Stevens, V. Adamchuk, et al., A globalspectral library to characterize the world’s soil, Earth-Science Reviews 155(2016) 198–230.

    14

    http://dx.doi.org/10.1023/A:1023008322682http://dx.doi.org/10.1016/j.biosystemseng.2016.03.005http://dx.doi.org/10.1016/j.biosystemseng.2016.03.005

    1 Introduction2 Materials and methods2.1 Data2.1.1 Study area and sample collection procedure2.1.2 Chemical measurements

    2.2 Methods2.2.1 ML regression models2.2.2 ML classifiers2.2.3 Feature ranking

    3 Results4 Discussion5 Conclusions6 Data availability upon acceptance7 Acknowledegments