1 DIGITAL MAPPING OF SOIL CARBON FRACTIONS By HAMZA KESKIN A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE UNIVERSITY OF FLORIDA 2015
1
DIGITAL MAPPING OF SOIL CARBON FRACTIONS
By
HAMZA KESKIN
A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE
UNIVERSITY OF FLORIDA
2015
2
© 2015 Hamza Keskin
3
To new generation soil scientists
4
ACKNOWLEDGMENTS
I would like to thank my parents who always encouraged me to pursue my
dreams. I specifically thank my major advisor Dr. Sabine Grunwald for believing in me
and providing me with the data from a research project she lead as principal investigator
named as “Rapid Assessment and Trajectory Modeling of Changes in Soil Carbon
across a Southeastern Landscape” (USDA – CSREES – NRI grant award 2007 – 35107
– 18368 by National Institute of Food and Agriculture (NIFA), U.S. Department of
Agriculture). I also thank my supervisory committee Dr. Willie Harris and Dr. Samira
Daroub for their professional advice and suggestions to increase the quality of the
manuscripts.
I owe a great deal of thanks to the Republic of Turkey Ministry of Forestry and
Water Affairs for financial support throughout the master program and to the General
Directorate of Combating Desertification and Erosion for the opportunity to work with
them after graduation.
Acknowledgments for field sampling, laboratory analysis, and development of the
soil-environmental database go to co-principal investigators of the project Dr.
Nicholas.B. Comerford, Dr. Willie.G. Harris, Dr. Gregory.L. Bruland. I also thank D.
Brenton. Myers, Nichola. M. Knox, Deoyani Sarkhot , Elena Azuaje, C. Wade Ross,
Xiong Xiong, Jongsung Kim, Gustavo M. Vasques, Pasicha Chaikaew, Aja Stoppe, Lisa
Stanley, Adriana Comerford, Xiaoling Dong, Samiah Moustafa, and Anne Quidez who
contributed to the construction of the carbon data used in the chapter 3 of the thesis. I
also would like to thank Esther Kaufman for unconditional help and guidance on the
revisions of the manuscripts.
5
TABLE OF CONTENTS page
ACKNOWLEDGMENTS .................................................................................................. 4
LIST OF TABLES ............................................................................................................ 7
LIST OF FIGURES .......................................................................................................... 8
LIST OF ABBREVIATIONS ........................................................................................... 10
ABSTRACT ................................................................................................................... 12
CHAPTER
1 INTRODUCTION .................................................................................................... 14
2 REGRESSION KRIGING AS WORKHORSE IN THE PEDOMETRICIAN’S TOOLBOX .............................................................................................................. 17
2.1 Introduction ....................................................................................................... 17 2.2 Material and Methods ....................................................................................... 22 2.3 Results .............................................................................................................. 23
2.3.1 Spatial Scale ........................................................................................... 23 2.3.1.1 Geographic region.......................................................................... 23 2.3.1.2 Area extent ..................................................................................... 24 2.3.1.3 Grain size ....................................................................................... 25
2.3.2 Target Soil Properties and Classes ......................................................... 26 2.3.3 Sampling ................................................................................................. 28
2.3.3.1 Sampling design ............................................................................ 28 2.3.3.2 Sample size-density ....................................................................... 29 2.3.3.3 Sample depth(s) ............................................................................. 30
2.3.4 SCORP Factors ....................................................................................... 30 2.3.5 Preprocessing ......................................................................................... 32
2.3.5.1 Logarithmic transformation ............................................................. 32 2.3.5.2 Factor analysis ............................................................................... 33
2.3.6 Regression Type to Quantify Deterministic Variation .............................. 34 2.3.7 Variogram ................................................................................................ 35
2.3.7.1 Model type ..................................................................................... 35 2.3.7.2 N:S ratio ......................................................................................... 35 2.3.7.3 Range ............................................................................................ 36
2.3.8 Validation ................................................................................................. 37 2.4 Discussion and Recommendations ................................................................... 38
2.4.1 Factors Effecting Performance of RK ...................................................... 38 2.4.2 Regression Kriging as a Default Soil Mapping Method............................ 40
2.4.2.1 Satisfactory performance of regression kriging over its competitors ............................................................................................. 40
6
2.4.2.2 Unsatisfactory performance of regression kriging over its competitors ............................................................................................. 41
2.4.3 REML-EBLUP vs. RK .............................................................................. 43 2.4.4 Future Trend of RK .................................................................................. 46 2.4.5 Model Averaging ..................................................................................... 48
2.5 Conclusions and Outlook .................................................................................. 50
3 DIGITAL MAPPING OF SOIL CARBON FRACTIONS ........................................... 57
3.1 Introduction ....................................................................................................... 57 3.2 Materials and Methods ...................................................................................... 62
3.2.1 Study Area ............................................................................................... 62 3.2.2 Soil Data .................................................................................................. 63
3.2.2.1 Sampling design and field sampling ............................................... 64 3.2.2.2 Laboratory and chemical analysis .................................................. 64 3.2.2.3 Determination of total, recalcitrant and labile carbon stocks .......... 65
3.2.3 Environmental Data ................................................................................. 65 3.2.3.1 Assembled environmental variables representing STEP-AWBH
factors ..................................................................................................... 65 3.2.3.2 Boruta feature selection technique ................................................. 67
3.2.4 Modeling Techniques .............................................................................. 68 3.2.5 Evaluation of Model Performance ........................................................... 71 3.2.6 Application of Models .............................................................................. 72 3.2.7 Mapping of Total, Labile and Recalcitrant Carbon Stocks ....................... 72
3.3 Results and Discussion ..................................................................................... 73 3.3.1 Descriptive Summary Statistic of Carbon Fractions ................................ 73 3.3.2 Spatial Autocorrelation with Trend and without Trend ............................. 74 3.3.3 Important Variables ................................................................................. 76 3.3.4 Assessment of the Prediction Capability of the Selected Methods .......... 79 3.3.5 Residual Spatial Autocorrelation of Evaluated Methods .......................... 85 3.3.6 Regional Scale Controls on Stabilization of Soil Carbon ......................... 88 3.3.7 Spatial Distribution of C fractions ............................................................ 95
3.4 Conclusions ...................................................................................................... 96
4 SUMMARY AND SYNTHESIS .............................................................................. 128
APPENDIX: LITERATURE REVIEW ........................................................................... 132
LIST OF REFERENCES ............................................................................................. 145
BIOGRAPHICAL SKETCH .......................................................................................... 166
7
LIST OF TABLES
Table page 2-1 Spatial range (m) from reviewed studies under three different ........................... 56
2-2 Modified Version of Regression Kriging (RK) ..................................................... 56
3-1 Assembled environmental variables representing STEP-ABWH factors ............ 98
3-2 R packages to perform evaluated methods ...................................................... 103
3-3 Descriptive statistic of observed soil C fractions. .............................................. 103
3-4 Spearman’s correlation analysis of the paired soil C fractions. ........................ 103
3-5 Z score as a sign for relative importance of all-relevant variables identified by Boruta. .............................................................................................................. 104
3-6 Performance of eight different modelling methods to predict soil total carbon (TC), recalcitrant carbon (RC) and labile carbon (HC) on validation. ................ 106
3-7 Cross-validation (on the 70% calibration dataset) and independent validation (on the 30% validation dataset) results of Random Forest models ................... 107
8
LIST OF FIGURES
Figure page 2-1 Evolution of Hybrid Interpolation Techniques ..................................................... 53
2-2 General framework for Regression Kriging ......................................................... 54
2-3 The cumulative amount of RK studied over time. ............................................... 55
2-4 Effects of coefficient of variation on the accuracy of RK methods compared in the 71 cases. ...................................................................................................... 55
3-1 A total of 1014 soil sampling locations .............................................................. 108
3-2 Upper part of figure depicts the omnidirectional variograms for total carbon (TC), recalcitrant carbon (RC) .......................................................................... 109
3-3 Predicted vs. observed soil total carbon (TC) of validation dataset .................. 110
3-4 Predicted vs. observed soil recalcitrant carbon (RC) of validation dataset ....... 111
3-5 Predicted vs. observed soil hot-water extractable carbon (HC) of validation ... 112
3-6 Relative increase (%) in root mean squared deviations (RMSD) of evaluated prediction techniques compare to RMSD of OK. .............................................. 113
3-7 Strength of the spatial autocorrelation among evaluated model residuals for total carbon (TC). .............................................................................................. 114
3-8 Strength of the spatial autocorrelation among evaluated model residuals for recalcitrant carbon (RC). .................................................................................. 115
3-9 Strength of the spatial autocorrelation among evaluated model residuals for hot-water extractable carbon (HC). ................................................................... 116
3-10 Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed land use/land cover (LULC). ............................. 117
3-11 Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed land use/land cover (LULC). ................... 118
3-12 Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed land use/land cover (LULC). ................... 119
3-13 Spatial distribution of landcover/landuse classes ............................................ 120
3-14 Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed suborders.. ....................................................... 121
9
3-15 Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed suborders. .............................................. 122
3-16 Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed suborders. .............................................. 123
3-17 Spatial distribution of soil suborders ................................................................. 124
3-18 Spatial distribution patterns of estimated soil total carbon stocks (kg m-2) across Florida, U.S. .......................................................................................... 125
3-19 Spatial distribution patterns of estimated recalcitrant carbon stocks (kg m-2) across Florida, U.S. .......................................................................................... 126
3-20 Spatial distribution patterns of estimated hot-water extractable carbon stocks (kg m-2) across Florida, U.S. ............................................................................. 127
10
LIST OF ABBREVIATIONS
ANN Artificial neural network
AWC Available water holding capacity (cm)
BaRT Bagged regression tree
BK Block kriging
BoRT Boosted regression tree
C Carbon (kg m-2)
C:N Carbon : nitrogen
CART Classification and regression tree
CLHS Conditioned Latin hypercube sampling
CV Coefficient of variation (%)
DEM Digital elevation model (m)
DSM Digital soil mapping
DSMM Digital soil mapping and modeling
GAM Generalized additive model
GLM Generalized linear model
HC Hot-water extractable carbon (kg m-2)
LULC Land use/land cover
ME Mean error (kg m-2)
MLR Multiple linear regression
N:S Nugget to sill ratio (%)
NRMSD Normalized root mean squared deviation
OK Ordinary kriging
PCA Principal component analysis
PLSR Partial least square regression
11
RC Recalcitrant carbon (kg m-2)
REML-EBLUP Residual maximum likelihood-empirical best unbiased prediction
RF Random forest
RK Regression kriging
RMSD Root mean squared deviation (kg m-2)
RPD Residual prediction deviation
RPIQ Ratio of prediction error to inter-quartile range
RSA Residual spatial autocorrelation
SCORP S:Soil, C:Climate,O: Organism, R: Relief, P: Parent material
SMLR Stepwise multiple linear regression
SOC Soil organic carbon (kg m-2)
SOM Soil organic matter (kg m-2)
STEP-ABWH S: Soil, T: Topography, E: Ecology, P: Parent material
A: Atmosphere, B: Biota, W:Water, H:Human
SVM Support vector machine
T Training dataset (N= 710)
TC Total carbon (kg m-2)
V Validation dataset (N=304)
𝑥𝑥 Location in one, two or three dimensions
𝑍𝑍(𝑥𝑥) The random variable Z at location x
𝜇𝜇(𝑥𝑥) Deterministic structural component, trend (drift)
𝜀𝜀′(𝑥𝑥) Stochastic component, spatially dependent residual from µ(x) [the
regionalized variable]
𝜀𝜀′′(𝑥𝑥): Spatially independent component, noise, unexplained variability
12
Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science
DIGITAL MAPPING OF SOIL CARBON FRACTIONS
By
Hamza Keskin
December 2015
Chair: Sabine Grunwald Major: Soil and Water Science
Our understanding of the spatial distribution of soil carbon (C) pools across
diverse land uses, soils, and climatic gradients at regional scale is still limited. Research
in digital soil mapping and modeling that investigates the interplay between (i) soil C
pools and environmental factors (“deterministic trend model”) and (ii) stochastic,
spatially dependent variations in soil C fractions (“stochastic model”) is just emerging.
This evoked our motivation to investigate soil C pools in the State of Florida covering
about 150,000 km2. Our specific objectives were to (i) compare different soil C pool
models that quantify stochastic and/or deterministic components, (ii) assess the
prediction performance of soil C models, and (iii) identify environmental factors that
impart most control on labile and recalcitrant pools and total soil C (TC). We used soil
data (0-20 cm) from a research project (USDA-CSREES-NRI grant award 2007-35107-
18368) collected at 1,014 georeferenced sites including measured bulk density (BD),
recalcitrant carbon (RC), labile (hot-water extractable) carbon (HC) and TC. A
comprehensive set of 327 geospatial soil-environmental variables was acquired. The
Boruta method was employed to identify “all-relevant” soil-environmental predictors. We
employed eight methods - Classification and Regression Tree (CaRT), Bagged
13
Regression Tree (BaRT), Boosted Regression Tree (BoRT), Random Forest (RF),
Support Vector Machine (SVM), Partial Least Square Regression (PLSR), Regression
Kriging (RK), and Ordinary Kriging (OK) – to predict soil C fractions and TC. Overall, 36,
20 and 25 predictors stood out as “all-relevant” to estimate TC, RC and HC,
respectively. We predicted a mean of 5.39 ± 3.74 kg TC m-2 in the top 20 cm with the
best model. The prediction performance assessed by the Ratio of Prediction Error to
Inter-quartile Range for TC stocks was as follows: RF > SVM > BoRT > BaRT > PLSR >
RK > CART > OK. The best models explained 71.6%, 71.7% and 30.5% of the total
variation for TC, RC and HC, respectively. Biotic and hydro-pedological factors
explained most of the variation in soil C pools and TC, lithologic and climatic factors
showed some relationships to soil C pools and TC, whereas topographic factors faded
from soil C models.
14
CHAPTER 1 INTRODUCTION
At the beginning of the 21st century, advances in computational power,
geographic information systems, remote sensing and statistical methods have
collectively enabled pedologists to produce state-of-the-art, reliable, categorical and
continuous spatial soil information at multiple scales in space and time, which empower
environmental scientists to model and policy makers to deal with wicked environmental
problems, such as land degradation, climate change, food and water security,
biodiversity and ecosystem functions protection (Bouma and McBratney, 2013;
Hartemink and McBratney, 2008). Consequently, providing high-quality, justifiably
reliable, reproducible spatiotemporal soil information with limited uncertainty has been
the major focus in digital soil mapping (DSM) which has now shifted from the research
phase into an operational phase (Minasny and McBratney, 2015). Decreasing
inaccuracies in DSM is an essential requirement in the quest to comprehend variability
in soil properties/classes at multiple scales. A better understanding of soil variability will
pave the way for a better understanding of geo-patterns on the Earth’s surface
(Bockheim and Gennadiyev, 2010).
Soil, as the central component of the Earth’s critical zone (Lin, 2010) and a large
terrestrial C pool, dictates quantity and quality of soil ecosystem services at multiple
scales. At the local scale, SOC management is particularly important as it influences the
physical, chemical, and biological properties of soil. At the global scale, appropriate
management of soil is critical because of its role in mitigating the atmospheric level of
greenhouse gases (GHGs) through sequester C from the atmosphere (Milne et al.,
2007). Due to the enormous capacity of soil to retain C, even relatively small shifts in its
15
quantity could have dramatic changes in the global C balance (Smith et al., 2008).
Baldock et al. (2012) estimated a sole 1% increase in the C content in the pedosphere
could offset 8 ppm of C content in the atmosphere. Hence, the re-equilibrium between
above- and below-ground C storage can be accomplished by reducing C loss and
boosting C build-up in the pedosphere with appropriate management practices (Lal,
2004). Accurately assessing soil C storage is challenged by the complexity of highly
variable C fluxes and C forming/degrading processes in space and time. Thus,
decreasing the uncertainty associated with today’s regional and global scale C
estimation can facilitate Earth System Science to address human-driven threats to soil
quality and soil security.
Regression Kriging (RK) is one of the most popular, practical and robust hybrid
spatial interpolation techniques in the pedometrician’s toolbox which enables the
modeling of soil distribution patterns at multiple scales in space and time. It explicitly
account for deterministic and stochastic portion of the total variation for phenomena of
interest.
A review of literature is provided in Chapter 2 which articulates past, present and
future development of DSM-RK studies. It describes the evolution of RK from a
historical perspective and traces development in the last decade using an extensive
literature review with the purpose of characterizing factors affect the prediction
performance of RK. The review also illustrates the steps taken to develop efficient RK
models and identifies the limitations and strengths of RK. The results of this review
raised further questions: i) Is it possible to incorporate data-mining methods into the RK
framework? , ii) What effect does the residual spatial autocorrelation (RSA) have in geo-
16
spatial hybrid methods? iii) Do hybrid methods that were developed with sophisticated
data-mining methods yield better prediction than standalone hybrid methods?
Chapter 3 addressed the modelling and mapping of TC, RC and HC across the
State of Florida by constructing parsimonious geo-spatial soil landscape models without
sacrificing prediction accuracy. What makes the Chapter 3 compelling is that labile and
recalcitrant portion of soil total C is modelled and mapped along pedo-climatic
trajectories in a diverse subtropical region. Additionally, the quantification of the
stochastic and deterministic variability of soil C pools is scarce in the literature.
Moreover, the RSA of the eight different methods populated by strategically identified
ancillary variables are compared to answer abovementioned questions that shape the
future development of the RK framework. Last but not least, regional scale
environmental controls on the stabilization and destabilization mechanisms of soil C are
discussed to enhance interpretability of modelling efforts.
17
CHAPTER 2 REGRESSION KRIGING AS WORKHORSE IN THE PEDOMETRICIAN’S TOOLBOX
2.1 Introduction
Inherently, soil variation leads to the significant problem of decreasing the
accuracy and reliability within soil maps; thus, its great complexity made pedologists
seek alternative ways to spatially notate the variability (Burrough et al., 1994). Two
general, yet distinct approaches have been offered to account for the soil variation:
discrete modeling of soil variation (polygon-based), and continuous modeling of soil
variation (pixel-based). While the first approach partitions the soil into more and less
discrete classes, the latter approach looks at the soil-landscape as a continuum.
Traditional soil classification uses a polygon-based soil map unit model that has
numerous drawbacks. As Hartemink et al. (2010) articulated the maps produced by
traditional soil classification methodology are static, inflexible, inaccurate, undetailed
and difficult to integrate with grid-based digital soil sources. Moreover, and most
importantly, polygon-based models do not formally specify the uncertainty (Grunwald,
2006). Altogether, these drawbacks largely contributed to the decrease in funding to
pedological research in the late 1990's (Basher 1997, Ryan et al., 2000). “The challenge
was to use actual knowledge about soil forming processes and to develop a spatially-
realistic, mathematical soil-landscape model useful for a variety of purposes beyond
taxonomic classification” (McSweeney et al., 1994). Consequently, soil scientists
inevitably shifted from qualitative subjective modeling of soil properties and classes to
quantitative objective modeling; “soil science under uncertainty” (Goovaerts, 2001).
The unifying modeling of soil spatial variation can be formalized by using the
regionalized variable theory with the following equation (after Burrough, 1986)
18
𝑍𝑍(𝑥𝑥) = 𝜇𝜇(𝑥𝑥) + 𝜀𝜀′(𝑥𝑥) + 𝜀𝜀′′(𝑥𝑥) (2-1)
Where: • 𝑥𝑥 : location in one, two or three dimensions,
• 𝑍𝑍(𝑥𝑥) : the random variable Z at location x,
• 𝜇𝜇(𝑥𝑥) : deterministic structural component, trend (drift),
• 𝜀𝜀′(𝑥𝑥) : stochastic component, spatially dependent residual from µ(x) [the regionalized variable] but locally varying in both lateral and vertical direction,
• 𝜀𝜀′′(𝑥𝑥): spatially independent component, noise, unexplained variability.
Spatial variability in soil forms a spectrum of variation ranging from microscopic
to megascopic scale (Wright and Wilson, 1979) as a function of many possible factors,
including target area of extent, grain size, specific soil properties or processes, spatial
location and time (Lin et al., 2005). Altogether these factors may form a trend at multiple
scales and these trends may be depicted with a deterministic function (𝜇𝜇(𝑥𝑥) in Equation
2-1). However, the processes responsible for soil variation are generally unknown and
with current expertise soil variability is unlikely to be captured analytically at multiple
scales in either space or time (Heuvelink and Webster, 2001). Typically, the values for a
soil property from samples taken at close geographic spacing is similar or spatially
correlated (Oliver, 1987). This is the premise of the spatially dependent random
component (𝜀𝜀′(𝑥𝑥) in Equation 2-1). Semivariograms have been used to characterize the
stochastic structural component as a function of distance between two adjacent points
under the stationary assumption. Spatially independent component of the variation,
noise, is the unexplainable variability (𝜀𝜀′′(𝑥𝑥) in Equation 2-1) which is present in any
model having a mean zero and variance σ2 (Webster, 2000).
19
The soil factorial model, an empirical-deterministic model of soil formation
developed by V.V. Dokuchaev (Glinka, 1927) and formulized by Jenny (Jenny, 1941)
has been widely utilized to explore the deterministic part of the variation, whereas
regionalized variable theory (Matheron, 1971) has mainly enabled researchers to
characterize the stochastic, spatially dependent variation (Webster, 1994). While soil
forming factors theory has attracted researcher’s attention to quantitatively describe the
relationship between soil and its creators, the regionalized variable theory (Matheron,
1971) has mainly flourished due to its ability to predict the values of various soil
properties at unknown locations. Many statistical and purely geostatistical methods
used since the 1960s have been collectively categorized under the new branch in soil
science called “pedometrics”. Pedometrics can be defined as the application of
probability and statistical methods to soil science (Webster, 1994) or the application of
mathematics and statistics to study the distribution and genesis of soil (McBratney et al.,
2000). Deterministic and stochastic variation of soil models have been systematically
studied in the discipline of pedometrics since the 1990s.
There are two main generic approaches that are representative of these two
distinct model paradigms that address soil variation and predict soil properties and
classes at an unvisited location: (1) non-geostatistical techniques (e.g., simple and
multiple linear regression (MLR), generalized additive model (GAM), regression tree
(RT)) and (2) geostatistical techniques (e.g. ordinary kriging (OK), simple kriging (SK),
universal kriging (UK)) (Burgess and Webster, 1980, Moore et al., 1993, Odeh et al.,
1994, McBratney et al, 2000). Non-geostatistical techniques have been used to quantify
the relationship between soil properties and state factors accounting for the
20
deterministic portion of the total variation “µ(x)” (Figure 2-1). Geostatistical methods,
conversely, have been used to quantify changes in soil properties over distance
accounting for the spatially dependent stochastic portion of the total variation “ɛ’(x)”
(Figure 2-1). These two generic approaches were combined to create hybrid techniques
(i.e., non-stationary geostatistical methods) (Wackernagel, 2003), in the mid-1990s.
While the non-geostatistical part detects the deterministic part of the total variation, the
geostatistical part quantifies the spatially dependent stochastic part of the total variation.
A number of hybrid techniques have been developed in the following years,
universal kriging or kriging with internal drift (UK) (Webster and Burgess, 1980) as well
as kriging with external drift (KED) (Goovaerts, 1997). Odeh et al. (1994, 1995) coined
the term RK and introduced RK type A, B and later RK type C.
RK type A which is called “kriging combined with regression” (Knotters et al.,
1995) involves kriging of the predicted values. In other words, first multivariate
regression is applied to predict the value of unvisited sites. This is followed by the
kriging of the regressed values.
RK type B, which is also called “ kriging with guess field” (Ahmed and De Marsily,
1987), is calculated with regressed values and with residuals arising from the regression
kriged simultaneously and summed to create a final map.
RK type C (Odeh et al. 1995), which is called “kriging after detrending”
(Goovaerts, 1999), is defined as the sum of the regressed values and kriged residuals
from the regression. The difference of RK type C to type B is that it only uses the kriging
of the residual to obtain final prediction. For an extensive review of the hybrid kriging
techniques, a full discussion of RK can be found elsewhere (Knotters et al., 2010).
21
RK type C is one of the most widely used hybrid spatial interpolation method
used in soil science to predict soil properties (Minasny and McBratney, 2007). The steps
to execute RK are provided (Figure 2-2).
First, soil and ancillary environmental data are collected for a given study region.
The next step is to compute a regression between the state factors and the target soil
property. Then the trend model, identified by the regression equation, is subtracted from
Z(x) and residuals are quantified. The residuals from the trend are treated as spatially
correlated stationary random variables. Finally, the regression estimates and the rigged
residual values are summed together to create the final map. McBratney et al. (2003)
called this approach SCORP kriging as a default modeling technique in the thoughtful
review article of DSM. Hengl et al. (2004) presented the general framework for RK. Lark
et al. (2006) articulated the problem with estimating the residual variogram and
concluded that RK is statistically suboptimal and offered the restricted maximum
likelihood-empirical best linear unbiased predictor (REML-EBLUP) method as
mathematically unbiased alternative to RK. However, Minasny and McBratney (2007)
compared the prediction accuracy of RK and REML-EBLUP with different datasets and
the performance of both techniques are found quite similar in different case studies.
Authors concluded that although RK is biased from a mathematical point of view it is
performing equally well as the unbiased counterpart (REML-EBLUP). This finding has
contributed to the popularity of RK which is much easier to implement than the
mathematical complex REML-EBLUP.
The hybrid method is reduced to ordinary kriging if no linear or non-linear
relationships are present between phenomena of interest and auxiliary variables. On the
22
other hand, if no autocorrelation in residuals is present then the hybrid method is
reduced to the (multiple linear) regression (Vanwalleghem et al., 2010). Multiple DSM
studies showed that RK outperformed geostatistical and non-geostatistical methods
(Bishop and McBratney, 2001; Carré and Girard, 2002; Odeha et al., 1994; Odeh et al.,
1995; Odeh and McBratney, 2000; Triantafilis et al., 2001; Rivero et al. 2007).
Several theoretical and applied aspects of RK have been discussed in the
literature. However, there is no systematic, extensive review of RK to predict soil
properties and classes. The objectives of the study are as follows:
1. Review of the studies that utilized RK to predict soil properties and classes at multiple spatio-temporal scales ranging from field to regional scale.
2. Identify the strengths and weaknesses of RK studies.
3. Quantify the factors affecting the accuracy of RK.
4. Document the development of RK through the last decade and characterize future trends.
2.2 Material and Methods
Three high-quality international soil science journals were selected: “Geoderma”,
“Soil Science Society of America” and “Catena”. The time period of 2004 to 2014 was
identified to review DSMM studies that utilized RK as a spatial interpolation method.
“Regression kriging and soil properties or classes” were defined as the key words to
gather articles that employed RK as an interpolation method to predict soil properties
and classes. All retrieved articles were thoroughly analyzed. As a result of conscientious
examination, a total of 40 articles were gathered to be included in this review. Multiple
criteria were used to assess RK studies, including factors that affect the accuracy of
prediction performance, such as sample size, area of extent, soil properties and
classes, sample depth, regression type, and auxiliary variables. Each soil model was
23
evaluated as different case to accurately quantify and obtain reliable results. Hence,
142 cases out of 40 articles are included in this extensive review.
The following criteria were considered in order to characterize the individual RK
studies:
i) Data collection process:
• Geographic location of soil attributes or classes • Area extent • Grain size • Target soil properties or classes • Sampling design • Sample size and density • Sample depth(s) • Auxiliary environmental variables which are explicitly specified SCORP factors:
soils (S), climate (C), organism/vegetation (O), relief (R), and parent material (P)
ii) Model development process:
• Transformation of target soil properties • Factor analysis of SCORP factors • Regression type which is used to explore the relationship between target soil
property or class and SCORP factors • Variogram model • Nugget to sill ratio (N:S) • Spatial autocorrelation range
iii) Validation process:
• Training (T) and validation (V) size of the target soil property/class • Coefficient of variation (CV%) of training dataset • Root mean squared deviation (RMSD) (kg C m-2 • Coefficient of determination (R2) of final prediction
2.3 Results
2.3.1 Spatial Scale
2.3.1.1 Geographic region
Out of 142 cases, RK studies were conducted in the following geographic
regions: 30% (N=42) in Europe, 24% (N=34) in the U.S., 20% (N=29) in Australia, 15%
24
(N=22) in China and 11% (N=15) in the remaining regions (Africa, Canada, India,
Middle East, and South America). The increase in the total number of RK studies is
given (Figure 2-3). Over time the locations for RK studies have become more diverse
around the world revealing that the approach is well recognized among DSMM
practitioners. While Europe, U.S. and Australia are the most prominent geographic
locations for RK with major DSM research groups, training and DSM workshops have
contribute to the spread of the RK studies around the globe. With the recognition of
2015 as the international year of soil science, DSMM studies utilizing RK appear to be
on the rise.
2.3.1.2 Area extent
142 cases within the 40 different articles were categorized under four spatial
extents: field (smaller than 0.25 km2), local (between 0.25 and 104 km2), regional
(between 104 and 107 km2) and global (wider than 107 km2) according to a classification
proposed by (Thompson et al., 2012)
A total of 20.4% (N=29) of the studies covered a field scale, more than half of the
studies 58.5% (N=83) covered a local extent and 16.9% (N=24) of the cases covered a
regional extent and none of them covered a global extent (Appendix). Although Hengl et
al. (2014) used RK to model and map some soil properties and classes at global extent,
it was not among the reviewed studies.
The various area of extent ranged from the finest with 0.042 km2 (Dlugoß et al.,
2010) to 0.06 km2 (Baxter and Oliver, 2005), to coarser extent, such as 432 km2 (Rivero
et al., 2007) and 6,157 km2 (Shi et al., 2011) up to most coarse covering large soil
regions of thousands of millions km2 (4,217,241 km2 by (Lado et al. (2008) to 9,600,000
km2 by Li et al.(2013)). Out of 40 different studies 10% (N=4) conducted their studies
25
within more than one spatial coverage to investigate the effect of area extent for either
the same or different soil properties and classes (Baxter and Oliver, 2005; Simbahan et
al., 2006; Minasny and McBratney, 2007; Poggio et al., 2010). Minasny and McBratney
(2007) tested the prediction capability of Residual Empirical Maximum Likelihood-
Empirical Best Unbiased Prediction (REML-EBLUP), RK and OK in four different areas.
Poggio et al. (2010) modeled the spatial uncertainty of interpolated values of available
water capacity (AWC) at three different nested extents: catchment, regional and
national. Simbahan et al. (2006) assessed soil organic carbon (SOC) stocks in three
large no-till fields.
2.3.1.3 Grain size
A total of 142 cases within the 40 reviewed articles had a grain size (i.e., spatial
resolution) ranging from 0.5 m up to 5,000 m. Out of 142 studies 38.7% (N=55) utilized
relatively finer resolutions ranging from 0.5 m to 25 m, 29.6% (N=42) employed 30 m
resolution, 16.2% (N=23) used relatively coarser resolution from 50 to 5,000 m and
15.5% (N=22) could not be grouped into any of the categories due to a lack of the
necessary information reported in the articles (Appendix). Grunwald (2009) observed a
similar trend in a comprehensive review of DSM studies. She found that as the spatial
domain size increases, the cell size also increases. Minasny et al. (2013) reported that
grid spacing of digital soil maps increases logarithmically with area of extent. Among the
reviewed studies, Chaplot et al. (2010) predicted the thickness of A horizon with the cell
size of 0.5 m in a 0.003 km2 area, Umali et al. (2012) predicted standard soil properties
with a 5 m resolution in a 0.056 km2 area, Li et al. (2013) explored the spatial variability
of soil organic matter (SOM) throughout China with 9,600,000 km2 area and a 1 km cell
size, and Lado et al. (2008) interpolated several heavy metal concentrations within a
26
large area- 4,217,241 km2- with a 5 km spatial resolution. Careful examination is
needed when selecting an appropriate spatial resolution for a soil map.
2.3.2 Target Soil Properties and Classes
Out of 142 examined cases, 38.7% (N=55) were concerned with SOM, soil
carbon, carbon stocks and carbon fractions to supply information dealing with global
environmental problems including climate change, desertification and soil security. For
example, Karunaratne et al. (2014) predicted fractions of SOC, namely resistant organic
carbon (ROC), humus organic carbon (HOC) and particulate organic carbon (POC) at a
100 cm depth at local scale. Vasques et al. (2010b) modeled and mapped the total soil
carbon stocks from 0 to 180 cm within a 3,580 km2 subtropical region. To quantify the
spatial distribution of soil carbon across multiple scales is urgently needed because soil
carbon is the largest manageable carbon pool compared to carbon in the biosphere and
atmosphere (Lal et al. 2004). To accurately assess the soil carbon within a study region
is profoundly important for better decision-making in sustainable development (Minasny
et al., 2013). Accordingly, the spatial distribution of soil carbon has been of great
interest as exemplified by the increasing number of publications in mapping soil carbon
stocks globally and nationally (Grunwald, 2009).
A total of 23.2% (N=33) of RK-DSMM studies addressed the issue of land
degradation and environmental concern by focusing on soil nutrients, mainly total
phosphorus (TP) (Roger et al., 2014), nitrogen (TN) (Baxter and Oliver, 2005; Rivero et
al., 2007), heavy metal concentration (Lado et al., 2008; Lin et al., 2011; Shi et al.,
2011), soil health, land degradation (Lamsal et al., 2006; Watt and Palmer, 2012) and
salinization (Douaoui et al., 2006) (Appendix). These environmentally-centered DSMM
27
research studies are responding to critical societal needs including environmental
quality assessment, soil degradation, and health.
A total of 11.3% (N=16) of the 142 observed cases predicted chemical soil
properties, such as pH (Hengl et al., 2004; Umali et al., 2012; Sun et al., 2012; Malone
et al., 2014) and electrical conductivity (EC) (Umali et al., 2012). 16.9% (N=24) of 142
cases predicted physical soil properties, such as depth to soil horizons (Chaplot et al.,
2010; Vanwalleghem et al., 2010) or clay, sand and silt (Minasny and McBratney, 2007;
Mora-Vallejo et al., 2008; Umali et al., 2012; de Carvalho Junior et al., 2014; Niang et
al., 2014) .
Only 8.5% (N=12) of the RK studies investigated hydrological soil properties,
such as available water capacity (Poggio et al., 2010) and soil moisture (Herbst et al.,
2006; Takagi and Lin, 2012).
Surprisingly, out of 142 cases, only very few (1.4%; N=2) predicted soil
categorical data. Hengl et al. (2007b) predicted soil texture classes and soil groups in
Iran and (2004) the presence or absence of a plant species (Taxus baccata L.) in each
grid cell using logistic regression. These numbers are significantly lower compared with
previous review studies. McBratney et al. (2003) found that 30% of reviewed studies
predicted soil categories and Grunwald (2009) observed 15.6%. Hence, the significant
decrease in frequency of soil classes studied shows a lack of preference by the DSMM
community for using RK to predict soil classes. One of the reasons may be the
satisfactory prediction accuracy of an increasing number of disaggregation methods to
predict soil categorical information, such as PROPR (Digital Soil Property Mapping
using Soil Class Probability Raster) and DSMART (Dissaggregation and harmonization
28
of Soil Map Units Through Resampled Classification Trees) (Odgers et al., 2014; 2015)
and machine learning methods, such as Classification and Regression Tree (CART)
and Random Forest (RF). Another reason may be a declining interest in mapping of
taxonomic classes since many soil class maps have been already produced in the past
especially by governmental organizations.
Out of 142 cases, 60% (N=24) predicted only one specific soil property/class
(e.g., Kuriakose et al., 2009; Poggio et al., 2010; Li, 2010; Zhang et al., 2012; Li et al.,
2013), whereas 40% (N=16) modeled and mapped more than one soil property (e.g.,
Lado et al., 2008; Vasques et al., 2010b; Shi et al., 2011; Sun et al., 2012; Umali et al.,
2012; Roger et al., 2014). For instance, Chai et al. (2008) compared the performance of
REML-BLUP with that of RK to predict SOM in the presence of different external trends.
Lado et al. (2008) modeled and mapped the distribution of eight critical heavy metals
(arsenic, cadmium, chromium, copper, mercury, nickel, lead and zinc) in the topsoil
using 1,588 georeferenced samples from the Forum of European Geological Surveys
Geochemical database (26 European countries). Roger et al. (2014) predicted the
spatial distribution of soil P forms (total, organic, inorganic, and available P) at regional
scale.
2.3.3 Sampling
2.3.3.1 Sampling design
Out of 142 models, 21.1% (N=30) of them used regular grid sampling with a
sample spacing ranging from 2 m (Chaplot et al., 2010) to 10 km (Poggio and Gimona,
2014), while 15.5% (N=22) of all cases sampled based on stratified random sampling
design (e.g., Karunaratne et al., 2014; Vasques et al., 2010a), 13.4% (N=19) of all
cases used a random sampling design at field scale (e.g., Umali et al., 2012) and at
29
regional scale (e.g., Kumar et al., 2012), and 9.2% (N=13) of all cases employed a
purposive sampling scheme (e.g., Kuriakose et al., 2009; Takagi and Lin, 2012)
(Appendix). Lastly, the conditioned Latin Hypercube sampling (cLHS) design was
employed in only 2.8% (N=4) of the 142 studies (e.g., Levi and Rasmussen, 2014;
Minasny and McBratney, 2007). The cLHS may directly contribute to the performance of
prediction because it maximizes the efficiency of sampling by enabling users to
adequately characterize the variability in a target geographic region for a given target of
interest (Minasny and McBratney, 2006). The small scale variability may not be
captured if the sample spacing is larger than the effective range (McBratney, 1998).
Unfortunately, 38.0% (N=54) of the studies did not specify the sampling design. As the
area of extent increased from field to regional scale, the sampling design shifted from
model-based to design-based (probability-based) sampling. Furthermore, as spatial
coverage increased, the preference of regular grid sampling decreased as follows:
43.3% (N=13) of the cases at field scale, 40% (N=12) at local scale and 10% (N=3) at
regional scale.
2.3.3.2 Sample size-density
The sample density is functions of area of extent and sample amount. Typically,
as the area of extent increases the sampling density drastically decreases. For
instance, Watt and Palmer, (2012) predicted C:N ratio with 0.0059 sample per km2 at
regional scale (1,949,359 km2) and Sun et al. (2012) predicted clay, pH and SOC with
25.9 sample per km2 at local scale (38 km2) and Takagi and Lin ( 2012) predicted soil
moisture with 1,341.7 sample per km2 at field scale (0.079 km2). No consistent
sampling density or spatial resolution was evident in the reviewed studies. However, an
increase in sampling density may likely increase the reliability of the prediction mapping.
30
2.3.3.3 Sample depth(s)
Out of 40 reviewed articles 82.5% (N=33) were conducted at a single depth
ranging from 10 to 100 cm; 0-10 cm (Umali et al., 2012; Watt and Palmer, 2012), 0-20
cm (Chai et al., 2008; Li, 2010; Shi et al., 2011; Roger et al., 2014), 0-25 cm (Zhang et
al., 2012), 0-30 cm (Baxter and Oliver, 2005; Lamsal et al., 2006; Mora-Vallejo et al.,
2008; Mishra et al., 2012; Karunaratne et al., 2014; Levi and Rasmussen, 2014), 0-100
cm (Chaplot et al., 2010; Kumar et al., 2012; Poggio and Gimona, 2014). Only 17.5%
(N=7) of the articles examined multiple depths (Vasques et al., 2010a; Takagi and Lin,
2012; Sun et al., 2012; de Carvalho Junior et al., 2014; Malone et al., 2014). For
example, Malone et al. (2014) predicted pH to a depth of 200 cm at 6 different horizons
using the mass preserving spline. Out of 142 cases, 15% (N=22) did not report the
horizon depth(s) (e.g., Hengl et al., 2004; Douaoui et al., 2006; Hengl et al., 2007a;
Minasny and McBratney, 2007; Lado et al., 2008; Poggio et al., 2010; Li et al., 2013)
(Appendix). In summary, the majority of DSMM-RK studies only focused on mapping of
the topsoil. This is rather reductionistic because biogeochemical processes occur within
the whole soil profile. The critical zone of the Earth surface extends far beyond the top
0-30 cm and for a more complete understanding of pedogenesis subsurface horizons
would also need to be considered.
2.3.4 SCORP Factors
Different combinations of the SCORP factors were explicitly used to predict soil
properties or classes. Out of 142 different cases the frequency of SCORP factors to
predict soil properties and classes are as follows: 43.7% (N=62) incorporated the S
factor, 34.5% (N=49) the C factor, 64.1% (N=91) the O factor, 86.6 % (N=123) the R
factor and 35.2 % (N=50) the P factor. The findings are in line with the review of DSM
31
studies by McBratney et al. (2003). They found the frequency of SCORP factors was as
follows: S (35%), C (5%), O (25%), R (80%) and P (25%). The increase of usage of O
and C factors may be a function of the increase in carbon modeling studies and the
increasing importance of O and C factor for modeling and mapping of carbon and/or an
increase in readily available data for digital soil mapping practitioners. For example,
Hijmans et al. (2005) produced a global scale study of primary soil attributes with a 1 km
resolution using mean annual temperature, precipitation and bioclimatic raster maps
from very high resolution raster layers for public use. The DEM derived from the Shuttle
Radar Topography Mission (SRTM) have been exhaustively used with differing
resolutions (30 to 1000 m) as the first choice for pedometricians desiring to derive
primary and secondary terrain attributes to model soil properties and classes
(McBratney et al., 2003). The availability to high resolution DEMs (10, 30 and 90 m) at
global scales may have contributed to the high percentage of R factors used in DSMM,
when compared to covariates for other SCOP factors. Grunwald (2009) found that
29.3% of all investigated DSMM studies (N=75) documented R factor as covariates.
Different combination of SCORP factors have been utilized for modeling different
soil properties. Some authors utilized a combination of ancillary variables (e.g.,
Vanwalleghem et al. (2010) used S and R factors and Simbahan et al. (2006) used S, O
and R factors), while others included all SCORP factors (e.g., Vasques et al. (2010b)
predicted fractions of soil carbon). Different sets of auxiliary variables and their impact
on the prediction accuracy has been addressed by Li (2010) by examining the effect of
topography, organism, climate and parent material on the accuracy of soil predictions.
Surprisingly, adding environmental factors deteriorated the accuracy of prediction due to
32
an inappropriate spatial scale of environmental variables and the total number of soil
observations. Zhang et al. (2012) investigated whether the inclusion of categorical
variables improves the accuracy of SOM predictions. They concluded that the prediction
accuracy was improved with the inclusion of soil genetic types.
Rivero et al. (2007) conducted an investigation of auxiliary variables obtained at
different grain size to find whether or not a finer resolution of environmental variables
increases the accuracy in modeling. In that study, they incorporated a number of indices
obtained from 30 m spatial resolution Landsat 7 Enhanced Thematic Matter (ETM+) and
the same number of indices from the 15 m spatial resolution Advanced Spaceborne
Thermal Emission and Reflection Radiometer (ASTER). The RK model with ASTER
derived indices yielded better prediction accuracy in terms of ME and RMSD than the
RK model with ETM. Although, there is no explicit directive with regard to spatial scale
of remote sensing derived environmental variables and prediction efficiency. Finer
resolution images may not always increase the performance of soil predictions (Kim et
al., 2014; Hong, 2011). However, at resolutions coarser than 40 m, erratic behavior of
terrain variables and some certain landscape attributes become apparent leading to a
loss of predictive capability (McKenzie and Ryan, 1999; Gessler et al., 2000).
2.3.5 Preprocessing
2.3.5.1 Logarithmic transformation
Parametric statistical methods such as Generalized Linear Model (GLM) and
Stepwise Multiple Linear Regression (SMLR) assume the Gaussian distribution of the
dependent variables; however, the distribution of soil properties generally shows a
skewed (right or left) distribution. Overcoming this issue, 62.7% (N=89) of studies out of
142 explicitly specified the use of a logarithmic transformation to reveal deterministic
33
part of the total variation while ensuring the normality of measured data when using
parametric methods .
With non-parametric statistical methods, there is no requisite to transform the
target of phenomena (Lamsal et al., 2006; Kumar et al., 2012; Li et al., 2013; Poggio
and Gimona, 2014; Niang et al., 2014). Webster and Oliver (2007) articulated that as
transformation can increase the model complexity and converting transformed output
back to original units can be problematic; a careful examination is needed.
2.3.5.2 Factor analysis
33.8 % (N=48) of 142 reviewed studies explicitly addressed the multicollinearity
issue arising from the correlation between certain original SCORP factors and before
the application of statistical analysis using principal component analysis (PCA, Wallis,
1965). Exhaustive environmental variables may be compiled to gather a spectrum of
variables that represent the environmental soil emergence with the advance of
Geographic Information Systems (GIS), Global Positioning System (GPS), and remote
and proximal sensing technologies. Auxiliary variables selected based on the
researchers’ domain knowledge of the soil environment processes is now a common
practice in soil-landscape mapping modeling and can lead to biased and suboptimal
model performance (Grunwald, 2009). Systematic selection of covariates may be
addressed by modern statistical methods such as Boruta (Xiong et al., 2014a). PCA has
been exhaustively used in a number of disciplines for a variety of reasons none due to
its simplicity. Investigation of modern statistical algorithms as an alternative to PCA may
yield improvements in prediction efficiency.
34
2.3.6 Regression Type to Quantify Deterministic Variation
62.7% (N=89) of 142 reviewed studies utilized SMLR, 8.5% (N=12) utilized
REML-EBLUP, and 28.9% (N=41) used one of the following: Logistic Regression (LR),
Generalized Linear Model (GLM), Classification and Regression Tree (CART),
Generalized Additive Model (GAM) and, Geographically Weighted Regression (GWR)
(Appendix). Obviously, more than half of the studies followed the general framework
presented by Hengl et al. (2004). There are a number of studies that compare accuracy
of RK to other pure geostatistical or hybrid methods. For instance, Levi and
Rasmussen, ( 2014) compared OK and RK; Herbst et al. (2006) compared kriging with
external drift (KED), OK and RK; Li, (2010) compared OK, RK and universal kriging
(UK); Baxter and Oliver, (2005) compared OK, RK and cokriging (COK). On the other
hand, numerous studies utilized different advanced statistical data mining algorithms to
determine trends (drift) between soil properties-classes and SCORP factors. For
example, Kumar et al. (2012) incorporated geographically weighted regression (GWR),
Malone et al. (2014) employed CUBIST, Lin et al. (2011) used logistic regression (LR)
and Lamsal et al. (2006) utilized CART for the regression portion of the RK. Additionally,
these studies compared the final accuracy of these novel hybrid methods with global RK
which used SMLR for the regression part of the RK. Moreover, some authors presented
novel approaches to modifying the kriging part of the RK. For instance, Leopold et al.
(2006) applied block kriging (BK) to kriged residuals from regression part and Sun et al.
(2012) demonstrated local RK as a step up version of RK
35
2.3.7 Variogram
2.3.7.1 Model type
In order to reveal the spatial correlation present in soil properties, detrended data
from the separated residuals of the regression part in RK were interpolated using mainly
two types of variogram models: spherical and exponential. 48.6% (N=69) and 28.2 %
(N=40) out of 142 cases utilized exponential and spherical variograms, respectively.
This finding is in line with what Minasny and McBratney, (2005) articulated; exponential
models represent most soil properties and are stable when nonlinear least square fitting
is applied. Also, the Gaussian semivariogram model is generally unrealistic and leads to
unstable kriging systems and artifacts in the estimated maps (Wackernagel, 2003).
2.3.7.2 N:S ratio
While the nugget (N) may be interpreted as the signature of the variability from
uncorrelated stochastic processes or microscale processes, the sill (S) is sum of the
nugget and the partial sill which represents the total variation (Oliver and Webster,
2014). The N:S ratio has been used to quantify the strength of spatial structure or the
unexplainable portion of short-range variability that is not quantified by a variogram (Zhu
and Lin, 2010). A N:S ratio of 0.5 signifies 50% of the variation has an unexplainable or
spatially independent, stochastic variation. If the ratio is less than or equal to 25%, the
N:S ratio is strongly (S) spatially dependent; between 25 and 75% moderately (M)
spatially dependent; and greater than 75%, then it is weakly (W) spatially dependent
(Cambardella et al., 1994). However, it should be noted that the cut-off values are
arbitrary, and there is no statistical distinction between 25 and 75 % N:S ratio.
Out of 142 cases, 38.0% (N=54) are moderately spatially dependent, 21.1%
(N=30) are strongly spatially dependent and 9.2% (N=13) are weakly spatially
36
dependent. The rest of the reviewed cases did not explicitly or implicitly specify the N:S
ratio. Since N:S ratio can be a significant signal in deciding which spatial interpolation
should be used, Kravchenko (2003) observed that where ordinary kriging yielded more
accurate predictions of soil properties with N:S ratio less than 0.1 (Kravchenko, 2003).
2.3.7.3 Range
Commonly range is the most important semivariogram parameter with regard to
the spacing between sample locations (Mulla and McBratney, 2001). At separation
distances greater than the range, sampled points are not spatially correlated; this has
great implications for sampling design. Thus the need to create an effective variogram
which requires the sample spacing should not exceed the range of the semivariogram.
Additionally, sample spacing should be within a ¼ to ½ of the range (Flatman and
Yfantis, 1984).
To reveal the general trend for spatial range in reviewed studies, the area of
extent is categorized under three nested area of extent; field, local and regional;
additionally, soil attributes are grouped into five main groups for investigation. The
average autocorrelation range value for reviewed studies is given (Table 2-1). A
statistical investigation for quantifying attributes of variogram in terms of range did not
yield reliably specific results due to the unavailability of range in reviewed studies and
the large amount of variability in modeled target soil attributes. Out of 142 cases, only
57.4% (N=81) reported their spatial range. In these cases, as the area of extent
increases the spatial range drastically increases as a general trend regardless of which
soil properties were used; however, the increase in rate does appear to be dependent
on the soil properties.
37
For the phenomena of interest, pre-existing semivariogram range information is
useful in determining where useful information should be obtained, whether in the area
of interest, in sites nearby or within a site located in the same region. The range values
reported in this review may allow a researcher to easily formulate an initial hypothesis
for range at multiple areas of extent, soil properties and soil classes.
2.3.8 Validation
Mainly, jack-knife and cross-validation procedures were preferred to test the
performance of the DSMM studies. Out of 142 studies, 64.8% (N=92) split field
observation data randomly to create separate training and validation datasets. 31.0%
(N=44) of the reviewed studies used cross-validation by either leaving one out or by the
k-fold method. Out of 142 reviewed studies, only 4.2% (N=6) did not use any validation
procedures. The result is significantly different compared to other review studies.
Grunwald, (2009) found that out of 90 investigated studies 21.1% used cross-validation,
46.7% used validation and 35.6% did not use any validation procedures. The large
increase in validation procedures shows the importance of quantifying uncertainty and is
now established among DSMM practitioners.
Furthermore, the evaluation of uncertainty analysis in various spatial interpolation
methods was assessed. 65.0% (N=26) utilized more than one of the following methods:
Mean Error (ME), Mean Absolute Error (MAE), Root Mean Square Deviation (RMSD),
Mean Squared Deviation Ratio (MSDR), Normalized Root Mean Square Deviation
(NRMSD), Residual Prediction Deviation (RPD). 30.0% (N=12) employed only one of
the above methods, and 5.0% (N=2) did not perform any uncertainty analysis.
64.8% (N=92) of the reviewed models assessed the accuracy of predicted soil
properties and classes with the jack-knife validation procedure but with no standard
38
ratio to divide the observed soil properties and classes. Soil samples were divided into
two sets: training (T) and validation (V); however, there is no standard ratio to divide
original observed values. The ratio of training datasets varied from 35% to 90% while
the ratio of validation datasets rate varied from 10% to 65% of the total number of
samples. The variance of the divided set depended on both the statistical methods used
and the minimum sample requirement for a reliable variogram. Thus, careful
consideration should be given when splitting original soil dataset into T and V dataset,
especially if the sample size is low because the mean of training and validation datasets
may change significantly. The minimum cut-off value for number of observations
necessary for accurate interpolation is dependent on the spatial characteristic of the
phenomena of interest including the sample distribution in a geographic space, the
strength of spatial dependence and the relationship with environmental factors.
Generally, the sample set in variogram development should be isotropic; 100 samples
at a minimum, 150 samples for a satisfactory result and 225 for a reliable result
(Webster and Oliver, 1992). Also, previous studies proved that REML-EBLUP may be a
useful technique when the sample size is smaller than 100. For example, Chai et al.
(2008) used REML-EBLUP with 70 (V) and 131 (T) sites. In spite of its drawbacks, the
REML method of estimating variogram parameters is still may have a valuable role to
play in pedometrics when practitioners have fewer than 100 data (Kerry and Oliver,
2007)
2.4 Discussion and Recommendations
2.4.1 Factors Effecting Performance of RK
To compare the performance of different RK models with the variety of scales,
region, soil properties and classes, the following criteria is considered essential: i)
39
landscape heterogeneity ii) sampling design iii) sample size iv) sample density or
distribution of samples v) strength of the correlation between target soil properties and
SCORP factors (R2) and vi) nugget to sill ratio as a strength of spatial dependence.
Unit/scale dependent measures were removed in order to compare accuracy of RK
models in the reviewed studies. Thus, a normalization of RMSD was necessary. Li and
Heap, (2011) proposed NRMSD as follows:
RMSD / mean Validation = NRMSD (2-2)
Since basic statistical information to calculate NRMSD was not generally
available in the reviewed studies, a simplification was made. The mean of validation of
each dataset was replaced by the mean of observed dataset. This is called
standardized NRMSD (Haberlandt, 2007). Even though the more reliable unit-free
metric, ratio of prediction error to inter-quartile range (RPIQ) (Bellon-Maurel et al., 2010)
could be used to compare the considered factors affecting prediction accuracy of 142
models, almost none of the studies specify the essential information to calculate these
measurements (i.e., 25th, 75th and standard deviation) nor provided the RPIQ.
Of 142 reviewed cases, 48.6% (N=71) included essential information to calculate
the NRMSD; therefore, the effect of sample size, sample density, sample design, N:S
ratio on prediction performance of RK were evaluated for only 71 cases. A basic
statistical correlation analysis was performed in order to identify any possible trends
present between any of the five considered factors listed above and the NRMSDs.
However, the author found no discernable patterns between the above mentioned
criteria and NRMSD. This finding may be interpreted in two ways. First, either there is
no pattern due to the fact that all of the above considered factors collectively affect the
40
final accuracy of RK, or second, it may be the function of an erratic behavior of
simplified NRMSD.
In order to prevent this discrepancy in the future, the following parameter could
be released by authors utilizing RK to model and map soil properties and classes: area
of extent, sample design, sample depth, sample size (training and validation
separately), sample depth(s), SCORP factors, spatial resolution of final map,
transformation methods, method of factorial analysis, regression type, coefficient of
determination from the deterministic function, model type for the variogram, spatial
autocorrelation range, N:S ratio and validation method and RPIQ as a reliable metric.
Out of 142 cases, 47.2% (N=69) reported the coefficient of variation (CV%).
Upon further evaluation, a logarithmic transformation was performed for CV% and
NRMSD which displays a strong trend between CV% and NRMSD. As the CV%
increases the performance of RK models decreases. Li and Heap, (2011) found the
similar trend between CV% and RK type C. Figure 2-4 shows that as CV of the
measured dataset increases, the accuracy of RK is decreasing.
2.4.2 Regression Kriging as a Default Soil Mapping Method
2.4.2.1 Satisfactory performance of regression kriging over its competitors
Since the emergence of RK in soil science, hybrid methods, especially RK, have
often yielded more accurate predictions than its competitors: geostatistical and non-
geostatistical methods. RK is frequently used and has been proven to be a robust,
practical hybrid method. The reviewed studies within the last decade have reported that
RK is superior to (KED) (Herbst et al., 2006; Simbahan et al., 2006), Cokriging(COK)
(Baxter and Oliver, 2005; Rivero et al., 2007; Niang et al., 2014), Ordinary Kriging(OK)
(Hengl et al., 2007a; Herbst et al., 2006; Hengl et al., 2004; Lado et al., 2008; Chai et
41
al., 2008; Kuriakose et al., 2009; Dlugoß et al., 2010b; Watt and Palmer, 2012; Zhang et
al., 2012), Multiple Linear Regression (MLR) (Takagi and Lin, 2012; Mishra et al., 2010;
Umali et al., 2012; Chaplot et al., 2010), Generalized Linear Model(GLM), CART (de
Carvalho Junior et al., 2014; Lamsal et al., 2006; G. M. Vasques et al., 2010b) Random
Forest(RF), and Logistic Regression(LR) (Lin et al., 2011). Theoretically, RK is a
combination of the linear or nonlinear regression and the kriged residual (i.e., the
unexplainable variation from the regression), thus the accuracy of RK is likely superior
to a pure geostatistical interpolation method and simple or modern statistical
interpolation methods with the condition that the residuals have either a weak, moderate
or strong spatial correlation throughout the area of interest.
2.4.2.2 Unsatisfactory performance of regression kriging over its competitors
Some authors, on the other hand, reported that RK did not improve prediction
performance when compared to OK (Li, 2010; Roger et al., 2014; Umali et al., 2012)
and MLR (Mora-Vallejo et al., 2008). Limitations adversely affecting the prediction
efficiency of RK are discussed by Hengl et al. (2007a). The body of literature
emphasized the main reasons for the unexpected results using RK over OK and MLR
are i) a limited number of soil observations unable to accurately reflect variability in the
area of interest leading to an unreliable variogram ii) an identifiably poor relationship
between target soil properties and auxiliary variables due to either the unintentional
exclusion of useful covariates, a lack of high quality available auxiliary variables or an
improper method choice that cannot capture the hierarchical, complex relationships
present in soil and auxiliary variables.
Firstly, the heterogeneity of the area of interest is reflected by the limited number
of samples though sample spacing is also particularly important. As the variability in soil
42
landscape increases through soil forming factors, the predictive capability of the RK
often decreases because capturing deterministic part of the variation across
heterogeneous landscape becomes harder with a sparsely distributed, finite number of
soil samples. The strength of the fit between soil properties and environmental factors
may decrease as the heterogeneity increases. Also, an increase in complexity in soil-
landscape may decrease the effective range which directly controls the strength of the
spatial autocorrelation. For example, Zhu and Lin, (2010) showed that while OK is
preferred in the gently-rolling agricultural landscape, RK is more favorable in the steep-
sloped forested landscape. Thus, slope may affect the variation, and the spatial range
could be small when compared to a gentle sloped area defeating the ability of RK to
capture stochastic spatially dependent variation.
The RK technique was purposely developed to use the exhaustively available
auxiliary variables as well as various data from different sources with differing spatial
scales. The introduction of environmental covariates into the model is thought to
improve the interpolation accuracy by reducing the number of observations needed for
target variable. Hence, the interpolation accuracy of RK depends on the selection of
high-quality, useful auxiliary variables that are representative of the main dynamics of
phenomena of interest. However, in most cases, researchers do not have a choice of
the auxiliary variables, and the impact of various spatial scales of auxiliary data on the
performance of RK prediction is still largely unknown. The relationship between a target
soil property and auxiliary variables, often represented determination of coefficient (R2)
from a linear or non-linear regression, is important in determining whether RK will be
more accurate than OK (Kravchenko and Robertson, 2007). Also, processes which
43
govern the total variation of target soil properties may not be fully represented due to a
lack of high quality useful available data and knowledge pertaining to processes that
account for the variation or inappropriate spatial scale of variables. In terms of spatial
dependence of soil properties and classes with N:S ratio as an indicator of strength of
spatial dependence, Kravchenko (2003) found that soil properties with N:S < 0.1 can be
mapped more accurately by ordinary kriging (OK) than those with N:S > 0.1. N:S ratio.
Therefore, if the spatial autocorrelation is weak (N:S > 0.75), then a variogram cannot
substantially contribute the performance of prediction. On the other hand, if a strong
spatial dependence (N:S < 0.25) is detected, then the accuracy of prediction may be
improved substantially since there is some explainable variation present in the
residuals.
In addition, the residual spatial autocorrelation of a model is largely dependent on
the input variables used in the deterministic function. During the model development
process, the introduction of spatially correlated environmental predictors will largely
influence the residual spatial autocorrelation of the model. In other words, models
populated with all-relevant variables (i.e., some of them spatially autocorrelated) will
leave no residual spatial autocorrelation; hence, ordinary kriging of the residual for
those models will not substantially improve the RK prediction performance.
2.4.3 REML-EBLUP vs. RK
In the conceptual framework, soil properties and classes are treated as a
realization of spatially correlated random functions (Lark, 2012) which are a combination
of the deterministic variation and the spatially varying stochastic variation. Empirically,
the best linear unbiased prediction (E-BLUP) is one that accounts for both variations by
incorporating a trend function f(x) and the random variable ε(x) with a mean of zero and
44
spatial dependence as described by the variogram. The prediction value at an unknown
location is a combination of the trend prediction, f(xo), and a kriged estimate of ε(xo)
(Stacey et al., 2006; Stein, 1999).
𝑍𝑍(𝑥𝑥0) = 𝑓𝑓(𝑥𝑥0) + 𝜀𝜀 (𝑥𝑥0) (2-3)
One way of obtaining E-BLUP is with the C method RK model (Odeh et al.,
1995). However, the major drawback of using RK is the requirement that the
deterministic model parameters and covariance function parameters must be estimated
separately. The parameters of the deterministic model are estimated with the choice of
regression type and used to compute of the trend in the area of interest. The residuals
arising from this trend model are quantified by variogram; and typically, the final model
is fitted with the method-of-moments estimator of Matheron. Nevertheless, neither trend
can be estimated without bias because the distribution of the random residual is
unknown at this stage and the variogram of the residuals cannot be estimated without
bias when the trend is unknown (Cressie, 1993). In this process, the same regression
coefficients are used to compute the trend at all locations, even if the kriging estimation
is only done in a local neighborhood (Stacey et al., 2006). Hence, RK is mathematically
biased which leaves room for improvement.
As one possible improvement to this performance, Lark et al. (2006) introduced
Residual Maximum Likelihood-Empirical Best Linear Unbiased Predictor (REML-
EBLUP) to model the spatial variability of soil properties and classes. The advantage of
REML-EBLUP over RK is a result of the incorporation of the estimation of the variance
by REML, since these estimates are subject to substantially less bias than method-of-
moment estimate from OLS or GLS residuals. Theoretically, REML-EBLUP may give
45
better prediction accuracy when compared with RK. The REML-EBLUP may provide
more efficient predictions and unbiased estimates of the error variances for quantifying
the uncertainty, whereas RK separates the errors of deterministic and random
components of the prediction, which contributes to the final uncertainty. A fuller
description of the theory underlying REML and its justification and use is given
elsewhere (Lark and Cullis, 2004; Lark et al., 2006b; Lark and Webster, 2006).
Though RK is statistically biased, the prediction performance of REML-EBLUP
and RK is found to be similar. Minasny and McBratney, (2007) tested the accuracy of
RK and REML-EBLUP by modeling different soil properties in different geographic
regions. There were slight improvements in prediction when using REML-EBLUP;
however, the advantage does not appear to be great. REML-EBLUP is useful when
there is a strong trend, when one needs to understand the underlying spatial process
and when the number of observations is small (< 200). Minasny and McBratney, (2007)
concluded that although RK is statistically inappropriate, RK is easy to use and has
proven to be a robust technique for practical application of soil landscape modeling and
mapping. Chai et al. (2008) compared the accuracy of RK and REML-EBLUP to predict
SOM with different auxiliary variables. The improvement of REML-EBLUP over RK was
not significant in this study. They presented that REML-EBLUP performed better than
RK in the ability to increase the prediction accuracy, especially when a smaller
proportion of variation in the target variable is accounted for by a trend model. Also,
other studies show that REML-EBLUP is preferred when the number of observation is
fewer than 100 (Kerry and Oliver, 2007) or fewer than 200 (Minasny and McBratney,
2007). Therefore, when a fixed trend between target soil properties and classes and
46
SCORP factors is strong and there are too few observations to conduct a successful
variogram, 100, 150 -200- (Webster and Oliver, 1992) REML-EBLUP may be
preferable. As the number of studies using REML-EBLUP to predict soil properties and
classes increase, the advantages of using REML-EBLUP over RK may be better
documented in the near future.
2.4.4 Future Trend of RK
The current global framework for RK is generally a combination of linear models
(GLS, LM) which reveal deterministic portion of the variation and OK which reveal
spatially dependent stochastic portion of the variation as well as a combination of both
to form the final map (Odeh et al., 1995; Hengl et al., 2004). There are no restrictions
on how to quantify the relationship between sparsely available soil properties and
exhaustively available exogenous variables. McBratney et al., (2000) incorporated the
modern statistical techniques, including generalized linear models (GLM), generalized
additive models (GAM), classification and regression trees (CART) and neural networks
(NN). After detrending the data, the residual kriged with ordinary kriging were combined
to create a final map, and they reported an obvious improvement from modified RK
types over global RK.
During the last decade, machine learning algorithms have gained tremendous
attention from environmental scientists working to improve the accuracy of the
deterministic portion of the variation. As modern statistical progress is made, RK has
been revised and refined with novel and robust parametric/ non-parametric regression
types or ordinary kriging with block kriging. Modified RK types usually yield better
prediction accuracy over global RK. From the reviewed studies, Sun et al. (2012) tested
and presented local RK against global RK and found, in general, that local RK performs
47
no worse than global RK, which had been thought to be a stepped-up version of RK
(Hengl 2007a). When used with geographically weighted regression (GWR) (Brunsdon
et al., 1996), Mishra et al. (2010) reported a relative improvement of 22% over MLR and
an improvement of 2% over RK was observed in SOC prediction. Kumar et al. (2012)
used geographically weighted regression kriging (GWRK) by combining GWR and OK
as a modified version of RK and reported the least biased and most accurate results
compared to RK for estimating the SOC stock based on the lowest RMSD. Niang et al.
(2014) used a support vector regression and produced the best prediction accuracy
compared with the geostatistical interpolation techniques. Poggio and Gimona, (2014)
employed a hybrid GAM-geostatistical 3D model (3DGAM + GS), by combining the
fitting of a GAM to estimate the trend of the variable, using a 3D smoother with related
covariates and kriging or Gaussian simulations of GAM residuals as spatial component
in order to account for local details and found better prediction accuracy. Shi et al.
(2011) used high accuracy surface modeling (HASM) which uses a spatial interpolation
technique based on the fundamental theorem of surfaces, and proposed a modified
HASM method based on the incorporation of ancillary land use information. The results
have shown that HASM_LU generally performs better than HASM, OK_LU, SK and
RK_GLM (with a lower estimation bias). MAE and RMSD generally perform with a
greater prediction error (PE). Li et al. (2013) proposed a radial basis function neural
networks model (RBFNN), this method was combined with a principal component
analysis (PCA) to predict the spatial distribution of SOM content across China. They
reported a higher ratio of performance to deviation (RPD) and lower prediction errors
(MAE), mean relative error (MRE) and root mean squared deviation (RMSD) when
48
compare to RK. Guo et al. (2015) used random forest (Breiman, 2001) with residual
kriging (RFRK) and compared results with SMLR to predict and map the spatial
distribution of SOM which yielded a much better prediction accuracy.
Very complex relationships between soil properties and environmental variables
are present in pedologic data (Lark, 1999). Machine learning algorithms do not require
Gaussian distribution assumption and can handle a nonlinear and hierarchical complex
relationship between soil properties and classes and SCORP factors. Therefore,
pedometricians may wish to explore whether DSMM practitioners can gain research
ground by combining machine learning algorithm with kriging methods. This
combination is a promising area for further investigation in the near future. Modified
versions of RK types are proposed with the hope that further investigation of these
combinations may increase prediction accuracy for soil science (Table 2-2).
2.4.5 Model Averaging
Numerous DSMM studies have utilized different geostatistical, non-geostatistical,
and hybrid methods to predict soil properties and classes. In order to identify the best
method for the particular soil properties at multiple scales and time, numerous methods
are being investigated by the DSMM community. The feasibility of testing all of the
different geostatistical, non-geostatistical and hybrid methods may be cumbersome but
may prove to increase performance of predictions because each method may have its
own strength and/or weakness. In order to take full advantage of the best method,
model averaging may be an opportunity to make further gains as model averaging has
been applied in a variety of disciplines (Hoeting et al., 1999; Goswami and O’Connor,
2007; Raftery et al., 2005). The model averaging framework involves a combination of
the predictions from two or more methods by enhancing the strengths of each while
49
reducing the weakness of each source map (Malone et al., 2014). Further investigation
on whether or not averaging the predictions of the best performing methods may help
determine what further increases may be gained in performance accuracy. Li et al.
(2011) could not find any increase in prediction accuracy with averaging the prediction
from RKRF, OK, RKRF, IDS and RF in their review. To authors’ knowledge, there are too
few examples in reviewed studies where averaging of the different model results in
prediction of soil properties and classes to make a statement of the likelihood of model
accuracy improvements. The only example of averaging two different methods to
predict soil properties and classes is given by Malone et al. (2014). In that study, four
model averaging methods were employed, namely: Equal weights averaging (EW),
Bates–Granger or variance weighted averaging (VW), Granger–Ramanathan averaging
(GRA), and Bayesian model averaging (BMA) in order to average the disaggregated
conventional soil map using DSMART (Odgers et al., 2014) and PROPR algorithms
(Odgers et al., 2015) and the RK based digital soil map. The most accurate results were
found by averaging disaggregated soil map and RK based soil map.
The model averaging technique is analogous to leveraging the best aspects of
each contributing model, and discarding the worst aspects. If both contributing models
are poor, ultimately the quality of the combined outcome will also be relatively poor;
however, one can at least expect the quality of the combined output to be comparable
to or better than the best of the contributing models (Malone et al., 2014). Biswas and
Cheng, (2013) employed model averaging to reduce the uncertainty associated with
semivariogram model parameters. In short, the use of model averaging in DSMM
communities is scarce. Therefore, pedometricians may also investigate whether or not
50
model averaging can produce better predictions efficiency of soil properties and
classes.
2.5 Conclusions and Outlook
To address the environmental problems in today’s world with a look toward
multiple scales in space and time, one of the key factors relies on upscaling scarcely
available categorical and continuous soil information at a local, regional and global
scale. Since the 1960s, soil spatial variability has been studied in systematic way in
order to characterize the pedogenic processes and complex distribution patterns of soils
in space and time and to depict categorical and continuous soil information on a map
(Burrough et al., 1994). As it is unlikely to gather soil information at every possible
location for any target property of interest in space and time, quantitative
characterization of soil properties and classes in soil-landscape continuum require
prediction modeling based on sparsely distributed finite number of soil observations.
Hence, the investigation of constant and robust methods resulting in higher prediction
accuracy for soil properties and classes has profound importance in pedometrics for the
foreseeable future.
Based on scarcity of field observations, pedologist have developed conceptual
models in order to capture the significant factors and processes responsible for the
genesis and spatial distribution of soil and its horizons (Minasny et al., 2008). RK, as a
combination of Jenny’s factorial model (Jenny, 1941) to quantify deterministic variation
and Matheron’s regionalized variable theory (Matheron, 1971) to quantify spatially
dependent stochastic variation, has proven to be one of the most widely accepted
methods among DSMM practitioners. As computational power, SCORP factors and
knowledge about soil forming processes increase and coevolve, more satisfactory
51
prediction efficiency for geo-spatial soil landscape models has been achieved. RK is
being used as a workhorse in the pedometrician’s toolbox and has been shown to be a
robust and widely accepted soil mapping method for over 20 years. It appears to have
reached maturity, given the large body of literature that now exists. However, there are
no consistent findings about the factors affecting the accuracy of RK due to a lack in
reporting of essential information and unreliable metrics. To guarantee more consistent
findings and allow more accurate comparisons the following recommendations for
inclusion of essential criteria to be reported in future DSM studies are: Soil property
descriptive statistics- at least mean, minimum, maximum, median, range and coefficient
of variation-, area of extent, total number of samples, sample design, sample depth,
sample size (training and validation separately), sample depth(s), SCORP factors,
spatial resolution of final map, transformation methods, the method of factorial analysis,
regression type, coefficient of determination of the deterministic function (i.e., R2 , fit
statistics), variogram model type, spatial autocorrelation range, N:S ratio, validation
method, R2 of the fitted models and RPIQ (accuracy metric).
The appropriate selection of variables for input into RK is essential because the
functional relationship between SCORP factors and a soil variable is often unknown and
noisy. The variable selection strategy may suffer bias or even fail in regions where the
process knowledge is insufficient (Xiong et al., 2014a). The reviewed studies have
shown that the number of SCORP factors has been increasing over the past decade
with the advance of Geographic Information Systems (GIS), Global Positioning System
(GPS), and remote and proximal sensing technologies. The challenge is to gather a
comprehensive set of spatially exhaustive environmental predictors to characterize the
52
mosaic of soil-environmental systems and identify the relevant set of predictors.
Furthermore, it is still important to develop the most parsimonious model but well
performing soil prediction model while dealing with multicollinearity between SCORP
factors and without sacrificing prediction accuracy. This may be rectified with the
incorporation of machine learning algorithms into the RK framework and systematic
variable selection algorithms (e.g., Boruta) that are used to increase the efficiency of
predictions.
Since machine learning algorithms do not require normally distributed soil data,
their ability to handle hierarchical and nonlinear relationship between soil observation
and auxiliary variables have produced same or better predictions than achieved using
conventional multivariate regression methods. Successful modification of RK with
modern statistical methods, especially machine learning algorithms, may allow
researchers to capture all attainable information offered by data and decrease the
inaccuracies of geo-spatial soil landscape models. Even though the performance of
prediction is heavily dependent on the data quality, further gains can be made by
modifications in the specific methods underlying RK. Several variations of RK have
been offered: RKRF, OK , RKRF, IDS , RKRF, BK , RKSVM, OK , RKSVM, OK, RKSVM, IDS , RKGWR, OK
, RKGWR, IDS , RKGWR, BK , RKPLSR, OK , RKPLSR, IDS , RKPLSR, BK , RKPCR, OK , RKPCR, IDS ,
RKPCR, BK. It may be not likely to identify a spatial prediction method that is best for every
case (Sun et al., 2012), but it is possible to develop models that characterize all
attainable variability with a given dataset.
In order to take full advantage of the strengths of different methods, model
averaging techniques may be utilized to reduce the prediction error. The increasing
53
number of pedological data with emerging technological advancement such as
electromagnetic induction techniques may allow pedometricians to focus on the depth
and time component of soil phenomena which are generally overlooked and may
ameliorate the accuracy of target predictions.
Figure 2-1. Evolution of hybrid interpolation techniques. GLM = generalized linear
model, SMLR = stepwise multiple linear regression, CART = classification and regression tree, BK = block kriging, OK = ordinary kriging, SK = simple kriging, RK = regression kriging, KED = kriging with external drift, COK = cokriging, 𝑥𝑥 = location in one, two or three dimensions, 𝑍𝑍(𝑥𝑥) = the random variable Z at location x, 𝜇𝜇(𝑥𝑥) = deterministic structural component, trend (drift), 𝜀𝜀′(𝑥𝑥) = stochastic component, spatially dependent residual from µ(x) ( the regionalized variable), 𝜀𝜀′′(𝑥𝑥) = spatially independent component, noise, unexplained variability.
54
Figure 2-2. General framework for regression kriging (RK). PCA = principal component analysis, RMSD = root mean squared deviation, ME = mean error, MAE = mean absolute error, 𝑥𝑥 = location in one, two or three dimensions, 𝑍𝑍(𝑥𝑥) = the random variable Z at location x, 𝜇𝜇(𝑥𝑥) = deterministic structural component, trend (drift), 𝜀𝜀′(𝑥𝑥) = stochastic component, spatially dependent residual from µ(x) ( the regionalized variable), 𝜀𝜀′′(𝑥𝑥) = spatially independent component, noise, unexplained variability.
55
Figure 2-3. The cumulative amount of RK studied over time.
Figure 2-4. Effects of coefficient of variation on the accuracy of RK methods in the 71
cases.
56
Table 2-1. Spatial range (m) from reviewed studies under three different area of extents Area of Extent
( km2 ) Carbon
(m) Chemical Hydrological Nutrient Physical Total
Average
Field < 0.25 59 - - 196 2 77 0.25< Local < 104 7222 2883 2629 26635 9951 12511
104< Regional < 107 194548 4361 13250 90778 30000 86123 Total Average 35103 4150 6169 48485 11394 32236
Carbon: total carbon, soil organic carbon, soil organic matter, fractions of organic matters (HOC, POC, ROC, RC, HC, MC); Chemical: pH; nutrients; N, K, Al, Ca, Mg and Zn, Cr Cu, Ni; Hydrological: AWC, salinization, Ks; Physical: Sand, silt, clay, horizon thickness, depth to C1. Table 2-2. Modified version of Regression Kriging (RK)
RK Version Deterministic Stochastic RKGLM,OK Generalized Linear Model Ordinary Kriging RKGLS,OK Generalized Least Square Ordinary Kriging RKRF,OK Random Forest Ordinary Kriging RKRF,IDS Random Forest Inverse Distance Squared RKRF,BK Random Forest Block Kriging RKCART,OK Regression Tree Ordinary Kriging RKCART,IDS Regression Tree Inverse Distance Squared RKCART,BK Regression Tree Block Kriging RKSVM,OK Support Vector Regression Ordinary Kriging RKSVM,IDS Support Vector Regression Inverse Distance Squared RKSVM,BK Support Vector Regression Block Kriging RKGWR,OK Geographically Weighted Regression Ordinary Kriging RKGWR,IDS Geographically Weighted Regression Inverse Distance Squared RKGWR,BK Geographically Weighted Regression Block Kriging RKPLSR,OK Partial Least Square Regression Ordinary Kriging RKPLSR,IDS Partial Least Square Regression Inverse Distance Squared RKPLSR,BK Partial Least Square Regression Block Kriging RKPCR,OK Principal Component Regression Ordinary Kriging RKPCR,IDS Principal Component Regression Inverse Distance Squared RKPCR,BK Principal Component Regression Block Kriging
57
CHAPTER 3 DIGITAL MAPPING OF SOIL CARBON FRACTIONS
3.1 Introduction
Quantifying only the soil total C stocks in a particular soil body to mirror its role
over a majority of soil functions does not adequately reflect the true gravity of soil C as
an ecosystem property (Parton et al., 1987a; Elliott et al., 1996). Organic C in soil is
comprised of a large variety of thermodynamically unstable materials with varying
degrees of decomposition and residence time which are located within the architecture
of the soil matrix (Jastrow and Miller, 1998). Accordingly, the soil organic C (SOC) may
simply be conceptualized into two major sub-pools: a labile pool with turnover rates
ranging from days to decades and a recalcitrant pool that persists in soil hundreds to
thousands of years (Cheng et al., 2007). Therefore, modelling fractions of soil TC yields
multiple benefits, including identifying anthropogenically induced short-term C loss in
soil through labile pool of soil TC and adequately determining the long-term C budget
through recalcitrant pool of soil TC.
Soil is the key to the majority of today’s global environmental problems, such as
food, water, energy and biodiversity security (Bouma and McBratney, 2013). Framing
the role of soil C to deal with the global challenges of our time mandates the
understanding of the dynamics and characteristics of distinct SOC pools, along with
their interaction with soil-environmental factors. Lately, an emerging view contends that
total soil C storage and decomposition are not necessarily driven only by the inherent
molecular structure of soil organic matter (Marschner et al., 2008; Kleber et al., 2010;
Conant et al., 2011; Schmidt et al., 2011). Several authors argue that the quantity and
quality of soil C are predominantly controlled by physical, chemical, and biological
58
factors (Oades, 1988; Sollins et al., 1996; Jobbágy and Jackson, 2000; Ekschmitt et al.,
2008; Totsche et al., 2010; Schmidt et al., 2011). Our awareness of how environmental
factors control stabilization and destabilization mechanisms of soil C in particle,
aggregate, and pedon scales is still limited. Additionally, the up-scaling of spatially and
temporally heterogeneous C dynamics to local, regional, and global scales is still in its
infancy.
Today’s predictive soil mapping and modeling studies date back to the widely
recognized and globally accepted soil factorial model. The empirical-deterministic model
of soil formation developed by V.V. Dokuchaev (Glinka, 1927) and formulized by Jenny
(Jenny, 1941) define soil formation as a function of Climate, Organism, Relief, Parent
material, and Time (CLOPRT). Technological advancements in geographic information
systems and remote sensing allow pedologists to reframe the soil factorial model.
Hence, McBratney et al. (2003) proposed the SCORPAN (S: Soil, C: Climate, O:
Organism, R: Relief, P: Parent material, A: Age, and T: time) model incorporating
spatially and temporally explicit environmental factors and soil data into an equation.
This conceptual model serves as a framework for spatially explicit predictions of soil
properties at unvisited locations. Functional linkages are quantified between sparse site-
specific soil data and exhaustively available environmental covariates to derive soil
models. In the last epoch, humanity has rapidly become the main driver of the
functioning of the Earth System (Steffen et al., 2011). Soil and environmental scientists
recognize the extent, complexity and intensity of human influences on soil; hence,
humans are acknowledged as integral to soil genesis (Richter et al., 2011). In response,
the STEP-AWBH model (S: Soil, T: Topography, E: Ecology, P: Parent material, A:
59
Atmosphere, W: Water, B: Biota, and H: Human) was proposed to explicitly account for
both human-induced and natural factors that determine and modulate soil and space-
time interactions (Grunwald et al., 2011; Thompson et al., 2012).
As an ecosystem property, soil C is not randomly distributed (Webster, 2000).
Soil C is often spatially autocorrelated; that is, values close to each other in geographic
space are generally correspond to similar values in feature space (Rossiter, 2012).
Accordingly, variation in soil C across a soil-landscape can be partitioned into two parts:
large-scale deterministic spatial variation as a function of a certain set of soil-
environmental variables and small-scale stochastic variation as a function of distance
between soil samples (McBratney, 1992). In addition, the relationship between soil C
storage and the accompanying ancillary variables are usually complex, non-linear, and
hierarchical. Hence, the ability of empirical geo-spatial soil-landscape models to
accurately quantify the spatial distribution pattern of soil C stocks over vast areas is
directly proportional to their capacity to capture non-linear, hierarchical relationships
between soil C and its environment, and also to explicitly account for spatial
autocorrelation often present in pedological data.
Overall, three generic, yet distinct, approaches have been adapted in
pedometrics to quantify the relationship between soil C and its environment with the
purpose of describing, analyzing, predicting, mapping, and assessing spatial distribution
across the soil-landscape continuum. The first approach is feature-space-based models
(statistical, machine learning) which do not explicitly account for stochastic spatially
dependent variation, such as Multiple Linear Regression (MLR) (Meersmans et al.,
2008), Classification and regression tree (CART), (McKenzie and Ryan, 1999;
60
Stoorvogel et al., 2009; Vasques et al., 2008), Random Forest (RF) (Grimm et al., 2008;
Wiesmeier et al., 2014), Support Vector Machines (SVM) (Were et al., 2015), Boosted
Regression Trees (BoRT) (Martin et al., 2011) and Bagged Regression Trees (BaRT)
(Xiong et al., 2014a). The second approach is geographic-space-based (geostatistical)
models which model the spatial dependence structure of site observations without
accounting for the deterministic trend, such as Ordinary Kriging (OK) (Rawlins et al.,
2011). The third approach is hybrid methods which explicitly account for the stochastic
spatially dependent variation and the deterministic trend, such as Regression Kriging
(RK) type C (Simbahan et al., 2006; Vasques et al., 2010a; Mishra et al., 2012; Sun et
al., 2012; Malone et al., 2014). The hybrid methods are primed to outperform statistical
and geostatistical models due to their dualistic nature. Lastly, an overview of the state-
of-the-art methods used to model and map soil C and soil C change is presented
(Minasny et al., 2013).
Recently, a few studies have reported improvements in the prediction accuracies
through the incorporation of machine learning methods into the RK framework. For
example, Guo et al. (2015) used RF and OK sequentially to model and map soil organic
matter (SOM) across tropical regions of China. In another regional study, Li et al. (2013)
proposed a radial-based function of neural networks along with the OK model to predict
the spatial distribution of SOM content across China. In France, Martin et al. (2014)
employed BoRT and OK in national scale C studies to characterize the spatial
distribution of SOC. However, there is no consistent finding about coupling residual
spatial autocorrelation and machine learning algorithm. Although roughly half of the
digital soil mapping and modeling (DSMM) studies focused on soil C and SOM in a
61
large meta-review (Grunwald, 2009), only a few individual studies have focused on
modeling different pools of soil C, possibly because of the analytical and computational
costs needed to perform such studies (Vasques et al., 2010b). In a catchment scale
study in Australia, Karunaratne et al. (2014) modeled and mapped the measurable
fractions of soil C, namely resistance, humus, and particulate organic C. In the United
States, Vasques et al. (2010b) mapped and modeled different carbon fractions in a
watershed in Florida. Also, Knox et al. (2015) developed prediction models for TC,
SOC, and labile C, specifically hot-water extractable C (HC), using visible-/near-infrared
and mid-infrared spectroscopy in Florida. Neither study created soil C fraction models
that describe spatial distribution patterns across Florida which will be addressed in this
chapter.
Since labile C responds much faster to land use and other human-induced
changes (e.g., management) than TC and RC (Conant et al., 2003; Veldkamp et al.,
2003; Haynes, 2005), labile/active C fractions provide critical signatures serving as an
indicator of change. However, not much is known how labile C varies across larger
regions with mixed land use, soil, and hydrologic settings. This motivated our research.
Our objectives are as follows:
1. Identify and characterize the most sensitive STEP-ABWH factors relevant to soil C pools to develop parsimonious geo-spatial soil-landscape models without sacrificing prediction accuracy.
2. Compare three distinct approaches under eight different methods to pick the best method to model each of the soil C fractions and rank the prediction performance of evaluated models.
3. Investigate the spatial autocorrelation of soil C model residuals and assess the capability to improve explaining the variability of soil C fractions models
62
3.2 Materials and Methods
3.2.1 Study Area
This study was conducted in the state of Florida, which is located in the
southeastern United States, with latitudes from 24°52’ N to 31°02’ N and longitudes
from 80°03’ W to 87°64’ W (Figure 3-1). As a peninsula, Florida is surrounded by the
Gulf of Mexico and the Atlantic Ocean on three sides and has a total area of
approximately 150,000 km2 (United States Census Bureau, 2000). While a humid,
subtropical climate is predominant in northern and central Florida, a humid, tropical
climate is predominant in southern Florida. The mean annual precipitation is 1,373 mm,
predominately from the extraordinarily high prevalence of thunderstorms. The mean
annual temperature is 22.3°C (National Climatic Data Center, 2008). Elevation ranges
from sea level to 106 m across Florida (United States Geological Survey, 1999).
Landforms associated with a nearly level, gentle slope dominate almost the whole state
with the exception of the northwestern part of the state (i.e., Florida Panhandle) and an
escarpment in north-central Florida (Cody Scarp). Micro-topography can greatly
influence the hydrological pattern (Mulkey et al., 2008). A generally high amount of
rainfall and low elevations, coupled with a relatively high water table, combine to form a
relatively high number of wetlands and marshes across the state. Even though Florida
is among the wettest states in the United States, Florida is susceptible to wildfires
during the driest months of the year, typically between October and May. The soils in
the study area were formed in sandy to loamy marine-derived parent material with sand
as the dominant particle size fraction. Dominant soil orders in Florida include Spodosols
(32%), Entisols (22%), Ultisols (19%), Alfisols (13%), Histosols (11%), and Mollisols and
Inceptisols (3% combined) (Vasques et al., 2010a). The most frequent soil subgroups
63
are Aeric Alaquods, Ultic Alaquods, Lamellic Quartzipsamments, Typic
Quartzipsamments, and Arenic Glossaqualfs (Natural Resources Conservation Service,
2009). The Florida LULC dominated by wetlands (28%), pinelands (18%), urban and
barren lands (15%), agriculture (9%), rangelands (9%), and improved pasture (8%)
(Florida Fish and Wildlife Conservation Commission, 2003). With 19.9 million people
and counting, Florida is the third most populous U.S. state (United States Census
Bureau, 2015). Its increasing population has resulted in major changes, including rapid
urban growth and loss of agricultural and forest land for the past several decades
(Kautz et al., 2007). From the 1970s to 2011, the urban area in Florida increased by
more than 140% to about 24,900 km2, primarily converted from agriculture and upland
forest (Xiong et al., 2014b)
3.2.2 Soil Data
The soil data used in this thesis are derived as part of a larger project funded by
USDA-CSREES-NRI grant award 2007-35107-18368 titled “Rapid Assessment and
Trajectory Modeling of Changes in Soil Carbon across a Southeastern Landscape”
(National Institute of Food and Agriculture [NIFA], Agriculture and Food Research
Initiative [AFRI]). The Principal Investigator of this project is Dr. S. Grunwald and Co-
Principal Investigators are Dr. W.G. Harris, N.B. Comerford, and G.L. Bruland. This
project is a Core Project of the North American Carbon Program. The following section
briefly describes the sampling design and laboratory analysis. In the following section is
a description of how the field and lab analyses were performed by the project team. My
role in the project begins with model development.
64
3.2.2.1 Sampling design and field sampling
As a product of the statewide project known as “Florida Soil Carbon Project”
conducted between July 2008 and June 2009, a total of 1,014 soil samples were
collected at a fixed depth of 20 cm across the state of Florida. A stratified random
sampling approach was implemented to capture the broad range in the variability of soil
C across Florida. Sixty-three land use/cover (LULC)-suborder strata were designed
based on a combination of the reclassified LULC map obtained from the Florida Fish
and Wildlife Conservation Commission (2003) and the 10 soil suborders acquired from
the Soil Data Mart-Soil Survey Geographic Database (SSURGO) (Natural Resources
Conservation Service, 2006).
To reflect local variability at each predefined site, four soil samples were
collected (20 cm deep x 5.8 cm diameter soil cores within a 2 m diameter radius) and
then georeferenced, bulked, and transported in a cooler for lab analysis. Afterward, the
bulk samples were air-dried and sieved to retrieve the fine earth fraction (< 2 mm).
These samples were thoroughly mixed and different quantities of subsamples were ball
milled for use in chemical analysis to derive different pools of soil C: TC, RC, and HC.
3.2.2.2 Laboratory and chemical analysis
Carbon fractions were measured using a Shimadzu TOC-VCPN catalytic
combustion oxidation instrument with a SSM-5000a solid sample module (Shimadzu
Scientific Instruments, Kyoto, Japan). Total C was measured from the 80–700 mg ball
milled samples combusted at 900°C. Measurement of hydrolysable ‘labile’ carbon (hot
water extractable – HC) was performed by incubating 4 g of soil in 40 mL (1:10) of
double de-ionized water for 16 h at 80°C. Samples were then filtered to 0.22 μm.
Measurement of the non-hydrolysable ‘recalcitrant’ C (RC) was accomplished by
65
digesting 2 g of the ball milled soil in 10 mL of 5 M HCL under reflux conditions for 16 h.
The soil digest was washed 3 times by centrifuge, dried and the remaining undigested C
was then combusted at 900°C (Knox et al., 2015).
3.2.2.3 Determination of total, recalcitrant and labile carbon stocks
Soil carbon stocks in areal units (kg m−2) was derived for each of the C fractions
by multiplying TC, RC, and HC concentrations with oven dry bulk density (BD) values
which were also measured (Eq. 1). Mass of soil C fractions present in the top 20 cm (kg
C m-2) was computed using the following equation:
TC, RC or HC stocks = (TC, RC or HC x BD x 2000) / 1000 (3-1)
TC, RC, or HC stocks : Soil total carbon, recalcitrant and labile carbon stocks in kg
C m-2 (0–20 cm soil profile)
BD : Oven dry bulk density (g cm-3)
PD : Profile Depth (0.2 m)
3.2.3 Environmental Data
3.2.3.1 Assembled environmental variables representing STEP-AWBH factors
In Digital Soil Mapping and Modeling (DSMM), the prediction performance of
geospatial models for soil properties has been largely dependent on the assembling of
useful and appropriate scale qualitative and quantitative soil-spatial information rather
than employing more sophisticated statistical or geostatistical methods (Minasny and
McBratney, 2007; Grunwald, 2009). Even though having a set of potential predictors
may substantially improve prediction accuracy for soil properties and classes, the
selection of parsimonious environmental variable sets does not command attention like
the calibration and validation part of a geo-spatial modeling process. Building a pool of
potential predictors can be overlooked because some researchers are unaware of the
66
availability of potential predictors or the general belief in the similarity to or superiority of
the variables chosen (Miller et al., 2015). Accurate, efficient, and unbiased model
development requires the inclusion of all possible environmental determinants;
otherwise, selection of predictors based on the researchers’ knowledge could lead to
biased and suboptimal model performance (Grunwald, 2009). This research is
designated on the approach presented in Xiong et al. (2014a).
To represent a spectrum of possible soil-forming processes that may have an
impact on the fate of TC, RC, and HC, a large set of up-to-date STEP-ABWH variables
(N: 332) with statewide coverage were gathered from numerous data sources with
ArcGIS 10.2 (Environmental Systems Research Institute, ESRI Inc., Redlands, CA)
(Table 3-1). About 12% (N: 40) of variables were categorical (i.e., ordinal, nominal,
binary), including SSURGO derived soil taxonomic properties, LULC classes obtained
from different sources, soil drainage and hydrological classes, vegetation type, etc.,
whereas about 88% (293) were continuous ( i.e., floating point, integer), including
proximal and remote sensing derived variables (with a variety of spatial resolutions) and
terrain variables, such as soil water-holding capacities, historic organic matter content,
primary and secondary terrain attributes, and climatic and biotic variables. Some
topographic variables, including elevation, compound topographic index, slope, and flow
accumulations, were collected at three spatial resolutions (30, 90, and 1000 m).
Moreover, some biotic and climatic variables such as normalized differentiated
vegetation indices (NDVI), enhanced vegetation indices (EVI), and monthly precipitation
were also represented as multi-temporal sequences.
67
3.2.3.2 Boruta feature selection technique
Many of the variables available may have the effect of introducing noise or may
not provide information to infer on a target soil property. Additionally, variables may be
redundant or highly correlated which make the task of gathering the comprehensive set
of environmental predictors problematic (Xiong et al., 2014a). Thus, the need for
strategically identifying variables related to major pedogenic and environmental
processes for phenomena of interest is a focal point to any research, in this case TC,
RC, and HC. This problem has been addressed in the machine learning literature under
the topic of identifying the minimal optimal, all-relevant variable selection (Liu and
Motoda, 2012). According to Xiong et al. (2014a), the minimal-optimal set is preferable
to yield the best prediction accuracy when the focus is on developing a predictive
model, whereas the all-relevant variable selection method is preferable to characterize
the mechanism between environmental variables and phenomena of interest.
Furthermore, the selection of all-relevant variables out of a broad set of environmental
variables reduces overfitting, model development, and application time, and also
increases model interpretability (Belanche-Muñoz and Blanch, 2008; Merow et al.,
2014).
Boruta, an all-relevant variable selection method, was applied to characterize
and identify the variables which impart control on the fate of the spatial distribution of
soil TC, RC, and HC. This method can detect linear and non-linear relationships
between soil C fractions and environmental predictors because the Boruta algorithm is
based on RF classification algorithm. In short, the Boruta algorithm produce five random
probes whose values acquired by shuffling values of the original predictors to reduce
their collinearity with phenome of interest (e.g., TC, RC and HC). Afterwards, RF
68
regression is conducted on the original predictors and random probes combined and Z
score as an importance of each variable is determined. Then, the maximum Z score
among the random probes (MZRP) is identified and utilized as a reference to identify if a
predictor is relevant to TC, RC and HC with a two-sided test of equality. Only the
predictors with the Z score significantly higher than MZRP was accepted as the relevant
variable (Kursa and Rudnicki, 2010).. A full discussion can be found in Xiong et al.
(2014a). The Boruta package (Kursa and Rudnicki, 2010) was used to perform Boruta
all-relevant searching method in R 3.2.0 (R Core Team, 2015).
3.2.4 Modeling Techniques
The whole dataset (N: 1,014) was randomly split into calibration (N: 710) and
validation (N: 304) sets to model TC, RC, and HC. The calibration dataset was used to
train models with all-relevant variables identified by Boruta. The independent validation
sets were used to evaluate the predictive performance of the each model.
For comparative assessment, eight different techniques were selected to
characterize the spatial distribution pattern of soil TC, RC, and HC across the state. The
methods fall into three generic modelling approaches: feature-space-based methods,
including statistical and machine learning methods (i.e., PLSR, CART, BaRT, BoRT,
RF, and SVM), geostatistical (i.e., OK), and hybrid methods (i.e., RK). A thorough
explanation of the evaluated methods and applications can be found in James et al.
(2013) and Kuhn and Johnson (2013).
CART involves constructing a set of decision trees on the predictor variables.
The trees are grown by repeatedly stratifying the dataset into successively smaller
subsets (child node) with binary splits based on the single categorical or continuous
predictor variable (Breiman,1984). The splitting procedure is applied until the best split
69
is chosen based on the one that maximizes the response into two homogenous groups
(i.e., minimizing variability within each child node) (Prasad et al., 2006).
BaRT is an ensemble decision tree method that involves the averaging of several
individual trees to acquire a final prediction. Individual regression trees have been
recognized as unstable learners (Breiman, 1996); that is, small changes in the
calibration dataset can give rise to very different output trees (Hastie et al., 2009).
Bagging (bootstrap aggregating) is a relatively simple ensemble procedure that uses
many bootstrap sets drawn with replacement from the original training data set and
grows a regression tree from each bootstrap sample (Efron and Tibshirani, 1993). The
results of each individual tree are subsequently averaged to obtain the overall
prediction.
RF is the ensemble approach that involves the bagging of un-pruned trees (weak
learners) by randomly selecting predictors in each split (Breiman, 2001). The main
difference of RF over BaRT is that the set of predictor variables is randomly restricted in
each split (Prasad et al., 2006) and this reduces the problem of correlation between the
individual trees, and hence ameliorates the final prediction accuracy and efficiency of
the ensemble.
BoRT is an ensemble approach which diverges from RF and BaRT by one main
difference. BaRT and RF involve fitting a separate decision to each copy sample
derived from the original data with a combination of each single tree to create a single
predictive model. In BoRT, trees are instead grown sequentially with each tree grown
using the information from previously grown trees (Hastie et al., 2009).
70
SVM belongs to the regression model family and emerge from the area of
statistical learning theory (Vapnik, 1998). SVMs mainly involve a projection of the data
into a high-dimensional feature space using a valid kernel function and then apply a
simple linear regression within this enhanced space (Hornik et al., 2006). The resulting
linear regression function in the high-dimensional feature space corresponds to a non-
linear regression in the original input space (Smola and Schölkopf, 2004). In this study,
the radial basis kernel function was applied to project data into the high-dimensional
feature space before fitting a linear regression. To validate the kernel function, the
parameters cost and sigma was determined with a grid search method (Grunwald et al.,
2014)
The PLSR algorithm relates the response variable (e.g., TC, RC, and HC) and a
large number of highly collinear predictor variables (e.g., STEP-ABWH variables)
through a linear multivariate model to identify successive orthogonal principal
components (latent variables) that maximize the covariance between the response and
predictor variables (Garthwaite, 1994). Predictions were finalized by linear multivariate
regression of the response variable on the calibration dataset. For this study, 14, 18,
and 12 principal components were selected for TC, RC, and HC, respectively, by
identifying the minimum root mean square deviation (RMSD) of cross-validation on the
calibration datasets.
OK is the most commonly used weighted average interpolation technique that is
based on regionalized variable theory and depends on the spatial autocorrelation
structure of the target variable (McBratney et al., 2000). Because the sample distribution
of TC, RC, and HC were non-normal the TC, RC, and HC values were transformed with
71
a log transformation (i.e. log10) to approximate the Gaussian distribution. Spherical and
exponential models were tested to select the best fit to the experimental
semivariograms for TC, RC, and HC. Exponential models were fitted to each of the
omnidirectional variograms of log-transformed TC, RC, and HC. After OK was
conducted on the validation locations, the log-transformed SOC pools were back-
transformed to the original units as outlined by Webster and Oliver (2007).
RK is the most commonly used hybrid interpolation technique that combines the
regression (e.g., the trend between the target variable and the auxiliary variables) and
ordinary kriging of the residual (i.e., stochastic unexplained variation) (Odeh et al., 1995;
Hengl et al., 2004). Stepwise multiple linear regression (SMLR) was employed to model
global spatial trend using the log-transformed TC, RC, and HC. This is followed by the
ordinary kriging of the regression residuals. The final prediction was then obtained by
summing the predicted and interpolated outputs in the original scale. After back-
transformation, the final C pools estimations were validated using the independent
validation set.
3.2.5 Evaluation of Model Performance
Independent validation was used to assess the prediction performance of the
evaluated methods. The Kolmogorov-Smirnov test was conducted to confirm the similar
distributions of the calibration and validation soil C fraction datasets. The difference
between the measured and predicted values in eight models for TC, RC, and HC was
carried out in the original scale.
As the goodness-of-fit statistic, the coefficient of determination, R2, was used to
compare the amount of variation each model was able to explain. The Root Mean
Square Deviation (RMSD, kg C m-2) was used to make further inquiries on model
72
precision. Furthermore, to clearly illustrate the contribution of different methods to the
prediction performance, the relative decrease in RMSD was evaluated by taking the
RMSD of OK as a reference, with the following formula
𝑅𝑅𝑅𝑅 % = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑂𝑂𝑂𝑂 – RMSDCART, BaRT, BoRT, RF, SVM, PLSR, RK
𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑂𝑂𝑂𝑂× 100 (3-2)
Where; RMSDOK is the error of OK as the benchmark (i.e., initial, primitive model of soil-
landscape) and RMSDCART, BaRT, BoRT, RF, SVM, PLSR, RK is the error of an evaluated model
being investigated.
Residual prediction deviation (RPD) (Williams, 1987) was selected to compare
our results with other studies documented in the literature. In addition to the RPD, the
ratio of prediction error to inter-quartile range (RPIQ) (Bellon-Maurel et al., 2010) was
reported since RPIQ is well suited to non-Gaussian distributed data.
3.2.6 Application of Models
Data preparation, manipulation, model development, model accuracy
assessment, and mapping were carried out using R 3.2.0 and its add-in packages (R
Core Team, 2015) (Table 3-2). To develop CART, BaRT, BoRT, RF, SVM, PLSR, and
OK, the ‘rpart’ (Therneau et al. 2015), ‘ipred’ (Peters et al. 2015), ‘gbm’ (Ridgeway,
2007), ‘randomForest’ (Liaw and Wiener, 2002), ‘kernlab’ (Karatzoglou et al. 2007), ‘pls’
(Wehrens et al. 2007), and ‘gstat’ (Pebesma, 2004) packages were used, respectively.
3.2.7 Mapping of Total, Labile and Recalcitrant Carbon Stocks
The Random Forest model for each soil C fraction was employed to create high-
resolution (30 m) soil C fraction maps covering the state of Florida. Categorical
variables identified by Boruta were excluded in model development process. Xiong et al
73
(2014a) stressed that adding more categorical variables leads to missing values in
produced maps because all predictor classes were not represented in calibration
samples. Therefore, C maps were produced based on the all-relevant continuous
predictors available across Florida.
3.3 Results and Discussion
3.3.1 Descriptive Summary Statistic of Carbon Fractions
The descriptive statistics of the measured soil carbon fractions observed with
respect to the entire dataset, calibration set (70%), and validation set (30%) are
presented (Table 3-3). Total C ranged from 0.45 to 34.15 kg TC m-2 with a mean of 4.74
and a median of 3.32 kg TC m-2. Considering the whole dataset, high kurtosis values
and strongly positively skewed distributions of TC, RC, and HC revealed the diverse
characteristics of Florida’s soil-scapes. The non-Gaussian distribution was probably due
to the frequent low values of soil C representative in the well-drained, coarse-grained
upland soils and the few extremely high values found in the poorly-drained wetland
soils. Collectively, the characteristics of the sample distributions reflected the large
variation in soil suborders, climatic conditions, land-uses, and environmental conditions
throughout the surface soil of Florida.
In this study, SOC essentially is equivalent to soil TC since the soil inorganic
carbon was a very minor constituent in most of the samples which is most likely due to
the acidic nature of Florida’s soil. The analysis revealed that the major amount of the
total C stored in the surface soil of Florida is concentrated within the stable recalcitrant
sub-pool (RC) as compared to the labile sub-pool (HC). While RC varied from 0.22 to
25. 08 kg m-2 with a mean of 2.81 and a median of 1.69 kg m-2, HC ranged from 0.02 to
0.71 kg m-2 with a mean of 0.16 and a median of 0.14 kg m-2 (Table 3-3).
74
Even though the HC made up a relatively low portion of the TC, the HC is
essential in understanding soil C dynamics as it is a sensitive measure for short-term
changes induced by management practices, temperature, and soil moisture. Usually a
high correlation of HC to microbial biomass indicates that C is readily available for
microbial utilizations (Leinweber et al., 1995). Additionally, HC is thought to be an
important labile component for soil micro-aggregation in organic matter and the soil
physical parameter to be studied with regard to soil quality (Ghani et al., 2003).
The Shapiro Wilk test confirmed the approximately normal distribution for log-
transformed soil TC, RC, and HC. The Kolmogorov-Smirnov test confirmed that the
randomly separated calibration and validation samples appropriately represent the
population for TC, RC, and HC, respectively. This similarity between the calibration and
validation sets demonstrated that they were randomly sampled from identical
populations for each dataset. The Spearman’s pairwise correlation analysis between the
different soil C fractions indicated that TC and RC were strongly correlated and HC-TC
was weakly correlated (Table 3-4). This implied the difference in the processes that are
responsible for the accumulation and decomposition of RC and HC.
3.3.2 Spatial Autocorrelation with Trend and without Trend
The omnidirectional variograms of log-transformed TC, RC, and HC and
residuals of the SMLR are presented to assess how much variation is captured by the
SMLR (Figure 3-2). For TC, RC, and HC, an exponential model was fitted with an
effective range of 28, 30, and 18 km, respectively. Also, an exponential model was fitted
with an effective range of 8, 7, and 3 km, respectively, to the residuals of the SMLR.
This finding indicated that the SMLR method was only able to explain some portions of
the stochastic spatially dependent variation. Similarity in range value for TC and RC
75
indicated the likelihood of similar major environmental controlling both carbon pools at
the regional scale. The lowest range was obtained for HC which could be attributed to
both the unstable inherent structure of HC to its environment (i.e., easily decomposable
molecular structure) and the uncaptured short-range variability due to area of extent and
sampling density tradeoff. Labile C fractions were structured with simple chemical
molecules, including carbohydrates and proteins, which were also constituents of the
soil microbial biomass (Balaria et al., 2009). Thus, HC can dynamically mineralize into
the atmosphere, transform into RC, or leach out to subsurface horizons. This may all
contribute to the uncertainty associated with HC modeling. In the present study, the
sampling density was about 0.007 per km2, which was sufficient to capture the long-
range variability of TC and RC (Figure 3-2), while the shorter spatial autocorrelation
range for HC suggested that there was uncaptured short range variability in the labile C
fraction.
The nugget to sill ratio (N:S), which expresses the magnitude of the spatial
autocorrelation, amounted to 37%, 36%, and 44% for TC, RC, and HC, respectively.
The N:S ratio for the residuals of TC, RC and HC was 38%, 32%, and 80%,
respectively. These findings also indicated that the explainable proportion of the total
variance for TC and RC was greater than HC. In the same line with our finding,
Vasques et al. (2010b) found strong spatial structure for TC and RC, whereas moderate
spatial dependence for HC was a large mixed-use watershed in Florida. The variogram
analysis in the present study revealed that the soil stable carbon pool, represented by
RC, showed the longest range (30 km). It is well-established that the stable C pool, RC,
is protected through physical, chemical, and biological stabilization mechanisms; thus,
76
the mean residence time can be decades to even thousands of years within the soil
system (Sollins et al., 1996; Goh, 2004; Lutzow et al., 2006). In contrast, the labile C
pool, HC, shows the shortest range in the surface soil, suggesting that the controls on
the stabilization of soil organic matter across Florida is highly variable and, therefore,
the labile C pool is less predictable when compared to the recalcitrant C pool.
3.3.3 Important Variables
Boruta, the all-relevant variable discovery method, identified 53 environmental
factors out of 332 variables to be relevant to topsoil C fractions in the state of Florida
(Table 3-5). For TC, RC, and HC, 36, 30, and 25 soil-environmental factors,
respectively, stood out as relevant with varying degrees of explanatory power. The Z
score, which represents the importance of the all-relevant variables, ranged from 3.5 to
22.3. The explanatory power of the all-relevant predictors was grouped under 4 classes
based on the Z score: weakly relevant (Z < 5), slightly relevant (5 < Z < 10), moderately
relevant 10 < Z < 15), and strongly relevant (Z > 15) to soil C. Furthermore, the first 13
variables common to TC, RC, and HC had Z scores ranging from slightly relevant to
strongly relevant; the 8 variables common to TC and RC were weakly relevant to slightly
relevant; and the 4 variables common to TC and HC were slightly relevant to
moderately relevant. The remaining 11, 9, and 5 variables were identified only as
weakly relevant to TC, RC and HC, respectively. This implies that the major predictors
which have the highest explanatory power for TC, RC, and HC were similar for all three
investigated C pools.
Boruta filtered out most of the irrelevant climatic and topographic variables, and
multi-collinearity among the all-relevant variables was drastically reduced. For instance,
out of 180 variables associated with the atmosphere only 3 variables were selected.
77
However, there were still obviously redundant variables in the all-relevant variable set.
These included certain land cover variables that were obtained from different times,
certain vegetation properties obtained from different sources, and variables aggregated
over different profile depths (e.g., AWC25, AWC50, and AWC100). Xiong et al. (2014a)
pointed out that developing the most parsimonious model with the minimal optimal
variables and comparing it to an exhaustive model that includes all-relevant variables
could both decrease the overall model complexity and increase the uncertainty.
Ultimately, the selected variables by Boruta were included in each C fraction model
without removing the redundant variables because we wish to maintain high quality
model prediction performance.
Soil taxonomic variables (i.e., soil suborder, soil greatgroup) were strongly
relevant in the explanation of the total variation of TC, RC, and HC in surface soil in
Florida. In addition, soil order and historic soil organic matter (derived from the soil map
of the Soil Geographic Survey Database, SSURGO) were both moderately relevant to
the stabilization and destabilization processes of soil C. Furthermore, the soil
environmental variables with respect to soil moisture status (e.g., soil drainage classes,
plant available soil water holding capacity [AWC] in different depths [25, 50, and 100],
soil hydric rating, and soil runoff potential) were identified as weakly relevant to strongly
relevant variables. These results confirm earlier findings by Vasques et al. (2012). They
found the soil AWC had well-structured spatial dependence among all the ecological
variables in both short and long ranges across Florida.
The field obtained LULC classes and other similar predictors representing land
use/cover across Florida. The reclassified LULC (LULCRecls) were the most strongly
78
relevant to TC, RC, and HC. In addition, the national cropland data layer (Cropland) and
the national land cover dataset (LandCovCls) were identified as slightly relevant to TC.
Also, biotic variables such as vegetation type (VegType), were moderately important to
TC, RC, and HC. Seasonally active vegetation (SmallNdviPkInt) was strongly relevant
to RC, while biophysical setting (BiophySet) was slightly relevant to TC and HC.
Moreover, NDVI and EVI were connected slightly to variation in RC and HC.
Furthermore, variables representing parent material included various strongly
relevant ones (i.e., physiographic province name and type) to TC, RC, and HC. A few
others (i.e., environmental geology and surficial geology) were found to be weakly
relevant to TC only. These variables expressed the influence of parent material on the
soil C budget through modulating soil mineralogy. With vegetation and soil water being
somewhat dependent on surficial geology, surficial geology also has an indirect
influence through biotic/water complexes on the soil C pools (Eberhardt and Latham,
2000). Hence, the explanatory power of surficial geology was high as it may modify
vegetation and water processes.
As the topography across much of much Florida is nearly level, soil slope was
the only topographic variable that stood out as slightly important to TC, RC, and HC,
whereas none of the other topographic variables were identified by Boruta. The
influence of slope as opposed to elevation revealed the fact that micro-topography
across Florida controls the soil C by modifying the soil water pattern (Mulkey et al.,
2008). Thus, soil slope was the only predictors to infer on the variation in TC, RC, and
HC in the topsoil.
79
Interestingly, all variables reflecting atmospheric properties demonstrated only
weak to slight relevance to infer on TC, RC, and HC variation. Among the atmospheric
variables, average monthly precipitation (i.e., PrecipFeb, PrecipMay, PrecipDecem,
PrecipJune, PrecipOct) and temperature (i.e., MaxTempDec, MaxTempJan,
SolarRadMay, MaxTempApr) were identified as weakly relevant to surface soil C across
Florida. The present study mimicked others’ findings by showing relatively insignificant
associations between precipitation or temperature and soil C in a subtropical climate
(Vasques et al., 2010a; Xiong et al., 2014b). Xiong et al. (2014b) explained the weak
relationship between precipitation and variation in SOC in the topsoil of Florida with both
the translocation of SOC from top layers to subsequent horizons which mainly control
the forming process of Spodosols and the high decomposition rate as a result of high
precipitation and net primary production (NPP).
3.3.4 Assessment of the Prediction Capability of the Selected Methods
A summary of the parameters characterizing the efficiency and quality of the
fitted models for each soil C fraction is presented (Table 3-6). In addition, graphs that
showcase observed and predicted soil C stocks with evaluated methods are illustrated
(Figures 3-3, 3-4 and 3-5). These graphs highlight deviations from the 1:1 line (i.e.,
“true” model) for TC, RC, and HC. Overall, the observed vs. predicted TC, RC., and HC
in validation dataset matched well for RF, BaRT, and SVM with values aligned close to
the 1:1 line. The high C values were under-predicted, whereas low C values were over-
predicted. For HC, there was significant scatter the around 1:1 line and large prediction
errors for all models.
Total C and RC appeared to respond similarly to the evaluated techniques in
terms of R2, RMSD, RPD, and RPIQ. In contrast, HC behaved significantly different,
80
compared with TC and RC. Overall, the best of the eight models were able to account
for 71.6% of the total variation of TC and RC, but only 30.5% of the total variation of HC.
This high proportion of unexplainable variation present in the labile C fraction may be
due to the inherent characteristic of HC. Modelling the labile pool of SOC was relatively
difficult because its formation is affected by dynamic biochemical processes which are
largely controlled by complex interacting soil-environmental factors. Specifically, several
space-time factors, such as SoilRunoff, PrecipJune, NdviJune, EviAgust, were related
to HC which are inherently dynamic factors varying across different spatial and temporal
scales. Altogether, these spatial-temporal complexities led to an increase in the
uncertainty of HC. In a catchment scale study, Karunaratne et al. (2014) indicated the
labile pool of soil C with particulate organic carbon (POC) was the noisiest data, was
spatially correlated to the shortest range, and was the hardest to fit a model when
compared with resistant organic carbon (ROC). Therefore, the short-range variation
inherent to POC was not captured in their study. Despite the lower performance of HC
models, some of the variation was explained by a mixture of pedogenic, lithologic,
biotic, climatic, and water-specific factors.
In terms of RPIQ, the hierarchy in prediction performance for TC was as follows:
RF > SVM > BoRT > BaRT > PLSR > RK > CART > OK (Table 3-6). For the RC pool
the performance of models in terms of RPIQ was RF > PLSR > BoRT ~ BaRT ~ RK >
SVM > CART > OK, and for the labile carbon pool the model ranking was RF > SVM ~
BoRT > BaRT > PLSR > RK > CART > OK. Overall, the findings implied that OK was
the worst, as it did not use any covariates and relied on soil C fraction measurements at
sites. As an individual machine learning method, CART showed significantly lower
81
prediction accuracy compared to ensemble machine learning methods (i.e., RF, BaRT,
BoRT). RK yielded usually better prediction accuracy than OK and CART. Because the
RK relied on a simple regression model (derived from SMLR), it is possible it did not
have the ability to predict as well as RF and the other ensemble regression methods.
In terms of prediction error, RMSDs from the validation results ranged from 2.39
kg m-2 to 3.80 kg m-2 for TC (i.e., RF: 2.39 kg m-2, SVM: 2.69 kg m-2, BoRT: 2.74 kg m-2,
BaRT: 2.78 kg m-2, PLSR: 2.82 kg m-2, RK: 2.99 kg m-2, and OK: 3.80 kg m-2). Compare
to the RMSDs of TC, there was an overall decrease in RMSDs of RC for each method,
which varied from 1.89 kg m-2 to 3.27 kg m-2 (i.e., RF: 1.89 kg m-2, PLSR: 2.08 kg m-2,
RK: 2.13 kg m-2, BaRT: 2.16 kg m-2, BoRT and SVM: 2.21 kg m-2, CART: 2.57 kg m-2,
and OK: 3.27 kg m-2). The lowest RMSDs were achieved with HC, compared to TC and
RC. Also, multiple methods achieved the same RMSDs, which changed from 0.06 kg m-
2 to 0.08 kg m-2 (i.e., RF ~ BoRT ~BaRT ~ SVM: 0.063 kg m-2, PLSR ~ RK ~OK: 0.07 kg
m-2, and CART: 0.08 kg m-2).
Using OK, which models the spatial autocorrelation of soil C fractions, as a
reference method, the relative improvement in model performance (in %) was assessed
for the seven methods (Figure 3-6). All evaluated methods significantly improved the
accuracy of prediction for TC, RC, and HC. In other words, inclusion of the all-relevant
variables substantially decreased the prediction error for TC, RC, and HC. The overall
improvement was highest in RC, followed by TC and HC. Using the most sophisticated
methods, we were able to decrease prediction error up to 37%, 42%, and 12% for TC,
RC, and HC, respectively (Figure 3-6). For RC, the relative improvement was the
highest in using RF with a 42% gain, followed by PLSR, BoRT, RK, SVM, and BaRT
82
with 36%, 34%, 34%, 32%, and 32% gains, respectively, with the least improvement
with CART (21% gain) relative to the reference OK model. The greater improvement of
RC over TC can be attributed to higher accuracy of OK of RC over OK of TC. For TC,
RF and SVM improved the prediction accuracy by 37% and 29%, respectively, followed
by BoRT, BaRT, and PLSR with 27%, 27%, and 25% gains. The lowest improvements
for TC were found with RK and CART with 21% and 20% gains
Surprisingly, for HC, the relative improvement was nearly identical for all machine
learning methods with a gain of 12.5% for RF, BoRT, SVM, and BaRT. This suggests
that applying the all-relevant variable approach for the modeling of HC did not
substantially improve the prediction accuracy when compared to OK. This finding is in
line with a 2010 watershed scale study by Vasques et al. (2010b), who found that in
three out of the five soil organic C fractions, namely (HC, MC, and SC), RK did not
outperform block kriging, whereas it did outperform block kriging in the cases of TC and
RC.This implies that mining the relationship between soil HC and environmental factors
utilized are not representative, and missing predictors mask the satisfactory uncertainty
relative to TC and RC. Ahn et al. (2009) recommended that utilizing TC is more efficient
and preferable than HC with the purpose of detecting mineralization rate of C because
HC extraction is time-consuming and generally has high measurement uncertainty.
Overall, for all soil C fractions, the RF model yielded the most satisfactory results
in terms of model fit (R2), RMSD, RPD, and RPIQ among the competitors, including
other machine learning methods (CART, BaRT, BoRT), advanced statistical methods
(i.e., SVM, PLSR), a hybrid method (RK), and a geostatistical method (OK). The
performances of machine learning methods were stable for different carbon pools. The
83
simpler CART was outperformed by its more complex counterparts (i.e., RF, BaRT, and
BoRT) as expected. The superior performance of RF, as a modified version of CART
and BaRT, stemmed from random selection of variables during tree building and
assembly. On the other hand, the superior performance of BoRT over BaRT and CART
can be explained since its stochastic gradient, which boosts the procedure, may
minimize overfitting, and may increase the accuracy of prediction within the validation
dataset (Lawrence, 2004).
The SVM, in terms of predictive capability, closely followed RF and was
comparable to the model performance of the machine learning methods for TC, RC, and
HC. One peculiarity of SVM is that it is susceptible to overfitting (Hernández et al.,
2009) when compared to its competitors, such as RF, BaRT, and BoRT. On the other
hand, SVM successfully captured the non-linear relationship in the data when compared
to PLSR and CART. Even though ensemble data-mining methods and SVM take their
power from the ability to detect the non-linear, complex, hierarchical relationship
between predictors (i.e., the STEP-ABWH factors) and predictants (i.e., distinct SOC
pools in the present study), they may by susceptible to overfitting which might limit their
competitiveness. For example, Grunwald et al. (2014) found that PLSR was superior to
SVM when transferring and scaling spectral-based soil TC prediction models which
indicates that the predictors’ domains are largely affecting the behavior of the modeling
techniques.
PLSR has been commonly applied in chemometric modeling using hyperspectral
soil datasets which enables the user to characterize the linear association between the
generally high numbers of predictors with the phenomena of interest (i.e., the target
84
variable). PLSR can handle multicollinearity, while simple multivariate regression
cannot. Actually, the latter assumes that predictors are independent, which in DSM is
often not the case (i.e., STEP-AWBH variables are often correlated). However, there
are only a few studies that have utilized PLSR as an upscaling method for regional
scale soil mapping (Rodríguez-Lado and Martínez-Cortizas, 2015). Thus, it is important
to note the prediction behavior and its predictive power among other powerful soil
prediction techniques. In this study, PLSR was able to explain 64%, 65%, and 26% of
the total variation with an RMSD of 2.82 kg C m-2, 2.08 kg C m-2, and 0.07 kg C m-2 for
TC, RC, and HC, respectively. The PLSR models performed better to predict soil TC,
RC, and HC than RK, CART, and OK and very similar to the other methods (BaRT,
SVM, and BoRT). These findings imply that PLSR is a promising method for mapping
soil properties and classes. With PLSR, satisfactory predictive performance may be
obtained even though relatively low numbers of predictors are used to train calibration
datasets.
As expected, OK yielded the poorest model among all methods for the three soil
C pools (R2 for TC = 0.43, RC: 0.29, and HC 0.11). This was because it solely
accounted for the spatially correlated stochastic variation characterized by the spatial
autocorrelation of distinct soil C fractions. RK, on the other hand, significantly improved
the prediction accuracy compared to OK for TC and RC. RK performed better when
auxiliary variables could be used to explain the significant part of the variation (Hengl et
al., 2007a). In the validation mode, the SMLR for TC and RC was significantly better
than HC with R2 = 0.46, 0.51, and 0.26 kg m-2, and RMSD = 0.22, 0.24, and 0.19 kg m-2,
respectively. Hence, the RKSMLR, OK for TC and RC outperformed the standalone OK.
85
However, there was no gain on RKSMLR, OK of HC because R2 of the regression with
auxiliary variables (0.26) was not significant.
3.3.5 Residual Spatial Autocorrelation of Evaluated Methods
Residual spatial autocorrelation (RSA) of the evaluated models was investigated
with the omnidirectional variogram (Figures 3-7, 3-8 and 3-9). No meaningful RSA was
left behind in the evaluated models. This suggested that all attainable variation present
in the data was captured with the prediction models (BaRT, BoRT, CART, PLSR, RF,
and SVM) which were developed by the all-relevant variables identified by Boruta. This
can be interpreted as the spatially autocorrelated all-relevant STEP-ABWH factors such
as LULC, soil suborder, AWC enable the models to capture all-attainable variation in
validation datasets. However, the residuals of CART were much more erratic for TC and
RC and somewhat for HC.
In contrast, but in line with our hypothesis, Zhao and Shi (2010) accomplished
the best prediction performance, explaining up to 67% of the overall variation in SOC
stocks across a province in China, by combining artificial neural networks (ANN) to
capture deterministic trends and utilizing OK to capture stochastic variation (RKANN,OK)
compared to other models (MLR, Universal Kriging, RK). Similarly, Dai et al. (2014)
succeeded in improving SOM prediction performance in Tibet with RKANN,OK. In a
regional scale SOM characterization study, Guo et al. (2015) found a significant
improvement in overall model performance with RKRF,OK compared to SMLR. As data-
mining approaches have been increasingly employed in DSMM, it is critical to assess
the spatial autocorrelations in residuals (OK) that would potentially facilitate the
improvement of model performance (e.g., RK). Interestingly, in this study, no significant
spatial autocorrelation was found in the best performing RF model (TC, RC, and HC).
86
Because the more complex and best performing machine learning RF model could not
be used as a trend model in RK, we modeled using SMLR instead. This is probably why
RKSMLR,OK did not perform as well as RF (and other machine learning methods) to
predict TC, RC, and HC.
In soil C modeling studies, an incomplete knowledge on the processes largely
affecting stabilization and destabilization of soil organic C, a lack of appropriate scale-
relevant environmental predictors, observation and prediction scale mismatch, poor
sampling design or insufficient sampling density, and improper model choice may
contribute to the strength of the RSA of any model. Therefore, a soil model which has
some autocorrelation left in its residuals can be improved. Though some of these
factors (e.g., sampling design, density) should be addressed before the model
development process, the adverse influence of other factors (e.g., proper model choice,
useful predictors) on prediction performance can be ameliorated in the model
development process.
Prediction performance is known to depend on the gathering of useful data and
not on sophisticated methods. Thus, missing predictors can leave some RSA, even if
the right choice of methods was employed. In a national scale SOC stocks prediction
study, Martin et al. (2014) showed that simple BoRT models developed with a limited
number of predictors coupled with geostatistical modelling of residual can significantly
improve standalone BoRT predictions, whereas the complex models developed with
relatively higher number of predictors coupled with ordinary kriging of residual did not
significantly improve standalone BoRT predictions. Unless all-relevant variables have
been included in a model, there is the chance that some additional explainable variation
87
can be captured with further analysis. In the present study, where the identified
parsimonious predictors captured sufficiently the stochastic spatially dependent
variation, there was no meaningful RSA left in the model residuals. When aiming to
improve predictions, one cannot be assured of the inclusion of all-relevant variables
without investigating RSA. Thus, the more successfully the all-relevant variable model
performs, the less likely it is that significant RSA can be identified and modeled.
The choice of method when characterizing the relationships between the soil-
environmental factors and target soil properties (e.g., TC) is particularly important with
respect to RSA. In the current study, the evaluated models (BaRT, BoRT, CART, PLSR,
RF, and SVM) did not have any substantial RSA because they were capable of
detecting a hierarchical, non-linear relationship between TC, RC, and HC and the all-
relevant STEP-ABWH factors. However, the SMLR of TC, RC, and HC did not capture
all the stochastic spatially dependent attainable variability in the data due to the
incapable nature of SMLR to capture the non-linear, hierarchical, complex relationship
between a dependent variable (i.e., TC, RC, and HC) and independent variables (i.e.,
the STEP-ABWH factors). Consequently, the residual of SMLR for TC and RC was
moderately and HC was weakly spatially autocorrelated (Figure 3-2). That is why
RKSMLR,OK for TC and RC improved the prediction accuracy when compared to the
SMLR of TC and RC. For HC, however, RKSMLR, OK did not improve the prediction
accuracy due to its dynamic, noisy nature.
Because the success of modeling is determined by the proximity of the true
model of soil-landscape at a scale of interest, the coupling of stochastic spatially
dependent and deterministic variation is necessary to characterize the spatial
88
distribution of soil properties and classes. Efforts to acquire exhaustive environmental
variable sets (i.e., STEP-AWBH variables) and then filter out those most relevant to
infer on a soil property of interest are less user biased. In addition, the Boruta all-
relevant approach for six machine learning methods ensured that no significant RSA
was present (Figures 3-7, 3-8 and 3-9). Although machine learning models are
frequently superior to other modeling techniques, there is no guarantee that they
account for all of the stochastic spatially dependent variation. As an identifiable RSA
can lead to an increase in model performance, an investigation of the RSA of any model
will need to be tested in DSMM.
3.3.6 Regional Scale Controls on Stabilization of Soil Carbon
The all-relevant variable approach enabled us to explain 71.6% of the overall
variation in TC and RC and 30.5 % of the HC at the regional scale. The result confirms
that biotic (e.g., vegetation and land use) and abiotic (soil-water gradient) environmental
determinants mainly control the storage of topsoil TC, RC, and HC in this subtropical
region. The TC and RC stocks were formed by various C forming and degrading
processes, such as aggregation, decomposition, humification, translocation, and
transformation, which have acted over prolonged periods of time. In contrast, the HC
stocks are more dynamically controlled by ecosystem processes resulting in temporal
trends that were more pronounced than spatial ones across the state.
The complex, hierarchical and multiscale interaction of distinct soil-environmental
determinants on formation and decomposition of soil C makes it difficult to assess
unambiguously how these predominant environmental factors regulate the fate of soil C.
Though many studies have aimed to map soil C stocks or concentrations (Simbahan et
al., 2006; Mishra et al., 2012; Karunaratne et al., 2014a; Poggio and Gimona, 2014), the
89
links between ancillary environmental variables and storage of C are still not clear
(Doetterl et al., 2013) and vary among geographic regions.
The amount of soil C in a particular soil body is primarily determined by the
tension between the influx through NPP (its quantity and mainly quality) and outflux (its
decomposition and leaching rate) (Janzen, 2004). In their landmark paper, Sollins et al.
(1996) described three main mechanisms responsible for the persistence of SOM from
decomposition; 1) recalcitrance (i.e., molecular structure of SOM), 2) low accessibility
for biological decomposition, and 3) interaction with mineral particles. In the following
section, the discussion is carried out on how important variables and their interactions
can possibly modulate the above and belowground C balance through important de-
stabilization processes operating in regional scale.
Biotic variables, specifically LULC, stand out as one of the most important
predictors explaining the spatial distribution of topsoil C in Florida. Vegetation directly
controls the quantity and quality of organic matter residues via litter cover, species
diversity, distribution, and canopy cover (Davidson and Janssens, 2006). Additionally,
vegetation can alter the chemical properties of soils and microbial community
composition (Lange et al., 2015). Variations in relative abundance of labile and
recalcitrant compound depending on vegetation type impart control on decomposability
of fresh organic matter residues (Melillo et al., 1989; Fissore et al., 2008). In a meta-
analysis study, Guo and Gifford (2002) reported the paramount influence of LULC
change on soil C stocks.
In present study, the Kruskal-Wallis test indicates that land-uses differed
significantly in concentration of TC, RC, and HC present in Florida’s top soil (20 cm) at
90
the significance level of 0.0001. Post-hoc multiple comparison results are shown to
identify the pattern of those differences (Figure 3-10, 3-11, 3-12). For visual assessment
purposes spatial distribution of soil landcover/landuse classes are given (Figure 3-13)
which is adapted from Florida Fish and Wildlife Commission, 2003. Overall, the amount
of C stored in different LULCs shows similar patterns for TC and RC. In particular, the
amount of TC and RC stored by LULC classes followed the general trend: sugarcane
and wetland contain the highest amount, followed by improved pasture, urban, mesic
upland forest, rangeland, and pineland while crop, citrus, and xeric upland forest
contained the lowest amount. The general trend of drainage-poor, water-saturated
wetland soils had an extremely higher TC than drainage-sufficient upland soils. Even
though little differences were observed among the wetland soils due to water saturation
periods, it is not significantly different as in the case of mixed forest (7.2 kg m-2) and
cypress swamp (9.6 kg m-2). No significance difference was observed between pineland
and mesic upland forest. While citrus, crop, improved pasture have similar TC storage,
urban has significantly higher TC than upland soil. Vasenev et al. (2014) found that the
urban SOC contents were comparable or higher than those of natural and agricultural
areas in the Moscov region. Xiong et al. (2014b) quantified the changes in SOC stocks
depending on changes in LULC across the state. Also, they found that at the sites that
had undergone LULC changes, the conversion of wetland to other LULCs resulted in
dramatic SOC losses, whereas conversion from other LULCs to wetland promoted SOC
accretion. In addition, Xiong et al. (2014b) found moderately higher SOC stocks in
urban soils and that the conversion of barren land, crop, and pineland to urban soils
91
leads to C build-up. This confirms that better characterization and understanding of
urban C stocks may have a significant impact on our global C cycling understanding.
Soil taxonomic groupings are known to be determined by the configuration of
environmental factors; hence, C forming/degrading processes and C budget can
significantly differ with soil types. Previously, Histosols and Spodosols with mean SOC
contents of 97.6 and 9.9 kg m-2, respectively, standardized to a 1 m soil profile were
estimated to have the highest C sequestration potential in a study based on STATSGO
data (Guo et al., 2006). In a study by Vasques et al. (2010a) in a Florida watershed,
they found that Histosols and Inceptisols had substantially higher TC in the 1 m soil
profile. In the same study, Alfisols had higher TC in the 1 m soil profile than Ultisol and
Entisols because there was a higher base saturation of Alfisol which promotes natural
fertility. Entisols were dominated with quartz-rich sandy soils and were depleted in
organic matter and reactive minerals.
In the present study, the Kruskal-Wallis test indicated that suborders differed
significantly in concentration of TC, RC, and HC present in Florida’s top soil (20 cm) at
the significance level of 0.0001. Post-hoc multiple comparison results were shown to
identify the pattern of those differences (Figure 3-14, 3-15, 3-16). To visual assessment
purposes spatial distribution of soil landcover/landuse classes are given (Figure 3-17)
which is adapted from Florida Fish and Wildlife Commission, 2003. The greatest stocks
of TC were measured in Saprist and Aquols with medians of 13.9 and 8.4 kg m−2,
respectively. In contrast, the smallest stocks of TC were measured in Psamments and
Udalfs with medians of 2.1 kg m−2. The greatest stocks of RC were measured in Saprist
and Aquols with medians of 10.2 and 5.5 kg m−2, respectively. In contrast, the smallest
92
stocks of RC were measured in Psamments and Udalfs with medians of 1.1 kg m−2. The
greatest stocks of HC were measured in Aquolls, Saprists, Aquepts, and Arents with
medians of 0.24, 0.21, 0.19, and 0.19 kg m−2, respectively. In contrast, the smallest
stocks of HC were measured in Psamments and Udults with medians of 0.1 kg m−2.
Overall, soil suborders found on poorly drained portions of the landscape (e.g., Saprist,
Aquept, and Aquent) exhibited higher soil C than those found on better drained areas
(e.g., Psamment, Udult, and Orthod) for each soil carbon fractions.
Chemical protection (i.e., adsorption of organic molecules onto clay surfaces)
and physical protection (i.e., incorporation of organic molecules into aggregates) retard
decomposition of SOM through the mechanisms associated with soil mineralogy
(Schimel et al., 1985; Hassink, 1997; Torn et al., 1997; J. Six, 2002). Previous works
have reported that the extent of protection offered by fine-textured soil is greater than
coarse-textured soil (Parton et al., 1987b; Schimel et al., 1994; Baldock and Skjemstad,
2000). Accordingly, given the sand-rich, acidic nature of Florida’s topsoil, the protection
offered by mineral surfaces is relatively low. For instance, in a study in southeastern
Florida (Santa Fe River Watershed), Ahn et al. (2009) showed that the low clay content
was associated with relatively low TC and HC concentrations. They demonstrated
through incubation experiments that the sandy nature of these surface soils imparted a
lack of protection against C mineralization. Interestingly, in this study, variables that
reflect soil textural composition were not identified as relevant to TC, RC, and HC,
possibly because the topsoil was dominated by sand texture. Others also did not
observe a strong relationship between soil C stocks and clay because of the limited
range of clay content (Fissore et al., 2008; Angers et al., 2011; Doetterl et al., 2015).
93
Lawrence et al. (2015) stressed that the type of clay (expandable/non-expandable), high
surface area, and presence of very reactive forms of Al- and Fe oxides(including
hydroxides and oxy-hydroxides) are better parameters to explain correlation of SOC
with minerals than clay content by itself.
Previous studies have indicated that soil moisture (Thomsen et al. 2003), soil
aeration (Holden and Fierer, 2005), and soil temperature control the microbial activity
and hence stability of SOC. Hydropedologic characteristic of the landscape across
Florida may influence stabilization of soil C in a several ways. First, an excessive
amount of soil-water associated with convergent soil-scapes (e.g., depressions,
wetlands, depositional valley bottom) can retard the microbial C mineralization because
water-filled soil pores limit the oxygen availability to microbial activity; ultimately, this
can lead to stabilization of soil C across the soil profile (Ekschmitt et al., 2008; Rumpel
and Kögel-Knabner, 2010). Second, especially in subtropical climates, soil-water status
promotes the NPP and can directly influence soil C storage by increasing the quantity of
C supplied as residue to the soil system. Therefore, pedological and hydrological
processes can inhibit the microbial-controlled decomposition of SOC and this may
stabilize the soil C specifically in subtropical regions such as Florida.
Though topography is muted across the study area, micro-topography affects the
soil C status by regulating hydrological processes. The high water table, the high
amount of rainfall, and coarse-texture dominated characteristic of the surface soils may
collectively enhance the accretion of soil C. Consequently, in poorly drained
depressions where water is often ponded for periods of time (e.g., flatwoods), anaerobic
conditions decrease decomposition and enhance soil C accumulation. For example,
94
Vasques et al. (2010a) found large variations in SOC stocks between drainage types
from very poorly drained to well-drained types in a subset region in northeastern Florida.
This implies that micro-topography exerts secondary importance by modifying the soil
matrix and indirectly facilitating the stabilization of soil C.
Climatic factors, specifically precipitation and temperature, have been commonly
documented as the most important environmental determinants of soil C storage, flux,
and processes (Amundson, 2001; Baldock and Skjemstad, 2000) because of their
pronounced influence on the rate of organic matter decomposition and the quantity and
quality of organic matter (Liu et al., 2011). However, in the present study, climatic
predictors were weakly relevant to TC, RC, and HC. This finding is in line with other
studies (Percival et al., 2000; Liu et al., 2011). Michaletz et al. (2014) advocated that
climate and temperature have an indirect influence on the variation in terrestrial net
primary production by modifying plant age, stand biomass, and growing season length.
Moreover, microtopography; a fluctuating high water table; sandy dominated topsoils
that promote infiltration and percolation; relatively acidic nature of soils; and high
precipitation in excessively drained, nearly level landscapes promote the vertical
leaching of metals and organic material to subsurface horizons, forming spodic C-rich
layers. Spodosols tend to have a high proportion of recalcitrant C in the topsoils, but
also in subsurface horizons (Stone et al., 1993). Xiong et al. (2014b) found a negative
relationship between the SOC sequestration rate in topsoil and the mean annual
precipitation, possibly because the coarse-texture allows the organic material to
translocate to lower layers. Given the landscape conditions in Florida, Histosols and
Spodosols are most prominent throughout the state and provide ample opportunities to
95
sequester C. This implies that even though climatic properties are not the most
important variables at the regional scale, they greatly influence soil C storage at the
local scale by modifying the interplay between pedogenic and biotic factors.
3.3.7 Spatial Distribution of C fractions
The random forest method is utilized to map spatial distribution pattern of SOC
pools at a resolution of 30 m across the whole region. Only all-relevant continuous
predictors are included during model calibration because the calibration sets lacked
adequate representation of the categorical predictors. The cross-validation statistics of
calibration and validation sets are presented (Table 3-7). As expected, relatively lower
prediction accuracy is acquired with the RF model solely relying on continuous
predictors. Even though categorical variables were among the most important predictors
to explain variability in TC, RC, and HC (Table 3-5), the RF models that were developed
with only continuous variables yielded comparably well with the RF models that were
developed with all-relevant variables. Similar results were also found by Xiong et al.
(2014a). They reported that the introduction of categorical variables into the RF models
leads to gaps in the produced maps because the predictor classes should be
represented in the calibration dataset. In this study, the findings also suggest that the
continuous variables employed to produce soil C maps may capture the major
processes relevant to TC, RC, and HC across the surface soil of the region. Hence,
they serve as good surrogates to their categorical counterparts.
Soil total C was predicted with the mean of 5.39 and standard deviation of 3.74,
as was the recalcitrant pool of soil C with the mean of 3.25 and standard deviation of
2.66, and the labile pool of soil C with the mean of 0.17 and standard deviation of 0.05.
96
The predicted spatial distribution pattern of TC, RC, and HC are mapped in
Figure 3-18, 3-19 and 3-20. In general, low and high values were consistent with all the
maps. For instance, a similar cluster of large soil C values can be observed in the bend
area along the Gulf coast which spans from the Everglades agricultural area to south of
Lake Okeechobee. Also, high C stocks can be observed in the wetlands interspersed in
the pine forests in northern Florida. These areas are generally characterized with
flatwoods, wetland forests, and cypress swamps with significant accretion of organic
matter in the O and A horizons. In addition, similar clusters can be found at the western
border of Florida which is also dominated by flatwoods and swamps. On the other hand,
the north-central portions of the Panhandle area are dominated by upland soils with
rolling topography and relatively lower C stocks. The gaps in the all maps are due to the
lack of soil data from the SSURGO database.
3.4 Conclusions
The present study demonstrated that the Boruta all-relevant variable searching
algorithm can be employed to filter out the best performing parsimonious predictors
from a spectrum of environmental factors without user bias. The results reveal that
human-induced vegetative and hydro-pedological characteristic of the region
predominantly control the soil C stocks of surface soil. In general, the lower C stocks
were associated with well-drained upland soils, and the higher C stocks were related to
water-rich wetland soils. This study also used the most important STEP-ABWH factors
to trace their role in the stabilization of soil C across Florida with a distinct signature of
TC, RC, and HC.
Comparisons of common geostatistical, machine learning, and hybrid methods in
the pedometricians’ toolbox indicated that RF as an ensemble machine learning method
97
outperformed all the competitors in terms of R2, RPD, RPIQ, and RMSD. RF models
also accounted for up to three-fourths of the total variability in TC and RC, but only one-
fourth of HC probably due to its unstable dynamic nature and/or the low concentrations
in relation to analytical error. Also, best performing RF models contributed up to a 40%
decrease in the RMSD of TC and RC, compared to the RMSD of OK as the reference
model of soil-landscape.
Investigation of the RSA of the evaluated models revealed that the inclusion of
the all-relevant STEP-ABWH factors with proper methodologies could guarantee little to
no RSA. Because one cannot be assured all of the relevant variables have been
included in the model development process, further characterization of RSA with
appropriate statistical metrics could be a routine for future DSMM studies. More
sophisticated predictors in the representation of vegetation, soil-water, and soil
geochemistry may lead to more accurate empirical geo-spatial soil landscape models.
98
Table 3-1. Assembled environmental variables representing STEP-ABWH factors
Variable a Relevant variable Na Factor Data typea Sourcea Original
scale (m) Date
Soil taxonomic order SoilOrder 1 S Cat. SSURGO 1:24,000 2009 Soil taxonomic suborder SoilSuborder 1 S Cat. SSURGO 1:24,000 2009 Soil taxonomic subgroup 1 S Cat. SSURGO 1:24,000 2009 Soil taxonomic great group SoilGreatGrp 1 S Cat. SSURGO 1:24,000 2009 Soil particle size class 1 S Cat. SSURGO 1:24,000 2009 Soil family CEC activity class 1 S Cat. SSURGO 1:24,000 2009 Soil family reaction class SoilReaction 1 S Cat. SSURGO 1:24,000 2009 Soil family temperature class 1 S Cat. SSURGO 1:24,000 2009 Soil family moisture subclass 1 S Cat. SSURGO 1:24,000 2009 Soil muck 1 S Cat. SSURGO 1:24,000 2009 Soil hydration expansion SoilHydration 1 S Cat. SSURGO 1:24,000 2009 Soil leaching potential 1 S Cat. SSURGO 1:24,000 2009 Soil runoff potential SoilRunoff 1 S/W Cat. SSURGO 1:24,000 2009 Soil albedo SoilAlbedo 1 S Con. SSURGO 1:24,000 2009 Soil sand content (0-20 cm) 1 S Con. SSURGO 1:24,000 2009 Soil silt content (0-20 cm) 1 S Con. SSURGO 1:24,000 2009 Soil clay content (0-20 cm) 1 S Con. SSURGO 1:24,000 2009 Soil organic matter (0-20 cm)(historic)
SOM 1 S
Con. SSURGO 1:24,000 2009
Soil moisture b SoilMoistFeb, … 17 S-W Con. SMOS 15,000 2010-11 Elevation (30 m, 90 m, 1 km) c 3 T Con. USGS 30/90/1000 1999 Slope (30 m, 90 m, 1 km) c 3 T Con. USGS 30/90/1000 1999 Flow accumulation (30 m, 90 m, 1 km) c
3 T
Con. USGS 30/90/1000 1999
CTI (30 m, 90 m, 1 km) c 3 T Con. USGS 30/90/1000 1999 Soil slope SoilSlope 1 T Con. SSURGO 1:24,000 2009
99
Table 3-1. Continued Variablea Relevant variable Na Factor Data
typea Sourcea Original scale (m) Date
Distance from coast 1 T Con. FMRI 30 1999 Distance from sinkhole 1 T Con. FGS 30 1999 Distance from stream 1 T/W Con. USGS 30 1999 Distance from open water 1 T/W Con. USGS 30 1999 Easting, northing d 2 T Con. Field sampling N/A 2009 Ecological regions EcoRegion 1 E Cat. USGS 1:250,000 2009 Physiographic province name and type
PhysiogName, .. 2 E/P
Cat. USGS 1:500,000 1998
Environmental geology EnvGeology 1 P Cat. USGS 1:500,000 1998 Surficial geology SurGeology 1 P Cat. USGS 1:500,000 1998 Surficial geology epoch and period
2 P
Cat. USGS 1:500,000 1998
Gamma-ray absorbed dose rate 1 P
Con. USGS 4000 1999-2005
Gamma Ray Concentrations of potassium, thorium,uranium,
3 P
Con. USGS 2000 1975-1983
Gamma RayBouguer gravity anomaly
2 P
Con. USGS 4000 1998-1999
Gamma Ray magnetic anomaly 3 P
Con. USGS 2000 1945-2001
Precipitation b PrecipFeb, … 26 A Con. PRISM 800 1971-2000
Temperature b MaxTempJan, …
65 A
Con. PRISM 800 1971-2000 1981-2010
Solar radiation b SolarRadMay 13 A Con. NARR 32,000 1979-2009
Total ET b 13 W Con. Uni. of Montana
1000 2000
Total Potential ET b 13 W Con. Uni. of Montana
1000 2000
100
Table 3-1. Continued Variablea Relevant variable Na Factor Data
typea Sourcea Original scale (m) Date
Annual latent heat flux b LatHeat2009 13 W Con. Uni. of Montana
1000 2000
Long-term average annual ET b 2 W
Con. USGS 800 1971-2000
Soil annual minimum water table b 2 W
Con. SSURGO 1:24,000 2009
Soil available water capacity(0 -25 cm, 0-50 cm, 0-100 cm and 0-150 cm)
AWC25, AWC50, AWC100 4 W
Con. SSURGO 1:24,000 2009
Flooding frequency class 1 W Cat. SSURGO 1:24,000 2009 Ponding frequency class PondFreq 1 W Cat. SSURGO 1:24,000 2009 Drainage class DrainCls 1 W Cat. SSURGO 1:24,000 2009 Hydrologic group 1 W Cat. SSURGO 1:24,000 2009 Runoff class 1 W Cat. SSURGO 1:24,000 2009 Vegetation type VegType 1 B Cat. LANDFIRE 30 2002 Vegetation type system group 1 VegTpSysGrp1
1 B Cat. LANDFIRE 30 2002
Vegetation type system group 2 VegTpSysGrp2 1 B
Cat. LANDFIRE 30 2002
Vegetation type order 1 B Cat. LANDFIRE 30 2002 Vegetation type class 1 B Cat. LANDFIRE 30 2002 Vegetation type subclass 1 B Cat. LANDFIRE 30 2002 Biophysical settings BiophySet 1 B Cat. LANDFIRE 30 2002 Environmental site potential EnvSitePot 1 B Cat. LANDFIRE 30 2002 Vegetation height 1 B Cat. LANDFIRE 30 2002 Vegetation cover 1 B Cat. LANDFIRE 30 2002 Forest canopy properties 4 B Con. LANDFIRE 30 2002 Landsat ETM + bands LsatB5,… 6 B Con. USGS 30 2003 Landsat ETM + tasseled cap indices
LsatTC1, … 6 B Con. USGS 30 2003
101
Table 3-1. Continued
Variablea Relevant variable Na Factor Data typea Sourcea Original
scale (m) Date
Landsat ETM + principal components
LsatPC1, … 6 B
Con. USGS 30 2003
Monthly MODIS NDVI NdviMay, … 12 B Con. MODIS4NACP 500 2005 Monthly MODIS EVI EviAgust, … 12 B Con. MODIS4NACP 500 2005 Monthly MODIS LAI 12 B Con. MODIS4NACP 500 2005 Monthly MODIS FPAR 12 B Con. MODIS4NACP 500 2005 Annual min, max and mean NDVI
3 B Con. MODIS4NACP 1000 2005
NDVI greenup, peak and browndown day of year
3 B
Con. MODIS4NACP 1000 2005
NDVI greenup and browndown rate
2 B
Con. MODIS4NACP 1000 2005
NDVI Season length 1 B Con. MODIS4NACP 1000 2005 NDVI amplitude and base NDVI level
NDVIAmplitude 2 B
Con. MODIS4NACP 1000 2005
Max peak NDVI 1 B Con. MODIS4NACP 1000 2005 Large NDVI peak integral e 1 B Con. MODIS4NACP 1000 2005 Small NDVI peak integral e SmallNdviPkInt 1 B Con. MODIS4NACP 1000 2005 Canopy coverage and Imperviousness
2 B
Con. NLCD 30 2001
Aboveground live dry biomass 1 B
Con. NBCD 30 2000
102
Table 3-1. Continued
Variablea Relevant variable Na Factor Data typea Sourcea Original
scale (m) Date
Gross and net primary production
2 B
Con. MODIS4NACP 1000 2005
Land cover class LandCovCls 1 B/H Cat. NLCD 30 2001 Cropland data layer Cropland 1 B/H Cat. NCDL 30 2004 Land use and land cover LULCSampled 1 B/H Cat. Field sampling N/A 2009 Land use and land cover LULC 1 B/H Cat. FFWCC 30 2003 Land use and land cover f LULCRecls 1 B/H Cat. FFWCC 30 2003
a Abbreviations: CEC, Cation Exchange Capacity; CTI, Compound Topographic Index; Landsat ETM+, Enhanced Thematic Mapper; MODIS, Moderate-Resolution Imaging Spectroradiometer; NDVI, Normalized Difference Vegetation Index; EVI, Enhanced Vegetation Index; LAI, Leaf Area Index; FPAR, Fraction of Photosynthetically Active Radiation; SSURGO, Soil Survey Geographic Database; STATSGO2, State Soil Geographic Database; SMOS, Soil Moisture and Ocean Salinity; USGS, United States Geological Survey; FMRI, Florida Marine Research Institute; PRISM, Parameter-elevation Regressions on Independent Slopes Model; NARR, North American Regional Reanalysis; LANDFIRE, LANDscape FIRE and resource management tools project; MODIS4NACP, MODIS for North American Carbon Project; ET, Evapotranspiration; NLCD, National Land Cover Data; NBCD, National Biomass and Carbon Dataset; NCDL, National Cropland Data Layer; FFWCC, Florida Fish and Wildlife Conservation Commission; FGS, FL geological survey; N, number of variables; Cat., Categorical; Con., Continuous. b The 17 soil moisture variables are 12 monthly averages and 4 seasonal (e.g., spring, summer, autumn, winter) and one overall average over 2010-2011. The 26 precipitation variables are 12 monthly averages and one overall average over 1971-2000 and the same for 1981-2010. The 65 temperature variables are 24 monthly averages of daily max and min temperatures plus 2 long-term averages (1971-2000). Also, there are 39 monthly averages of daily max, mean and min temperatures plus 2 long-term averages (1981-2010). The 13 solar radiation variables are 12 monthly averages over 1979-2009 and one long-term average. 36 evapotranspiration (ET) variables consist of 13 annual total evapotranspiration, potential ET and latent heat flux from 2000 to 2012 and 13 annual total potential ET Long-term average annual ET one annual average over 1971-2000 and long-term average ratio over precipitation between 1971-2000. The 2 soil water depth variables are soil annual minimum water table depth and annual minimum water table depth from April to June. c Topographic attributes are gathered from different data sources at multiple scales including 30, 90 and 1000 m d Easting and northing are the projected coordinates where soil samples were collected. e Small peak integral, given by the area of the region between the fitted function and the average of green-up NDVI and brown-down NDVI values, represents the seasonally active vegetation, which may be large for herbaceous vegetation cover and small for evergreen vegetation cover. Large peak integral, given by the area between the fitted function and the zero NDVI value bounded by the green-up time and brown-down time, represents the total vegetation stand and is a proxy for vegetation production. f Reclassified land use and land cover layer was created by combining relatively small and similar groups.
103
Table 3-2. R packages to perform evaluated methods Methods R Packages References SMLR stats Base R team CART rpart Therneau et al. (2015) BaRT ipred Peters et al. (2015) BoRT gbm Ridgeway (2004) RF randomForest Liaw and Wiener (2002) SVM kernlab Karatzoglou et al. ( 2004) PLSR pls Wehrens et al. (2007) OK gstat Pebesma ( 2004)
BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SMLR = stepwise multiple linear regression, SVM = support vector machine. Table 3-3. Descriptive statistic of observed soil C fractions (TC: Total carbon, RC:
Recalcitrant carbon and HC: Hot-water extractable carbon). Min. Max. Mean Median St.Dev.1 Range Skew2 Kurtosis
kg m-2
Whole set (N= 1014)
TC 0.45 34.15 4.74 3.32 4.35 0.45 34.15 2.94 11.21
RC 0.22 25.08 2.81 1.69 3.28 0.22 25.08 3.40 14.09
HC 0.02 0.71 0.16 0.14 0.09 0.02 0.71 1.49 3.71
Calibration (N= 710)
TC 0.45 34.15 4.71 3.33 4.30 0.45 34.15 2.96 11.73
RC 0.22 25.08 2.77 1.69 3.19 0.22 25.08 3.56 15.91
HC 0.02 0.71 0.16 0.14 0.09 0.02 0.71 1.53 3.70
Validation (N= 304)
TC 0.81 28.96 4.80 3.30 4.49 0.81 28.96 2.88 10.01
RC 0.30 24.02 2.91 1.66 3.47 0.30 24.02 3.07 10.66
HC 0.04 0.55 0.15 0.14 0.07 0.04 0.55 1.14 2.12 1 St.Dev = standard deviation. 2 Skew. = skewness. Table 3-4. Spearman’s correlation analysis of the paired soil C fractions.
TC RC HC TC 1.00 0.94 0.79 RC 1.00 0.73 HC 1.00
HC = hot water-extractable carbon, RC = recalcitrant carbon, TC = total carbon
104
Table 3-5. Z score as a sign for relative importance of all-relevant variables identified by Boruta to infer on total carbon (TC), recalcitrant carbon (RC) and hot-water extractable carbon (HC) in kg m-2 in the. Note: The variables are described in Table 1.
Factors a Relevant Variables b TC RC HC
S SoilSuborder 22.3 26.8 16.8 B/H LULCSampled 27.5 31.7 14.2 S SoilGreatGrp 11.2 11.0 14.2 P PhysiogName 9.7 7.3 11.9 W DrainCls 7.7 6.5 12.1 B/H LULC 11.6 8.8 6.3 P SurGeology 11.7 8.1 13.3 B/H LULCRecls 11.5 8.0 4.4 S SOM 9.3 10.4 4.6 B VegType 8.2 6.3 4.3 A PrecipFeb 4.8 6.8 4.3 T SoilSlope 5.9 4.5 4.1 W AWC50 12.5 11.6 S SoilReaction 10.6 8.3 S SoilOrder 6.2 10.0 B SmallNdviPkInt 5.7 12.7 W AWC25 7.5 8.1 W AWC100 6.0 5.3 S SoilAlbedo 5.3 5.8 B LsatTC1 4.8 6.7 W SoilRunoff 7.3 11.8 E EcoRegion 6.7 5.1 B BiophySet 5.6
5.4
S SoilHydration 6.4
4.5 B/H LandCovCls 7.3 B/H Cropland 6.7 B VegTpSysGrp1 6.0 E EnvSitePot 5.8 A MaxTempDec 5.6 P PhysiogType 5.6 B VegTpSysGrp2 5.5
105
Table 3-5. Continued Factors Relevant Variables TC RC HC
A MaxTempJan 5.3 A PrecipMay 5.2 A MaxTempApr 5.0 P EnvGeology 4.9 B NDVIAmplitude 8.1 A PrecipDecem 7.0 A PrecipDec 5.6 B LsatPC1 4.7 B EviOct 4.8 B LsatB5 4.8 W PondFreq 4.5 A SolarRadMay 4.3 A SoilMoistSep 4.3 A PrecipJune 6.6 B NdviJune 5.1 B EviJune 5.1 A PrecipOct 4.5 B NdviMay 4.3 B EviAgust 4.2 B LatHeat2009 4.2 S-W SoilMoistFeb 3.9
Abbreviations: S = soil, T = topography, E = ecology, P = parent material, A = atmosphere, B = biota, W = water, H = human. See table 3 -1 for the description of relevant variables. HC = hydrolysable carbon, RC = recalcitrant carbon, TC = total carbon,
106
Table 3-6. Performance of eight different modelling methods to predict soil total carbon (TC), recalcitrant carbon (RC) and labile carbon (HC) on validation dataset (n=304) across topsoil’s (0-20 cm) of Florida.
R2 RMSD (kg m -2) RPD RPIQ
TC RC HC TC RC HC TC RC HC TC RC HC
RF 0.72 0.72 0.31 2.39 1.89 0.06 1.88 1.84 1.19 1.35 0.90 1.54 BoRT 0.63 0.63 0.30 2.75 2.22 0.06 1.64 1.64 1.18 1.18 0.80 1.53 BaRT 0.62 0.62 0.28 2.78 2.16 0.06 1.62 1.61 1.17 1.16 0.78 1.51 SVM 0.65 0.62 0.30 2.69 2.21 0.06 1.67 1.57 1.19 1.20 0.76 1.54 PLSR 0.64 0.65 0.26 2.82 2.08 0.07 1.59 1.68 1.14 1.14 0.82 1.47 RK 0.63 0.63 0.21 2.99 2.13 0.07 1.51 1.63 1.05 1.08 0.79 1.36 CART 0.56 0.51 0.17 3.03 2.57 0.08 1.48 1.35 1.00 1.06 0.66 1.29 OK 0.43 0.29 0.11 3.81 3.27 0.07 1.18 1.06 1.04 0.85 0.52 0.35
Abbreviations: R2 = coefficient of determination, RMSD = root mean squared deviations, RPD = residual prediction deviation, RPIQ = ratio of prediction error to inter-quartile range; TC = total carbon, RC = recalcitrant carbon, HC = hot-water extractable carbon; BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine.
107
Table 3-7. Cross-validation (on the 70% calibration dataset) and independent validation (on the 30% validation dataset) results of Random Forest models to produce spatial distribution pattern for soil total carbon (TC), recalcitrant carbon (RC) and hydrolysable carbon (HC) across Florida.
Calibration (70%) Validation (30%)
R2 RMSD a (kg m-2 ) RMSD (kg m-2)
TC b 0.55 2.89 0.65 2.68
RC c 0.50 2.25 0.63 3.15
HC d 0.26 0.08 0.22 0.07 a Abbrevations: HC = hydrolysable carbon, RC = recalcitrant carbon, TC = total carbon, RMSD = root mean squared deviation. b The 10 continuous variables are AWC25, AWC50, LsatTC1, MaxTempDec, MaxTempJan, PrecipMay, SmallNdviPkInt, SOM, SoilAlbedo, and SoilSlope. c The 13 continuous variables are AWC25, AWC50, EviOct, LsatB5, LsatPC1, LsatTC1, NDVIAmplitude, PrecipDecem, SmallNdviPkInt, SolarRadMay, SOM, SoilAlbedo, and SoilSlope. d The 7 continuous variables are EviAgust, EviJune, NdviMay, NdviJune, PrecipOct, SOM, and SoilSlope.
108
Figure 3-1. A total of 1014 soil sampling locations (70% calibration samples in light blue
and 30% validation samples in red) and elevation in Florida.
109
Figure 3-2. Upper part of figure depicts the omnidirectional variograms for total carbon (TC), recalcitrant carbon (RC) and
hot-water extractable carbon (HC) in log kg m-2. Lower part of the figure illustrates the omnidirectional variogram for residuals arise from Stepwise Multiple Linear Regression (SMLR) of TC, RC and HC.
110
Figure 3-3. Predicted vs. observed soil total carbon (TC) of validation dataset derived from evaluated methods. BaRT =
bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine. R2 = coefficient of determination; RMSD = root mean squared deviations; RPD = residual prediction deviation; RPIQ = ratio of prediction error to inter-quartile range.
111
Figure 3-4. Predicted vs. observed soil recalcitrant carbon (RC) of validation dataset derived from evaluated methods.
BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine. R2 = coefficient of determination; RMSD = root mean squared deviations; RPD = residual prediction deviation; RPIQ = ratio of prediction error to inter-quartile range.
112
Figure 3-5. Predicted vs. observed soil hot-water extractable carbon (HC) of validation dataset derived from evaluated
methods. BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine. R2 = coefficient of determination; RMSD = root mean squared deviations; RPD = residual prediction deviation; RPIQ = ratio of prediction error to inter-quartile range.
113
Figure 3-6. Relative increase (%) in root mean squared deviations (RMSD) of evaluated
prediction techniques compare to RMSD of OK. BaRT = Bagged regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.
114
Figure 3-7. Strength of the spatial autocorrelation among evaluated model residuals for total carbon (TC). BaRT = Bagged
regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.
115
Figure 3-8. Strength of the spatial autocorrelation among evaluated model residuals for recalcitrant carbon (RC). BaRT =
Bagged regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.
116
Figure 3-9. Strength of the spatial autocorrelation among evaluated model residuals for hot-water extractable carbon
(HC). BaRT = Bagged regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.
117
Figure 3-10. Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed
land use/land cover (LULC). The Kruskal–Wallis test shows the significant effect of LULC on total C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different TC at α = 0.05).
118
Figure 3-11. Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-
observed land use/land cover (LULC). The Kruskal–Wallis test shows the significant effect of LULC on recalcitrant C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different RC at α = 0.05).
119
Figure 3-12. Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-
observed land use/land cover (LULC). The Kruskal–Wallis test shows the significant effect of LULC on hydrolysable C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different HC at α = 0.05).
120
Figure 3-13. Spatial distribution of landcover/landuse classes [Adapted from Florida
Fish and Wildlife Commission. 2003. Florida vegetation and land cover data derived from 2003 Landsat ETM+ imagery by B Styes et al. Office of Environmental Services, Florida Fish and Wildlife Conservation Commission, Tallahassee, Fl.]
121
Figure 3-14. Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed
suborders. The Kruskal–Wallis test shows the significant effect of suborders on total C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different TC at α = 0.05).
122
Figure 3-15. Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-
observed suborders. The Kruskal–Wallis test shows the significant effect of suborders on recalcitrant C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different RC at α = 0.05).
123
Figure 3-16. Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-
observed suborders. The Kruskal–Wallis test shows the significant effect of suborders on hydrolysable C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different HC at α = 0.05).
124
Figure 3-17. Spatial distribution of soil suborders [Adapted from Natural Resources
Conservation (NRCS), 2007. Soil Survey Geographic Database (SSURGO). United States Department of Agriculture (USDA). Map scale 1:24,000. Accessible through http://datagateway.nrcs.usda.gov/GDGOrder.aspx].
125
Figure 3-18. Spatial distribution patterns of estimated soil total carbon stocks (kg m-2)
across Florida, U.S. The map generated with Random forest model which only developed with continuous all-relevant variables identified by Boruta algorithm
126
Figure 3-19. Spatial distribution patterns of estimated recalcitrant carbon stocks (kg m-2)
across Florida, U.S. The map generated with Random forest model which only developed with continuous all-relevant variables identified by Boruta algorithm
127
Figure 3-20. Spatial distribution patterns of estimated hot-water extractable carbon
stocks (kg m-2) across Florida, U.S. The map generated with Random forest model which only developed with continuous all-relevant variables identified by Boruta algorithm
128
CHAPTER 4 SUMMARY AND SYNTHESIS
Soil C storage of Florida to a standardized depth of 1 m has been estimated as
the highest among all conterminous U.S states (Guo et al. 2006). In the Anthropocene,
however, increasing population, industrialization, urbanization and human-induced
impacts on natural forces have largely impacted on the global soil C budget. Therefore,
increasing reliability of soil C estimation for Florida is particularly important to determine
the future status of Florida’s soil C in changing world. Hence, the research presented in
this thesis focused on constructing accurate, realistic and parsimonious geo-spatial soil
landscape models to explore both - deterministic and stochastic parts that explain the
variability of distinct soil carbon fractions. Namely, they are recalcitrant, labile and total
soil C.
In the first part of the thesis (Chapter 2), a comprehensive synthesis on RK as
one of the most widely used methods in DSM was conducted to gain insights into
stochastic and deterministic variation of the investigated soil properties. The evolution of
hybrid techniques in pedometrics is outlined in a historical perspective. Moreover, the
parameters that may influence the performance of RK predictions was reported
reviewing 40 different articles published in international soil science journal on 2004-
2014. Findings from 140 cases that were documented in these articles revealed that the
sample density and the strength of relationship between auxiliary variables and soil
property predominantly influence the prediction performance of RK. In addition, we
propose that the following criteria are explicitly documented in future soil science DSM
papers to ensure consistency among all studies: area of extent, sample design, sample
depth, sample size (training and validation separately), sample depth(s), SCORPAN
129
factors, spatial resolution of final map, transformation methods, the method of factorial
analysis, regression type, coefficient of determination of the deterministic function,
variogram model type, spatial autocorrelation range, N:S ratio, validation method and
R2, RPIQ. Also, we revealed that incorporating non-parametric machine learning
method into the standard RK framework can improve the prediction accuracy of soil
properties. However, it needs further investigation of how parameters and methods
specifically affect the spatial dependence of residuals. Lastly, various RK types have
been proposed for comparative assessment to gain further insight in the RK protocol:
RKRF, OK , RKRF, IDS , RKRF, BK , RKSVM, OK , RKSVM, OK, RKSVM, IDS , RKGWR, OK , RKGWR, IDS ,
RKGWR, BK , RKPLSR, OK , RKPLSR, IDS , RKPLSR, BK , RKPCR, OK , RKPCR, IDS , RKPCR, BK.
In the second part of the thesis (Chapter 3) we aimed to develop accurate,
realistic and parsimonious soil C models for total, labile and recalcitrant soil C pools. To
strategically select important predictor variables the machine learning data reduction
technique Boruta was employed to filter out all-relevant environmental stressor out of
327 STEP-ABWH factors. This not only enabled us to reduce the multicollinearity
among exhaustive grids of environmental variables but also to develop unbiased
models. This allowed identifying 36, 30 and 25 all-relevant variables to optimize
prediction quality in terms of fitting, accuracy and parsimony for TC, RC and HC.
Results revealed that human–induced biotic and hydro-pedological factors of a given
soilscape predominantly control the stabilization and destabilization processes of soil C
pools. Also, to guarantee the accurate model, eight different pedometrics methods
employed for comparative assessments: PLSR, CART, BaRT, BoRT, RF, SVM, OK and
RK. Findings reveal that RF as an ensemble machine learning method outperformed all
130
of its competitor in terms of R2, RPD, RPIQ and RMSD and accounted for up to three-
fourth of the total variability in TC, RC, whereas only one fourth of HC because of its
unstable, dynamic nature. The spatial dependence of residuals derived from different
methods was investigated to develop the most realistic model. There was no significant
RSA left among evaluated methods, except in residuals derived from SMLR. In other
words, the incorporation of data-mining method into the RK framework was not
necessary because there was no stochastic variation left among model residuals. This
can be attributed to both: First, sophisticated methods were capable to capture all
attainable variation offered with environmental variables; and second, the introduction of
all-relevant auxiliary environmental variables guaranteed the capturing of all attainable
information present as deterministic and stochastic variation. Based on these findings
we propose that in cases where a biased smaller set of environmental predictors is
used to model soil properties the residuals should be reported explicitly and routinely in
future DSM studies.
It may be not likely to identify a spatial prediction method that is best for every
case (Sun et al., 2012), but it may be possible to develop models that identify and
quantify the attainable variability with a given sampling configuration. Hence, in this
research we illustrated how to guarantee capturing the all attainable variation in three
different soil properties. For further improvement, to introduce more sophisticated
environmental predictors representing vegetation; soil-water and soil geochemistry is
the way forward to decrease uncertainty associated with regional scale C estimation.
The study also elaborated how the most sensitive environmental factors may influence
the soil C budget along pedo-climatic trajectories across Florida. The predicted maps
131
clearly displayed the lower C density associated with well-drained upland, higher C
density related to water rich wetlands.
132
APPENDIX LITERATURE REVIEW
133
Summary of DSMM papers (2004-2004) which utilized RK to map soil properties and classes
134
135
136
137
138
139
140
141
142
143
Description of properties Literature Review
144
Predicted soil properties and classes: soil organic carbon (SOC), total phosphorus (P total), organic P (P org), inorganic P(P inorg), and available P , Soil available P was characterized by different chemical extractions: ammonium acetate (P-AEE), water (P-H2O), CO2-saturated nanopure water (P-CO2), Sodium bicarbonate( P-NaHCO3), Soil Organic Carbon (SOC) stocks, resistant organic carbon (ROC), humus organic carbon (HOC) and particulate organic carbon (POC), soil pH (pH), soil organic matter (SOM), Carbon to Nitrogen ratio (C:N), , alkali-hydrolysable N (AN),total C, N, K, Al, Ca, Mg and Zn ,Cr Cu, Ni, decalcified loess material (C1), Arsenic(As), cadmium(Cd), chromium (Cr), copper (Cu), mercury (Hg), nickel(Ni), lead and zinc(Z), soil texture classes( Tex. Clas), nitrate–nitrogen concentrations (NO3-N conc), sparse mineral nitrogen (MinN) , potentially available nitrogen(PAN) , Soil (regolith) depth (Reg. Dep), electrical conductivity (EC), recalcitrant C (RC), hydrolysable C (HC), hot-water-soluble C (SC), and mineralizable C (MC), Available water capacity (AWC), Sample Design: Regular grid and its sample spacing (m) (G- m), Random sampling (R), Stratified random sampling (SR), Purposive sampling (PS), conditioned latin hypercube sampling (cLHS) Total number of training set (T), total number of validation (V), Cross-validation (Cval), SCORP: Soil(S), Climate(C), Organism (O), Relief (R), Parent material (P), spatial resolution of digital elevation model (m) (DEM) Logarithmic transformation (log), principal component analysis (PCA), Regression Type: Stepwise multiple linear regression (SMLR), GLM (Generalized linear model), Regression tree (RT), Support vector regression (SVR), Residual maximum likelihood (REML), Geographically weighted regression (GWR),Logistic regression (LR), Classification and regression tree( CART), Generalized Additive Model (GAM), Variogram Model: Exponential (Exp), Spherical (Sph), Validation: R2 (coefficient of determination), mean error (ME), root mean square error (RMSE), Normalized RMSE by the total variation (Standard Deviation) (RMSEr,) , mean standardized squared deviation ratio (MSDR), Lin's concordance correlation coefficient (CCC), the standardized prediction error (θ), normalized root mean square error (NRMSE) by ymax-ymin,, residual prediction deviation (RPD), Relative Operating Characteristic (ROC), model efficiency coefficient (MEF), mean rank of a method (MR), mean square error (MSE), mean relative error (MRE).
145
LIST OF REFERENCES
Ahmed, S., De Marsily, G., 1987. Comparison of geostatistical methods for estimating transmissivity using data on transmissivity and specific capacity. Water Resour. Res. 23, 1717–1737. doi:10.1029/WR023i009p01717
Ahn, M.-Y., Zimmerman, A.R., Comerford, N.B., Sickman, J.O., Grunwald, S., 2009. Carbon mineralization and labile organic carbon pools in the sandy soils of a north florida watershed. Ecosystems 12, 672–685. doi:10.1007/s10021-009-9250-8
Amundson, R., 2001. The carbon budget in soils. Annu. Rev. Earth Planet. Sci. 29, 535–562. doi:10.1146/annurev.earth.29.1.535
Angers, D.A., Arrouays, D., Saby, N.P.A., Walter, C., 2011. Estimating and mapping the carbon saturation deficit of French agricultural topsoils: Carbon saturation of French soils. Soil Use Manag. 27, 448–452. doi:10.1111/j.1475-2743.2011.00366.x
Balaria, A., Johnson, C.E., Xu, Z., 2009. Molecular-scale characterization of hot-water-extractable organic matter in organic horizons of a forest soil. Soil Sci. Soc. Am. J. 73, 812. doi:10.2136/sssaj2008.0075
Baldock, J.A., Skjemstad, J.O., 2000. Role of the soil matrix and minerals in protecting natural organic materials against biological attack. Org. Geochem. 31, 697–710.
Baldock, J.A., Wheeler, I., McKenzie, N., McBrateny, A., 2012. Soils and climate change: potential impacts on carbon stocks and greenhouse gas emissions, and future research for Australian agriculture. Crop Pasture Sci. 63, 269–283.
Basher, L.R., 1997. Is pedology dead and buried? Aust. J. Soil Res. 35, 979–994.
Baxter, S.J., Oliver, M.A., 2005. The spatial prediction of soil mineral N and potentially available N using elevation. Geoderma, Pedometrics 2003 128, 325–339. doi:10.1016/j.geoderma.2005.04.013
Belanche-Muñoz, L., Blanch, A.R., 2008. Machine learning methods for microbial source tracking. Environ. Model. Softw. 23, 741–750. doi:10.1016/j.envsoft.2007.09.013
Bellon-Maurel, V., Fernandez-Ahumada, E., Palagos, B., Roger, J. M., McBratney, A., 2010. Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends Anal. Chem. 29, 1073–1081. doi:10.1016/j.trac.2010.05.006
Bishop, T.F.A., McBratney, A.B., 2001. A comparison of prediction methods for the creation of field-extent soil property maps. Geoderma, Estimating uncertainty in soil models 103, 149–160. doi:10.1016/S0016-7061(01)00074-X
146
Biswas, A., Cheng, B., 2013. Model Averaging for Semivariogram Model Parameters, in: Grundas, S. (Ed.), Advances in Agrophysical Research. InTech.
Blöschl, G., Sivapalan, M., 1995. Scale issues in hydrological modelling: A review. Hydrol. Process. 9, 251–290. doi:10.1002/hyp.3360090305
Bockheim, J.G., Gennadiyev, A.N., 2010. Soil-factorial models and earth-system science: A review. Geoderma 159, 243–251. doi:10.1016/j.geoderma.2010.09.005
Bouma, J., McBratney, A., 2013. Framing soils as an actor when dealing with wicked environmental problems. Geoderma 200–201, 130–139. doi:10.1016/j.geoderma.2013.02.011
Breiman, L. J.H. Friedman, R.A. Olshen, and C.J. Stone 1984. Classification and regression trees, Chapman & Hall, New York
Breiman, L., 1996. Bagging predictors. Mach. Learn. 24, 123–140.
Breiman, L., 2001. Random Forests. Mach. Learn. 45, 5–32. doi:10.1023/A:1010933404324
Brunsdon, C., Fotheringham, A.S., Charlton, M., 2008. Geographically weighted regression: a method for exploring spatial nonstationarity. Encycl. Geogr. Inf. Sci. 558.
Burgess, T.M. and Webster, R., 1980. Optimal interpolation and isarithmic mapping of soil properties. 1. The semi-variogram and punctual kriging. Journal of Soil Science, 31: 315-331.
Burrough, P.A., 1986. Principles of geographical information systems for land resources assessment, Monographs on soil and resources survey. Clarendon Press ; Oxford University Press, Oxford New York.
Burrough, P.A., Bouma, J., Yates, S.R., 1994. The state of the art in pedometrics. Geoderma 62, 311–326. doi:10.1016/0016-7061(94)90043-4
Cambardella, C.A., Moorman, T.B., Parkin, T.B., Karlen, D.L., Novak, J.M., Turco, R.F., Konopka, A.E., 1994. Field-scale variability of soil properties in Central Iowa Soils. Soil Sci. Soc. Am. J. 58, 1501. doi:10.2136/sssaj1994.03615995005800050033x
Carré, F., Girard, M.C., 2002. Quantitative mapping of soil types based on regression kriging of taxonomic distances with landform and land cover attributes. Geoderma 110, 241–263. doi:10.1016/S0016-7061(02)00233-1
147
Chai, X., Shen, C., Yuan, X., Huang, Y., 2008. Spatial prediction of soil organic matter in the presence of different external trends with REML-EBLUP. Geoderma 148, 159–166. doi:10.1016/j.geoderma.2008.09.018
Chaplot, V., Lorentz, S., Podwojewski, P., Jewitt, G., 2010. Digital mapping of A horizon thickness using the correlation between various soil properties and soil apparent electrical resistivity. Geoderma 157, 154–164. doi:10.1016/j.geoderma.2010.04.006
Cheng, L., Leavitt, S.W., Kimball, B.A., Pinter, P.J., Ottman, M.J., Matthias, A., Wall, G.W., Brooks, T., Williams, D.G., Thompson, T.L., 2007. Dynamics of labile and recalcitrant soil carbon pools in a sorghum free-air CO2 enrichment (FACE) agroecosystem. Soil Biol. Biochem. 39, 2250–2263. doi:10.1016/j.soilbio.2007.03.031
Conant, R.T., Ryan, M.G., Ågren, G.I., Birge, H.E., Davidson, E.A., Eliasson, P.E., Evans, S.E., Frey, S.D., Giardina, C.P., Hopkins, F.M., Hyvönen, R., Kirschbaum, M.U.F., Lavallee, J.M., Leifeld, J., Parton, W.J., Megan Steinweg, J., Wallenstein, M.D., Martin Wetterstedt, J.A., Bradford, M.A., 2011. Temperature and soil organic matter decomposition rates - synthesis of current knowledge and a way forward. Glob. Change Biol. 17, 3392–3404. doi:10.1111/j.1365-2486.2011.02496.x
Conant, R.T., Six, J., Paustian, K., 2003. Land use effects on soil carbon fractions in the southeastern United States. I. Management-intensive versus extensive grazing. Biol. Fertil. Soils 38, 386–392. doi:10.1007/s00374-003-0652-z
Cressie, N.A.C., 1993. Statistics for spatial data, Rev. ed. Wiley series in probability and mathematical statistics. Wiley, New York.
Cruz-Cárdenas, G., López-Mata, L., Ortiz-Solorio, C.A., Villaseñor, J.L., Ortiz, E., Silva, J.T., Estrada-Godoy, F., 2014. Interpolation of Mexican soil properties at a scale of 1:1,000,000. Geoderma 213, 29–35. doi:10.1016/j.geoderma.2013.07.014
Dai, F., Zhou, Q., Lv, Z., Wang, X., Liu, G., 2014. Spatial prediction of soil organic matter content integrating artificial neural network and ordinary kriging in Tibetan Plateau. Ecol. Indic. 45, 184–194. doi:10.1016/j.ecolind.2014.04.003
Davidson, E.A., Janssens, I.A., 2006. Temperature sensitivity of soil carbon decomposition and feedbacks to climate change. Nature 440, 165–173. doi:10.1038/nature04514
de Carvalho Junior, W., Lagacherie, P., da Silva Chagas, C., Calderano Filho, B., Bhering, S.B., 2014. A regional-scale assessment of digital mapping of soil attributes in a tropical hillslope environment. Geoderma 232–234, 479–486. doi:10.1016/j.geoderma.2014.06.007
148
Dlugoß, V., Fiener, P., Schneider, K., 2010. Layer-specific analysis and spatial prediction of soil organic carbon using terrain attributes and erosion modeling. Soil Sci. Soc. Am. J. 74, 922. doi:10.2136/sssaj2009.0325
Doetterl, S., Stevens, A., Six, J., Merckx, R., Van Oost, K., Casanova Pinto, M., Casanova-Katny, A., Muñoz, C., Boudin, M., Zagal Venegas, E., Boeckx, P., 2015. Soil carbon storage controlled by interactions between geochemistry and climate. Nat. Geosci. doi:10.1038/ngeo2516
Doetterl, S., Stevens, A., van Oost, K., Quine, T.A., van Wesemael, B., 2013. Spatially-explicit regional-scale prediction of soil organic carbon stocks in cropland using environmental variables and mixed model approaches. Geoderma 204-205, 31–42. doi:10.1016/j.geoderma.2013.04.007
Douaoui, A.E.K., Nicolas, H., Walter, C., 2006. Detecting salinity hazards within a semiarid context by means of combining soil and remote-sensing data. Geoderma 134, 217–230. doi:10.1016/j.geoderma.2005.10.009
Eberhardt, R.W., Latham, R.E., 2000. Relationships among vegetation, surficial geology and soil water content at the Pocono Mesic Till Barrens. J. Torrey Bot. Soc. 127, 115–124. doi:10.2307/3088689
Efron, B., Tibshirani, R., 1993. An introduction to the bootstrap, Monographs on statistics and applied probability. Chapman & Hall, New York.
Ekschmitt, K., Kandeler, E., Poll, C., Brune, A., Buscot, F., Friedrich, M., Gleixner, G., Hartmann, A., Kästner, M., Marhan, S., Miltner, A., Scheu, S., Wolters, V., 2008. Soil-carbon preservation through habitat constraints and biological limitations on decomposer activity. J. Plant Nutr. Soil Sci. 171, 27–35. doi:10.1002/jpln.200700051
Elliott, E.T., Paustian, K., Frey, S.D., 1996. Modeling the measurable or measuring the modelable: a hierarchical approach to isolating meaningful soil organic matter fractionations, in: Powlson, D.S., Smith, P., Smith, J.U. (Eds.), Evaluation of Soil Organic Matter Models, NATO ASI Series. Springer Berlin Heidelberg, pp. 161–179.
Fissore, C., Giardina, C.P., Kolka, R.K., Trettin, C.C., King, G.M., Jurgensen, M.F., Barton, C.D., Mcdowell, S.D., 2008. Temperature and vegetation effects on soil organic carbon quality along a forested mean annual temperature gradient in North America. Glob. Change Biol. 14, 193–205. doi:10.1111/j.1365-2486.2007.01478.x
Flatman, G.T., Yfantis, A.A., 1984. Geostatistical strategy for soil sampling: the survey and the census. Environ. Monit. Assess. 4, 335–349. doi:10.1007/BF00394172
149
Florida Fish and Wildlife Conservation Commission (FFWCC), 2003. Florida Vegetation and Land Cover Data Derived from Landsat ETM Imagery. Available at: http://myfwc.com/research/gis/data-maps/terrestrial/fl-vegetation-land-cover/.
Garthwaite, P.H., 1994. An interpretation of partial least squares. J. Am. Stat. Assoc. 89, 122–127. doi:10.2307/2291207
Gessler, P.E., Chadwick, O.A., Chamran, F., Althouse, L., Holmes, K., 2000. Modeling soil–landscape and ecosystem properties using terrain attributes. Soil Sci. Soc. Am. J. 64, 2046. doi:10.2136/sssaj2000.6462046x
Ghani, A., Dexter, M., Perrott, K.., 2003. Hot-water extractable carbon in soils: a sensitive measurement for determining impacts of fertilisation, grazing and cultivation. Soil Biol. Biochem. 35, 1231–1243. doi:10.1016/S0038-0717(03)00186-X
Glinka, K.D., 1927. Dokuchaiev’s ideas in the development of pedology and cognate sciences. Academy of Science, Leningrad.
Goh, K.M., 2004. Carbon sequestration and stabilization in soils: Implications for soil productivity and climate change. Soil Sci. Plant Nutr. 50, 467–476. doi:10.1080/00380768.2004.10408502
Goovaerts, P., 1997. Geostatistics for natural resources evaluation, Applied geostatistics series. Oxford University Press, New York.
Goovaerts, P., 1999. Using elevation to aid the geostatistical mapping of rainfall erosivity. Catena 34, 227–242.
Goovaerts, P., 2001. Geostatistical modelling of uncertainty in soil science. Geoderma, Estimating uncertainty in soil models 103, 3–26. doi:10.1016/S0016-7061(01)00067-2
Goswami, M., O’Connor, K.M., 2007. Real-time flow forecasting in the absence of quantitative precipitation forecasts: A multi-model approach. J. Hydrol. 334, 125–140. doi:10.1016/j.jhydrol.2006.10.002
Grimm, R., Behrens, T., Märker, M., Elsenbeer, H., 2008. Soil organic carbon concentrations and stocks on Barro Colorado Island — Digital soil mapping using Random Forests analysis. Geoderma 146, 102–113. doi:10.1016/j.geoderma.2008.05.008
Grunwald, S., 2006. What do we really know about the space-time continuum of soil-landscapes? In: Grunwald, S. (Ed.), Environmental soil-landscape modeling: geographic information technologies and pedometrics. CRC Press, Boca Raton, FL, pp. 3–36.
150
Grunwald, S., 2009. Multi-criteria characterization of recent digital soil mapping and modeling approaches. Geoderma 152, 195–207. doi:10.1016/j.geoderma.2009.06.003
Grunwald, S., McBratney, A.B., Thompson, J.A., Minasny, B., Boettinger, J.L., 2016. Digital Soil Mapping in a changing world. Comput. Ethics Multicult. Approach 301.
Grunwald, S., Thompson, J.A., Boettinger, J.L., 2011. Digital Soil Mapping and Modeling at Continental Scales: Finding Solutions for Global Issues. Soil Sci. Soc. Am. J. 75, 1201. doi:10.2136/sssaj2011.0025
Grunwald, S., Yu, C., Xiong, X., 2014. Transferability and scaling of soil total carbon prediction models in Florida. PeerJ PrePrints.
Guo, L.B., Gifford, R.M., 2002. Soil carbon stocks and land use change: a meta analysis. Glob. Change Biol. 8, 345–360. doi:10.1046/j.1354-1013.2002.00486.x
Guo, P.T., Li, M.F., Luo, W., Tang, Q.F., Liu, Z.W., Lin, Z.M., 2015. Digital mapping of soil organic matter for rubber plantation at regional scale: An application of random forest plus residuals kriging approach. Geoderma 237–238, 49–59. doi:10.1016/j.geoderma.2014.08.009
Guo, Y., Amundson, R., Gong, P., Yu, Q., 2006. Quantity and Spatial Variability of Soil Carbon in the Conterminous United States. Soil Sci. Soc. Am. J. 70, 590. doi:10.2136/sssaj2005.0162
Haberlandt, U., 2007. Geostatistical interpolation of hourly precipitation from rain gauges and radar for a large-scale extreme rainfall event. J. Hydrol. 332, 144–157. doi:10.1016/j.jhydrol.2006.06.028
Hartemink, A.E., Hempel, J., Lagacherie, P., McBratney, A., McKenzie, N., MacMillan, R.A., Minasny, B., Montanarella, L., Santos, M.L. de M., Sanchez, P., Walsh, M., Zhang, G.-L., 2010. GlobalSoilMap.net – A New Digital Soil Map of the World, in: Boettinger, D.J.L., Howell, D.W., Moore, A.C., Hartemink, P.D.A.E., Kienast-Brown, S. (Eds.), Digital Soil Mapping, Progress in Soil Science. Springer Netherlands, pp. 423–428.
Hartemink, A.E., McBratney, A., 2008. A soil science renaissance. Geoderma 148, 123–129. doi:10.1016/j.geoderma.2008.10.006
Hassink, J., 1997. The capacity of soils to preserve organic C and N by their association with clay and silt particles. Plant Soil 191, 77–87. doi:10.1023/A:1004213929699
Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning, Springer Series in Statistics. Springer New York, New York, NY.
151
Haynes, R.J., 2005. Labile organic matter fractions as central components of the quality of agricultural soils: an overview, in: Agronomy, B.-A. in (Ed.), Academic Press, pp. 221–268.
Hengl, T., de Jesus, J.M., MacMillan, R.A., Batjes, N.H., Heuvelink, G.B.M., Ribeiro, E., Samuel-Rosa, A., Kempen, B., Leenaars, J.G.B., Walsh, M.G., Gonzalez, M.R., 2014. SoilGrids1km — Global soil information based on automated mapping. PLoS ONE 9, e105992. doi:10.1371/journal.pone.0105992
Hengl, T., Heuvelink, G.B.M., Rossiter, D.G., 2007a. About regression-kriging: From equations to case studies. Comput. Geosci., Spatial Analysis Spatial Analysis 33, 1301–1315. doi:10.1016/j.cageo.2007.05.001
Hengl, T., Heuvelink, G.B.M., Stein, A., 2004. A generic framework for spatial prediction of soil variables based on regression-kriging. Geoderma 120, 75–93. doi:10.1016/j.geoderma.2003.08.018
Hengl, T., Toomanian, N., Reuter, H.I., Malakouti, M.J., 2007b. Methods to interpolate soil categorical variables from profile observations: Lessons from Iran. Geoderma, Pedometrics 2005 140, 417–427. doi:10.1016/j.geoderma.2007.04.022
Herbst, M., Diekkrüger, B., Vereecken, H., 2006. Geostatistical co-regionalization of soil hydraulic properties in a micro-scale catchment using terrain attributes. Geoderma 132, 206–221. doi:10.1016/j.geoderma.2005.05.008
Hernández, N., Kiralj, R., Ferreira, M.M.C., Talavera, I., 2009. Critical comparative analysis, validation and interpretation of SVM and PLS regression models in a QSAR study on HIV-1 protease inhibitors. Chemom. Intell. Lab. Syst. 98, 65–77. doi:10.1016/j.chemolab.2009.04.012
Heuvelink, G.B.M., Webster, R., 2001. Modelling soil variation: past, present, and future. Geoderma, Developments and Trends in Soil Science 100, 269–301. doi:10.1016/S0016-7061(01)00025-8
Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T., 1999. Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors. Stat. Sci. 14, 382–417. doi:10.1214/ss/1009212519
Holden, P.A., Fierer, N., 2005. Microbial processes in the vadose zone. Vadose Zone J. 4, 1–21.
Hornik, K., Meyer, D., Karatzoglou, A., 2006. Support vector machines in R. J. Stat. Softw. 15, 1–28.
152
Hu, K., Wang, S., Li, H., Huang, F., Li, B., 2014. Spatial scaling effects on variability of soil organic matter and total nitrogen in suburban Beijing. Geoderma 226–227, 54–63. doi:10.1016/j.geoderma.2014.03.001
Hudson, B.D., 1992. The soil survey as paradigm-based science. Soil Sci. Soc. Am. J. 56, 836–841.
J. Six, R.T.C., 2002. Stabilization mechanisms of soil organic matter: Implications for C-saturation of soils. Plant Soil 241, 155–176. doi:10.1023/A:1016125726789
James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical Learning: with Applications in R. Springer Science & Business Media.
Janzen, H.H., 2004. Carbon cycling in earth systems - a soil science perspective. Agric. Ecosyst. Environ. 104, 399–417. doi:10.1016/j.agee.2004.01.040
Jastrow JD, Miller RM. 1998. Soil aggregate stabilization and carbon sequestration: Feedbacks through organomineral associations. In Lal R, Kimble JM, Follett RF, Stewart BA, (Ed). Soil Processes and the Carbon Cycle. Boca Raton (FL): CRC Press, pp. 207–223
Jenny, H., 1941. Factors of soil formation. McGraw-Hill Book Company New York, NY.
Jobbágy, E.G., Jackson, R.B., 2000. The vertical distribution of soil organic carbon and its relation to climate and vegetation. Ecol. Appl. 10, 423–436.
Karatzoglou, A., Smola, A., Hornik, K., Karatzoglou, M.A., SparseM, S., Yes, L., 2007. The kernlab package. Compr. R Arch. Netw.
Karunaratne, S.B., Bishop, T.F.A., Baldock, J.A., Odeh, I.O.A., 2014. Catchment scale mapping of measureable soil organic carbon fractions. Geoderma 219–220, 14–23. doi:10.1016/j.geoderma.2013.12.005
Kautz R, Stys B, Kawula R, 2007. Florida vegetation 2003 and land use change between 1985–89 and 2003. Fla Sci 70(1):12
Kerry, R., Oliver, M.A., 2007. Comparing sampling needs for variograms of soil properties computed by the method of moments and residual maximum likelihood. Geoderma, Pedometrics 2005 140, 383–396. doi:10.1016/j.geoderma.2007.04.019
Kleber, M., Nico, P.S., Plante, A., Filley, T., Kramer, M., Swanston, C., Sollins, P., 2011. Old and stable soil organic matter is not necessarily chemically recalcitrant: implications for modeling concepts and temperature sensitivity: slow turnover of labile soil organic matter. Glob. Change Biol. 17, 1097–1107. doi:10.1111/j.1365-2486.2010.02278.x
153
Knotters, M., Brus, D.J., Oude Voshaar, J.H., 1995. A comparison of kriging, co-kriging and kriging combined with regression for spatial interpolation of horizon depth with censored observations. Geoderma 67, 227–246. doi:10.1016/0016-7061(95)00011-C
Knox, N.M., Grunwald, S., McDowell, M.L., Bruland, G.L., Myers, D.B., Harris, W.G., 2015. Modelling soil carbon fractions with visible near-infrared (VNIR) and mid-infrared (MIR) spectroscopy. Geoderma 239–240, 229–239. doi:10.1016/j.geoderma.2014.10.019
Kravchenko, A.N., 2003. Influence of spatial structure on accuracy of interpolation methods. Soil Sci. Soc. Am. J. 67, 1564. doi:10.2136/sssaj2003.1564
Kravchenko, A.N., Robertson, G.P., 2007. Can topographical and yield data substantially improve total soil carbon mapping by regression kriging? Agron. J. 99, 12. doi:10.2134/agronj2005.0251
Kuhn, M., Johnson, K., 2013. Applied predictive modeling. Springer New York, New York, NY.
Kumar, S., Lal, R., Liu, D., 2012. A geographically weighted regression kriging approach for mapping soil organic carbon stock. Geoderma 189–190, 627–634. doi:10.1016/j.geoderma.2012.05.022
Kuriakose, S.L., Devkota, S., Rossiter, D.G., Jetten, V.G., 2009. Prediction of soil depth using environmental variables in an anthropogenic landscape, a case study in the Western Ghats of Kerala, India. CATENA 79, 27–38. doi:10.1016/j.catena.2009.05.005
Kursa, M.B., Rudnicki, W.R., 2010. Feature selection with the Boruta package. J. Stat. Softw. 36, 1-13
Lado, L.R., Hengl, T., Reuter, H.I., 2008. Heavy metals in European soils: A geostatistical analysis of the FOREGS Geochemical database. Geoderma 148, 189–199. doi:10.1016/j.geoderma.2008.09.020
Lal, R., 2004. Soil carbon sequestration impacts on global climate change and food security. Science 304, 1623–1627. doi:10.1126/science.1097396
Lamsal, S., Grunwald, S., Bruland, G.L., Bliss, C.M., Comerford, N.B., 2006. Regional hybrid geospatial modeling of soil nitrate–nitrogen in the Santa Fe River Watershed. Geoderma 135, 233–247. doi:10.1016/j.geoderma.2005.12.009
Lange, M., Eisenhauer, N., Sierra, C.A., Bessler, H., Engels, C., Griffiths, R.I., Mellado-Vázquez, P.G., Malik, A.A., Roy, J., Scheu, S., Steinbeiss, S., Thomson, B.C., Trumbore, S.E., Gleixner, G., 2015. Plant diversity increases soil microbial activity and soil carbon storage. Nat. Commun. 6, 6707. doi:10.1038/ncomms7707
154
Lark, R.M., 1999. Soil–landform relationships at within-field scales: an investigation using continuous classification. Geoderma 92, 141–165. doi:10.1016/S0016-7061(99)00028-2
Lark, R.M., 2012. Towards soil geostatistics. Spat. Stat. 1, 92–99. doi:10.1016/j.spasta.2012.02.001
Lark, R.M., Cullis, B.R., 2004. Model-based analysis using REML for inference from systematically sampled data on soil. Eur. J. Soil Sci. 55, 799–813. doi:10.1111/j.1365-2389.2004.00637.x
Lark, R.M., Cullis, B.R., Welham, S.J., 2006. On spatial prediction of soil properties in the presence of a spatial trend: the empirical best linear unbiased predictor (E-BLUP) with REML. Eur. J. Soil Sci. 57, 787–799. doi:10.1111/j.1365-2389.2005.00768.x
Lark, R.M., Webster, R., 2006. Geostatistical mapping of geomorphic variables in the presence of trend. Earth Surf. Process. Landf. 31, 862–874. doi:10.1002/esp.1296
Lawrence, C.R., Harden, J.W., Xu, X., Schulz, M.S., Trumbore, S.E., 2015. Long-term controls on soil organic carbon with depth and time: A case study from the Cowlitz River Chronosequence, WA USA. Geoderma 247–248, 73–87. doi:10.1016/j.geoderma.2015.02.005
Lawrence, R., 2004. Classification of remotely sensed imagery using stochastic gradient boosting as a refinement of classification tree analysis. Remote Sens. Environ. 90, 331–336. doi:10.1016/j.rse.2004.01.007
Leinweber, P., Schulten, H.-R., Körschens, M., 1995. Hot water extracted organic matter: chemical composition and temporal variations in a long-term field experiment. Biol. Fertil. Soils 20, 17–23. doi:10.1007/BF00307836
Leopold, U., Heuvelink, G.B.M., Tiktak, A., Finke, P.A., Schoumans, O., 2006. Accounting for change of support in spatial accuracy assessment of modelled soil mineral phosphorous concentration. Geoderma 130, 368–386. doi:10.1016/j.geoderma.2005.02.008
Levi, M.R., Rasmussen, C., 2014. Covariate selection with iterative principal component analysis for predicting physical soil properties. Geoderma 219–220, 46–57. doi:10.1016/j.geoderma.2013.12.013
Li, J., Heap, A.D., 2011a. A review of comparative studies of spatial interpolation methods in environmental sciences: Performance and impact factors. Ecol. Inform. 6, 228–241. doi:10.1016/j.ecoinf.2010.12.003
155
Li, J., Heap, A.D., Potter, A., Daniell, J.J., 2011b. Application of machine learning methods to spatial interpolation of environmental variables. Environ. Model. Softw. 26, 1647–1659. doi:10.1016/j.envsoft.2011.07.004
Li, Q., Yue, T., Wang, C., Zhang, W., Yu, Y., Li, B., Yang, J., Bai, G., 2013. Spatially distributed modeling of soil organic matter across China: An application of artificial neural network approach. CATENA 104, 210–218. doi:10.1016/j.catena.2012.11.012
Li, Y., 2010. Can the spatial prediction of soil organic matter contents at various sampling scales be improved by using regression kriging with auxiliary information? Geoderma 159, 63–75. doi:10.1016/j.geoderma.2010.06.017
Liaw, A., Wiener, M., (2002) Classification and Regression by randomForest. R News 2: 18-22
Lin, H., 2010. Earth’s Critical Zone and hydropedology: concepts, characteristics, and advances. Hydrol Earth Syst Sci 14, 25–45. doi:10.5194/hess-14-25-2010
Lin, H., 2012. Hydropedology, in: Hydropedology. Elsevier, pp. 3–39.
Lin, H., Wheeler, D., Bell, J., Wilding, L., 2005. Assessment of soil spatial variability at multiple scales. Ecol. Model., Scaling, fractals and diversity in soils and ecohydrology 182, 271–290. doi:10.1016/j.ecolmodel.2004.04.006
Lin, Y.P., Cheng, B.Y., Chu, H.J., Chang, T.K., Yu, H.L., 2011. Assessing how heavy metal pollution and human activity are related by using logistic regression and kriging methods. Geoderma 163, 275–282. doi:10.1016/j.geoderma.2011.05.004
Liu, H., Motoda, H., 2012. Feature Selection for Knowledge Discovery and Data Mining. Springer Science & Business Media.
Liu, H., Motoda, H., 2012. Feature selection for knowledge discovery and data mining. Springer Science & Business Media.
Lutzow, M. v., Kogel-Knabner, I., Ekschmitt, K., Matzner, E., Guggenberger, G., Marschner, B., Flessa, H., 2006. Stabilization of organic matter in temperate soils: mechanisms and their relevance under different soil conditions - a review. Eur. J. Soil Sci. 57, 426–445. doi:10.1111/j.1365-2389.2006.00809.x
Malone, B.P., Minasny, B., Odgers, N.P., McBratney, A.B., 2014. Using model averaging to combine soil property rasters from legacy soil maps and from point data. Geoderma 232–234, 34–44. doi:10.1016/j.geoderma.2014.04.033
156
Marschner, B., Brodowski, S., Dreves, A., Gleixner, G., Gude, A., Grootes, P.M., Hamer, U., Heim, A., Jandl, G., Ji, R., Kaiser, K., Kalbitz, K., Kramer, C., Leinweber, P., Rethemeyer, J., Schäffer, A., Schmidt, M.W.I., Schwark, L., Wiesenberg, G.L.B., 2008. How relevant is recalcitrance for the stabilization of organic matter in soils? J. Plant Nutr. Soil Sci. 171, 91–110. doi:10.1002/jpln.200700049
Martin, M.P., Orton, T.G., Lacarce, E., Meersmans, J., Saby, N.P.A., Paroissien, J.B., Jolivet, C., Boulonne, L., Arrouays, D., 2014. Evaluation of modelling approaches for predicting the spatial distribution of soil organic carbon stocks at the national scale. Geoderma 223–225, 97–107. doi:10.1016/j.geoderma.2014.01.005
Martin, M.P., Wattenbach, M., Smith, P., Meersmans, J., Jolivet, C., Boulonne, L., Arrouays, D., 2011. Spatial distribution of soil organic carbon stocks in France. Biogeosciences 8, 1053–1065. doi:10.5194/bg-8-1053-2011
Matheron, G., 1971. The theory of regionalized variables and its Applications.
McBratney, A., 1992. On variation, uncertainty and informatics in environmental soil management. Soil Res. 30, 913–935.
McBratney, A., Mendonça Santos, M., Minasny, B., 2003. On digital soil mapping. Geoderma 117, 3–52. doi:10.1016/S0016-7061(03)00223-4
McBratney, A.B., 1998. Some considerations on methods for spatially aggregating and disaggregating soil information, in: Soil and Water Quality at Different Scales. Springer, pp. 51–62.
McBratney, A.B., Odeh, I.O.A., Bishop, T.F.A., Dunbar, M.S., Shatar, T.M., 2000. An overview of pedometric techniques for use in soil survey. Geoderma 97, 293–327. doi:10.1016/S0016-7061(00)00043-4
McKenzie, N.J., Ryan, P.J., 1999. Spatial prediction of soil properties using environmental correlation. Geoderma 89, 67–94. doi:10.1016/S0016-7061(98)00137-2
McSweeney, K., Slater, B.K., David Hammer, R., Bell, J.C., Gessler, P.E., Petersen, G.W., 1994. Towards a new framework for modeling the soil-landscape continuum, in: SSSA Special Publication. Soil Science Society of America.
Meersmans, J., De Ridder, F., Canters, F., De Baets, S., Van Molle, M., 2008. A multiple regression approach to assess the spatial distribution of Soil Organic Carbon (SOC) at the regional scale (Flanders, Belgium). Geoderma 143, 1–13. doi:10.1016/j.geoderma.2007.08.025
157
Melillo, J.M., Aber, J.D., Linkins, A.E., Ricca, A., Fry, B., Nadelhoffer, K.J., 1989. Carbon and nitrogen dynamics along the decay continuum: plant litter to soil organic matter, in: Ecology of Arable Land—Perspectives and Challenges. Springer, pp. 53–62.
Merow, C., Smith, M.J., Edwards, T.C., Guisan, A., McMahon, S.M., Normand, S., Thuiller, W., Wüest, R.O., Zimmermann, N.E., Elith, J., 2014. What do we gain from simplicity versus complexity in species distribution models? Ecography 37, 1267–1281. doi:10.1111/ecog.00845
Michaletz, S.T., Cheng, D., Kerkhoff, A.J., Enquist, B.J., 2014. Convergence of terrestrial plant production across global climate gradients. Nature. doi:10.1038/nature13470
Miller, B.A., Koszinski, S., Wehrhan, M., Sommer, M., 2015. Impact of multi-scale predictor selection for modeling soil properties. Geoderma 239–240, 97–106. doi:10.1016/j.geoderma.2014.09.018
Milne, E., Powlson, D.S., Cerri, C.E., 2007. Soil carbon stocks at regional scales. Agric. Ecosyst. Environ., Soil carbon stocks at regional scales Assessment of Soil Organic Carbon Stocks and Change at National Scale, Final Project Presentation, The United Nations Environment Programme, Nairobi, Kenya, 23-24 May 2005 122, 1–2. doi:10.1016/j.agee.2007.01.001
Minasny, B., McBratney, A.B., 2005. The Matérn function as a general model for soil variograms. Geoderma, Pedometrics 2003 128, 192–207. doi:10.1016/j.geoderma.2005.04.003
Minasny, B., McBratney, A.B., 2006. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Comput. Geosci. 32, 1378–1388. doi:10.1016/j.cageo.2005.12.009
Minasny, B., McBratney, A.B., 2007. Spatial prediction of soil properties using EBLUP with the Matérn covariance function. Geoderma, Pedometrics 2005 140, 324–336. doi:10.1016/j.geoderma.2007.04.028
Minasny, B., McBratney, A.B., 2015. Digital soil mapping: A brief history and some lessons. Geoderma. doi:10.1016/j.geoderma.2015.07.017
Minasny, B., McBratney, A.B., Malone, B.P., Wheeler, I., 2013. Digital mapping of soil Carbon, in: Advances in Agronomy. Elsevier, pp. 1–47.
Minasny, B., McBratney, A.B., Salvador-Blanes, S., 2008. Quantitative models for pedogenesis — A review. Geoderma, Antarctic Soils and Soil Forming Processes in a Changing Environment 144, 140–157. doi:10.1016/j.geoderma.2007.12.013
158
Mishra, U., Lal, R., Liu, D., Van Meirvenne, M., 2010. Predicting the spatial variation of the soil organic carbon pool at a regional scale. Soil Sci. Soc. Am. J. 74, 906. doi:10.2136/sssaj2009.0158
Mishra, U., Torn, M.S., Masanet, E., Ogle, S.M., 2012. Improving regional soil carbon inventories: Combining the IPCC carbon inventory method with regression kriging. Geoderma 189–190, 288–295. doi:10.1016/j.geoderma.2012.06.022
Moore, I.D., Gessler, P.E., Nielsen, G.A. and Peterson, G.A., 1993. Soil attribute prediction using terrain analysis. Soil Science Society of America Journal, 57: 443-452
Mora-Vallejo, A., Claessens, L., Stoorvogel, J., Heuvelink, G.B.M., 2008. Small scale digital soil mapping in Southeastern Kenya. CATENA 76, 44–53. doi:10.1016/j.catena.2008.09.008
Mulkey, S., Alavalapati, J., Hodges, A., Wilkie, A.C., Grunwald, S., 2008. Opportunities for greenhouse gas reduction through forestry and agriculture in Florida. Univ. Fla. Sch. Nat. Resour. Retrieved January 20, 2008.
Mulla D.J. and McBratney A.B. 2002 Soil spatial variability. In: Warrick, A.W. (Ed) Soil physic companion.CRC Press LLC, Boca Raton
National Climatic Data Center (NCDC), National Oceanic and Atmospheric Administration (NOAA), 2008. Monthly Surface Data. Available at: http://www.ncdc.noaa.gov.
Natural Resources Conservation Service (NRCS), U.S. Department of Agriculture, 2006. Soil Survey Geographic Database (SSURGO). Available at: http://www.nrcs.usda.gov/wps/portal/nrcs/main/soils/.
Natural Resources Conservation Service (NRCS), U.S. Department of Agriculture, 2007. Soil Survey Geographic Database (SSURGO). Available at: http://www.nrcs.usda.gov/wps/portal/nrcs/main/soils/.
Natural Resources Conservation Service (NRCS), U.S. Department of Agriculture, 2009. Soil Survey Geographic Database (SSURGO). Available at: http://www.nrcs.usda.gov/wps/portal/nrcs/main/soils/.
Niang, M.A., Nolin, M.C., Jégo, G., Perron, I., 2014. Digital mapping of soil texture using RADARSAT-2 polarimetric synthetic aperture radar data. Soil Sci. Soc. Am. J. 78, 673. doi:10.2136/sssaj2013.07.0307
Oades, J.M., 1988. The retention of organic matter in soils. Biogeochemistry 5, 35–70. doi:10.1007/BF02180317
159
Odeh, I.O.A., McBratney, A.B., 2000. Using AVHRR images for spatial prediction of clay content in the lower Namoi Valley of eastern Australia. Geoderma 97, 237–254. doi:10.1016/S0016-7061(00)00041-0
Odeh, I.O.A., McBratney, A.B., Chittleborough, D.J., 1995. Further results on prediction of soil properties from terrain attributes: heterotopic cokriging and regression-kriging. Geoderma 67, 215–226. doi:10.1016/0016-7061(95)00007-B
Odeha, I.O.A., McBratney, A.B., Chittleborough, D.J., 1994. Spatial prediction of soil properties from landform attributes derived from a digital elevation model. Geoderma 63, 197–214. doi:10.1016/0016-7061(94)90063-9
Odgers, N.P., McBratney, A.B., Minasny, B., 2015. Digital soil property mapping and uncertainty estimation using soil class probability rasters. Geoderma 237–238, 190–198. doi:10.1016/j.geoderma.2014.09.009
Odgers, N.P., Sun, W., McBratney, A.B., Minasny, B., Clifford, D., 2014. Disaggregating and harmonising soil map units through resampled classification trees. Geoderma 214–215, 91–100. doi:10.1016/j.geoderma.2013.09.024
Oliver, M. a., 1987. Geostatistics and its application to soil science. Soil Use Manag. 3, 8–20. doi:10.1111/j.1475-2743.1987.tb00703.x
Oliver, M.A., Webster, R., 2014. A tutorial guide to geostatistics: Computing and modelling variograms and kriging. CATENA 113, 56–69. doi:10.1016/j.catena.2013.09.006
Parton, W.J., Schimel, D.S., Cole, C.V., Ojima, D.S., 1987a. Division s-3-soil microbiology and biochemistry. Soil Sci Soc Am J 51, 1173–1179.
Parton, W.J., Schimel, D.S., Cole, C.V., Ojima, D.S., 1987b. Analysis of Factors Controlling Soil Organic Matter Levels in Great Plains Grasslands1. Soil Sci. Soc. Am. J. 51, 1173. doi:10.2136/sssaj1987.03615995005100050015x
Pebesma, E.J., 2004. Multivariable geostatistics in S: the gstat package. Comput. Geosci. 30, 683–691. doi:10.1016/j.cageo.2004.03.012
Percival, H.J., Parfitt, R.L., Scott, N.A., 2000. Factors controlling soil carbon levels in New Zealand Grasslands Is Clay Content Important? Soil Sci. Soc. Am. J. 64, 1623–1630.
Peters, A., Hothorn, T., Ripley, B.D., Therneau, T., Atkinson, B., Hothorn, M.T., 2015. Package “ipred.”
Poggio, L., Gimona, A., 2014. National scale 3D modelling of soil organic carbon stocks with uncertainty propagation — An example from Scotland. Geoderma 232–234, 284–299. doi:10.1016/j.geoderma.2014.05.004
160
Poggio, L., Gimona, A., Brown, I., Castellazzi, M., 2010. Soil available water capacity interpolation and spatial uncertainty modelling at multiple geographical extents. Geoderma 160, 175–188. doi:10.1016/j.geoderma.2010.09.015
Prasad, A.M., Iverson, L.R., Liaw, A., 2006. Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9, 181–199. doi:10.1007/s10021-005-0054-1
R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/
Raftery, A.E., Gneiting, T., Balabdaoui, F., Polakowski, M., 2005. Using Bayesian model averaging to calibrate forecast ensembles. Mon. Weather Rev. 133, 1155–1174.
Rawlins, B.G., Henrys, P., Breward, N., Robinson, D.A., Keith, A.M., Garcia-Bajo, M., 2011. The importance of inorganic carbon in soil carbon databases and stock estimates: a case study from England. Soil Use Manag. 27, 312–320. doi:10.1111/j.1475-2743.2011.00348.x
Richter, D. deB., Bacon, A.R., Megan, L.M., Richardson, C.J., Andrews, S.S., West, L., Wills, S., Billings, S., Cambardella, C.A., Cavallaro, N., DeMeester, J.E., Franzluebbers, A.J., Grandy, A.S., Grunwald, S., Gruver, J., Hartshorn, A.S., Janzen, H., Kramer, M.G., Ladha, J.K., Lajtha, K., Liles, G.C., Markewitz, D., Megonigal, P.J., Mermut, A.R., Rasmussen, C., Robinson, D.A., Smith, P., Stiles, C.A., Tate, R.L., Thompson, A., Tugel, A.J., van Es, H., Yaalon, D., Zobeck, T.M., 2011. Human–soil relations are changing rapidly: Proposals from SSSA’s Cross-Divisional Soil Change Working Group. Soil Sci. Soc. Am. J. 75, 2079. doi:10.2136/sssaj2011.0124
Ridgeway, G., Ridgeway, M.G., 2004. The gbm package. R Found. Stat. Comput. Vienna Austria.
Rivero, R.G., Grunwald, S., Bruland, G.L., 2007. Incorporation of spectral data into multivariate geostatistical models to map soil phosphorus variability in a Florida wetland. Geoderma, Pedometrics 2005 140, 428–443. doi:10.1016/j.geoderma.2007.04.026
Rodríguez-Lado, L., Martínez-Cortizas, A., 2015. Modelling and mapping organic carbon content of topsoils in an Atlantic area of southwestern Europe (Galicia, NW-Spain). Geoderma 245–246, 65–73. doi:10.1016/j.geoderma.2015.01.015
Roger, A., Libohova, Z., Rossier, N., Joost, S., Maltas, A., Frossard, E., Sinaj, S., 2014. Spatial variability of soil phosphorus in the Fribourg canton, Switzerland. Geoderma 217–218, 26–36. doi:10.1016/j.geoderma.2013.11.001
Rossiter, D.G., 2012. Applied geostatistics Exercise 3: Modelling spatial structure from point samples.
161
Rumpel, C., Kögel-Knabner, I., 2010. Deep soil organic matter—a key but poorly understood component of terrestrial C cycle. Plant Soil 338, 143–158. doi:10.1007/s11104-010-0391-5
Ryan, P.J., McKenzie, N.J., O’Connell, D., Loughhead, A.N., Leppert, P.M., Jacquier, D., Ashton, L., 2000. Integrating forest soils information across scales: spatial prediction of soil properties under Australian forests. For. Ecol. Manag. 138, 139–157. doi:10.1016/S0378-1127(00)00393-5
Schimel, D., Stillwell, M.A., Woodmansee, R.G., 1985. Biogeochemistry of C, N, and P in a Soil Catena of the Shortgrass Steppe. Ecology 66, 276–282. doi:10.2307/1941328
Schimel, D.S., Braswell, B.H., Holland, E.A., McKeown, R., Ojima, D.S., Painter, T.H., Parton, W.J., Townsend, A.R., 1994. Climatic, edaphic, and biotic controls over storage and turnover of carbon in soils. Glob. Biogeochem. Cycles 8, 279–293. doi:10.1029/94GB00993
Schmidt, M.W.I., Torn, M.S., Abiven, S., Dittmar, T., Guggenberger, G., Janssens, I.A., Kleber, M., Kögel-Knabner, I., Lehmann, J., Manning, D.A.C., Nannipieri, P., Rasse, D.P., Weiner, S., Trumbore, S.E., 2011. Persistence of soil organic matter as an ecosystem property. Nature 478, 49–56. doi:10.1038/nature10386
Shi, W., Liu, J., Du, Z., Stein, A., Yue, T., 2011. Surface modelling of soil properties based on land use information. Geoderma 162, 347–357. doi:10.1016/j.geoderma.2011.03.007
Simbahan, G.C., Dobermann, A., Goovaerts, P., Ping, J., Haddix, M.L., 2006. Fine-resolution mapping of soil organic carbon based on multivariate secondary data. Geoderma 132, 471–489. doi:10.1016/j.geoderma.2005.07.001
Simbahan, G.C., Dobermann, A., Goovaerts, P., Ping, J., Haddix, M.L., 2006. Fine-resolution mapping of soil organic carbon based on multivariate secondary data. Geoderma 132, 471–489. doi:10.1016/j.geoderma.2005.07.001
Smith, P., Fang, C., Dawson, J.J.C., Moncrieff, J.B., 2008. Impact of Global Warming on Soil Organic Carbon, in: Agronomy, B.-A. in (Ed.), Academic Press, pp. 1–43.
Smola, A.J., Schölkopf, B., 2004. A tutorial on support vector regression. Stat. Comput. 14, 199–222. doi:10.1023/B:STCO.0000035301.49549.88
Sollins, P., Homann, P., Caldwell, B.A., 1996. Stabilization and destabilization of soil organic matter: mechanisms and controls. Geoderma 74, 65–105. doi:10.1016/S0016-7061(96)00036-5
Stacey, K.F., Lark, R.M., Whitmore, A.P., Milne, A.E., 2006. Using a process model and regression kriging to improve predictions of nitrous oxide emissions from soil. Geoderma 135, 107–117. doi:10.1016/j.geoderma.2005.11.008
162
Steffen, W., Grinevald, J., Crutzen, P., McNeill, J., 2011. The Anthropocene: conceptual and historical perspectives. Philos. Trans. R. Soc. Lond. Math. Phys. Eng. Sci. 369, 842–867. doi:10.1098/rsta.2010.0327
Stein, M.L., 1999. Interpolation of spatial data, Springer Series in Statistics. Springer New York, New York, NY.
Stone, E.L., Harris, W.G., Brown, R.B., Kuehl, R.J., 1993. Carbon storage in Florida Spodosols. Soil Sci. Soc. Am. J. 57, 179. doi:10.2136/sssaj1993.03615995005700010032x
Stoorvogel, J.J., Kempen, B., Heuvelink, G.B.M., de Bruin, S., 2009. Implementation and evaluation of existing knowledge for digital soil mapping in Senegal. Geoderma 149, 161–170. doi:10.1016/j.geoderma.2008.11.039
Sun, W., Minasny, B., McBratney, A., 2012. Analysis and prediction of soil properties using local regression-kriging. Geoderma, Entering the Digital Era: Special Issue of Pedometrics 2009, Beijing 171–172, 16–23. doi:10.1016/j.geoderma.2011.02.010
Takagi, K., Lin, H.S., 2012. Changing controls of soil moisture spatial organization in the Shale Hills Catchment. Geoderma 173–174, 289–302. doi:10.1016/j.geoderma.2011.11.003
Therneau, T., Atkinson, B., Ripley, B., Ripley, M.B., 2015. Package “rpart”. Version.
Thompson, J.A., Roecker, S., Grunwald, S., Owens, P.R., 2012. Digital soil mapping, in: Hydropedology. Elsevier, pp. 665–709.
Thomsen, I.K., Schjønning, P., Olesen, J.E., Christensen, B.T., 2003. C and N turnover in structurally intact soils of different texture. Soil Biol. Biochem. 35, 765–774. doi:10.1016/S0038-0717(03)00093-2
Torn, M.S., Trumbore, S.E., Chadwick, O.A., Vitousek, P.M., Hendricks, D.M., 1997. Mineral control of soil organic carbon storage and turnover. Nature 389, 170–173. doi:10.1038/38260
Totsche, K.U., Rennert, T., Gerzabek, M.H., Kögel-Knabner, I., Smalla, K., Spiteller, M., Vogel, H.-J., 2010. Biogeochemical interfaces in soil: The interdisciplinary challenge for soil science. J. Plant Nutr. Soil Sci. 173, 88–99. doi:10.1002/jpln.200900105
Triantafilis, J., Odeh, I.O.A., McBratney, A.B., 2001. Five geostatistical models to predict soil salinity from electromagnetic induction data across irrigated cotton. Soil Sci. Soc. Am. J. 65, 869–878.
163
Umali, B.P., Oliver, D.P., Forrester, S., Chittleborough, D.J., Hutson, J.L., Kookana, R.S., Ostendorf, B., 2012. The effect of terrain and management on the spatial variability of soil properties in an apple orchard. CATENA 93, 38–48. doi:10.1016/j.catena.2012.01.010
United States Census Bureau, 2000. The Boundary of the State of Florida. Available at: http://www.census.gov/geo/www/cob/cbf_state.html.
United States Geological Survey (USGS), 1999. National Elevation Dataset (NED). Available at: http://ned.usgs.gov/.
United States Census Bureau, 2015. Population estimates. Available at https://www.census.gov/newsroom/press-releases/2014/cb14-232.html
Vanwalleghem, T., Poesen, J., McBratney, A., Deckers, J., 2010. Spatial variability of soil horizon depth in natural loess-derived soils. Geoderma 157, 37–45. doi:10.1016/j.geoderma.2010.03.013
Vapnik, V.N., 1998. Statistical Learning Theory. New York.
Vasenev, V.I., Stoorvogel, J.J., Vasenev, I.I., Valentini, R., 2014. How to map soil organic carbon stocks in highly urbanized regions? Geoderma 226–227, 103–115. doi:10.1016/j.geoderma.2014.03.007
Vasques, G.M., Grunwald, S., Comerford, N.B., Sickman, J.O., 2010. Regional modelling of soil carbon at multiple depths within a subtropical watershed. Geoderma 156, 326–336. doi:10.1016/j.geoderma.2010.03.002
Vasques, G.M., Grunwald, S., Myers, D.B., 2012. Associations between soil carbon and ecological landscape variables at escalating spatial scales in Florida, USA. Landsc. Ecol. 27, 355–367. doi:10.1007/s10980-011-9702-3
Vasques, G.M., Grunwald, S., Sickman, J.O., 2008. Comparison of multivariate methods for inferential modeling of soil carbon using visible/near-infrared spectra. Geoderma 146, 14–25. doi:10.1016/j.geoderma.2008.04.007
Vasques, G.M., Grunwald, S., Sickman, J.O., Comerford, N.B., 2010. Upscaling of dynamic soil organic carbon pools in a north-central Florida watershed. Soil Sci. Soc. Am. J. 74, 870. doi:10.2136/sssaj2009.0242
Veldkamp, E., Becker, A., Schwendenmann, L., Clark, D.A., Schulte-Bisping, H., 2003. Substantial labile carbon stocks and microbial activity in deeply weathered soils below a tropical wet forest. Glob. Change Biol. 9, 1171–1184. doi:10.1046/j.1365-2486.2003.00656.x
Wackernagel, H., 2003. Multivariate geostatistics: an introduction with applications. Springer, Berlin; New York.
164
Wallis, J.R., 1965. Multivariate statistical methods in hydrology—A comparison using data of known functional relationship. Water Resour. Res. 1, 447–461. doi:10.1029/WR001i004p00447
Watt, M.S., Palmer, D.J., 2012. Use of regression kriging to develop a Carbon:Nitrogen ratio surface for New Zealand. Geoderma 183–184, 49–57. doi:10.1016/j.geoderma.2012.03.013
Webster, R., 1994. The development of pedometrics. Geoderma 62, 1–15. doi:10.1016/0016-7061(94)90024-8
Webster, R., 2000. Is soil variation random? Geoderma 97, 149–163. doi:10.1016/S0016-7061(00)00036-7
Webster, R., Burgess, T.M., 1980. Optimal interpolation and isarithmic mapping of soil properties Iii Changing Drift and Universal Kriging. J. Soil Sci. 31, 505–524. doi:10.1111/j.1365-2389.1980.tb02100.x
Webster, R., Oliver, M.A., 1992. Sample adequately to estimate variograms of soil properties. J. Soil Sci. 43, 177–192. doi:10.1111/j.1365-2389.1992.tb00128.x
Webster, R., Oliver, M.A., 2007. Geostatistics for environmental scientists (2nd ed.) John Wiley, Chichester, United Kingdom.
Wehrens, R., Mevik, B.-H., Mevik, M.B.-H., 2007. The pls package. Ref. Man.
Were, K., Bui, D.T., Dick, Ø.B., Singh, B.R., 2015. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 52, 394–403. doi:10.1016/j.ecolind.2014.12.028
Wiesmeier, M., Barthold, F., Spörlein, P., Geuß, U., Hangen, E., Reischl, A., Schilling, B., Angst, G., von Lützow, M., Kögel-Knabner, I., 2014. Estimation of total organic carbon storage and its driving factors in soils of Bavaria (southeast Germany). Geoderma Reg. 1, 67–78. doi:10.1016/j.geodrs.2014.09.001
Williams, P.C., 1987. Variables affecting near-infrared reflectance spectroscopic analysis. In: Williams, P., Norris, K. (Eds.), Near-infrared Technology in the Agricultural and Food Industries. American Association of Cereal Chemists, St. Paul, MN, pp. 143–167
Wright, R.L., Wilson, S.R., 1979. On the analysis of soil variability, with an example from Spain. Geoderma 22, 297–313. doi:10.1016/0016-7061(79)90026-0
Xiong, X., Grunwald, S., Myers, D.B., Kim, J., Harris, W.G., Comerford, N.B., 2014a. Holistic environmental soil-landscape modeling of soil organic carbon. Environ. Model. Softw. 57, 202–215. doi:10.1016/j.envsoft.2014.03.004
165
Xiong, X., Grunwald, S., Myers, D.B., Ross, C.W., Harris, W.G., Comerford, N.B., 2014b. Interaction effects of climate and land use/land cover change on soil organic carbon sequestration. Sci. Total Environ. 493, 974–982. doi:10.1016/j.scitotenv.2014.06.088
Zhang, S., Huang, Y., Shen, C., Ye, H., Du, Y., 2012. Spatial prediction of soil organic matter using terrain indices and categorical variables as auxiliary information. Geoderma, Beijing 171–172, 35–43. doi:10.1016/j.geoderma.2011.07.012
Zhao, Y.-C., Shi, X.-Z., 2010. Spatial prediction and uncertainty assessment of soil organic carbon in Hebei province, China, in: Boettinger, J.L., Howell, D.W., Moore, A.C., Hartemink, A.E., Kienast-Brown, S. (Eds.), Digital Soil Mapping. Springer Netherlands, Dordrecht, pp. 227–239.
Zhu, Q., Lin, H.S., 2010. Comparing ordinary kriging and regression kriging for soil properties in contrasting landscapes. Pedosphere 20, 594–606. doi:10.1016/S1002-0160(10)60049-5
166
BIOGRAPHICAL SKETCH
Hamza Keskin was born in Istanbul, Turkey in 1988. He got his Bachelor of
Science degree in forestry engineering in 2010 at the University of Istanbul, Turkey. He
was awarded a scholarship from the Turkey Ministry of Forestry and Water Affairs
during his second year as a graduate student at University of Istanbul in 2012. He
decided to pursue his academic career in U.S. He enrolled at the University of Florida in
2013 where he earned his Master of Science degree in Soil and Water Science
Department in 2015. His academic and professional interests involve modeling and
mapping of soil properties to better understand the genesis and distribution of soil.