ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/04/96/69/00001/KESKIN_H.pdf · ACKNOWLEDGMENTS I would like to thank my parents who always encouraged me to pursue my dreams.

1

DIGITAL MAPPING OF SOIL CARBON FRACTIONS

By

HAMZA KESKIN

A THESIS PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

UNIVERSITY OF FLORIDA

2015

2

© 2015 Hamza Keskin

3

To new generation soil scientists

4

ACKNOWLEDGMENTS

I would like to thank my parents who always encouraged me to pursue my

dreams. I specifically thank my major advisor Dr. Sabine Grunwald for believing in me

and providing me with the data from a research project she lead as principal investigator

named as “Rapid Assessment and Trajectory Modeling of Changes in Soil Carbon

across a Southeastern Landscape” (USDA – CSREES – NRI grant award 2007 – 35107

– 18368 by National Institute of Food and Agriculture (NIFA), U.S. Department of

Agriculture). I also thank my supervisory committee Dr. Willie Harris and Dr. Samira

Daroub for their professional advice and suggestions to increase the quality of the

manuscripts.

I owe a great deal of thanks to the Republic of Turkey Ministry of Forestry and

Water Affairs for financial support throughout the master program and to the General

Directorate of Combating Desertification and Erosion for the opportunity to work with

them after graduation.

Acknowledgments for field sampling, laboratory analysis, and development of the

soil-environmental database go to co-principal investigators of the project Dr.

Nicholas.B. Comerford, Dr. Willie.G. Harris, Dr. Gregory.L. Bruland. I also thank D.

Brenton. Myers, Nichola. M. Knox, Deoyani Sarkhot , Elena Azuaje, C. Wade Ross,

Xiong Xiong, Jongsung Kim, Gustavo M. Vasques, Pasicha Chaikaew, Aja Stoppe, Lisa

Stanley, Adriana Comerford, Xiaoling Dong, Samiah Moustafa, and Anne Quidez who

contributed to the construction of the carbon data used in the chapter 3 of the thesis. I

also would like to thank Esther Kaufman for unconditional help and guidance on the

revisions of the manuscripts.

5

TABLE OF CONTENTS page

ACKNOWLEDGMENTS .................................................................................................. 4

LIST OF TABLES ............................................................................................................ 7

LIST OF FIGURES .......................................................................................................... 8

LIST OF ABBREVIATIONS ........................................................................................... 10

ABSTRACT ................................................................................................................... 12

CHAPTER

1 INTRODUCTION .................................................................................................... 14

2 REGRESSION KRIGING AS WORKHORSE IN THE PEDOMETRICIAN’S TOOLBOX .............................................................................................................. 17

2.1 Introduction ....................................................................................................... 17 2.2 Material and Methods ....................................................................................... 22 2.3 Results .............................................................................................................. 23

2.3.1 Spatial Scale ........................................................................................... 23 2.3.1.1 Geographic region.......................................................................... 23 2.3.1.2 Area extent ..................................................................................... 24 2.3.1.3 Grain size ....................................................................................... 25

2.3.2 Target Soil Properties and Classes ......................................................... 26 2.3.3 Sampling ................................................................................................. 28

2.3.3.1 Sampling design ............................................................................ 28 2.3.3.2 Sample size-density ....................................................................... 29 2.3.3.3 Sample depth(s) ............................................................................. 30

2.3.4 SCORP Factors ....................................................................................... 30 2.3.5 Preprocessing ......................................................................................... 32

2.3.5.1 Logarithmic transformation ............................................................. 32 2.3.5.2 Factor analysis ............................................................................... 33

2.3.6 Regression Type to Quantify Deterministic Variation .............................. 34 2.3.7 Variogram ................................................................................................ 35

2.3.7.1 Model type ..................................................................................... 35 2.3.7.2 N:S ratio ......................................................................................... 35 2.3.7.3 Range ............................................................................................ 36

2.3.8 Validation ................................................................................................. 37 2.4 Discussion and Recommendations ................................................................... 38

2.4.1 Factors Effecting Performance of RK ...................................................... 38 2.4.2 Regression Kriging as a Default Soil Mapping Method............................ 40

2.4.2.1 Satisfactory performance of regression kriging over its competitors ............................................................................................. 40

6

2.4.2.2 Unsatisfactory performance of regression kriging over its competitors ............................................................................................. 41

2.4.3 REML-EBLUP vs. RK .............................................................................. 43 2.4.4 Future Trend of RK .................................................................................. 46 2.4.5 Model Averaging ..................................................................................... 48

2.5 Conclusions and Outlook .................................................................................. 50

3 DIGITAL MAPPING OF SOIL CARBON FRACTIONS ........................................... 57

3.1 Introduction ....................................................................................................... 57 3.2 Materials and Methods ...................................................................................... 62

3.2.1 Study Area ............................................................................................... 62 3.2.2 Soil Data .................................................................................................. 63

3.2.2.1 Sampling design and field sampling ............................................... 64 3.2.2.2 Laboratory and chemical analysis .................................................. 64 3.2.2.3 Determination of total, recalcitrant and labile carbon stocks .......... 65

3.2.3 Environmental Data ................................................................................. 65 3.2.3.1 Assembled environmental variables representing STEP-AWBH

factors ..................................................................................................... 65 3.2.3.2 Boruta feature selection technique ................................................. 67

3.2.4 Modeling Techniques .............................................................................. 68 3.2.5 Evaluation of Model Performance ........................................................... 71 3.2.6 Application of Models .............................................................................. 72 3.2.7 Mapping of Total, Labile and Recalcitrant Carbon Stocks ....................... 72

3.3 Results and Discussion ..................................................................................... 73 3.3.1 Descriptive Summary Statistic of Carbon Fractions ................................ 73 3.3.2 Spatial Autocorrelation with Trend and without Trend ............................. 74 3.3.3 Important Variables ................................................................................. 76 3.3.4 Assessment of the Prediction Capability of the Selected Methods .......... 79 3.3.5 Residual Spatial Autocorrelation of Evaluated Methods .......................... 85 3.3.6 Regional Scale Controls on Stabilization of Soil Carbon ......................... 88 3.3.7 Spatial Distribution of C fractions ............................................................ 95

3.4 Conclusions ...................................................................................................... 96

4 SUMMARY AND SYNTHESIS .............................................................................. 128

APPENDIX: LITERATURE REVIEW ........................................................................... 132

LIST OF REFERENCES ............................................................................................. 145

BIOGRAPHICAL SKETCH .......................................................................................... 166

7

LIST OF TABLES

Table page 2-1 Spatial range (m) from reviewed studies under three different ........................... 56

2-2 Modified Version of Regression Kriging (RK) ..................................................... 56

3-1 Assembled environmental variables representing STEP-ABWH factors ............ 98

3-2 R packages to perform evaluated methods ...................................................... 103

3-3 Descriptive statistic of observed soil C fractions. .............................................. 103

3-4 Spearman’s correlation analysis of the paired soil C fractions. ........................ 103

3-5 Z score as a sign for relative importance of all-relevant variables identified by Boruta. .............................................................................................................. 104

3-6 Performance of eight different modelling methods to predict soil total carbon (TC), recalcitrant carbon (RC) and labile carbon (HC) on validation. ................ 106

3-7 Cross-validation (on the 70% calibration dataset) and independent validation (on the 30% validation dataset) results of Random Forest models ................... 107

8

LIST OF FIGURES

Figure page 2-1 Evolution of Hybrid Interpolation Techniques ..................................................... 53

2-2 General framework for Regression Kriging ......................................................... 54

2-3 The cumulative amount of RK studied over time. ............................................... 55

2-4 Effects of coefficient of variation on the accuracy of RK methods compared in the 71 cases. ...................................................................................................... 55

3-1 A total of 1014 soil sampling locations .............................................................. 108

3-2 Upper part of figure depicts the omnidirectional variograms for total carbon (TC), recalcitrant carbon (RC) .......................................................................... 109

3-3 Predicted vs. observed soil total carbon (TC) of validation dataset .................. 110

3-4 Predicted vs. observed soil recalcitrant carbon (RC) of validation dataset ....... 111

3-5 Predicted vs. observed soil hot-water extractable carbon (HC) of validation ... 112

3-6 Relative increase (%) in root mean squared deviations (RMSD) of evaluated prediction techniques compare to RMSD of OK. .............................................. 113

3-7 Strength of the spatial autocorrelation among evaluated model residuals for total carbon (TC). .............................................................................................. 114

3-8 Strength of the spatial autocorrelation among evaluated model residuals for recalcitrant carbon (RC). .................................................................................. 115

3-9 Strength of the spatial autocorrelation among evaluated model residuals for hot-water extractable carbon (HC). ................................................................... 116

3-10 Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed land use/land cover (LULC). ............................. 117

3-11 Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed land use/land cover (LULC). ................... 118

3-12 Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed land use/land cover (LULC). ................... 119

3-13 Spatial distribution of landcover/landuse classes ............................................ 120

3-14 Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed suborders.. ....................................................... 121

9

3-15 Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed suborders. .............................................. 122

3-16 Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed suborders. .............................................. 123

3-17 Spatial distribution of soil suborders ................................................................. 124

3-18 Spatial distribution patterns of estimated soil total carbon stocks (kg m-2) across Florida, U.S. .......................................................................................... 125

3-19 Spatial distribution patterns of estimated recalcitrant carbon stocks (kg m-2) across Florida, U.S. .......................................................................................... 126

3-20 Spatial distribution patterns of estimated hot-water extractable carbon stocks (kg m-2) across Florida, U.S. ............................................................................. 127

10

LIST OF ABBREVIATIONS

ANN Artificial neural network

AWC Available water holding capacity (cm)

BaRT Bagged regression tree

BK Block kriging

BoRT Boosted regression tree

C Carbon (kg m-2)

C:N Carbon : nitrogen

CART Classification and regression tree

CLHS Conditioned Latin hypercube sampling

CV Coefficient of variation (%)

DEM Digital elevation model (m)

DSM Digital soil mapping

DSMM Digital soil mapping and modeling

GAM Generalized additive model

GLM Generalized linear model

HC Hot-water extractable carbon (kg m-2)

LULC Land use/land cover

ME Mean error (kg m-2)

MLR Multiple linear regression

N:S Nugget to sill ratio (%)

NRMSD Normalized root mean squared deviation

OK Ordinary kriging

PCA Principal component analysis

PLSR Partial least square regression

11

RC Recalcitrant carbon (kg m-2)

REML-EBLUP Residual maximum likelihood-empirical best unbiased prediction

RF Random forest

RK Regression kriging

RMSD Root mean squared deviation (kg m-2)

RPD Residual prediction deviation

RPIQ Ratio of prediction error to inter-quartile range

RSA Residual spatial autocorrelation

SCORP S:Soil, C:Climate,O: Organism, R: Relief, P: Parent material

SMLR Stepwise multiple linear regression

SOC Soil organic carbon (kg m-2)

SOM Soil organic matter (kg m-2)

STEP-ABWH S: Soil, T: Topography, E: Ecology, P: Parent material

A: Atmosphere, B: Biota, W:Water, H:Human

SVM Support vector machine

T Training dataset (N= 710)

TC Total carbon (kg m-2)

V Validation dataset (N=304)

𝑥𝑥 Location in one, two or three dimensions

𝑍𝑍(𝑥𝑥) The random variable Z at location x

𝜇𝜇(𝑥𝑥) Deterministic structural component, trend (drift)

𝜀𝜀′(𝑥𝑥) Stochastic component, spatially dependent residual from µ(x) [the

regionalized variable]

𝜀𝜀′′(𝑥𝑥): Spatially independent component, noise, unexplained variability

12

Abstract of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science

DIGITAL MAPPING OF SOIL CARBON FRACTIONS

By

Hamza Keskin

December 2015

Chair: Sabine Grunwald Major: Soil and Water Science

Our understanding of the spatial distribution of soil carbon (C) pools across

diverse land uses, soils, and climatic gradients at regional scale is still limited. Research

in digital soil mapping and modeling that investigates the interplay between (i) soil C

pools and environmental factors (“deterministic trend model”) and (ii) stochastic,

spatially dependent variations in soil C fractions (“stochastic model”) is just emerging.

This evoked our motivation to investigate soil C pools in the State of Florida covering

about 150,000 km2. Our specific objectives were to (i) compare different soil C pool

models that quantify stochastic and/or deterministic components, (ii) assess the

prediction performance of soil C models, and (iii) identify environmental factors that

impart most control on labile and recalcitrant pools and total soil C (TC). We used soil

data (0-20 cm) from a research project (USDA-CSREES-NRI grant award 2007-35107-

18368) collected at 1,014 georeferenced sites including measured bulk density (BD),

recalcitrant carbon (RC), labile (hot-water extractable) carbon (HC) and TC. A

comprehensive set of 327 geospatial soil-environmental variables was acquired. The

Boruta method was employed to identify “all-relevant” soil-environmental predictors. We

employed eight methods - Classification and Regression Tree (CaRT), Bagged

13

Regression Tree (BaRT), Boosted Regression Tree (BoRT), Random Forest (RF),

Support Vector Machine (SVM), Partial Least Square Regression (PLSR), Regression

Kriging (RK), and Ordinary Kriging (OK) – to predict soil C fractions and TC. Overall, 36,

20 and 25 predictors stood out as “all-relevant” to estimate TC, RC and HC,

respectively. We predicted a mean of 5.39 ± 3.74 kg TC m-2 in the top 20 cm with the

best model. The prediction performance assessed by the Ratio of Prediction Error to

Inter-quartile Range for TC stocks was as follows: RF > SVM > BoRT > BaRT > PLSR >

RK > CART > OK. The best models explained 71.6%, 71.7% and 30.5% of the total

variation for TC, RC and HC, respectively. Biotic and hydro-pedological factors

explained most of the variation in soil C pools and TC, lithologic and climatic factors

showed some relationships to soil C pools and TC, whereas topographic factors faded

from soil C models.

14

CHAPTER 1 INTRODUCTION

At the beginning of the 21st century, advances in computational power,

geographic information systems, remote sensing and statistical methods have

collectively enabled pedologists to produce state-of-the-art, reliable, categorical and

continuous spatial soil information at multiple scales in space and time, which empower

environmental scientists to model and policy makers to deal with wicked environmental

problems, such as land degradation, climate change, food and water security,

biodiversity and ecosystem functions protection (Bouma and McBratney, 2013;

Hartemink and McBratney, 2008). Consequently, providing high-quality, justifiably

reliable, reproducible spatiotemporal soil information with limited uncertainty has been

the major focus in digital soil mapping (DSM) which has now shifted from the research

phase into an operational phase (Minasny and McBratney, 2015). Decreasing

inaccuracies in DSM is an essential requirement in the quest to comprehend variability

in soil properties/classes at multiple scales. A better understanding of soil variability will

pave the way for a better understanding of geo-patterns on the Earth’s surface

(Bockheim and Gennadiyev, 2010).

Soil, as the central component of the Earth’s critical zone (Lin, 2010) and a large

terrestrial C pool, dictates quantity and quality of soil ecosystem services at multiple

scales. At the local scale, SOC management is particularly important as it influences the

physical, chemical, and biological properties of soil. At the global scale, appropriate

management of soil is critical because of its role in mitigating the atmospheric level of

greenhouse gases (GHGs) through sequester C from the atmosphere (Milne et al.,

2007). Due to the enormous capacity of soil to retain C, even relatively small shifts in its

15

quantity could have dramatic changes in the global C balance (Smith et al., 2008).

Baldock et al. (2012) estimated a sole 1% increase in the C content in the pedosphere

could offset 8 ppm of C content in the atmosphere. Hence, the re-equilibrium between

above- and below-ground C storage can be accomplished by reducing C loss and

boosting C build-up in the pedosphere with appropriate management practices (Lal,

2004). Accurately assessing soil C storage is challenged by the complexity of highly

variable C fluxes and C forming/degrading processes in space and time. Thus,

decreasing the uncertainty associated with today’s regional and global scale C

estimation can facilitate Earth System Science to address human-driven threats to soil

quality and soil security.

Regression Kriging (RK) is one of the most popular, practical and robust hybrid

spatial interpolation techniques in the pedometrician’s toolbox which enables the

modeling of soil distribution patterns at multiple scales in space and time. It explicitly

account for deterministic and stochastic portion of the total variation for phenomena of

interest.

A review of literature is provided in Chapter 2 which articulates past, present and

future development of DSM-RK studies. It describes the evolution of RK from a

historical perspective and traces development in the last decade using an extensive

literature review with the purpose of characterizing factors affect the prediction

performance of RK. The review also illustrates the steps taken to develop efficient RK

models and identifies the limitations and strengths of RK. The results of this review

raised further questions: i) Is it possible to incorporate data-mining methods into the RK

framework? , ii) What effect does the residual spatial autocorrelation (RSA) have in geo-

16

spatial hybrid methods? iii) Do hybrid methods that were developed with sophisticated

data-mining methods yield better prediction than standalone hybrid methods?

Chapter 3 addressed the modelling and mapping of TC, RC and HC across the

State of Florida by constructing parsimonious geo-spatial soil landscape models without

sacrificing prediction accuracy. What makes the Chapter 3 compelling is that labile and

recalcitrant portion of soil total C is modelled and mapped along pedo-climatic

trajectories in a diverse subtropical region. Additionally, the quantification of the

stochastic and deterministic variability of soil C pools is scarce in the literature.

Moreover, the RSA of the eight different methods populated by strategically identified

ancillary variables are compared to answer abovementioned questions that shape the

future development of the RK framework. Last but not least, regional scale

environmental controls on the stabilization and destabilization mechanisms of soil C are

discussed to enhance interpretability of modelling efforts.

17

CHAPTER 2 REGRESSION KRIGING AS WORKHORSE IN THE PEDOMETRICIAN’S TOOLBOX

2.1 Introduction

Inherently, soil variation leads to the significant problem of decreasing the

accuracy and reliability within soil maps; thus, its great complexity made pedologists

seek alternative ways to spatially notate the variability (Burrough et al., 1994). Two

general, yet distinct approaches have been offered to account for the soil variation:

discrete modeling of soil variation (polygon-based), and continuous modeling of soil

variation (pixel-based). While the first approach partitions the soil into more and less

discrete classes, the latter approach looks at the soil-landscape as a continuum.

Traditional soil classification uses a polygon-based soil map unit model that has

numerous drawbacks. As Hartemink et al. (2010) articulated the maps produced by

traditional soil classification methodology are static, inflexible, inaccurate, undetailed

and difficult to integrate with grid-based digital soil sources. Moreover, and most

importantly, polygon-based models do not formally specify the uncertainty (Grunwald,

2006). Altogether, these drawbacks largely contributed to the decrease in funding to

pedological research in the late 1990's (Basher 1997, Ryan et al., 2000). “The challenge

was to use actual knowledge about soil forming processes and to develop a spatially-

realistic, mathematical soil-landscape model useful for a variety of purposes beyond

taxonomic classification” (McSweeney et al., 1994). Consequently, soil scientists

inevitably shifted from qualitative subjective modeling of soil properties and classes to

quantitative objective modeling; “soil science under uncertainty” (Goovaerts, 2001).

The unifying modeling of soil spatial variation can be formalized by using the

regionalized variable theory with the following equation (after Burrough, 1986)

18

𝑍𝑍(𝑥𝑥) = 𝜇𝜇(𝑥𝑥) + 𝜀𝜀′(𝑥𝑥) + 𝜀𝜀′′(𝑥𝑥) (2-1)

Where: • 𝑥𝑥 : location in one, two or three dimensions,

• 𝑍𝑍(𝑥𝑥) : the random variable Z at location x,

• 𝜇𝜇(𝑥𝑥) : deterministic structural component, trend (drift),

• 𝜀𝜀′(𝑥𝑥) : stochastic component, spatially dependent residual from µ(x) [the regionalized variable] but locally varying in both lateral and vertical direction,

• 𝜀𝜀′′(𝑥𝑥): spatially independent component, noise, unexplained variability.

Spatial variability in soil forms a spectrum of variation ranging from microscopic

to megascopic scale (Wright and Wilson, 1979) as a function of many possible factors,

including target area of extent, grain size, specific soil properties or processes, spatial

location and time (Lin et al., 2005). Altogether these factors may form a trend at multiple

scales and these trends may be depicted with a deterministic function (𝜇𝜇(𝑥𝑥) in Equation

2-1). However, the processes responsible for soil variation are generally unknown and

with current expertise soil variability is unlikely to be captured analytically at multiple

scales in either space or time (Heuvelink and Webster, 2001). Typically, the values for a

soil property from samples taken at close geographic spacing is similar or spatially

correlated (Oliver, 1987). This is the premise of the spatially dependent random

component (𝜀𝜀′(𝑥𝑥) in Equation 2-1). Semivariograms have been used to characterize the

stochastic structural component as a function of distance between two adjacent points

under the stationary assumption. Spatially independent component of the variation,

noise, is the unexplainable variability (𝜀𝜀′′(𝑥𝑥) in Equation 2-1) which is present in any

model having a mean zero and variance σ2 (Webster, 2000).

19

The soil factorial model, an empirical-deterministic model of soil formation

developed by V.V. Dokuchaev (Glinka, 1927) and formulized by Jenny (Jenny, 1941)

has been widely utilized to explore the deterministic part of the variation, whereas

regionalized variable theory (Matheron, 1971) has mainly enabled researchers to

characterize the stochastic, spatially dependent variation (Webster, 1994). While soil

forming factors theory has attracted researcher’s attention to quantitatively describe the

relationship between soil and its creators, the regionalized variable theory (Matheron,

1971) has mainly flourished due to its ability to predict the values of various soil

properties at unknown locations. Many statistical and purely geostatistical methods

used since the 1960s have been collectively categorized under the new branch in soil

science called “pedometrics”. Pedometrics can be defined as the application of

probability and statistical methods to soil science (Webster, 1994) or the application of

mathematics and statistics to study the distribution and genesis of soil (McBratney et al.,

2000). Deterministic and stochastic variation of soil models have been systematically

studied in the discipline of pedometrics since the 1990s.

There are two main generic approaches that are representative of these two

distinct model paradigms that address soil variation and predict soil properties and

classes at an unvisited location: (1) non-geostatistical techniques (e.g., simple and

multiple linear regression (MLR), generalized additive model (GAM), regression tree

(RT)) and (2) geostatistical techniques (e.g. ordinary kriging (OK), simple kriging (SK),

universal kriging (UK)) (Burgess and Webster, 1980, Moore et al., 1993, Odeh et al.,

1994, McBratney et al, 2000). Non-geostatistical techniques have been used to quantify

the relationship between soil properties and state factors accounting for the

20

deterministic portion of the total variation “µ(x)” (Figure 2-1). Geostatistical methods,

conversely, have been used to quantify changes in soil properties over distance

accounting for the spatially dependent stochastic portion of the total variation “ɛ’(x)”

(Figure 2-1). These two generic approaches were combined to create hybrid techniques

(i.e., non-stationary geostatistical methods) (Wackernagel, 2003), in the mid-1990s.

While the non-geostatistical part detects the deterministic part of the total variation, the

geostatistical part quantifies the spatially dependent stochastic part of the total variation.

A number of hybrid techniques have been developed in the following years,

universal kriging or kriging with internal drift (UK) (Webster and Burgess, 1980) as well

as kriging with external drift (KED) (Goovaerts, 1997). Odeh et al. (1994, 1995) coined

the term RK and introduced RK type A, B and later RK type C.

RK type A which is called “kriging combined with regression” (Knotters et al.,

1995) involves kriging of the predicted values. In other words, first multivariate

regression is applied to predict the value of unvisited sites. This is followed by the

kriging of the regressed values.

RK type B, which is also called “ kriging with guess field” (Ahmed and De Marsily,

1987), is calculated with regressed values and with residuals arising from the regression

kriged simultaneously and summed to create a final map.

RK type C (Odeh et al. 1995), which is called “kriging after detrending”

(Goovaerts, 1999), is defined as the sum of the regressed values and kriged residuals

from the regression. The difference of RK type C to type B is that it only uses the kriging

of the residual to obtain final prediction. For an extensive review of the hybrid kriging

techniques, a full discussion of RK can be found elsewhere (Knotters et al., 2010).

21

RK type C is one of the most widely used hybrid spatial interpolation method

used in soil science to predict soil properties (Minasny and McBratney, 2007). The steps

to execute RK are provided (Figure 2-2).

First, soil and ancillary environmental data are collected for a given study region.

The next step is to compute a regression between the state factors and the target soil

property. Then the trend model, identified by the regression equation, is subtracted from

Z(x) and residuals are quantified. The residuals from the trend are treated as spatially

correlated stationary random variables. Finally, the regression estimates and the rigged

residual values are summed together to create the final map. McBratney et al. (2003)

called this approach SCORP kriging as a default modeling technique in the thoughtful

review article of DSM. Hengl et al. (2004) presented the general framework for RK. Lark

et al. (2006) articulated the problem with estimating the residual variogram and

concluded that RK is statistically suboptimal and offered the restricted maximum

likelihood-empirical best linear unbiased predictor (REML-EBLUP) method as

mathematically unbiased alternative to RK. However, Minasny and McBratney (2007)

compared the prediction accuracy of RK and REML-EBLUP with different datasets and

the performance of both techniques are found quite similar in different case studies.

Authors concluded that although RK is biased from a mathematical point of view it is

performing equally well as the unbiased counterpart (REML-EBLUP). This finding has

contributed to the popularity of RK which is much easier to implement than the

mathematical complex REML-EBLUP.

The hybrid method is reduced to ordinary kriging if no linear or non-linear

relationships are present between phenomena of interest and auxiliary variables. On the

22

other hand, if no autocorrelation in residuals is present then the hybrid method is

reduced to the (multiple linear) regression (Vanwalleghem et al., 2010). Multiple DSM

studies showed that RK outperformed geostatistical and non-geostatistical methods

(Bishop and McBratney, 2001; Carré and Girard, 2002; Odeha et al., 1994; Odeh et al.,

1995; Odeh and McBratney, 2000; Triantafilis et al., 2001; Rivero et al. 2007).

Several theoretical and applied aspects of RK have been discussed in the

literature. However, there is no systematic, extensive review of RK to predict soil

properties and classes. The objectives of the study are as follows:

1. Review of the studies that utilized RK to predict soil properties and classes at multiple spatio-temporal scales ranging from field to regional scale.

2. Identify the strengths and weaknesses of RK studies.

3. Quantify the factors affecting the accuracy of RK.

4. Document the development of RK through the last decade and characterize future trends.

2.2 Material and Methods

Three high-quality international soil science journals were selected: “Geoderma”,

“Soil Science Society of America” and “Catena”. The time period of 2004 to 2014 was

identified to review DSMM studies that utilized RK as a spatial interpolation method.

“Regression kriging and soil properties or classes” were defined as the key words to

gather articles that employed RK as an interpolation method to predict soil properties

and classes. All retrieved articles were thoroughly analyzed. As a result of conscientious

examination, a total of 40 articles were gathered to be included in this review. Multiple

criteria were used to assess RK studies, including factors that affect the accuracy of

prediction performance, such as sample size, area of extent, soil properties and

classes, sample depth, regression type, and auxiliary variables. Each soil model was

23

evaluated as different case to accurately quantify and obtain reliable results. Hence,

142 cases out of 40 articles are included in this extensive review.

The following criteria were considered in order to characterize the individual RK

studies:

i) Data collection process:

• Geographic location of soil attributes or classes • Area extent • Grain size • Target soil properties or classes • Sampling design • Sample size and density • Sample depth(s) • Auxiliary environmental variables which are explicitly specified SCORP factors:

soils (S), climate (C), organism/vegetation (O), relief (R), and parent material (P)

ii) Model development process:

• Transformation of target soil properties • Factor analysis of SCORP factors • Regression type which is used to explore the relationship between target soil

property or class and SCORP factors • Variogram model • Nugget to sill ratio (N:S) • Spatial autocorrelation range

iii) Validation process:

• Training (T) and validation (V) size of the target soil property/class • Coefficient of variation (CV%) of training dataset • Root mean squared deviation (RMSD) (kg C m-2 • Coefficient of determination (R2) of final prediction

2.3 Results

2.3.1 Spatial Scale

2.3.1.1 Geographic region

Out of 142 cases, RK studies were conducted in the following geographic

regions: 30% (N=42) in Europe, 24% (N=34) in the U.S., 20% (N=29) in Australia, 15%

24

(N=22) in China and 11% (N=15) in the remaining regions (Africa, Canada, India,

Middle East, and South America). The increase in the total number of RK studies is

given (Figure 2-3). Over time the locations for RK studies have become more diverse

around the world revealing that the approach is well recognized among DSMM

practitioners. While Europe, U.S. and Australia are the most prominent geographic

locations for RK with major DSM research groups, training and DSM workshops have

contribute to the spread of the RK studies around the globe. With the recognition of

2015 as the international year of soil science, DSMM studies utilizing RK appear to be

on the rise.

2.3.1.2 Area extent

142 cases within the 40 different articles were categorized under four spatial

extents: field (smaller than 0.25 km2), local (between 0.25 and 104 km2), regional

(between 104 and 107 km2) and global (wider than 107 km2) according to a classification

proposed by (Thompson et al., 2012)

A total of 20.4% (N=29) of the studies covered a field scale, more than half of the

studies 58.5% (N=83) covered a local extent and 16.9% (N=24) of the cases covered a

regional extent and none of them covered a global extent (Appendix). Although Hengl et

al. (2014) used RK to model and map some soil properties and classes at global extent,

it was not among the reviewed studies.

The various area of extent ranged from the finest with 0.042 km2 (Dlugoß et al.,

2010) to 0.06 km2 (Baxter and Oliver, 2005), to coarser extent, such as 432 km2 (Rivero

et al., 2007) and 6,157 km2 (Shi et al., 2011) up to most coarse covering large soil

regions of thousands of millions km2 (4,217,241 km2 by (Lado et al. (2008) to 9,600,000

km2 by Li et al.(2013)). Out of 40 different studies 10% (N=4) conducted their studies

25

within more than one spatial coverage to investigate the effect of area extent for either

the same or different soil properties and classes (Baxter and Oliver, 2005; Simbahan et

al., 2006; Minasny and McBratney, 2007; Poggio et al., 2010). Minasny and McBratney

(2007) tested the prediction capability of Residual Empirical Maximum Likelihood-

Empirical Best Unbiased Prediction (REML-EBLUP), RK and OK in four different areas.

Poggio et al. (2010) modeled the spatial uncertainty of interpolated values of available

water capacity (AWC) at three different nested extents: catchment, regional and

national. Simbahan et al. (2006) assessed soil organic carbon (SOC) stocks in three

large no-till fields.

2.3.1.3 Grain size

A total of 142 cases within the 40 reviewed articles had a grain size (i.e., spatial

resolution) ranging from 0.5 m up to 5,000 m. Out of 142 studies 38.7% (N=55) utilized

relatively finer resolutions ranging from 0.5 m to 25 m, 29.6% (N=42) employed 30 m

resolution, 16.2% (N=23) used relatively coarser resolution from 50 to 5,000 m and

15.5% (N=22) could not be grouped into any of the categories due to a lack of the

necessary information reported in the articles (Appendix). Grunwald (2009) observed a

similar trend in a comprehensive review of DSM studies. She found that as the spatial

domain size increases, the cell size also increases. Minasny et al. (2013) reported that

grid spacing of digital soil maps increases logarithmically with area of extent. Among the

reviewed studies, Chaplot et al. (2010) predicted the thickness of A horizon with the cell

size of 0.5 m in a 0.003 km2 area, Umali et al. (2012) predicted standard soil properties

with a 5 m resolution in a 0.056 km2 area, Li et al. (2013) explored the spatial variability

of soil organic matter (SOM) throughout China with 9,600,000 km2 area and a 1 km cell

size, and Lado et al. (2008) interpolated several heavy metal concentrations within a

26

large area- 4,217,241 km2- with a 5 km spatial resolution. Careful examination is

needed when selecting an appropriate spatial resolution for a soil map.

2.3.2 Target Soil Properties and Classes

Out of 142 examined cases, 38.7% (N=55) were concerned with SOM, soil

carbon, carbon stocks and carbon fractions to supply information dealing with global

environmental problems including climate change, desertification and soil security. For

example, Karunaratne et al. (2014) predicted fractions of SOC, namely resistant organic

carbon (ROC), humus organic carbon (HOC) and particulate organic carbon (POC) at a

100 cm depth at local scale. Vasques et al. (2010b) modeled and mapped the total soil

carbon stocks from 0 to 180 cm within a 3,580 km2 subtropical region. To quantify the

spatial distribution of soil carbon across multiple scales is urgently needed because soil

carbon is the largest manageable carbon pool compared to carbon in the biosphere and

atmosphere (Lal et al. 2004). To accurately assess the soil carbon within a study region

is profoundly important for better decision-making in sustainable development (Minasny

et al., 2013). Accordingly, the spatial distribution of soil carbon has been of great

interest as exemplified by the increasing number of publications in mapping soil carbon

stocks globally and nationally (Grunwald, 2009).

A total of 23.2% (N=33) of RK-DSMM studies addressed the issue of land

degradation and environmental concern by focusing on soil nutrients, mainly total

phosphorus (TP) (Roger et al., 2014), nitrogen (TN) (Baxter and Oliver, 2005; Rivero et

al., 2007), heavy metal concentration (Lado et al., 2008; Lin et al., 2011; Shi et al.,

2011), soil health, land degradation (Lamsal et al., 2006; Watt and Palmer, 2012) and

salinization (Douaoui et al., 2006) (Appendix). These environmentally-centered DSMM

27

research studies are responding to critical societal needs including environmental

quality assessment, soil degradation, and health.

A total of 11.3% (N=16) of the 142 observed cases predicted chemical soil

properties, such as pH (Hengl et al., 2004; Umali et al., 2012; Sun et al., 2012; Malone

et al., 2014) and electrical conductivity (EC) (Umali et al., 2012). 16.9% (N=24) of 142

cases predicted physical soil properties, such as depth to soil horizons (Chaplot et al.,

2010; Vanwalleghem et al., 2010) or clay, sand and silt (Minasny and McBratney, 2007;

Mora-Vallejo et al., 2008; Umali et al., 2012; de Carvalho Junior et al., 2014; Niang et

al., 2014) .

Only 8.5% (N=12) of the RK studies investigated hydrological soil properties,

such as available water capacity (Poggio et al., 2010) and soil moisture (Herbst et al.,

2006; Takagi and Lin, 2012).

Surprisingly, out of 142 cases, only very few (1.4%; N=2) predicted soil

categorical data. Hengl et al. (2007b) predicted soil texture classes and soil groups in

Iran and (2004) the presence or absence of a plant species (Taxus baccata L.) in each

grid cell using logistic regression. These numbers are significantly lower compared with

previous review studies. McBratney et al. (2003) found that 30% of reviewed studies

predicted soil categories and Grunwald (2009) observed 15.6%. Hence, the significant

decrease in frequency of soil classes studied shows a lack of preference by the DSMM

community for using RK to predict soil classes. One of the reasons may be the

satisfactory prediction accuracy of an increasing number of disaggregation methods to

predict soil categorical information, such as PROPR (Digital Soil Property Mapping

using Soil Class Probability Raster) and DSMART (Dissaggregation and harmonization

28

of Soil Map Units Through Resampled Classification Trees) (Odgers et al., 2014; 2015)

and machine learning methods, such as Classification and Regression Tree (CART)

and Random Forest (RF). Another reason may be a declining interest in mapping of

taxonomic classes since many soil class maps have been already produced in the past

especially by governmental organizations.

Out of 142 cases, 60% (N=24) predicted only one specific soil property/class

(e.g., Kuriakose et al., 2009; Poggio et al., 2010; Li, 2010; Zhang et al., 2012; Li et al.,

2013), whereas 40% (N=16) modeled and mapped more than one soil property (e.g.,

Lado et al., 2008; Vasques et al., 2010b; Shi et al., 2011; Sun et al., 2012; Umali et al.,

2012; Roger et al., 2014). For instance, Chai et al. (2008) compared the performance of

REML-BLUP with that of RK to predict SOM in the presence of different external trends.

Lado et al. (2008) modeled and mapped the distribution of eight critical heavy metals

(arsenic, cadmium, chromium, copper, mercury, nickel, lead and zinc) in the topsoil

using 1,588 georeferenced samples from the Forum of European Geological Surveys

Geochemical database (26 European countries). Roger et al. (2014) predicted the

spatial distribution of soil P forms (total, organic, inorganic, and available P) at regional

scale.

2.3.3 Sampling

2.3.3.1 Sampling design

Out of 142 models, 21.1% (N=30) of them used regular grid sampling with a

sample spacing ranging from 2 m (Chaplot et al., 2010) to 10 km (Poggio and Gimona,

2014), while 15.5% (N=22) of all cases sampled based on stratified random sampling

design (e.g., Karunaratne et al., 2014; Vasques et al., 2010a), 13.4% (N=19) of all

cases used a random sampling design at field scale (e.g., Umali et al., 2012) and at

29

regional scale (e.g., Kumar et al., 2012), and 9.2% (N=13) of all cases employed a

purposive sampling scheme (e.g., Kuriakose et al., 2009; Takagi and Lin, 2012)

(Appendix). Lastly, the conditioned Latin Hypercube sampling (cLHS) design was

employed in only 2.8% (N=4) of the 142 studies (e.g., Levi and Rasmussen, 2014;

Minasny and McBratney, 2007). The cLHS may directly contribute to the performance of

prediction because it maximizes the efficiency of sampling by enabling users to

adequately characterize the variability in a target geographic region for a given target of

interest (Minasny and McBratney, 2006). The small scale variability may not be

captured if the sample spacing is larger than the effective range (McBratney, 1998).

Unfortunately, 38.0% (N=54) of the studies did not specify the sampling design. As the

area of extent increased from field to regional scale, the sampling design shifted from

model-based to design-based (probability-based) sampling. Furthermore, as spatial

coverage increased, the preference of regular grid sampling decreased as follows:

43.3% (N=13) of the cases at field scale, 40% (N=12) at local scale and 10% (N=3) at

regional scale.

2.3.3.2 Sample size-density

The sample density is functions of area of extent and sample amount. Typically,

as the area of extent increases the sampling density drastically decreases. For

instance, Watt and Palmer, (2012) predicted C:N ratio with 0.0059 sample per km2 at

regional scale (1,949,359 km2) and Sun et al. (2012) predicted clay, pH and SOC with

25.9 sample per km2 at local scale (38 km2) and Takagi and Lin ( 2012) predicted soil

moisture with 1,341.7 sample per km2 at field scale (0.079 km2). No consistent

sampling density or spatial resolution was evident in the reviewed studies. However, an

increase in sampling density may likely increase the reliability of the prediction mapping.

30

2.3.3.3 Sample depth(s)

Out of 40 reviewed articles 82.5% (N=33) were conducted at a single depth

ranging from 10 to 100 cm; 0-10 cm (Umali et al., 2012; Watt and Palmer, 2012), 0-20

cm (Chai et al., 2008; Li, 2010; Shi et al., 2011; Roger et al., 2014), 0-25 cm (Zhang et

al., 2012), 0-30 cm (Baxter and Oliver, 2005; Lamsal et al., 2006; Mora-Vallejo et al.,

2008; Mishra et al., 2012; Karunaratne et al., 2014; Levi and Rasmussen, 2014), 0-100

cm (Chaplot et al., 2010; Kumar et al., 2012; Poggio and Gimona, 2014). Only 17.5%

(N=7) of the articles examined multiple depths (Vasques et al., 2010a; Takagi and Lin,

2012; Sun et al., 2012; de Carvalho Junior et al., 2014; Malone et al., 2014). For

example, Malone et al. (2014) predicted pH to a depth of 200 cm at 6 different horizons

using the mass preserving spline. Out of 142 cases, 15% (N=22) did not report the

horizon depth(s) (e.g., Hengl et al., 2004; Douaoui et al., 2006; Hengl et al., 2007a;

Minasny and McBratney, 2007; Lado et al., 2008; Poggio et al., 2010; Li et al., 2013)

(Appendix). In summary, the majority of DSMM-RK studies only focused on mapping of

the topsoil. This is rather reductionistic because biogeochemical processes occur within

the whole soil profile. The critical zone of the Earth surface extends far beyond the top

0-30 cm and for a more complete understanding of pedogenesis subsurface horizons

would also need to be considered.

2.3.4 SCORP Factors

Different combinations of the SCORP factors were explicitly used to predict soil

properties or classes. Out of 142 different cases the frequency of SCORP factors to

predict soil properties and classes are as follows: 43.7% (N=62) incorporated the S

factor, 34.5% (N=49) the C factor, 64.1% (N=91) the O factor, 86.6 % (N=123) the R

factor and 35.2 % (N=50) the P factor. The findings are in line with the review of DSM

31

studies by McBratney et al. (2003). They found the frequency of SCORP factors was as

follows: S (35%), C (5%), O (25%), R (80%) and P (25%). The increase of usage of O

and C factors may be a function of the increase in carbon modeling studies and the

increasing importance of O and C factor for modeling and mapping of carbon and/or an

increase in readily available data for digital soil mapping practitioners. For example,

Hijmans et al. (2005) produced a global scale study of primary soil attributes with a 1 km

resolution using mean annual temperature, precipitation and bioclimatic raster maps

from very high resolution raster layers for public use. The DEM derived from the Shuttle

Radar Topography Mission (SRTM) have been exhaustively used with differing

resolutions (30 to 1000 m) as the first choice for pedometricians desiring to derive

primary and secondary terrain attributes to model soil properties and classes

(McBratney et al., 2003). The availability to high resolution DEMs (10, 30 and 90 m) at

global scales may have contributed to the high percentage of R factors used in DSMM,

when compared to covariates for other SCOP factors. Grunwald (2009) found that

29.3% of all investigated DSMM studies (N=75) documented R factor as covariates.

Different combination of SCORP factors have been utilized for modeling different

soil properties. Some authors utilized a combination of ancillary variables (e.g.,

Vanwalleghem et al. (2010) used S and R factors and Simbahan et al. (2006) used S, O

and R factors), while others included all SCORP factors (e.g., Vasques et al. (2010b)

predicted fractions of soil carbon). Different sets of auxiliary variables and their impact

on the prediction accuracy has been addressed by Li (2010) by examining the effect of

topography, organism, climate and parent material on the accuracy of soil predictions.

Surprisingly, adding environmental factors deteriorated the accuracy of prediction due to

32

an inappropriate spatial scale of environmental variables and the total number of soil

observations. Zhang et al. (2012) investigated whether the inclusion of categorical

variables improves the accuracy of SOM predictions. They concluded that the prediction

accuracy was improved with the inclusion of soil genetic types.

Rivero et al. (2007) conducted an investigation of auxiliary variables obtained at

different grain size to find whether or not a finer resolution of environmental variables

increases the accuracy in modeling. In that study, they incorporated a number of indices

obtained from 30 m spatial resolution Landsat 7 Enhanced Thematic Matter (ETM+) and

the same number of indices from the 15 m spatial resolution Advanced Spaceborne

Thermal Emission and Reflection Radiometer (ASTER). The RK model with ASTER

derived indices yielded better prediction accuracy in terms of ME and RMSD than the

RK model with ETM. Although, there is no explicit directive with regard to spatial scale

of remote sensing derived environmental variables and prediction efficiency. Finer

resolution images may not always increase the performance of soil predictions (Kim et

al., 2014; Hong, 2011). However, at resolutions coarser than 40 m, erratic behavior of

terrain variables and some certain landscape attributes become apparent leading to a

loss of predictive capability (McKenzie and Ryan, 1999; Gessler et al., 2000).

2.3.5 Preprocessing

2.3.5.1 Logarithmic transformation

Parametric statistical methods such as Generalized Linear Model (GLM) and

Stepwise Multiple Linear Regression (SMLR) assume the Gaussian distribution of the

dependent variables; however, the distribution of soil properties generally shows a

skewed (right or left) distribution. Overcoming this issue, 62.7% (N=89) of studies out of

142 explicitly specified the use of a logarithmic transformation to reveal deterministic

33

part of the total variation while ensuring the normality of measured data when using

parametric methods .

With non-parametric statistical methods, there is no requisite to transform the

target of phenomena (Lamsal et al., 2006; Kumar et al., 2012; Li et al., 2013; Poggio

and Gimona, 2014; Niang et al., 2014). Webster and Oliver (2007) articulated that as

transformation can increase the model complexity and converting transformed output

back to original units can be problematic; a careful examination is needed.

2.3.5.2 Factor analysis

33.8 % (N=48) of 142 reviewed studies explicitly addressed the multicollinearity

issue arising from the correlation between certain original SCORP factors and before

the application of statistical analysis using principal component analysis (PCA, Wallis,

1965). Exhaustive environmental variables may be compiled to gather a spectrum of

variables that represent the environmental soil emergence with the advance of

Geographic Information Systems (GIS), Global Positioning System (GPS), and remote

and proximal sensing technologies. Auxiliary variables selected based on the

researchers’ domain knowledge of the soil environment processes is now a common

practice in soil-landscape mapping modeling and can lead to biased and suboptimal

model performance (Grunwald, 2009). Systematic selection of covariates may be

addressed by modern statistical methods such as Boruta (Xiong et al., 2014a). PCA has

been exhaustively used in a number of disciplines for a variety of reasons none due to

its simplicity. Investigation of modern statistical algorithms as an alternative to PCA may

yield improvements in prediction efficiency.

34

2.3.6 Regression Type to Quantify Deterministic Variation

62.7% (N=89) of 142 reviewed studies utilized SMLR, 8.5% (N=12) utilized

REML-EBLUP, and 28.9% (N=41) used one of the following: Logistic Regression (LR),

Generalized Linear Model (GLM), Classification and Regression Tree (CART),

Generalized Additive Model (GAM) and, Geographically Weighted Regression (GWR)

(Appendix). Obviously, more than half of the studies followed the general framework

presented by Hengl et al. (2004). There are a number of studies that compare accuracy

of RK to other pure geostatistical or hybrid methods. For instance, Levi and

Rasmussen, ( 2014) compared OK and RK; Herbst et al. (2006) compared kriging with

external drift (KED), OK and RK; Li, (2010) compared OK, RK and universal kriging

(UK); Baxter and Oliver, (2005) compared OK, RK and cokriging (COK). On the other

hand, numerous studies utilized different advanced statistical data mining algorithms to

determine trends (drift) between soil properties-classes and SCORP factors. For

example, Kumar et al. (2012) incorporated geographically weighted regression (GWR),

Malone et al. (2014) employed CUBIST, Lin et al. (2011) used logistic regression (LR)

and Lamsal et al. (2006) utilized CART for the regression portion of the RK. Additionally,

these studies compared the final accuracy of these novel hybrid methods with global RK

which used SMLR for the regression part of the RK. Moreover, some authors presented

novel approaches to modifying the kriging part of the RK. For instance, Leopold et al.

(2006) applied block kriging (BK) to kriged residuals from regression part and Sun et al.

(2012) demonstrated local RK as a step up version of RK

35

2.3.7 Variogram

2.3.7.1 Model type

In order to reveal the spatial correlation present in soil properties, detrended data

from the separated residuals of the regression part in RK were interpolated using mainly

two types of variogram models: spherical and exponential. 48.6% (N=69) and 28.2 %

(N=40) out of 142 cases utilized exponential and spherical variograms, respectively.

This finding is in line with what Minasny and McBratney, (2005) articulated; exponential

models represent most soil properties and are stable when nonlinear least square fitting

is applied. Also, the Gaussian semivariogram model is generally unrealistic and leads to

unstable kriging systems and artifacts in the estimated maps (Wackernagel, 2003).

2.3.7.2 N:S ratio

While the nugget (N) may be interpreted as the signature of the variability from

uncorrelated stochastic processes or microscale processes, the sill (S) is sum of the

nugget and the partial sill which represents the total variation (Oliver and Webster,

2014). The N:S ratio has been used to quantify the strength of spatial structure or the

unexplainable portion of short-range variability that is not quantified by a variogram (Zhu

and Lin, 2010). A N:S ratio of 0.5 signifies 50% of the variation has an unexplainable or

spatially independent, stochastic variation. If the ratio is less than or equal to 25%, the

N:S ratio is strongly (S) spatially dependent; between 25 and 75% moderately (M)

spatially dependent; and greater than 75%, then it is weakly (W) spatially dependent

(Cambardella et al., 1994). However, it should be noted that the cut-off values are

arbitrary, and there is no statistical distinction between 25 and 75 % N:S ratio.

Out of 142 cases, 38.0% (N=54) are moderately spatially dependent, 21.1%

(N=30) are strongly spatially dependent and 9.2% (N=13) are weakly spatially

36

dependent. The rest of the reviewed cases did not explicitly or implicitly specify the N:S

ratio. Since N:S ratio can be a significant signal in deciding which spatial interpolation

should be used, Kravchenko (2003) observed that where ordinary kriging yielded more

accurate predictions of soil properties with N:S ratio less than 0.1 (Kravchenko, 2003).

2.3.7.3 Range

Commonly range is the most important semivariogram parameter with regard to

the spacing between sample locations (Mulla and McBratney, 2001). At separation

distances greater than the range, sampled points are not spatially correlated; this has

great implications for sampling design. Thus the need to create an effective variogram

which requires the sample spacing should not exceed the range of the semivariogram.

Additionally, sample spacing should be within a ¼ to ½ of the range (Flatman and

Yfantis, 1984).

To reveal the general trend for spatial range in reviewed studies, the area of

extent is categorized under three nested area of extent; field, local and regional;

additionally, soil attributes are grouped into five main groups for investigation. The

average autocorrelation range value for reviewed studies is given (Table 2-1). A

statistical investigation for quantifying attributes of variogram in terms of range did not

yield reliably specific results due to the unavailability of range in reviewed studies and

the large amount of variability in modeled target soil attributes. Out of 142 cases, only

57.4% (N=81) reported their spatial range. In these cases, as the area of extent

increases the spatial range drastically increases as a general trend regardless of which

soil properties were used; however, the increase in rate does appear to be dependent

on the soil properties.

37

For the phenomena of interest, pre-existing semivariogram range information is

useful in determining where useful information should be obtained, whether in the area

of interest, in sites nearby or within a site located in the same region. The range values

reported in this review may allow a researcher to easily formulate an initial hypothesis

for range at multiple areas of extent, soil properties and soil classes.

2.3.8 Validation

Mainly, jack-knife and cross-validation procedures were preferred to test the

performance of the DSMM studies. Out of 142 studies, 64.8% (N=92) split field

observation data randomly to create separate training and validation datasets. 31.0%

(N=44) of the reviewed studies used cross-validation by either leaving one out or by the

k-fold method. Out of 142 reviewed studies, only 4.2% (N=6) did not use any validation

procedures. The result is significantly different compared to other review studies.

Grunwald, (2009) found that out of 90 investigated studies 21.1% used cross-validation,

46.7% used validation and 35.6% did not use any validation procedures. The large

increase in validation procedures shows the importance of quantifying uncertainty and is

now established among DSMM practitioners.

Furthermore, the evaluation of uncertainty analysis in various spatial interpolation

methods was assessed. 65.0% (N=26) utilized more than one of the following methods:

Mean Error (ME), Mean Absolute Error (MAE), Root Mean Square Deviation (RMSD),

Mean Squared Deviation Ratio (MSDR), Normalized Root Mean Square Deviation

(NRMSD), Residual Prediction Deviation (RPD). 30.0% (N=12) employed only one of

the above methods, and 5.0% (N=2) did not perform any uncertainty analysis.

64.8% (N=92) of the reviewed models assessed the accuracy of predicted soil

properties and classes with the jack-knife validation procedure but with no standard

38

ratio to divide the observed soil properties and classes. Soil samples were divided into

two sets: training (T) and validation (V); however, there is no standard ratio to divide

original observed values. The ratio of training datasets varied from 35% to 90% while

the ratio of validation datasets rate varied from 10% to 65% of the total number of

samples. The variance of the divided set depended on both the statistical methods used

and the minimum sample requirement for a reliable variogram. Thus, careful

consideration should be given when splitting original soil dataset into T and V dataset,

especially if the sample size is low because the mean of training and validation datasets

may change significantly. The minimum cut-off value for number of observations

necessary for accurate interpolation is dependent on the spatial characteristic of the

phenomena of interest including the sample distribution in a geographic space, the

strength of spatial dependence and the relationship with environmental factors.

Generally, the sample set in variogram development should be isotropic; 100 samples

at a minimum, 150 samples for a satisfactory result and 225 for a reliable result

(Webster and Oliver, 1992). Also, previous studies proved that REML-EBLUP may be a

useful technique when the sample size is smaller than 100. For example, Chai et al.

(2008) used REML-EBLUP with 70 (V) and 131 (T) sites. In spite of its drawbacks, the

REML method of estimating variogram parameters is still may have a valuable role to

play in pedometrics when practitioners have fewer than 100 data (Kerry and Oliver,

2007)

2.4 Discussion and Recommendations

2.4.1 Factors Effecting Performance of RK

To compare the performance of different RK models with the variety of scales,

region, soil properties and classes, the following criteria is considered essential: i)

39

landscape heterogeneity ii) sampling design iii) sample size iv) sample density or

distribution of samples v) strength of the correlation between target soil properties and

SCORP factors (R2) and vi) nugget to sill ratio as a strength of spatial dependence.

Unit/scale dependent measures were removed in order to compare accuracy of RK

models in the reviewed studies. Thus, a normalization of RMSD was necessary. Li and

Heap, (2011) proposed NRMSD as follows:

RMSD / mean Validation = NRMSD (2-2)

Since basic statistical information to calculate NRMSD was not generally

available in the reviewed studies, a simplification was made. The mean of validation of

each dataset was replaced by the mean of observed dataset. This is called

standardized NRMSD (Haberlandt, 2007). Even though the more reliable unit-free

metric, ratio of prediction error to inter-quartile range (RPIQ) (Bellon-Maurel et al., 2010)

could be used to compare the considered factors affecting prediction accuracy of 142

models, almost none of the studies specify the essential information to calculate these

measurements (i.e., 25th, 75th and standard deviation) nor provided the RPIQ.

Of 142 reviewed cases, 48.6% (N=71) included essential information to calculate

the NRMSD; therefore, the effect of sample size, sample density, sample design, N:S

ratio on prediction performance of RK were evaluated for only 71 cases. A basic

statistical correlation analysis was performed in order to identify any possible trends

present between any of the five considered factors listed above and the NRMSDs.

However, the author found no discernable patterns between the above mentioned

criteria and NRMSD. This finding may be interpreted in two ways. First, either there is

no pattern due to the fact that all of the above considered factors collectively affect the

40

final accuracy of RK, or second, it may be the function of an erratic behavior of

simplified NRMSD.

In order to prevent this discrepancy in the future, the following parameter could

be released by authors utilizing RK to model and map soil properties and classes: area

of extent, sample design, sample depth, sample size (training and validation

separately), sample depth(s), SCORP factors, spatial resolution of final map,

transformation methods, method of factorial analysis, regression type, coefficient of

determination from the deterministic function, model type for the variogram, spatial

autocorrelation range, N:S ratio and validation method and RPIQ as a reliable metric.

Out of 142 cases, 47.2% (N=69) reported the coefficient of variation (CV%).

Upon further evaluation, a logarithmic transformation was performed for CV% and

NRMSD which displays a strong trend between CV% and NRMSD. As the CV%

increases the performance of RK models decreases. Li and Heap, (2011) found the

similar trend between CV% and RK type C. Figure 2-4 shows that as CV of the

measured dataset increases, the accuracy of RK is decreasing.

2.4.2 Regression Kriging as a Default Soil Mapping Method

2.4.2.1 Satisfactory performance of regression kriging over its competitors

Since the emergence of RK in soil science, hybrid methods, especially RK, have

often yielded more accurate predictions than its competitors: geostatistical and non-

geostatistical methods. RK is frequently used and has been proven to be a robust,

practical hybrid method. The reviewed studies within the last decade have reported that

RK is superior to (KED) (Herbst et al., 2006; Simbahan et al., 2006), Cokriging(COK)

(Baxter and Oliver, 2005; Rivero et al., 2007; Niang et al., 2014), Ordinary Kriging(OK)

(Hengl et al., 2007a; Herbst et al., 2006; Hengl et al., 2004; Lado et al., 2008; Chai et

41

al., 2008; Kuriakose et al., 2009; Dlugoß et al., 2010b; Watt and Palmer, 2012; Zhang et

al., 2012), Multiple Linear Regression (MLR) (Takagi and Lin, 2012; Mishra et al., 2010;

Umali et al., 2012; Chaplot et al., 2010), Generalized Linear Model(GLM), CART (de

Carvalho Junior et al., 2014; Lamsal et al., 2006; G. M. Vasques et al., 2010b) Random

Forest(RF), and Logistic Regression(LR) (Lin et al., 2011). Theoretically, RK is a

combination of the linear or nonlinear regression and the kriged residual (i.e., the

unexplainable variation from the regression), thus the accuracy of RK is likely superior

to a pure geostatistical interpolation method and simple or modern statistical

interpolation methods with the condition that the residuals have either a weak, moderate

or strong spatial correlation throughout the area of interest.

2.4.2.2 Unsatisfactory performance of regression kriging over its competitors

Some authors, on the other hand, reported that RK did not improve prediction

performance when compared to OK (Li, 2010; Roger et al., 2014; Umali et al., 2012)

and MLR (Mora-Vallejo et al., 2008). Limitations adversely affecting the prediction

efficiency of RK are discussed by Hengl et al. (2007a). The body of literature

emphasized the main reasons for the unexpected results using RK over OK and MLR

are i) a limited number of soil observations unable to accurately reflect variability in the

area of interest leading to an unreliable variogram ii) an identifiably poor relationship

between target soil properties and auxiliary variables due to either the unintentional

exclusion of useful covariates, a lack of high quality available auxiliary variables or an

improper method choice that cannot capture the hierarchical, complex relationships

present in soil and auxiliary variables.

Firstly, the heterogeneity of the area of interest is reflected by the limited number

of samples though sample spacing is also particularly important. As the variability in soil

42

landscape increases through soil forming factors, the predictive capability of the RK

often decreases because capturing deterministic part of the variation across

heterogeneous landscape becomes harder with a sparsely distributed, finite number of

soil samples. The strength of the fit between soil properties and environmental factors

may decrease as the heterogeneity increases. Also, an increase in complexity in soil-

landscape may decrease the effective range which directly controls the strength of the

spatial autocorrelation. For example, Zhu and Lin, (2010) showed that while OK is

preferred in the gently-rolling agricultural landscape, RK is more favorable in the steep-

sloped forested landscape. Thus, slope may affect the variation, and the spatial range

could be small when compared to a gentle sloped area defeating the ability of RK to

capture stochastic spatially dependent variation.

The RK technique was purposely developed to use the exhaustively available

auxiliary variables as well as various data from different sources with differing spatial

scales. The introduction of environmental covariates into the model is thought to

improve the interpolation accuracy by reducing the number of observations needed for

target variable. Hence, the interpolation accuracy of RK depends on the selection of

high-quality, useful auxiliary variables that are representative of the main dynamics of

phenomena of interest. However, in most cases, researchers do not have a choice of

the auxiliary variables, and the impact of various spatial scales of auxiliary data on the

performance of RK prediction is still largely unknown. The relationship between a target

soil property and auxiliary variables, often represented determination of coefficient (R2)

from a linear or non-linear regression, is important in determining whether RK will be

more accurate than OK (Kravchenko and Robertson, 2007). Also, processes which

43

govern the total variation of target soil properties may not be fully represented due to a

lack of high quality useful available data and knowledge pertaining to processes that

account for the variation or inappropriate spatial scale of variables. In terms of spatial

dependence of soil properties and classes with N:S ratio as an indicator of strength of

spatial dependence, Kravchenko (2003) found that soil properties with N:S < 0.1 can be

mapped more accurately by ordinary kriging (OK) than those with N:S > 0.1. N:S ratio.

Therefore, if the spatial autocorrelation is weak (N:S > 0.75), then a variogram cannot

substantially contribute the performance of prediction. On the other hand, if a strong

spatial dependence (N:S < 0.25) is detected, then the accuracy of prediction may be

improved substantially since there is some explainable variation present in the

residuals.

In addition, the residual spatial autocorrelation of a model is largely dependent on

the input variables used in the deterministic function. During the model development

process, the introduction of spatially correlated environmental predictors will largely

influence the residual spatial autocorrelation of the model. In other words, models

populated with all-relevant variables (i.e., some of them spatially autocorrelated) will

leave no residual spatial autocorrelation; hence, ordinary kriging of the residual for

those models will not substantially improve the RK prediction performance.

2.4.3 REML-EBLUP vs. RK

In the conceptual framework, soil properties and classes are treated as a

realization of spatially correlated random functions (Lark, 2012) which are a combination

of the deterministic variation and the spatially varying stochastic variation. Empirically,

the best linear unbiased prediction (E-BLUP) is one that accounts for both variations by

incorporating a trend function f(x) and the random variable ε(x) with a mean of zero and

44

spatial dependence as described by the variogram. The prediction value at an unknown

location is a combination of the trend prediction, f(xo), and a kriged estimate of ε(xo)

(Stacey et al., 2006; Stein, 1999).

𝑍𝑍(𝑥𝑥0) = 𝑓𝑓(𝑥𝑥0) + 𝜀𝜀 (𝑥𝑥0) (2-3)

One way of obtaining E-BLUP is with the C method RK model (Odeh et al.,

1995). However, the major drawback of using RK is the requirement that the

deterministic model parameters and covariance function parameters must be estimated

separately. The parameters of the deterministic model are estimated with the choice of

regression type and used to compute of the trend in the area of interest. The residuals

arising from this trend model are quantified by variogram; and typically, the final model

is fitted with the method-of-moments estimator of Matheron. Nevertheless, neither trend

can be estimated without bias because the distribution of the random residual is

unknown at this stage and the variogram of the residuals cannot be estimated without

bias when the trend is unknown (Cressie, 1993). In this process, the same regression

coefficients are used to compute the trend at all locations, even if the kriging estimation

is only done in a local neighborhood (Stacey et al., 2006). Hence, RK is mathematically

biased which leaves room for improvement.

As one possible improvement to this performance, Lark et al. (2006) introduced

Residual Maximum Likelihood-Empirical Best Linear Unbiased Predictor (REML-

EBLUP) to model the spatial variability of soil properties and classes. The advantage of

REML-EBLUP over RK is a result of the incorporation of the estimation of the variance

by REML, since these estimates are subject to substantially less bias than method-of-

moment estimate from OLS or GLS residuals. Theoretically, REML-EBLUP may give

45

better prediction accuracy when compared with RK. The REML-EBLUP may provide

more efficient predictions and unbiased estimates of the error variances for quantifying

the uncertainty, whereas RK separates the errors of deterministic and random

components of the prediction, which contributes to the final uncertainty. A fuller

description of the theory underlying REML and its justification and use is given

elsewhere (Lark and Cullis, 2004; Lark et al., 2006b; Lark and Webster, 2006).

Though RK is statistically biased, the prediction performance of REML-EBLUP

and RK is found to be similar. Minasny and McBratney, (2007) tested the accuracy of

RK and REML-EBLUP by modeling different soil properties in different geographic

regions. There were slight improvements in prediction when using REML-EBLUP;

however, the advantage does not appear to be great. REML-EBLUP is useful when

there is a strong trend, when one needs to understand the underlying spatial process

and when the number of observations is small (< 200). Minasny and McBratney, (2007)

concluded that although RK is statistically inappropriate, RK is easy to use and has

proven to be a robust technique for practical application of soil landscape modeling and

mapping. Chai et al. (2008) compared the accuracy of RK and REML-EBLUP to predict

SOM with different auxiliary variables. The improvement of REML-EBLUP over RK was

not significant in this study. They presented that REML-EBLUP performed better than

RK in the ability to increase the prediction accuracy, especially when a smaller

proportion of variation in the target variable is accounted for by a trend model. Also,

other studies show that REML-EBLUP is preferred when the number of observation is

fewer than 100 (Kerry and Oliver, 2007) or fewer than 200 (Minasny and McBratney,

2007). Therefore, when a fixed trend between target soil properties and classes and

46

SCORP factors is strong and there are too few observations to conduct a successful

variogram, 100, 150 -200- (Webster and Oliver, 1992) REML-EBLUP may be

preferable. As the number of studies using REML-EBLUP to predict soil properties and

classes increase, the advantages of using REML-EBLUP over RK may be better

documented in the near future.

2.4.4 Future Trend of RK

The current global framework for RK is generally a combination of linear models

(GLS, LM) which reveal deterministic portion of the variation and OK which reveal

spatially dependent stochastic portion of the variation as well as a combination of both

to form the final map (Odeh et al., 1995; Hengl et al., 2004). There are no restrictions

on how to quantify the relationship between sparsely available soil properties and

exhaustively available exogenous variables. McBratney et al., (2000) incorporated the

modern statistical techniques, including generalized linear models (GLM), generalized

additive models (GAM), classification and regression trees (CART) and neural networks

(NN). After detrending the data, the residual kriged with ordinary kriging were combined

to create a final map, and they reported an obvious improvement from modified RK

types over global RK.

During the last decade, machine learning algorithms have gained tremendous

attention from environmental scientists working to improve the accuracy of the

deterministic portion of the variation. As modern statistical progress is made, RK has

been revised and refined with novel and robust parametric/ non-parametric regression

types or ordinary kriging with block kriging. Modified RK types usually yield better

prediction accuracy over global RK. From the reviewed studies, Sun et al. (2012) tested

and presented local RK against global RK and found, in general, that local RK performs

47

no worse than global RK, which had been thought to be a stepped-up version of RK

(Hengl 2007a). When used with geographically weighted regression (GWR) (Brunsdon

et al., 1996), Mishra et al. (2010) reported a relative improvement of 22% over MLR and

an improvement of 2% over RK was observed in SOC prediction. Kumar et al. (2012)

used geographically weighted regression kriging (GWRK) by combining GWR and OK

as a modified version of RK and reported the least biased and most accurate results

compared to RK for estimating the SOC stock based on the lowest RMSD. Niang et al.

(2014) used a support vector regression and produced the best prediction accuracy

compared with the geostatistical interpolation techniques. Poggio and Gimona, (2014)

employed a hybrid GAM-geostatistical 3D model (3DGAM + GS), by combining the

fitting of a GAM to estimate the trend of the variable, using a 3D smoother with related

covariates and kriging or Gaussian simulations of GAM residuals as spatial component

in order to account for local details and found better prediction accuracy. Shi et al.

(2011) used high accuracy surface modeling (HASM) which uses a spatial interpolation

technique based on the fundamental theorem of surfaces, and proposed a modified

HASM method based on the incorporation of ancillary land use information. The results

have shown that HASM_LU generally performs better than HASM, OK_LU, SK and

RK_GLM (with a lower estimation bias). MAE and RMSD generally perform with a

greater prediction error (PE). Li et al. (2013) proposed a radial basis function neural

networks model (RBFNN), this method was combined with a principal component

analysis (PCA) to predict the spatial distribution of SOM content across China. They

reported a higher ratio of performance to deviation (RPD) and lower prediction errors

(MAE), mean relative error (MRE) and root mean squared deviation (RMSD) when

48

compare to RK. Guo et al. (2015) used random forest (Breiman, 2001) with residual

kriging (RFRK) and compared results with SMLR to predict and map the spatial

distribution of SOM which yielded a much better prediction accuracy.

Very complex relationships between soil properties and environmental variables

are present in pedologic data (Lark, 1999). Machine learning algorithms do not require

Gaussian distribution assumption and can handle a nonlinear and hierarchical complex

relationship between soil properties and classes and SCORP factors. Therefore,

pedometricians may wish to explore whether DSMM practitioners can gain research

ground by combining machine learning algorithm with kriging methods. This

combination is a promising area for further investigation in the near future. Modified

versions of RK types are proposed with the hope that further investigation of these

combinations may increase prediction accuracy for soil science (Table 2-2).

2.4.5 Model Averaging

Numerous DSMM studies have utilized different geostatistical, non-geostatistical,

and hybrid methods to predict soil properties and classes. In order to identify the best

method for the particular soil properties at multiple scales and time, numerous methods

are being investigated by the DSMM community. The feasibility of testing all of the

different geostatistical, non-geostatistical and hybrid methods may be cumbersome but

may prove to increase performance of predictions because each method may have its

own strength and/or weakness. In order to take full advantage of the best method,

model averaging may be an opportunity to make further gains as model averaging has

been applied in a variety of disciplines (Hoeting et al., 1999; Goswami and O’Connor,

2007; Raftery et al., 2005). The model averaging framework involves a combination of

the predictions from two or more methods by enhancing the strengths of each while

49

reducing the weakness of each source map (Malone et al., 2014). Further investigation

on whether or not averaging the predictions of the best performing methods may help

determine what further increases may be gained in performance accuracy. Li et al.

(2011) could not find any increase in prediction accuracy with averaging the prediction

from RKRF, OK, RKRF, IDS and RF in their review. To authors’ knowledge, there are too

few examples in reviewed studies where averaging of the different model results in

prediction of soil properties and classes to make a statement of the likelihood of model

accuracy improvements. The only example of averaging two different methods to

predict soil properties and classes is given by Malone et al. (2014). In that study, four

model averaging methods were employed, namely: Equal weights averaging (EW),

Bates–Granger or variance weighted averaging (VW), Granger–Ramanathan averaging

(GRA), and Bayesian model averaging (BMA) in order to average the disaggregated

conventional soil map using DSMART (Odgers et al., 2014) and PROPR algorithms

(Odgers et al., 2015) and the RK based digital soil map. The most accurate results were

found by averaging disaggregated soil map and RK based soil map.

The model averaging technique is analogous to leveraging the best aspects of

each contributing model, and discarding the worst aspects. If both contributing models

are poor, ultimately the quality of the combined outcome will also be relatively poor;

however, one can at least expect the quality of the combined output to be comparable

to or better than the best of the contributing models (Malone et al., 2014). Biswas and

Cheng, (2013) employed model averaging to reduce the uncertainty associated with

semivariogram model parameters. In short, the use of model averaging in DSMM

communities is scarce. Therefore, pedometricians may also investigate whether or not

50

model averaging can produce better predictions efficiency of soil properties and

classes.

2.5 Conclusions and Outlook

To address the environmental problems in today’s world with a look toward

multiple scales in space and time, one of the key factors relies on upscaling scarcely

available categorical and continuous soil information at a local, regional and global

scale. Since the 1960s, soil spatial variability has been studied in systematic way in

order to characterize the pedogenic processes and complex distribution patterns of soils

in space and time and to depict categorical and continuous soil information on a map

(Burrough et al., 1994). As it is unlikely to gather soil information at every possible

location for any target property of interest in space and time, quantitative

characterization of soil properties and classes in soil-landscape continuum require

prediction modeling based on sparsely distributed finite number of soil observations.

Hence, the investigation of constant and robust methods resulting in higher prediction

accuracy for soil properties and classes has profound importance in pedometrics for the

foreseeable future.

Based on scarcity of field observations, pedologist have developed conceptual

models in order to capture the significant factors and processes responsible for the

genesis and spatial distribution of soil and its horizons (Minasny et al., 2008). RK, as a

combination of Jenny’s factorial model (Jenny, 1941) to quantify deterministic variation

and Matheron’s regionalized variable theory (Matheron, 1971) to quantify spatially

dependent stochastic variation, has proven to be one of the most widely accepted

methods among DSMM practitioners. As computational power, SCORP factors and

knowledge about soil forming processes increase and coevolve, more satisfactory

51

prediction efficiency for geo-spatial soil landscape models has been achieved. RK is

being used as a workhorse in the pedometrician’s toolbox and has been shown to be a

robust and widely accepted soil mapping method for over 20 years. It appears to have

reached maturity, given the large body of literature that now exists. However, there are

no consistent findings about the factors affecting the accuracy of RK due to a lack in

reporting of essential information and unreliable metrics. To guarantee more consistent

findings and allow more accurate comparisons the following recommendations for

inclusion of essential criteria to be reported in future DSM studies are: Soil property

descriptive statistics- at least mean, minimum, maximum, median, range and coefficient

of variation-, area of extent, total number of samples, sample design, sample depth,

sample size (training and validation separately), sample depth(s), SCORP factors,

spatial resolution of final map, transformation methods, the method of factorial analysis,

regression type, coefficient of determination of the deterministic function (i.e., R2 , fit

statistics), variogram model type, spatial autocorrelation range, N:S ratio, validation

method, R2 of the fitted models and RPIQ (accuracy metric).

The appropriate selection of variables for input into RK is essential because the

functional relationship between SCORP factors and a soil variable is often unknown and

noisy. The variable selection strategy may suffer bias or even fail in regions where the

process knowledge is insufficient (Xiong et al., 2014a). The reviewed studies have

shown that the number of SCORP factors has been increasing over the past decade

with the advance of Geographic Information Systems (GIS), Global Positioning System

(GPS), and remote and proximal sensing technologies. The challenge is to gather a

comprehensive set of spatially exhaustive environmental predictors to characterize the

52

mosaic of soil-environmental systems and identify the relevant set of predictors.

Furthermore, it is still important to develop the most parsimonious model but well

performing soil prediction model while dealing with multicollinearity between SCORP

factors and without sacrificing prediction accuracy. This may be rectified with the

incorporation of machine learning algorithms into the RK framework and systematic

variable selection algorithms (e.g., Boruta) that are used to increase the efficiency of

predictions.

Since machine learning algorithms do not require normally distributed soil data,

their ability to handle hierarchical and nonlinear relationship between soil observation

and auxiliary variables have produced same or better predictions than achieved using

conventional multivariate regression methods. Successful modification of RK with

modern statistical methods, especially machine learning algorithms, may allow

researchers to capture all attainable information offered by data and decrease the

inaccuracies of geo-spatial soil landscape models. Even though the performance of

prediction is heavily dependent on the data quality, further gains can be made by

modifications in the specific methods underlying RK. Several variations of RK have

been offered: RKRF, OK , RKRF, IDS , RKRF, BK , RKSVM, OK , RKSVM, OK, RKSVM, IDS , RKGWR, OK

, RKGWR, IDS , RKGWR, BK , RKPLSR, OK , RKPLSR, IDS , RKPLSR, BK , RKPCR, OK , RKPCR, IDS ,

RKPCR, BK. It may be not likely to identify a spatial prediction method that is best for every

case (Sun et al., 2012), but it is possible to develop models that characterize all

attainable variability with a given dataset.

In order to take full advantage of the strengths of different methods, model

averaging techniques may be utilized to reduce the prediction error. The increasing

53

number of pedological data with emerging technological advancement such as

electromagnetic induction techniques may allow pedometricians to focus on the depth

and time component of soil phenomena which are generally overlooked and may

ameliorate the accuracy of target predictions.

Figure 2-1. Evolution of hybrid interpolation techniques. GLM = generalized linear

model, SMLR = stepwise multiple linear regression, CART = classification and regression tree, BK = block kriging, OK = ordinary kriging, SK = simple kriging, RK = regression kriging, KED = kriging with external drift, COK = cokriging, 𝑥𝑥 = location in one, two or three dimensions, 𝑍𝑍(𝑥𝑥) = the random variable Z at location x, 𝜇𝜇(𝑥𝑥) = deterministic structural component, trend (drift), 𝜀𝜀′(𝑥𝑥) = stochastic component, spatially dependent residual from µ(x) ( the regionalized variable), 𝜀𝜀′′(𝑥𝑥) = spatially independent component, noise, unexplained variability.

54

Figure 2-2. General framework for regression kriging (RK). PCA = principal component analysis, RMSD = root mean squared deviation, ME = mean error, MAE = mean absolute error, 𝑥𝑥 = location in one, two or three dimensions, 𝑍𝑍(𝑥𝑥) = the random variable Z at location x, 𝜇𝜇(𝑥𝑥) = deterministic structural component, trend (drift), 𝜀𝜀′(𝑥𝑥) = stochastic component, spatially dependent residual from µ(x) ( the regionalized variable), 𝜀𝜀′′(𝑥𝑥) = spatially independent component, noise, unexplained variability.

55

Figure 2-3. The cumulative amount of RK studied over time.

Figure 2-4. Effects of coefficient of variation on the accuracy of RK methods in the 71

cases.

56

Table 2-1. Spatial range (m) from reviewed studies under three different area of extents Area of Extent

( km2 ) Carbon

(m) Chemical Hydrological Nutrient Physical Total

Average

Field < 0.25 59 - - 196 2 77 0.25< Local < 104 7222 2883 2629 26635 9951 12511

104< Regional < 107 194548 4361 13250 90778 30000 86123 Total Average 35103 4150 6169 48485 11394 32236

Carbon: total carbon, soil organic carbon, soil organic matter, fractions of organic matters (HOC, POC, ROC, RC, HC, MC); Chemical: pH; nutrients; N, K, Al, Ca, Mg and Zn, Cr Cu, Ni; Hydrological: AWC, salinization, Ks; Physical: Sand, silt, clay, horizon thickness, depth to C1. Table 2-2. Modified version of Regression Kriging (RK)

RK Version Deterministic Stochastic RKGLM,OK Generalized Linear Model Ordinary Kriging RKGLS,OK Generalized Least Square Ordinary Kriging RKRF,OK Random Forest Ordinary Kriging RKRF,IDS Random Forest Inverse Distance Squared RKRF,BK Random Forest Block Kriging RKCART,OK Regression Tree Ordinary Kriging RKCART,IDS Regression Tree Inverse Distance Squared RKCART,BK Regression Tree Block Kriging RKSVM,OK Support Vector Regression Ordinary Kriging RKSVM,IDS Support Vector Regression Inverse Distance Squared RKSVM,BK Support Vector Regression Block Kriging RKGWR,OK Geographically Weighted Regression Ordinary Kriging RKGWR,IDS Geographically Weighted Regression Inverse Distance Squared RKGWR,BK Geographically Weighted Regression Block Kriging RKPLSR,OK Partial Least Square Regression Ordinary Kriging RKPLSR,IDS Partial Least Square Regression Inverse Distance Squared RKPLSR,BK Partial Least Square Regression Block Kriging RKPCR,OK Principal Component Regression Ordinary Kriging RKPCR,IDS Principal Component Regression Inverse Distance Squared RKPCR,BK Principal Component Regression Block Kriging

57

CHAPTER 3 DIGITAL MAPPING OF SOIL CARBON FRACTIONS

3.1 Introduction

Quantifying only the soil total C stocks in a particular soil body to mirror its role

over a majority of soil functions does not adequately reflect the true gravity of soil C as

an ecosystem property (Parton et al., 1987a; Elliott et al., 1996). Organic C in soil is

comprised of a large variety of thermodynamically unstable materials with varying

degrees of decomposition and residence time which are located within the architecture

of the soil matrix (Jastrow and Miller, 1998). Accordingly, the soil organic C (SOC) may

simply be conceptualized into two major sub-pools: a labile pool with turnover rates

ranging from days to decades and a recalcitrant pool that persists in soil hundreds to

thousands of years (Cheng et al., 2007). Therefore, modelling fractions of soil TC yields

multiple benefits, including identifying anthropogenically induced short-term C loss in

soil through labile pool of soil TC and adequately determining the long-term C budget

through recalcitrant pool of soil TC.

Soil is the key to the majority of today’s global environmental problems, such as

food, water, energy and biodiversity security (Bouma and McBratney, 2013). Framing

the role of soil C to deal with the global challenges of our time mandates the

understanding of the dynamics and characteristics of distinct SOC pools, along with

their interaction with soil-environmental factors. Lately, an emerging view contends that

total soil C storage and decomposition are not necessarily driven only by the inherent

molecular structure of soil organic matter (Marschner et al., 2008; Kleber et al., 2010;

Conant et al., 2011; Schmidt et al., 2011). Several authors argue that the quantity and

quality of soil C are predominantly controlled by physical, chemical, and biological

58

factors (Oades, 1988; Sollins et al., 1996; Jobbágy and Jackson, 2000; Ekschmitt et al.,

2008; Totsche et al., 2010; Schmidt et al., 2011). Our awareness of how environmental

factors control stabilization and destabilization mechanisms of soil C in particle,

aggregate, and pedon scales is still limited. Additionally, the up-scaling of spatially and

temporally heterogeneous C dynamics to local, regional, and global scales is still in its

infancy.

Today’s predictive soil mapping and modeling studies date back to the widely

recognized and globally accepted soil factorial model. The empirical-deterministic model

of soil formation developed by V.V. Dokuchaev (Glinka, 1927) and formulized by Jenny

(Jenny, 1941) define soil formation as a function of Climate, Organism, Relief, Parent

material, and Time (CLOPRT). Technological advancements in geographic information

systems and remote sensing allow pedologists to reframe the soil factorial model.

Hence, McBratney et al. (2003) proposed the SCORPAN (S: Soil, C: Climate, O:

Organism, R: Relief, P: Parent material, A: Age, and T: time) model incorporating

spatially and temporally explicit environmental factors and soil data into an equation.

This conceptual model serves as a framework for spatially explicit predictions of soil

properties at unvisited locations. Functional linkages are quantified between sparse site-

specific soil data and exhaustively available environmental covariates to derive soil

models. In the last epoch, humanity has rapidly become the main driver of the

functioning of the Earth System (Steffen et al., 2011). Soil and environmental scientists

recognize the extent, complexity and intensity of human influences on soil; hence,

humans are acknowledged as integral to soil genesis (Richter et al., 2011). In response,

the STEP-AWBH model (S: Soil, T: Topography, E: Ecology, P: Parent material, A:

59

Atmosphere, W: Water, B: Biota, and H: Human) was proposed to explicitly account for

both human-induced and natural factors that determine and modulate soil and space-

time interactions (Grunwald et al., 2011; Thompson et al., 2012).

As an ecosystem property, soil C is not randomly distributed (Webster, 2000).

Soil C is often spatially autocorrelated; that is, values close to each other in geographic

space are generally correspond to similar values in feature space (Rossiter, 2012).

Accordingly, variation in soil C across a soil-landscape can be partitioned into two parts:

large-scale deterministic spatial variation as a function of a certain set of soil-

environmental variables and small-scale stochastic variation as a function of distance

between soil samples (McBratney, 1992). In addition, the relationship between soil C

storage and the accompanying ancillary variables are usually complex, non-linear, and

hierarchical. Hence, the ability of empirical geo-spatial soil-landscape models to

accurately quantify the spatial distribution pattern of soil C stocks over vast areas is

directly proportional to their capacity to capture non-linear, hierarchical relationships

between soil C and its environment, and also to explicitly account for spatial

autocorrelation often present in pedological data.

Overall, three generic, yet distinct, approaches have been adapted in

pedometrics to quantify the relationship between soil C and its environment with the

purpose of describing, analyzing, predicting, mapping, and assessing spatial distribution

across the soil-landscape continuum. The first approach is feature-space-based models

(statistical, machine learning) which do not explicitly account for stochastic spatially

dependent variation, such as Multiple Linear Regression (MLR) (Meersmans et al.,

2008), Classification and regression tree (CART), (McKenzie and Ryan, 1999;

60

Stoorvogel et al., 2009; Vasques et al., 2008), Random Forest (RF) (Grimm et al., 2008;

Wiesmeier et al., 2014), Support Vector Machines (SVM) (Were et al., 2015), Boosted

Regression Trees (BoRT) (Martin et al., 2011) and Bagged Regression Trees (BaRT)

(Xiong et al., 2014a). The second approach is geographic-space-based (geostatistical)

models which model the spatial dependence structure of site observations without

accounting for the deterministic trend, such as Ordinary Kriging (OK) (Rawlins et al.,

2011). The third approach is hybrid methods which explicitly account for the stochastic

spatially dependent variation and the deterministic trend, such as Regression Kriging

(RK) type C (Simbahan et al., 2006; Vasques et al., 2010a; Mishra et al., 2012; Sun et

al., 2012; Malone et al., 2014). The hybrid methods are primed to outperform statistical

and geostatistical models due to their dualistic nature. Lastly, an overview of the state-

of-the-art methods used to model and map soil C and soil C change is presented

(Minasny et al., 2013).

Recently, a few studies have reported improvements in the prediction accuracies

through the incorporation of machine learning methods into the RK framework. For

example, Guo et al. (2015) used RF and OK sequentially to model and map soil organic

matter (SOM) across tropical regions of China. In another regional study, Li et al. (2013)

proposed a radial-based function of neural networks along with the OK model to predict

the spatial distribution of SOM content across China. In France, Martin et al. (2014)

employed BoRT and OK in national scale C studies to characterize the spatial

distribution of SOC. However, there is no consistent finding about coupling residual

spatial autocorrelation and machine learning algorithm. Although roughly half of the

digital soil mapping and modeling (DSMM) studies focused on soil C and SOM in a

61

large meta-review (Grunwald, 2009), only a few individual studies have focused on

modeling different pools of soil C, possibly because of the analytical and computational

costs needed to perform such studies (Vasques et al., 2010b). In a catchment scale

study in Australia, Karunaratne et al. (2014) modeled and mapped the measurable

fractions of soil C, namely resistance, humus, and particulate organic C. In the United

States, Vasques et al. (2010b) mapped and modeled different carbon fractions in a

watershed in Florida. Also, Knox et al. (2015) developed prediction models for TC,

SOC, and labile C, specifically hot-water extractable C (HC), using visible-/near-infrared

and mid-infrared spectroscopy in Florida. Neither study created soil C fraction models

that describe spatial distribution patterns across Florida which will be addressed in this

chapter.

Since labile C responds much faster to land use and other human-induced

changes (e.g., management) than TC and RC (Conant et al., 2003; Veldkamp et al.,

2003; Haynes, 2005), labile/active C fractions provide critical signatures serving as an

indicator of change. However, not much is known how labile C varies across larger

regions with mixed land use, soil, and hydrologic settings. This motivated our research.

Our objectives are as follows:

1. Identify and characterize the most sensitive STEP-ABWH factors relevant to soil C pools to develop parsimonious geo-spatial soil-landscape models without sacrificing prediction accuracy.

2. Compare three distinct approaches under eight different methods to pick the best method to model each of the soil C fractions and rank the prediction performance of evaluated models.

3. Investigate the spatial autocorrelation of soil C model residuals and assess the capability to improve explaining the variability of soil C fractions models

62

3.2 Materials and Methods

3.2.1 Study Area

This study was conducted in the state of Florida, which is located in the

southeastern United States, with latitudes from 24°52’ N to 31°02’ N and longitudes

from 80°03’ W to 87°64’ W (Figure 3-1). As a peninsula, Florida is surrounded by the

Gulf of Mexico and the Atlantic Ocean on three sides and has a total area of

approximately 150,000 km2 (United States Census Bureau, 2000). While a humid,

subtropical climate is predominant in northern and central Florida, a humid, tropical

climate is predominant in southern Florida. The mean annual precipitation is 1,373 mm,

predominately from the extraordinarily high prevalence of thunderstorms. The mean

annual temperature is 22.3°C (National Climatic Data Center, 2008). Elevation ranges

from sea level to 106 m across Florida (United States Geological Survey, 1999).

Landforms associated with a nearly level, gentle slope dominate almost the whole state

with the exception of the northwestern part of the state (i.e., Florida Panhandle) and an

escarpment in north-central Florida (Cody Scarp). Micro-topography can greatly

influence the hydrological pattern (Mulkey et al., 2008). A generally high amount of

rainfall and low elevations, coupled with a relatively high water table, combine to form a

relatively high number of wetlands and marshes across the state. Even though Florida

is among the wettest states in the United States, Florida is susceptible to wildfires

during the driest months of the year, typically between October and May. The soils in

the study area were formed in sandy to loamy marine-derived parent material with sand

as the dominant particle size fraction. Dominant soil orders in Florida include Spodosols

(32%), Entisols (22%), Ultisols (19%), Alfisols (13%), Histosols (11%), and Mollisols and

Inceptisols (3% combined) (Vasques et al., 2010a). The most frequent soil subgroups

63

are Aeric Alaquods, Ultic Alaquods, Lamellic Quartzipsamments, Typic

Quartzipsamments, and Arenic Glossaqualfs (Natural Resources Conservation Service,

2009). The Florida LULC dominated by wetlands (28%), pinelands (18%), urban and

barren lands (15%), agriculture (9%), rangelands (9%), and improved pasture (8%)

(Florida Fish and Wildlife Conservation Commission, 2003). With 19.9 million people

and counting, Florida is the third most populous U.S. state (United States Census

Bureau, 2015). Its increasing population has resulted in major changes, including rapid

urban growth and loss of agricultural and forest land for the past several decades

(Kautz et al., 2007). From the 1970s to 2011, the urban area in Florida increased by

more than 140% to about 24,900 km2, primarily converted from agriculture and upland

forest (Xiong et al., 2014b)

3.2.2 Soil Data

The soil data used in this thesis are derived as part of a larger project funded by

USDA-CSREES-NRI grant award 2007-35107-18368 titled “Rapid Assessment and

Trajectory Modeling of Changes in Soil Carbon across a Southeastern Landscape”

(National Institute of Food and Agriculture [NIFA], Agriculture and Food Research

Initiative [AFRI]). The Principal Investigator of this project is Dr. S. Grunwald and Co-

Principal Investigators are Dr. W.G. Harris, N.B. Comerford, and G.L. Bruland. This

project is a Core Project of the North American Carbon Program. The following section

briefly describes the sampling design and laboratory analysis. In the following section is

a description of how the field and lab analyses were performed by the project team. My

role in the project begins with model development.

64

3.2.2.1 Sampling design and field sampling

As a product of the statewide project known as “Florida Soil Carbon Project”

conducted between July 2008 and June 2009, a total of 1,014 soil samples were

collected at a fixed depth of 20 cm across the state of Florida. A stratified random

sampling approach was implemented to capture the broad range in the variability of soil

C across Florida. Sixty-three land use/cover (LULC)-suborder strata were designed

based on a combination of the reclassified LULC map obtained from the Florida Fish

and Wildlife Conservation Commission (2003) and the 10 soil suborders acquired from

the Soil Data Mart-Soil Survey Geographic Database (SSURGO) (Natural Resources

Conservation Service, 2006).

To reflect local variability at each predefined site, four soil samples were

collected (20 cm deep x 5.8 cm diameter soil cores within a 2 m diameter radius) and

then georeferenced, bulked, and transported in a cooler for lab analysis. Afterward, the

bulk samples were air-dried and sieved to retrieve the fine earth fraction (< 2 mm).

These samples were thoroughly mixed and different quantities of subsamples were ball

milled for use in chemical analysis to derive different pools of soil C: TC, RC, and HC.

3.2.2.2 Laboratory and chemical analysis

Carbon fractions were measured using a Shimadzu TOC-VCPN catalytic

combustion oxidation instrument with a SSM-5000a solid sample module (Shimadzu

Scientific Instruments, Kyoto, Japan). Total C was measured from the 80–700 mg ball

milled samples combusted at 900°C. Measurement of hydrolysable ‘labile’ carbon (hot

water extractable – HC) was performed by incubating 4 g of soil in 40 mL (1:10) of

double de-ionized water for 16 h at 80°C. Samples were then filtered to 0.22 μm.

Measurement of the non-hydrolysable ‘recalcitrant’ C (RC) was accomplished by

65

digesting 2 g of the ball milled soil in 10 mL of 5 M HCL under reflux conditions for 16 h.

The soil digest was washed 3 times by centrifuge, dried and the remaining undigested C

was then combusted at 900°C (Knox et al., 2015).

3.2.2.3 Determination of total, recalcitrant and labile carbon stocks

Soil carbon stocks in areal units (kg m−2) was derived for each of the C fractions

by multiplying TC, RC, and HC concentrations with oven dry bulk density (BD) values

which were also measured (Eq. 1). Mass of soil C fractions present in the top 20 cm (kg

C m-2) was computed using the following equation:

TC, RC or HC stocks = (TC, RC or HC x BD x 2000) / 1000 (3-1)

TC, RC, or HC stocks : Soil total carbon, recalcitrant and labile carbon stocks in kg

C m-2 (0–20 cm soil profile)

BD : Oven dry bulk density (g cm-3)

PD : Profile Depth (0.2 m)

3.2.3 Environmental Data

3.2.3.1 Assembled environmental variables representing STEP-AWBH factors

In Digital Soil Mapping and Modeling (DSMM), the prediction performance of

geospatial models for soil properties has been largely dependent on the assembling of

useful and appropriate scale qualitative and quantitative soil-spatial information rather

than employing more sophisticated statistical or geostatistical methods (Minasny and

McBratney, 2007; Grunwald, 2009). Even though having a set of potential predictors

may substantially improve prediction accuracy for soil properties and classes, the

selection of parsimonious environmental variable sets does not command attention like

the calibration and validation part of a geo-spatial modeling process. Building a pool of

potential predictors can be overlooked because some researchers are unaware of the

66

availability of potential predictors or the general belief in the similarity to or superiority of

the variables chosen (Miller et al., 2015). Accurate, efficient, and unbiased model

development requires the inclusion of all possible environmental determinants;

otherwise, selection of predictors based on the researchers’ knowledge could lead to

biased and suboptimal model performance (Grunwald, 2009). This research is

designated on the approach presented in Xiong et al. (2014a).

To represent a spectrum of possible soil-forming processes that may have an

impact on the fate of TC, RC, and HC, a large set of up-to-date STEP-ABWH variables

(N: 332) with statewide coverage were gathered from numerous data sources with

ArcGIS 10.2 (Environmental Systems Research Institute, ESRI Inc., Redlands, CA)

(Table 3-1). About 12% (N: 40) of variables were categorical (i.e., ordinal, nominal,

binary), including SSURGO derived soil taxonomic properties, LULC classes obtained

from different sources, soil drainage and hydrological classes, vegetation type, etc.,

whereas about 88% (293) were continuous ( i.e., floating point, integer), including

proximal and remote sensing derived variables (with a variety of spatial resolutions) and

terrain variables, such as soil water-holding capacities, historic organic matter content,

primary and secondary terrain attributes, and climatic and biotic variables. Some

topographic variables, including elevation, compound topographic index, slope, and flow

accumulations, were collected at three spatial resolutions (30, 90, and 1000 m).

Moreover, some biotic and climatic variables such as normalized differentiated

vegetation indices (NDVI), enhanced vegetation indices (EVI), and monthly precipitation

were also represented as multi-temporal sequences.

67

3.2.3.2 Boruta feature selection technique

Many of the variables available may have the effect of introducing noise or may

not provide information to infer on a target soil property. Additionally, variables may be

redundant or highly correlated which make the task of gathering the comprehensive set

of environmental predictors problematic (Xiong et al., 2014a). Thus, the need for

strategically identifying variables related to major pedogenic and environmental

processes for phenomena of interest is a focal point to any research, in this case TC,

RC, and HC. This problem has been addressed in the machine learning literature under

the topic of identifying the minimal optimal, all-relevant variable selection (Liu and

Motoda, 2012). According to Xiong et al. (2014a), the minimal-optimal set is preferable

to yield the best prediction accuracy when the focus is on developing a predictive

model, whereas the all-relevant variable selection method is preferable to characterize

the mechanism between environmental variables and phenomena of interest.

Furthermore, the selection of all-relevant variables out of a broad set of environmental

variables reduces overfitting, model development, and application time, and also

increases model interpretability (Belanche-Muñoz and Blanch, 2008; Merow et al.,

2014).

Boruta, an all-relevant variable selection method, was applied to characterize

and identify the variables which impart control on the fate of the spatial distribution of

soil TC, RC, and HC. This method can detect linear and non-linear relationships

between soil C fractions and environmental predictors because the Boruta algorithm is

based on RF classification algorithm. In short, the Boruta algorithm produce five random

probes whose values acquired by shuffling values of the original predictors to reduce

their collinearity with phenome of interest (e.g., TC, RC and HC). Afterwards, RF

68

regression is conducted on the original predictors and random probes combined and Z

score as an importance of each variable is determined. Then, the maximum Z score

among the random probes (MZRP) is identified and utilized as a reference to identify if a

predictor is relevant to TC, RC and HC with a two-sided test of equality. Only the

predictors with the Z score significantly higher than MZRP was accepted as the relevant

variable (Kursa and Rudnicki, 2010).. A full discussion can be found in Xiong et al.

(2014a). The Boruta package (Kursa and Rudnicki, 2010) was used to perform Boruta

all-relevant searching method in R 3.2.0 (R Core Team, 2015).

3.2.4 Modeling Techniques

The whole dataset (N: 1,014) was randomly split into calibration (N: 710) and

validation (N: 304) sets to model TC, RC, and HC. The calibration dataset was used to

train models with all-relevant variables identified by Boruta. The independent validation

sets were used to evaluate the predictive performance of the each model.

For comparative assessment, eight different techniques were selected to

characterize the spatial distribution pattern of soil TC, RC, and HC across the state. The

methods fall into three generic modelling approaches: feature-space-based methods,

including statistical and machine learning methods (i.e., PLSR, CART, BaRT, BoRT,

RF, and SVM), geostatistical (i.e., OK), and hybrid methods (i.e., RK). A thorough

explanation of the evaluated methods and applications can be found in James et al.

(2013) and Kuhn and Johnson (2013).

CART involves constructing a set of decision trees on the predictor variables.

The trees are grown by repeatedly stratifying the dataset into successively smaller

subsets (child node) with binary splits based on the single categorical or continuous

predictor variable (Breiman,1984). The splitting procedure is applied until the best split

69

is chosen based on the one that maximizes the response into two homogenous groups

(i.e., minimizing variability within each child node) (Prasad et al., 2006).

BaRT is an ensemble decision tree method that involves the averaging of several

individual trees to acquire a final prediction. Individual regression trees have been

recognized as unstable learners (Breiman, 1996); that is, small changes in the

calibration dataset can give rise to very different output trees (Hastie et al., 2009).

Bagging (bootstrap aggregating) is a relatively simple ensemble procedure that uses

many bootstrap sets drawn with replacement from the original training data set and

grows a regression tree from each bootstrap sample (Efron and Tibshirani, 1993). The

results of each individual tree are subsequently averaged to obtain the overall

prediction.

RF is the ensemble approach that involves the bagging of un-pruned trees (weak

learners) by randomly selecting predictors in each split (Breiman, 2001). The main

difference of RF over BaRT is that the set of predictor variables is randomly restricted in

each split (Prasad et al., 2006) and this reduces the problem of correlation between the

individual trees, and hence ameliorates the final prediction accuracy and efficiency of

the ensemble.

BoRT is an ensemble approach which diverges from RF and BaRT by one main

difference. BaRT and RF involve fitting a separate decision to each copy sample

derived from the original data with a combination of each single tree to create a single

predictive model. In BoRT, trees are instead grown sequentially with each tree grown

using the information from previously grown trees (Hastie et al., 2009).

70

SVM belongs to the regression model family and emerge from the area of

statistical learning theory (Vapnik, 1998). SVMs mainly involve a projection of the data

into a high-dimensional feature space using a valid kernel function and then apply a

simple linear regression within this enhanced space (Hornik et al., 2006). The resulting

linear regression function in the high-dimensional feature space corresponds to a non-

linear regression in the original input space (Smola and Schölkopf, 2004). In this study,

the radial basis kernel function was applied to project data into the high-dimensional

feature space before fitting a linear regression. To validate the kernel function, the

parameters cost and sigma was determined with a grid search method (Grunwald et al.,

2014)

The PLSR algorithm relates the response variable (e.g., TC, RC, and HC) and a

large number of highly collinear predictor variables (e.g., STEP-ABWH variables)

through a linear multivariate model to identify successive orthogonal principal

components (latent variables) that maximize the covariance between the response and

predictor variables (Garthwaite, 1994). Predictions were finalized by linear multivariate

regression of the response variable on the calibration dataset. For this study, 14, 18,

and 12 principal components were selected for TC, RC, and HC, respectively, by

identifying the minimum root mean square deviation (RMSD) of cross-validation on the

calibration datasets.

OK is the most commonly used weighted average interpolation technique that is

based on regionalized variable theory and depends on the spatial autocorrelation

structure of the target variable (McBratney et al., 2000). Because the sample distribution

of TC, RC, and HC were non-normal the TC, RC, and HC values were transformed with

71

a log transformation (i.e. log10) to approximate the Gaussian distribution. Spherical and

exponential models were tested to select the best fit to the experimental

semivariograms for TC, RC, and HC. Exponential models were fitted to each of the

omnidirectional variograms of log-transformed TC, RC, and HC. After OK was

conducted on the validation locations, the log-transformed SOC pools were back-

transformed to the original units as outlined by Webster and Oliver (2007).

RK is the most commonly used hybrid interpolation technique that combines the

regression (e.g., the trend between the target variable and the auxiliary variables) and

ordinary kriging of the residual (i.e., stochastic unexplained variation) (Odeh et al., 1995;

Hengl et al., 2004). Stepwise multiple linear regression (SMLR) was employed to model

global spatial trend using the log-transformed TC, RC, and HC. This is followed by the

ordinary kriging of the regression residuals. The final prediction was then obtained by

summing the predicted and interpolated outputs in the original scale. After back-

transformation, the final C pools estimations were validated using the independent

validation set.

3.2.5 Evaluation of Model Performance

Independent validation was used to assess the prediction performance of the

evaluated methods. The Kolmogorov-Smirnov test was conducted to confirm the similar

distributions of the calibration and validation soil C fraction datasets. The difference

between the measured and predicted values in eight models for TC, RC, and HC was

carried out in the original scale.

As the goodness-of-fit statistic, the coefficient of determination, R2, was used to

compare the amount of variation each model was able to explain. The Root Mean

Square Deviation (RMSD, kg C m-2) was used to make further inquiries on model

72

precision. Furthermore, to clearly illustrate the contribution of different methods to the

prediction performance, the relative decrease in RMSD was evaluated by taking the

RMSD of OK as a reference, with the following formula

𝑅𝑅𝑅𝑅 % = 𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑂𝑂𝑂𝑂 – RMSDCART, BaRT, BoRT, RF, SVM, PLSR, RK

𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑅𝑂𝑂𝑂𝑂× 100 (3-2)

Where; RMSDOK is the error of OK as the benchmark (i.e., initial, primitive model of soil-

landscape) and RMSDCART, BaRT, BoRT, RF, SVM, PLSR, RK is the error of an evaluated model

being investigated.

Residual prediction deviation (RPD) (Williams, 1987) was selected to compare

our results with other studies documented in the literature. In addition to the RPD, the

ratio of prediction error to inter-quartile range (RPIQ) (Bellon-Maurel et al., 2010) was

reported since RPIQ is well suited to non-Gaussian distributed data.

3.2.6 Application of Models

Data preparation, manipulation, model development, model accuracy

assessment, and mapping were carried out using R 3.2.0 and its add-in packages (R

Core Team, 2015) (Table 3-2). To develop CART, BaRT, BoRT, RF, SVM, PLSR, and

OK, the ‘rpart’ (Therneau et al. 2015), ‘ipred’ (Peters et al. 2015), ‘gbm’ (Ridgeway,

2007), ‘randomForest’ (Liaw and Wiener, 2002), ‘kernlab’ (Karatzoglou et al. 2007), ‘pls’

(Wehrens et al. 2007), and ‘gstat’ (Pebesma, 2004) packages were used, respectively.

3.2.7 Mapping of Total, Labile and Recalcitrant Carbon Stocks

The Random Forest model for each soil C fraction was employed to create high-

resolution (30 m) soil C fraction maps covering the state of Florida. Categorical

variables identified by Boruta were excluded in model development process. Xiong et al

73

(2014a) stressed that adding more categorical variables leads to missing values in

produced maps because all predictor classes were not represented in calibration

samples. Therefore, C maps were produced based on the all-relevant continuous

predictors available across Florida.

3.3 Results and Discussion

3.3.1 Descriptive Summary Statistic of Carbon Fractions

The descriptive statistics of the measured soil carbon fractions observed with

respect to the entire dataset, calibration set (70%), and validation set (30%) are

presented (Table 3-3). Total C ranged from 0.45 to 34.15 kg TC m-2 with a mean of 4.74

and a median of 3.32 kg TC m-2. Considering the whole dataset, high kurtosis values

and strongly positively skewed distributions of TC, RC, and HC revealed the diverse

characteristics of Florida’s soil-scapes. The non-Gaussian distribution was probably due

to the frequent low values of soil C representative in the well-drained, coarse-grained

upland soils and the few extremely high values found in the poorly-drained wetland

soils. Collectively, the characteristics of the sample distributions reflected the large

variation in soil suborders, climatic conditions, land-uses, and environmental conditions

throughout the surface soil of Florida.

In this study, SOC essentially is equivalent to soil TC since the soil inorganic

carbon was a very minor constituent in most of the samples which is most likely due to

the acidic nature of Florida’s soil. The analysis revealed that the major amount of the

total C stored in the surface soil of Florida is concentrated within the stable recalcitrant

sub-pool (RC) as compared to the labile sub-pool (HC). While RC varied from 0.22 to

25. 08 kg m-2 with a mean of 2.81 and a median of 1.69 kg m-2, HC ranged from 0.02 to

0.71 kg m-2 with a mean of 0.16 and a median of 0.14 kg m-2 (Table 3-3).

74

Even though the HC made up a relatively low portion of the TC, the HC is

essential in understanding soil C dynamics as it is a sensitive measure for short-term

changes induced by management practices, temperature, and soil moisture. Usually a

high correlation of HC to microbial biomass indicates that C is readily available for

microbial utilizations (Leinweber et al., 1995). Additionally, HC is thought to be an

important labile component for soil micro-aggregation in organic matter and the soil

physical parameter to be studied with regard to soil quality (Ghani et al., 2003).

The Shapiro Wilk test confirmed the approximately normal distribution for log-

transformed soil TC, RC, and HC. The Kolmogorov-Smirnov test confirmed that the

randomly separated calibration and validation samples appropriately represent the

population for TC, RC, and HC, respectively. This similarity between the calibration and

validation sets demonstrated that they were randomly sampled from identical

populations for each dataset. The Spearman’s pairwise correlation analysis between the

different soil C fractions indicated that TC and RC were strongly correlated and HC-TC

was weakly correlated (Table 3-4). This implied the difference in the processes that are

responsible for the accumulation and decomposition of RC and HC.

3.3.2 Spatial Autocorrelation with Trend and without Trend

The omnidirectional variograms of log-transformed TC, RC, and HC and

residuals of the SMLR are presented to assess how much variation is captured by the

SMLR (Figure 3-2). For TC, RC, and HC, an exponential model was fitted with an

effective range of 28, 30, and 18 km, respectively. Also, an exponential model was fitted

with an effective range of 8, 7, and 3 km, respectively, to the residuals of the SMLR.

This finding indicated that the SMLR method was only able to explain some portions of

the stochastic spatially dependent variation. Similarity in range value for TC and RC

75

indicated the likelihood of similar major environmental controlling both carbon pools at

the regional scale. The lowest range was obtained for HC which could be attributed to

both the unstable inherent structure of HC to its environment (i.e., easily decomposable

molecular structure) and the uncaptured short-range variability due to area of extent and

sampling density tradeoff. Labile C fractions were structured with simple chemical

molecules, including carbohydrates and proteins, which were also constituents of the

soil microbial biomass (Balaria et al., 2009). Thus, HC can dynamically mineralize into

the atmosphere, transform into RC, or leach out to subsurface horizons. This may all

contribute to the uncertainty associated with HC modeling. In the present study, the

sampling density was about 0.007 per km2, which was sufficient to capture the long-

range variability of TC and RC (Figure 3-2), while the shorter spatial autocorrelation

range for HC suggested that there was uncaptured short range variability in the labile C

fraction.

The nugget to sill ratio (N:S), which expresses the magnitude of the spatial

autocorrelation, amounted to 37%, 36%, and 44% for TC, RC, and HC, respectively.

The N:S ratio for the residuals of TC, RC and HC was 38%, 32%, and 80%,

respectively. These findings also indicated that the explainable proportion of the total

variance for TC and RC was greater than HC. In the same line with our finding,

Vasques et al. (2010b) found strong spatial structure for TC and RC, whereas moderate

spatial dependence for HC was a large mixed-use watershed in Florida. The variogram

analysis in the present study revealed that the soil stable carbon pool, represented by

RC, showed the longest range (30 km). It is well-established that the stable C pool, RC,

is protected through physical, chemical, and biological stabilization mechanisms; thus,

76

the mean residence time can be decades to even thousands of years within the soil

system (Sollins et al., 1996; Goh, 2004; Lutzow et al., 2006). In contrast, the labile C

pool, HC, shows the shortest range in the surface soil, suggesting that the controls on

the stabilization of soil organic matter across Florida is highly variable and, therefore,

the labile C pool is less predictable when compared to the recalcitrant C pool.

3.3.3 Important Variables

Boruta, the all-relevant variable discovery method, identified 53 environmental

factors out of 332 variables to be relevant to topsoil C fractions in the state of Florida

(Table 3-5). For TC, RC, and HC, 36, 30, and 25 soil-environmental factors,

respectively, stood out as relevant with varying degrees of explanatory power. The Z

score, which represents the importance of the all-relevant variables, ranged from 3.5 to

22.3. The explanatory power of the all-relevant predictors was grouped under 4 classes

based on the Z score: weakly relevant (Z < 5), slightly relevant (5 < Z < 10), moderately

relevant 10 < Z < 15), and strongly relevant (Z > 15) to soil C. Furthermore, the first 13

variables common to TC, RC, and HC had Z scores ranging from slightly relevant to

strongly relevant; the 8 variables common to TC and RC were weakly relevant to slightly

relevant; and the 4 variables common to TC and HC were slightly relevant to

moderately relevant. The remaining 11, 9, and 5 variables were identified only as

weakly relevant to TC, RC and HC, respectively. This implies that the major predictors

which have the highest explanatory power for TC, RC, and HC were similar for all three

investigated C pools.

Boruta filtered out most of the irrelevant climatic and topographic variables, and

multi-collinearity among the all-relevant variables was drastically reduced. For instance,

out of 180 variables associated with the atmosphere only 3 variables were selected.

77

However, there were still obviously redundant variables in the all-relevant variable set.

These included certain land cover variables that were obtained from different times,

certain vegetation properties obtained from different sources, and variables aggregated

over different profile depths (e.g., AWC25, AWC50, and AWC100). Xiong et al. (2014a)

pointed out that developing the most parsimonious model with the minimal optimal

variables and comparing it to an exhaustive model that includes all-relevant variables

could both decrease the overall model complexity and increase the uncertainty.

Ultimately, the selected variables by Boruta were included in each C fraction model

without removing the redundant variables because we wish to maintain high quality

model prediction performance.

Soil taxonomic variables (i.e., soil suborder, soil greatgroup) were strongly

relevant in the explanation of the total variation of TC, RC, and HC in surface soil in

Florida. In addition, soil order and historic soil organic matter (derived from the soil map

of the Soil Geographic Survey Database, SSURGO) were both moderately relevant to

the stabilization and destabilization processes of soil C. Furthermore, the soil

environmental variables with respect to soil moisture status (e.g., soil drainage classes,

plant available soil water holding capacity [AWC] in different depths [25, 50, and 100],

soil hydric rating, and soil runoff potential) were identified as weakly relevant to strongly

relevant variables. These results confirm earlier findings by Vasques et al. (2012). They

found the soil AWC had well-structured spatial dependence among all the ecological

variables in both short and long ranges across Florida.

The field obtained LULC classes and other similar predictors representing land

use/cover across Florida. The reclassified LULC (LULCRecls) were the most strongly

78

relevant to TC, RC, and HC. In addition, the national cropland data layer (Cropland) and

the national land cover dataset (LandCovCls) were identified as slightly relevant to TC.

Also, biotic variables such as vegetation type (VegType), were moderately important to

TC, RC, and HC. Seasonally active vegetation (SmallNdviPkInt) was strongly relevant

to RC, while biophysical setting (BiophySet) was slightly relevant to TC and HC.

Moreover, NDVI and EVI were connected slightly to variation in RC and HC.

Furthermore, variables representing parent material included various strongly

relevant ones (i.e., physiographic province name and type) to TC, RC, and HC. A few

others (i.e., environmental geology and surficial geology) were found to be weakly

relevant to TC only. These variables expressed the influence of parent material on the

soil C budget through modulating soil mineralogy. With vegetation and soil water being

somewhat dependent on surficial geology, surficial geology also has an indirect

influence through biotic/water complexes on the soil C pools (Eberhardt and Latham,

2000). Hence, the explanatory power of surficial geology was high as it may modify

vegetation and water processes.

As the topography across much of much Florida is nearly level, soil slope was

the only topographic variable that stood out as slightly important to TC, RC, and HC,

whereas none of the other topographic variables were identified by Boruta. The

influence of slope as opposed to elevation revealed the fact that micro-topography

across Florida controls the soil C by modifying the soil water pattern (Mulkey et al.,

2008). Thus, soil slope was the only predictors to infer on the variation in TC, RC, and

HC in the topsoil.

79

Interestingly, all variables reflecting atmospheric properties demonstrated only

weak to slight relevance to infer on TC, RC, and HC variation. Among the atmospheric

variables, average monthly precipitation (i.e., PrecipFeb, PrecipMay, PrecipDecem,

PrecipJune, PrecipOct) and temperature (i.e., MaxTempDec, MaxTempJan,

SolarRadMay, MaxTempApr) were identified as weakly relevant to surface soil C across

Florida. The present study mimicked others’ findings by showing relatively insignificant

associations between precipitation or temperature and soil C in a subtropical climate

(Vasques et al., 2010a; Xiong et al., 2014b). Xiong et al. (2014b) explained the weak

relationship between precipitation and variation in SOC in the topsoil of Florida with both

the translocation of SOC from top layers to subsequent horizons which mainly control

the forming process of Spodosols and the high decomposition rate as a result of high

precipitation and net primary production (NPP).

3.3.4 Assessment of the Prediction Capability of the Selected Methods

A summary of the parameters characterizing the efficiency and quality of the

fitted models for each soil C fraction is presented (Table 3-6). In addition, graphs that

showcase observed and predicted soil C stocks with evaluated methods are illustrated

(Figures 3-3, 3-4 and 3-5). These graphs highlight deviations from the 1:1 line (i.e.,

“true” model) for TC, RC, and HC. Overall, the observed vs. predicted TC, RC., and HC

in validation dataset matched well for RF, BaRT, and SVM with values aligned close to

the 1:1 line. The high C values were under-predicted, whereas low C values were over-

predicted. For HC, there was significant scatter the around 1:1 line and large prediction

errors for all models.

Total C and RC appeared to respond similarly to the evaluated techniques in

terms of R2, RMSD, RPD, and RPIQ. In contrast, HC behaved significantly different,

80

compared with TC and RC. Overall, the best of the eight models were able to account

for 71.6% of the total variation of TC and RC, but only 30.5% of the total variation of HC.

This high proportion of unexplainable variation present in the labile C fraction may be

due to the inherent characteristic of HC. Modelling the labile pool of SOC was relatively

difficult because its formation is affected by dynamic biochemical processes which are

largely controlled by complex interacting soil-environmental factors. Specifically, several

space-time factors, such as SoilRunoff, PrecipJune, NdviJune, EviAgust, were related

to HC which are inherently dynamic factors varying across different spatial and temporal

scales. Altogether, these spatial-temporal complexities led to an increase in the

uncertainty of HC. In a catchment scale study, Karunaratne et al. (2014) indicated the

labile pool of soil C with particulate organic carbon (POC) was the noisiest data, was

spatially correlated to the shortest range, and was the hardest to fit a model when

compared with resistant organic carbon (ROC). Therefore, the short-range variation

inherent to POC was not captured in their study. Despite the lower performance of HC

models, some of the variation was explained by a mixture of pedogenic, lithologic,

biotic, climatic, and water-specific factors.

In terms of RPIQ, the hierarchy in prediction performance for TC was as follows:

RF > SVM > BoRT > BaRT > PLSR > RK > CART > OK (Table 3-6). For the RC pool

the performance of models in terms of RPIQ was RF > PLSR > BoRT ~ BaRT ~ RK >

SVM > CART > OK, and for the labile carbon pool the model ranking was RF > SVM ~

BoRT > BaRT > PLSR > RK > CART > OK. Overall, the findings implied that OK was

the worst, as it did not use any covariates and relied on soil C fraction measurements at

sites. As an individual machine learning method, CART showed significantly lower

81

prediction accuracy compared to ensemble machine learning methods (i.e., RF, BaRT,

BoRT). RK yielded usually better prediction accuracy than OK and CART. Because the

RK relied on a simple regression model (derived from SMLR), it is possible it did not

have the ability to predict as well as RF and the other ensemble regression methods.

In terms of prediction error, RMSDs from the validation results ranged from 2.39

kg m-2 to 3.80 kg m-2 for TC (i.e., RF: 2.39 kg m-2, SVM: 2.69 kg m-2, BoRT: 2.74 kg m-2,

BaRT: 2.78 kg m-2, PLSR: 2.82 kg m-2, RK: 2.99 kg m-2, and OK: 3.80 kg m-2). Compare

to the RMSDs of TC, there was an overall decrease in RMSDs of RC for each method,

which varied from 1.89 kg m-2 to 3.27 kg m-2 (i.e., RF: 1.89 kg m-2, PLSR: 2.08 kg m-2,

RK: 2.13 kg m-2, BaRT: 2.16 kg m-2, BoRT and SVM: 2.21 kg m-2, CART: 2.57 kg m-2,

and OK: 3.27 kg m-2). The lowest RMSDs were achieved with HC, compared to TC and

RC. Also, multiple methods achieved the same RMSDs, which changed from 0.06 kg m-

2 to 0.08 kg m-2 (i.e., RF ~ BoRT ~BaRT ~ SVM: 0.063 kg m-2, PLSR ~ RK ~OK: 0.07 kg

m-2, and CART: 0.08 kg m-2).

Using OK, which models the spatial autocorrelation of soil C fractions, as a

reference method, the relative improvement in model performance (in %) was assessed

for the seven methods (Figure 3-6). All evaluated methods significantly improved the

accuracy of prediction for TC, RC, and HC. In other words, inclusion of the all-relevant

variables substantially decreased the prediction error for TC, RC, and HC. The overall

improvement was highest in RC, followed by TC and HC. Using the most sophisticated

methods, we were able to decrease prediction error up to 37%, 42%, and 12% for TC,

RC, and HC, respectively (Figure 3-6). For RC, the relative improvement was the

highest in using RF with a 42% gain, followed by PLSR, BoRT, RK, SVM, and BaRT

82

with 36%, 34%, 34%, 32%, and 32% gains, respectively, with the least improvement

with CART (21% gain) relative to the reference OK model. The greater improvement of

RC over TC can be attributed to higher accuracy of OK of RC over OK of TC. For TC,

RF and SVM improved the prediction accuracy by 37% and 29%, respectively, followed

by BoRT, BaRT, and PLSR with 27%, 27%, and 25% gains. The lowest improvements

for TC were found with RK and CART with 21% and 20% gains

Surprisingly, for HC, the relative improvement was nearly identical for all machine

learning methods with a gain of 12.5% for RF, BoRT, SVM, and BaRT. This suggests

that applying the all-relevant variable approach for the modeling of HC did not

substantially improve the prediction accuracy when compared to OK. This finding is in

line with a 2010 watershed scale study by Vasques et al. (2010b), who found that in

three out of the five soil organic C fractions, namely (HC, MC, and SC), RK did not

outperform block kriging, whereas it did outperform block kriging in the cases of TC and

RC.This implies that mining the relationship between soil HC and environmental factors

utilized are not representative, and missing predictors mask the satisfactory uncertainty

relative to TC and RC. Ahn et al. (2009) recommended that utilizing TC is more efficient

and preferable than HC with the purpose of detecting mineralization rate of C because

HC extraction is time-consuming and generally has high measurement uncertainty.

Overall, for all soil C fractions, the RF model yielded the most satisfactory results

in terms of model fit (R2), RMSD, RPD, and RPIQ among the competitors, including

other machine learning methods (CART, BaRT, BoRT), advanced statistical methods

(i.e., SVM, PLSR), a hybrid method (RK), and a geostatistical method (OK). The

performances of machine learning methods were stable for different carbon pools. The

83

simpler CART was outperformed by its more complex counterparts (i.e., RF, BaRT, and

BoRT) as expected. The superior performance of RF, as a modified version of CART

and BaRT, stemmed from random selection of variables during tree building and

assembly. On the other hand, the superior performance of BoRT over BaRT and CART

can be explained since its stochastic gradient, which boosts the procedure, may

minimize overfitting, and may increase the accuracy of prediction within the validation

dataset (Lawrence, 2004).

The SVM, in terms of predictive capability, closely followed RF and was

comparable to the model performance of the machine learning methods for TC, RC, and

HC. One peculiarity of SVM is that it is susceptible to overfitting (Hernández et al.,

2009) when compared to its competitors, such as RF, BaRT, and BoRT. On the other

hand, SVM successfully captured the non-linear relationship in the data when compared

to PLSR and CART. Even though ensemble data-mining methods and SVM take their

power from the ability to detect the non-linear, complex, hierarchical relationship

between predictors (i.e., the STEP-ABWH factors) and predictants (i.e., distinct SOC

pools in the present study), they may by susceptible to overfitting which might limit their

competitiveness. For example, Grunwald et al. (2014) found that PLSR was superior to

SVM when transferring and scaling spectral-based soil TC prediction models which

indicates that the predictors’ domains are largely affecting the behavior of the modeling

techniques.

PLSR has been commonly applied in chemometric modeling using hyperspectral

soil datasets which enables the user to characterize the linear association between the

generally high numbers of predictors with the phenomena of interest (i.e., the target

84

variable). PLSR can handle multicollinearity, while simple multivariate regression

cannot. Actually, the latter assumes that predictors are independent, which in DSM is

often not the case (i.e., STEP-AWBH variables are often correlated). However, there

are only a few studies that have utilized PLSR as an upscaling method for regional

scale soil mapping (Rodríguez-Lado and Martínez-Cortizas, 2015). Thus, it is important

to note the prediction behavior and its predictive power among other powerful soil

prediction techniques. In this study, PLSR was able to explain 64%, 65%, and 26% of

the total variation with an RMSD of 2.82 kg C m-2, 2.08 kg C m-2, and 0.07 kg C m-2 for

TC, RC, and HC, respectively. The PLSR models performed better to predict soil TC,

RC, and HC than RK, CART, and OK and very similar to the other methods (BaRT,

SVM, and BoRT). These findings imply that PLSR is a promising method for mapping

soil properties and classes. With PLSR, satisfactory predictive performance may be

obtained even though relatively low numbers of predictors are used to train calibration

datasets.

As expected, OK yielded the poorest model among all methods for the three soil

C pools (R2 for TC = 0.43, RC: 0.29, and HC 0.11). This was because it solely

accounted for the spatially correlated stochastic variation characterized by the spatial

autocorrelation of distinct soil C fractions. RK, on the other hand, significantly improved

the prediction accuracy compared to OK for TC and RC. RK performed better when

auxiliary variables could be used to explain the significant part of the variation (Hengl et

al., 2007a). In the validation mode, the SMLR for TC and RC was significantly better

than HC with R2 = 0.46, 0.51, and 0.26 kg m-2, and RMSD = 0.22, 0.24, and 0.19 kg m-2,

respectively. Hence, the RKSMLR, OK for TC and RC outperformed the standalone OK.

85

However, there was no gain on RKSMLR, OK of HC because R2 of the regression with

auxiliary variables (0.26) was not significant.

3.3.5 Residual Spatial Autocorrelation of Evaluated Methods

Residual spatial autocorrelation (RSA) of the evaluated models was investigated

with the omnidirectional variogram (Figures 3-7, 3-8 and 3-9). No meaningful RSA was

left behind in the evaluated models. This suggested that all attainable variation present

in the data was captured with the prediction models (BaRT, BoRT, CART, PLSR, RF,

and SVM) which were developed by the all-relevant variables identified by Boruta. This

can be interpreted as the spatially autocorrelated all-relevant STEP-ABWH factors such

as LULC, soil suborder, AWC enable the models to capture all-attainable variation in

validation datasets. However, the residuals of CART were much more erratic for TC and

RC and somewhat for HC.

In contrast, but in line with our hypothesis, Zhao and Shi (2010) accomplished

the best prediction performance, explaining up to 67% of the overall variation in SOC

stocks across a province in China, by combining artificial neural networks (ANN) to

capture deterministic trends and utilizing OK to capture stochastic variation (RKANN,OK)

compared to other models (MLR, Universal Kriging, RK). Similarly, Dai et al. (2014)

succeeded in improving SOM prediction performance in Tibet with RKANN,OK. In a

regional scale SOM characterization study, Guo et al. (2015) found a significant

improvement in overall model performance with RKRF,OK compared to SMLR. As data-

mining approaches have been increasingly employed in DSMM, it is critical to assess

the spatial autocorrelations in residuals (OK) that would potentially facilitate the

improvement of model performance (e.g., RK). Interestingly, in this study, no significant

spatial autocorrelation was found in the best performing RF model (TC, RC, and HC).

86

Because the more complex and best performing machine learning RF model could not

be used as a trend model in RK, we modeled using SMLR instead. This is probably why

RKSMLR,OK did not perform as well as RF (and other machine learning methods) to

predict TC, RC, and HC.

In soil C modeling studies, an incomplete knowledge on the processes largely

affecting stabilization and destabilization of soil organic C, a lack of appropriate scale-

relevant environmental predictors, observation and prediction scale mismatch, poor

sampling design or insufficient sampling density, and improper model choice may

contribute to the strength of the RSA of any model. Therefore, a soil model which has

some autocorrelation left in its residuals can be improved. Though some of these

factors (e.g., sampling design, density) should be addressed before the model

development process, the adverse influence of other factors (e.g., proper model choice,

useful predictors) on prediction performance can be ameliorated in the model

development process.

Prediction performance is known to depend on the gathering of useful data and

not on sophisticated methods. Thus, missing predictors can leave some RSA, even if

the right choice of methods was employed. In a national scale SOC stocks prediction

study, Martin et al. (2014) showed that simple BoRT models developed with a limited

number of predictors coupled with geostatistical modelling of residual can significantly

improve standalone BoRT predictions, whereas the complex models developed with

relatively higher number of predictors coupled with ordinary kriging of residual did not

significantly improve standalone BoRT predictions. Unless all-relevant variables have

been included in a model, there is the chance that some additional explainable variation

87

can be captured with further analysis. In the present study, where the identified

parsimonious predictors captured sufficiently the stochastic spatially dependent

variation, there was no meaningful RSA left in the model residuals. When aiming to

improve predictions, one cannot be assured of the inclusion of all-relevant variables

without investigating RSA. Thus, the more successfully the all-relevant variable model

performs, the less likely it is that significant RSA can be identified and modeled.

The choice of method when characterizing the relationships between the soil-

environmental factors and target soil properties (e.g., TC) is particularly important with

respect to RSA. In the current study, the evaluated models (BaRT, BoRT, CART, PLSR,

RF, and SVM) did not have any substantial RSA because they were capable of

detecting a hierarchical, non-linear relationship between TC, RC, and HC and the all-

relevant STEP-ABWH factors. However, the SMLR of TC, RC, and HC did not capture

all the stochastic spatially dependent attainable variability in the data due to the

incapable nature of SMLR to capture the non-linear, hierarchical, complex relationship

between a dependent variable (i.e., TC, RC, and HC) and independent variables (i.e.,

the STEP-ABWH factors). Consequently, the residual of SMLR for TC and RC was

moderately and HC was weakly spatially autocorrelated (Figure 3-2). That is why

RKSMLR,OK for TC and RC improved the prediction accuracy when compared to the

SMLR of TC and RC. For HC, however, RKSMLR, OK did not improve the prediction

accuracy due to its dynamic, noisy nature.

Because the success of modeling is determined by the proximity of the true

model of soil-landscape at a scale of interest, the coupling of stochastic spatially

dependent and deterministic variation is necessary to characterize the spatial

88

distribution of soil properties and classes. Efforts to acquire exhaustive environmental

variable sets (i.e., STEP-AWBH variables) and then filter out those most relevant to

infer on a soil property of interest are less user biased. In addition, the Boruta all-

relevant approach for six machine learning methods ensured that no significant RSA

was present (Figures 3-7, 3-8 and 3-9). Although machine learning models are

frequently superior to other modeling techniques, there is no guarantee that they

account for all of the stochastic spatially dependent variation. As an identifiable RSA

can lead to an increase in model performance, an investigation of the RSA of any model

will need to be tested in DSMM.

3.3.6 Regional Scale Controls on Stabilization of Soil Carbon

The all-relevant variable approach enabled us to explain 71.6% of the overall

variation in TC and RC and 30.5 % of the HC at the regional scale. The result confirms

that biotic (e.g., vegetation and land use) and abiotic (soil-water gradient) environmental

determinants mainly control the storage of topsoil TC, RC, and HC in this subtropical

region. The TC and RC stocks were formed by various C forming and degrading

processes, such as aggregation, decomposition, humification, translocation, and

transformation, which have acted over prolonged periods of time. In contrast, the HC

stocks are more dynamically controlled by ecosystem processes resulting in temporal

trends that were more pronounced than spatial ones across the state.

The complex, hierarchical and multiscale interaction of distinct soil-environmental

determinants on formation and decomposition of soil C makes it difficult to assess

unambiguously how these predominant environmental factors regulate the fate of soil C.

Though many studies have aimed to map soil C stocks or concentrations (Simbahan et

al., 2006; Mishra et al., 2012; Karunaratne et al., 2014a; Poggio and Gimona, 2014), the

89

links between ancillary environmental variables and storage of C are still not clear

(Doetterl et al., 2013) and vary among geographic regions.

The amount of soil C in a particular soil body is primarily determined by the

tension between the influx through NPP (its quantity and mainly quality) and outflux (its

decomposition and leaching rate) (Janzen, 2004). In their landmark paper, Sollins et al.

(1996) described three main mechanisms responsible for the persistence of SOM from

decomposition; 1) recalcitrance (i.e., molecular structure of SOM), 2) low accessibility

for biological decomposition, and 3) interaction with mineral particles. In the following

section, the discussion is carried out on how important variables and their interactions

can possibly modulate the above and belowground C balance through important de-

stabilization processes operating in regional scale.

Biotic variables, specifically LULC, stand out as one of the most important

predictors explaining the spatial distribution of topsoil C in Florida. Vegetation directly

controls the quantity and quality of organic matter residues via litter cover, species

diversity, distribution, and canopy cover (Davidson and Janssens, 2006). Additionally,

vegetation can alter the chemical properties of soils and microbial community

composition (Lange et al., 2015). Variations in relative abundance of labile and

recalcitrant compound depending on vegetation type impart control on decomposability

of fresh organic matter residues (Melillo et al., 1989; Fissore et al., 2008). In a meta-

analysis study, Guo and Gifford (2002) reported the paramount influence of LULC

change on soil C stocks.

In present study, the Kruskal-Wallis test indicates that land-uses differed

significantly in concentration of TC, RC, and HC present in Florida’s top soil (20 cm) at

90

the significance level of 0.0001. Post-hoc multiple comparison results are shown to

identify the pattern of those differences (Figure 3-10, 3-11, 3-12). For visual assessment

purposes spatial distribution of soil landcover/landuse classes are given (Figure 3-13)

which is adapted from Florida Fish and Wildlife Commission, 2003. Overall, the amount

of C stored in different LULCs shows similar patterns for TC and RC. In particular, the

amount of TC and RC stored by LULC classes followed the general trend: sugarcane

and wetland contain the highest amount, followed by improved pasture, urban, mesic

upland forest, rangeland, and pineland while crop, citrus, and xeric upland forest

contained the lowest amount. The general trend of drainage-poor, water-saturated

wetland soils had an extremely higher TC than drainage-sufficient upland soils. Even

though little differences were observed among the wetland soils due to water saturation

periods, it is not significantly different as in the case of mixed forest (7.2 kg m-2) and

cypress swamp (9.6 kg m-2). No significance difference was observed between pineland

and mesic upland forest. While citrus, crop, improved pasture have similar TC storage,

urban has significantly higher TC than upland soil. Vasenev et al. (2014) found that the

urban SOC contents were comparable or higher than those of natural and agricultural

areas in the Moscov region. Xiong et al. (2014b) quantified the changes in SOC stocks

depending on changes in LULC across the state. Also, they found that at the sites that

had undergone LULC changes, the conversion of wetland to other LULCs resulted in

dramatic SOC losses, whereas conversion from other LULCs to wetland promoted SOC

accretion. In addition, Xiong et al. (2014b) found moderately higher SOC stocks in

urban soils and that the conversion of barren land, crop, and pineland to urban soils

91

leads to C build-up. This confirms that better characterization and understanding of

urban C stocks may have a significant impact on our global C cycling understanding.

Soil taxonomic groupings are known to be determined by the configuration of

environmental factors; hence, C forming/degrading processes and C budget can

significantly differ with soil types. Previously, Histosols and Spodosols with mean SOC

contents of 97.6 and 9.9 kg m-2, respectively, standardized to a 1 m soil profile were

estimated to have the highest C sequestration potential in a study based on STATSGO

data (Guo et al., 2006). In a study by Vasques et al. (2010a) in a Florida watershed,

they found that Histosols and Inceptisols had substantially higher TC in the 1 m soil

profile. In the same study, Alfisols had higher TC in the 1 m soil profile than Ultisol and

Entisols because there was a higher base saturation of Alfisol which promotes natural

fertility. Entisols were dominated with quartz-rich sandy soils and were depleted in

organic matter and reactive minerals.

In the present study, the Kruskal-Wallis test indicated that suborders differed

significantly in concentration of TC, RC, and HC present in Florida’s top soil (20 cm) at

the significance level of 0.0001. Post-hoc multiple comparison results were shown to

identify the pattern of those differences (Figure 3-14, 3-15, 3-16). To visual assessment

purposes spatial distribution of soil landcover/landuse classes are given (Figure 3-17)

which is adapted from Florida Fish and Wildlife Commission, 2003. The greatest stocks

of TC were measured in Saprist and Aquols with medians of 13.9 and 8.4 kg m−2,

respectively. In contrast, the smallest stocks of TC were measured in Psamments and

Udalfs with medians of 2.1 kg m−2. The greatest stocks of RC were measured in Saprist

and Aquols with medians of 10.2 and 5.5 kg m−2, respectively. In contrast, the smallest

92

stocks of RC were measured in Psamments and Udalfs with medians of 1.1 kg m−2. The

greatest stocks of HC were measured in Aquolls, Saprists, Aquepts, and Arents with

medians of 0.24, 0.21, 0.19, and 0.19 kg m−2, respectively. In contrast, the smallest

stocks of HC were measured in Psamments and Udults with medians of 0.1 kg m−2.

Overall, soil suborders found on poorly drained portions of the landscape (e.g., Saprist,

Aquept, and Aquent) exhibited higher soil C than those found on better drained areas

(e.g., Psamment, Udult, and Orthod) for each soil carbon fractions.

Chemical protection (i.e., adsorption of organic molecules onto clay surfaces)

and physical protection (i.e., incorporation of organic molecules into aggregates) retard

decomposition of SOM through the mechanisms associated with soil mineralogy

(Schimel et al., 1985; Hassink, 1997; Torn et al., 1997; J. Six, 2002). Previous works

have reported that the extent of protection offered by fine-textured soil is greater than

coarse-textured soil (Parton et al., 1987b; Schimel et al., 1994; Baldock and Skjemstad,

2000). Accordingly, given the sand-rich, acidic nature of Florida’s topsoil, the protection

offered by mineral surfaces is relatively low. For instance, in a study in southeastern

Florida (Santa Fe River Watershed), Ahn et al. (2009) showed that the low clay content

was associated with relatively low TC and HC concentrations. They demonstrated

through incubation experiments that the sandy nature of these surface soils imparted a

lack of protection against C mineralization. Interestingly, in this study, variables that

reflect soil textural composition were not identified as relevant to TC, RC, and HC,

possibly because the topsoil was dominated by sand texture. Others also did not

observe a strong relationship between soil C stocks and clay because of the limited

range of clay content (Fissore et al., 2008; Angers et al., 2011; Doetterl et al., 2015).

93

Lawrence et al. (2015) stressed that the type of clay (expandable/non-expandable), high

surface area, and presence of very reactive forms of Al- and Fe oxides(including

hydroxides and oxy-hydroxides) are better parameters to explain correlation of SOC

with minerals than clay content by itself.

Previous studies have indicated that soil moisture (Thomsen et al. 2003), soil

aeration (Holden and Fierer, 2005), and soil temperature control the microbial activity

and hence stability of SOC. Hydropedologic characteristic of the landscape across

Florida may influence stabilization of soil C in a several ways. First, an excessive

amount of soil-water associated with convergent soil-scapes (e.g., depressions,

wetlands, depositional valley bottom) can retard the microbial C mineralization because

water-filled soil pores limit the oxygen availability to microbial activity; ultimately, this

can lead to stabilization of soil C across the soil profile (Ekschmitt et al., 2008; Rumpel

and Kögel-Knabner, 2010). Second, especially in subtropical climates, soil-water status

promotes the NPP and can directly influence soil C storage by increasing the quantity of

C supplied as residue to the soil system. Therefore, pedological and hydrological

processes can inhibit the microbial-controlled decomposition of SOC and this may

stabilize the soil C specifically in subtropical regions such as Florida.

Though topography is muted across the study area, micro-topography affects the

soil C status by regulating hydrological processes. The high water table, the high

amount of rainfall, and coarse-texture dominated characteristic of the surface soils may

collectively enhance the accretion of soil C. Consequently, in poorly drained

depressions where water is often ponded for periods of time (e.g., flatwoods), anaerobic

conditions decrease decomposition and enhance soil C accumulation. For example,

94

Vasques et al. (2010a) found large variations in SOC stocks between drainage types

from very poorly drained to well-drained types in a subset region in northeastern Florida.

This implies that micro-topography exerts secondary importance by modifying the soil

matrix and indirectly facilitating the stabilization of soil C.

Climatic factors, specifically precipitation and temperature, have been commonly

documented as the most important environmental determinants of soil C storage, flux,

and processes (Amundson, 2001; Baldock and Skjemstad, 2000) because of their

pronounced influence on the rate of organic matter decomposition and the quantity and

quality of organic matter (Liu et al., 2011). However, in the present study, climatic

predictors were weakly relevant to TC, RC, and HC. This finding is in line with other

studies (Percival et al., 2000; Liu et al., 2011). Michaletz et al. (2014) advocated that

climate and temperature have an indirect influence on the variation in terrestrial net

primary production by modifying plant age, stand biomass, and growing season length.

Moreover, microtopography; a fluctuating high water table; sandy dominated topsoils

that promote infiltration and percolation; relatively acidic nature of soils; and high

precipitation in excessively drained, nearly level landscapes promote the vertical

leaching of metals and organic material to subsurface horizons, forming spodic C-rich

layers. Spodosols tend to have a high proportion of recalcitrant C in the topsoils, but

also in subsurface horizons (Stone et al., 1993). Xiong et al. (2014b) found a negative

relationship between the SOC sequestration rate in topsoil and the mean annual

precipitation, possibly because the coarse-texture allows the organic material to

translocate to lower layers. Given the landscape conditions in Florida, Histosols and

Spodosols are most prominent throughout the state and provide ample opportunities to

95

sequester C. This implies that even though climatic properties are not the most

important variables at the regional scale, they greatly influence soil C storage at the

local scale by modifying the interplay between pedogenic and biotic factors.

3.3.7 Spatial Distribution of C fractions

The random forest method is utilized to map spatial distribution pattern of SOC

pools at a resolution of 30 m across the whole region. Only all-relevant continuous

predictors are included during model calibration because the calibration sets lacked

adequate representation of the categorical predictors. The cross-validation statistics of

calibration and validation sets are presented (Table 3-7). As expected, relatively lower

prediction accuracy is acquired with the RF model solely relying on continuous

predictors. Even though categorical variables were among the most important predictors

to explain variability in TC, RC, and HC (Table 3-5), the RF models that were developed

with only continuous variables yielded comparably well with the RF models that were

developed with all-relevant variables. Similar results were also found by Xiong et al.

(2014a). They reported that the introduction of categorical variables into the RF models

leads to gaps in the produced maps because the predictor classes should be

represented in the calibration dataset. In this study, the findings also suggest that the

continuous variables employed to produce soil C maps may capture the major

processes relevant to TC, RC, and HC across the surface soil of the region. Hence,

they serve as good surrogates to their categorical counterparts.

Soil total C was predicted with the mean of 5.39 and standard deviation of 3.74,

as was the recalcitrant pool of soil C with the mean of 3.25 and standard deviation of

2.66, and the labile pool of soil C with the mean of 0.17 and standard deviation of 0.05.

96

The predicted spatial distribution pattern of TC, RC, and HC are mapped in

Figure 3-18, 3-19 and 3-20. In general, low and high values were consistent with all the

maps. For instance, a similar cluster of large soil C values can be observed in the bend

area along the Gulf coast which spans from the Everglades agricultural area to south of

Lake Okeechobee. Also, high C stocks can be observed in the wetlands interspersed in

the pine forests in northern Florida. These areas are generally characterized with

flatwoods, wetland forests, and cypress swamps with significant accretion of organic

matter in the O and A horizons. In addition, similar clusters can be found at the western

border of Florida which is also dominated by flatwoods and swamps. On the other hand,

the north-central portions of the Panhandle area are dominated by upland soils with

rolling topography and relatively lower C stocks. The gaps in the all maps are due to the

lack of soil data from the SSURGO database.

3.4 Conclusions

The present study demonstrated that the Boruta all-relevant variable searching

algorithm can be employed to filter out the best performing parsimonious predictors

from a spectrum of environmental factors without user bias. The results reveal that

human-induced vegetative and hydro-pedological characteristic of the region

predominantly control the soil C stocks of surface soil. In general, the lower C stocks

were associated with well-drained upland soils, and the higher C stocks were related to

water-rich wetland soils. This study also used the most important STEP-ABWH factors

to trace their role in the stabilization of soil C across Florida with a distinct signature of

TC, RC, and HC.

Comparisons of common geostatistical, machine learning, and hybrid methods in

the pedometricians’ toolbox indicated that RF as an ensemble machine learning method

97

outperformed all the competitors in terms of R2, RPD, RPIQ, and RMSD. RF models

also accounted for up to three-fourths of the total variability in TC and RC, but only one-

fourth of HC probably due to its unstable dynamic nature and/or the low concentrations

in relation to analytical error. Also, best performing RF models contributed up to a 40%

decrease in the RMSD of TC and RC, compared to the RMSD of OK as the reference

model of soil-landscape.

Investigation of the RSA of the evaluated models revealed that the inclusion of

the all-relevant STEP-ABWH factors with proper methodologies could guarantee little to

no RSA. Because one cannot be assured all of the relevant variables have been

included in the model development process, further characterization of RSA with

appropriate statistical metrics could be a routine for future DSMM studies. More

sophisticated predictors in the representation of vegetation, soil-water, and soil

geochemistry may lead to more accurate empirical geo-spatial soil landscape models.

98

Table 3-1. Assembled environmental variables representing STEP-ABWH factors

Variable a Relevant variable Na Factor Data typea Sourcea Original

scale (m) Date

Soil taxonomic order SoilOrder 1 S Cat. SSURGO 1:24,000 2009 Soil taxonomic suborder SoilSuborder 1 S Cat. SSURGO 1:24,000 2009 Soil taxonomic subgroup 1 S Cat. SSURGO 1:24,000 2009 Soil taxonomic great group SoilGreatGrp 1 S Cat. SSURGO 1:24,000 2009 Soil particle size class 1 S Cat. SSURGO 1:24,000 2009 Soil family CEC activity class 1 S Cat. SSURGO 1:24,000 2009 Soil family reaction class SoilReaction 1 S Cat. SSURGO 1:24,000 2009 Soil family temperature class 1 S Cat. SSURGO 1:24,000 2009 Soil family moisture subclass 1 S Cat. SSURGO 1:24,000 2009 Soil muck 1 S Cat. SSURGO 1:24,000 2009 Soil hydration expansion SoilHydration 1 S Cat. SSURGO 1:24,000 2009 Soil leaching potential 1 S Cat. SSURGO 1:24,000 2009 Soil runoff potential SoilRunoff 1 S/W Cat. SSURGO 1:24,000 2009 Soil albedo SoilAlbedo 1 S Con. SSURGO 1:24,000 2009 Soil sand content (0-20 cm) 1 S Con. SSURGO 1:24,000 2009 Soil silt content (0-20 cm) 1 S Con. SSURGO 1:24,000 2009 Soil clay content (0-20 cm) 1 S Con. SSURGO 1:24,000 2009 Soil organic matter (0-20 cm)(historic)

SOM 1 S

Con. SSURGO 1:24,000 2009

Soil moisture b SoilMoistFeb, … 17 S-W Con. SMOS 15,000 2010-11 Elevation (30 m, 90 m, 1 km) c 3 T Con. USGS 30/90/1000 1999 Slope (30 m, 90 m, 1 km) c 3 T Con. USGS 30/90/1000 1999 Flow accumulation (30 m, 90 m, 1 km) c

3 T

Con. USGS 30/90/1000 1999

CTI (30 m, 90 m, 1 km) c 3 T Con. USGS 30/90/1000 1999 Soil slope SoilSlope 1 T Con. SSURGO 1:24,000 2009

99

Table 3-1. Continued Variablea Relevant variable Na Factor Data

typea Sourcea Original scale (m) Date

Distance from coast 1 T Con. FMRI 30 1999 Distance from sinkhole 1 T Con. FGS 30 1999 Distance from stream 1 T/W Con. USGS 30 1999 Distance from open water 1 T/W Con. USGS 30 1999 Easting, northing d 2 T Con. Field sampling N/A 2009 Ecological regions EcoRegion 1 E Cat. USGS 1:250,000 2009 Physiographic province name and type

PhysiogName, .. 2 E/P

Cat. USGS 1:500,000 1998

Environmental geology EnvGeology 1 P Cat. USGS 1:500,000 1998 Surficial geology SurGeology 1 P Cat. USGS 1:500,000 1998 Surficial geology epoch and period

2 P

Cat. USGS 1:500,000 1998

Gamma-ray absorbed dose rate 1 P

Con. USGS 4000 1999-2005

Gamma Ray Concentrations of potassium, thorium,uranium,

3 P

Con. USGS 2000 1975-1983

Gamma RayBouguer gravity anomaly

2 P

Con. USGS 4000 1998-1999

Gamma Ray magnetic anomaly 3 P

Con. USGS 2000 1945-2001

Precipitation b PrecipFeb, … 26 A Con. PRISM 800 1971-2000

Temperature b MaxTempJan, …

65 A

Con. PRISM 800 1971-2000 1981-2010

Solar radiation b SolarRadMay 13 A Con. NARR 32,000 1979-2009

Total ET b 13 W Con. Uni. of Montana

1000 2000

Total Potential ET b 13 W Con. Uni. of Montana

1000 2000

100

Table 3-1. Continued Variablea Relevant variable Na Factor Data

typea Sourcea Original scale (m) Date

Annual latent heat flux b LatHeat2009 13 W Con. Uni. of Montana

1000 2000

Long-term average annual ET b 2 W

Con. USGS 800 1971-2000

Soil annual minimum water table b 2 W

Con. SSURGO 1:24,000 2009

Soil available water capacity(0 -25 cm, 0-50 cm, 0-100 cm and 0-150 cm)

AWC25, AWC50, AWC100 4 W

Con. SSURGO 1:24,000 2009

Flooding frequency class 1 W Cat. SSURGO 1:24,000 2009 Ponding frequency class PondFreq 1 W Cat. SSURGO 1:24,000 2009 Drainage class DrainCls 1 W Cat. SSURGO 1:24,000 2009 Hydrologic group 1 W Cat. SSURGO 1:24,000 2009 Runoff class 1 W Cat. SSURGO 1:24,000 2009 Vegetation type VegType 1 B Cat. LANDFIRE 30 2002 Vegetation type system group 1 VegTpSysGrp1

1 B Cat. LANDFIRE 30 2002

Vegetation type system group 2 VegTpSysGrp2 1 B

Cat. LANDFIRE 30 2002

Vegetation type order 1 B Cat. LANDFIRE 30 2002 Vegetation type class 1 B Cat. LANDFIRE 30 2002 Vegetation type subclass 1 B Cat. LANDFIRE 30 2002 Biophysical settings BiophySet 1 B Cat. LANDFIRE 30 2002 Environmental site potential EnvSitePot 1 B Cat. LANDFIRE 30 2002 Vegetation height 1 B Cat. LANDFIRE 30 2002 Vegetation cover 1 B Cat. LANDFIRE 30 2002 Forest canopy properties 4 B Con. LANDFIRE 30 2002 Landsat ETM + bands LsatB5,… 6 B Con. USGS 30 2003 Landsat ETM + tasseled cap indices

LsatTC1, … 6 B Con. USGS 30 2003

101

Table 3-1. Continued

Variablea Relevant variable Na Factor Data typea Sourcea Original

scale (m) Date

Landsat ETM + principal components

LsatPC1, … 6 B

Con. USGS 30 2003

Monthly MODIS NDVI NdviMay, … 12 B Con. MODIS4NACP 500 2005 Monthly MODIS EVI EviAgust, … 12 B Con. MODIS4NACP 500 2005 Monthly MODIS LAI 12 B Con. MODIS4NACP 500 2005 Monthly MODIS FPAR 12 B Con. MODIS4NACP 500 2005 Annual min, max and mean NDVI

3 B Con. MODIS4NACP 1000 2005

NDVI greenup, peak and browndown day of year

3 B

Con. MODIS4NACP 1000 2005

NDVI greenup and browndown rate

2 B


NDVI Season length 1 B Con. MODIS4NACP 1000 2005 NDVI amplitude and base NDVI level

NDVIAmplitude 2 B


Max peak NDVI 1 B Con. MODIS4NACP 1000 2005 Large NDVI peak integral e 1 B Con. MODIS4NACP 1000 2005 Small NDVI peak integral e SmallNdviPkInt 1 B Con. MODIS4NACP 1000 2005 Canopy coverage and Imperviousness

2 B

Con. NLCD 30 2001

Aboveground live dry biomass 1 B

Con. NBCD 30 2000

102

Table 3-1. Continued

Variablea Relevant variable Na Factor Data typea Sourcea Original

scale (m) Date

Gross and net primary production

2 B


Land cover class LandCovCls 1 B/H Cat. NLCD 30 2001 Cropland data layer Cropland 1 B/H Cat. NCDL 30 2004 Land use and land cover LULCSampled 1 B/H Cat. Field sampling N/A 2009 Land use and land cover LULC 1 B/H Cat. FFWCC 30 2003 Land use and land cover f LULCRecls 1 B/H Cat. FFWCC 30 2003

a Abbreviations: CEC, Cation Exchange Capacity; CTI, Compound Topographic Index; Landsat ETM+, Enhanced Thematic Mapper; MODIS, Moderate-Resolution Imaging Spectroradiometer; NDVI, Normalized Difference Vegetation Index; EVI, Enhanced Vegetation Index; LAI, Leaf Area Index; FPAR, Fraction of Photosynthetically Active Radiation; SSURGO, Soil Survey Geographic Database; STATSGO2, State Soil Geographic Database; SMOS, Soil Moisture and Ocean Salinity; USGS, United States Geological Survey; FMRI, Florida Marine Research Institute; PRISM, Parameter-elevation Regressions on Independent Slopes Model; NARR, North American Regional Reanalysis; LANDFIRE, LANDscape FIRE and resource management tools project; MODIS4NACP, MODIS for North American Carbon Project; ET, Evapotranspiration; NLCD, National Land Cover Data; NBCD, National Biomass and Carbon Dataset; NCDL, National Cropland Data Layer; FFWCC, Florida Fish and Wildlife Conservation Commission; FGS, FL geological survey; N, number of variables; Cat., Categorical; Con., Continuous. b The 17 soil moisture variables are 12 monthly averages and 4 seasonal (e.g., spring, summer, autumn, winter) and one overall average over 2010-2011. The 26 precipitation variables are 12 monthly averages and one overall average over 1971-2000 and the same for 1981-2010. The 65 temperature variables are 24 monthly averages of daily max and min temperatures plus 2 long-term averages (1971-2000). Also, there are 39 monthly averages of daily max, mean and min temperatures plus 2 long-term averages (1981-2010). The 13 solar radiation variables are 12 monthly averages over 1979-2009 and one long-term average. 36 evapotranspiration (ET) variables consist of 13 annual total evapotranspiration, potential ET and latent heat flux from 2000 to 2012 and 13 annual total potential ET Long-term average annual ET one annual average over 1971-2000 and long-term average ratio over precipitation between 1971-2000. The 2 soil water depth variables are soil annual minimum water table depth and annual minimum water table depth from April to June. c Topographic attributes are gathered from different data sources at multiple scales including 30, 90 and 1000 m d Easting and northing are the projected coordinates where soil samples were collected. e Small peak integral, given by the area of the region between the fitted function and the average of green-up NDVI and brown-down NDVI values, represents the seasonally active vegetation, which may be large for herbaceous vegetation cover and small for evergreen vegetation cover. Large peak integral, given by the area between the fitted function and the zero NDVI value bounded by the green-up time and brown-down time, represents the total vegetation stand and is a proxy for vegetation production. f Reclassified land use and land cover layer was created by combining relatively small and similar groups.

103

Table 3-2. R packages to perform evaluated methods Methods R Packages References SMLR stats Base R team CART rpart Therneau et al. (2015) BaRT ipred Peters et al. (2015) BoRT gbm Ridgeway (2004) RF randomForest Liaw and Wiener (2002) SVM kernlab Karatzoglou et al. ( 2004) PLSR pls Wehrens et al. (2007) OK gstat Pebesma ( 2004)

BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SMLR = stepwise multiple linear regression, SVM = support vector machine. Table 3-3. Descriptive statistic of observed soil C fractions (TC: Total carbon, RC:

Recalcitrant carbon and HC: Hot-water extractable carbon). Min. Max. Mean Median St.Dev.1 Range Skew2 Kurtosis

kg m-2

Whole set (N= 1014)

TC 0.45 34.15 4.74 3.32 4.35 0.45 34.15 2.94 11.21

RC 0.22 25.08 2.81 1.69 3.28 0.22 25.08 3.40 14.09

HC 0.02 0.71 0.16 0.14 0.09 0.02 0.71 1.49 3.71

Calibration (N= 710)

TC 0.45 34.15 4.71 3.33 4.30 0.45 34.15 2.96 11.73

RC 0.22 25.08 2.77 1.69 3.19 0.22 25.08 3.56 15.91

HC 0.02 0.71 0.16 0.14 0.09 0.02 0.71 1.53 3.70

Validation (N= 304)

TC 0.81 28.96 4.80 3.30 4.49 0.81 28.96 2.88 10.01

RC 0.30 24.02 2.91 1.66 3.47 0.30 24.02 3.07 10.66

HC 0.04 0.55 0.15 0.14 0.07 0.04 0.55 1.14 2.12 1 St.Dev = standard deviation. 2 Skew. = skewness. Table 3-4. Spearman’s correlation analysis of the paired soil C fractions.

TC RC HC TC 1.00 0.94 0.79 RC 1.00 0.73 HC 1.00

HC = hot water-extractable carbon, RC = recalcitrant carbon, TC = total carbon

104

Table 3-5. Z score as a sign for relative importance of all-relevant variables identified by Boruta to infer on total carbon (TC), recalcitrant carbon (RC) and hot-water extractable carbon (HC) in kg m-2 in the. Note: The variables are described in Table 1.

Factors a Relevant Variables b TC RC HC

S SoilSuborder 22.3 26.8 16.8 B/H LULCSampled 27.5 31.7 14.2 S SoilGreatGrp 11.2 11.0 14.2 P PhysiogName 9.7 7.3 11.9 W DrainCls 7.7 6.5 12.1 B/H LULC 11.6 8.8 6.3 P SurGeology 11.7 8.1 13.3 B/H LULCRecls 11.5 8.0 4.4 S SOM 9.3 10.4 4.6 B VegType 8.2 6.3 4.3 A PrecipFeb 4.8 6.8 4.3 T SoilSlope 5.9 4.5 4.1 W AWC50 12.5 11.6 S SoilReaction 10.6 8.3 S SoilOrder 6.2 10.0 B SmallNdviPkInt 5.7 12.7 W AWC25 7.5 8.1 W AWC100 6.0 5.3 S SoilAlbedo 5.3 5.8 B LsatTC1 4.8 6.7 W SoilRunoff 7.3 11.8 E EcoRegion 6.7 5.1 B BiophySet 5.6

5.4

S SoilHydration 6.4

4.5 B/H LandCovCls 7.3 B/H Cropland 6.7 B VegTpSysGrp1 6.0 E EnvSitePot 5.8 A MaxTempDec 5.6 P PhysiogType 5.6 B VegTpSysGrp2 5.5

105

Table 3-5. Continued Factors Relevant Variables TC RC HC

A MaxTempJan 5.3 A PrecipMay 5.2 A MaxTempApr 5.0 P EnvGeology 4.9 B NDVIAmplitude 8.1 A PrecipDecem 7.0 A PrecipDec 5.6 B LsatPC1 4.7 B EviOct 4.8 B LsatB5 4.8 W PondFreq 4.5 A SolarRadMay 4.3 A SoilMoistSep 4.3 A PrecipJune 6.6 B NdviJune 5.1 B EviJune 5.1 A PrecipOct 4.5 B NdviMay 4.3 B EviAgust 4.2 B LatHeat2009 4.2 S-W SoilMoistFeb 3.9

Abbreviations: S = soil, T = topography, E = ecology, P = parent material, A = atmosphere, B = biota, W = water, H = human. See table 3 -1 for the description of relevant variables. HC = hydrolysable carbon, RC = recalcitrant carbon, TC = total carbon,

106

Table 3-6. Performance of eight different modelling methods to predict soil total carbon (TC), recalcitrant carbon (RC) and labile carbon (HC) on validation dataset (n=304) across topsoil’s (0-20 cm) of Florida.

R2 RMSD (kg m -2) RPD RPIQ

TC RC HC TC RC HC TC RC HC TC RC HC

RF 0.72 0.72 0.31 2.39 1.89 0.06 1.88 1.84 1.19 1.35 0.90 1.54 BoRT 0.63 0.63 0.30 2.75 2.22 0.06 1.64 1.64 1.18 1.18 0.80 1.53 BaRT 0.62 0.62 0.28 2.78 2.16 0.06 1.62 1.61 1.17 1.16 0.78 1.51 SVM 0.65 0.62 0.30 2.69 2.21 0.06 1.67 1.57 1.19 1.20 0.76 1.54 PLSR 0.64 0.65 0.26 2.82 2.08 0.07 1.59 1.68 1.14 1.14 0.82 1.47 RK 0.63 0.63 0.21 2.99 2.13 0.07 1.51 1.63 1.05 1.08 0.79 1.36 CART 0.56 0.51 0.17 3.03 2.57 0.08 1.48 1.35 1.00 1.06 0.66 1.29 OK 0.43 0.29 0.11 3.81 3.27 0.07 1.18 1.06 1.04 0.85 0.52 0.35

Abbreviations: R2 = coefficient of determination, RMSD = root mean squared deviations, RPD = residual prediction deviation, RPIQ = ratio of prediction error to inter-quartile range; TC = total carbon, RC = recalcitrant carbon, HC = hot-water extractable carbon; BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine.

107

Table 3-7. Cross-validation (on the 70% calibration dataset) and independent validation (on the 30% validation dataset) results of Random Forest models to produce spatial distribution pattern for soil total carbon (TC), recalcitrant carbon (RC) and hydrolysable carbon (HC) across Florida.

Calibration (70%) Validation (30%)

R2 RMSD a (kg m-2 ) RMSD (kg m-2)

TC b 0.55 2.89 0.65 2.68

RC c 0.50 2.25 0.63 3.15

HC d 0.26 0.08 0.22 0.07 a Abbrevations: HC = hydrolysable carbon, RC = recalcitrant carbon, TC = total carbon, RMSD = root mean squared deviation. b The 10 continuous variables are AWC25, AWC50, LsatTC1, MaxTempDec, MaxTempJan, PrecipMay, SmallNdviPkInt, SOM, SoilAlbedo, and SoilSlope. c The 13 continuous variables are AWC25, AWC50, EviOct, LsatB5, LsatPC1, LsatTC1, NDVIAmplitude, PrecipDecem, SmallNdviPkInt, SolarRadMay, SOM, SoilAlbedo, and SoilSlope. d The 7 continuous variables are EviAgust, EviJune, NdviMay, NdviJune, PrecipOct, SOM, and SoilSlope.

108

Figure 3-1. A total of 1014 soil sampling locations (70% calibration samples in light blue

and 30% validation samples in red) and elevation in Florida.

109

Figure 3-2. Upper part of figure depicts the omnidirectional variograms for total carbon (TC), recalcitrant carbon (RC) and

hot-water extractable carbon (HC) in log kg m-2. Lower part of the figure illustrates the omnidirectional variogram for residuals arise from Stepwise Multiple Linear Regression (SMLR) of TC, RC and HC.

110

Figure 3-3. Predicted vs. observed soil total carbon (TC) of validation dataset derived from evaluated methods. BaRT =

bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine. R2 = coefficient of determination; RMSD = root mean squared deviations; RPD = residual prediction deviation; RPIQ = ratio of prediction error to inter-quartile range.

111

Figure 3-4. Predicted vs. observed soil recalcitrant carbon (RC) of validation dataset derived from evaluated methods.

BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine. R2 = coefficient of determination; RMSD = root mean squared deviations; RPD = residual prediction deviation; RPIQ = ratio of prediction error to inter-quartile range.

112

Figure 3-5. Predicted vs. observed soil hot-water extractable carbon (HC) of validation dataset derived from evaluated

methods. BaRT = bagged regression tree, BoRT = boosted regression tree, CART = classification and regression tree, OK = ordinary kriging, PLSR = partial least square regression, RF = random forest, RK = regression kriging, SVM = support vector machine. R2 = coefficient of determination; RMSD = root mean squared deviations; RPD = residual prediction deviation; RPIQ = ratio of prediction error to inter-quartile range.

113

Figure 3-6. Relative increase (%) in root mean squared deviations (RMSD) of evaluated

prediction techniques compare to RMSD of OK. BaRT = Bagged regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.

114

Figure 3-7. Strength of the spatial autocorrelation among evaluated model residuals for total carbon (TC). BaRT = Bagged

regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.

115

Figure 3-8. Strength of the spatial autocorrelation among evaluated model residuals for recalcitrant carbon (RC). BaRT =

Bagged regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.

116

Figure 3-9. Strength of the spatial autocorrelation among evaluated model residuals for hot-water extractable carbon

(HC). BaRT = Bagged regression tree, BoRT = Boosted regression tree, CART = Classification and regression tree, OK = Ordinary kriging, PLSR = Partial least square regression, RF = Random forest, RK = Regression kriging, SVM = Support vector machine.

117

Figure 3-10. Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed

land use/land cover (LULC). The Kruskal–Wallis test shows the significant effect of LULC on total C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different TC at α = 0.05).

118

Figure 3-11. Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-

observed land use/land cover (LULC). The Kruskal–Wallis test shows the significant effect of LULC on recalcitrant C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different RC at α = 0.05).

119

Figure 3-12. Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-

observed land use/land cover (LULC). The Kruskal–Wallis test shows the significant effect of LULC on hydrolysable C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different HC at α = 0.05).

120

Figure 3-13. Spatial distribution of landcover/landuse classes [Adapted from Florida

Fish and Wildlife Commission. 2003. Florida vegetation and land cover data derived from 2003 Landsat ETM+ imagery by B Styes et al. Office of Environmental Services, Florida Fish and Wildlife Conservation Commission, Tallahassee, Fl.]

121

Figure 3-14. Violin plot of soil total C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-observed

suborders. The Kruskal–Wallis test shows the significant effect of suborders on total C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different TC at α = 0.05).

122

Figure 3-15. Violin plot of soil recalcitrant C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-

observed suborders. The Kruskal–Wallis test shows the significant effect of suborders on recalcitrant C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different RC at α = 0.05).

123

Figure 3-16. Violin plot of soil hydrolysable C (kg m−2) of the 1014 current samples (2008–2009) grouped by the field-

observed suborders. The Kruskal–Wallis test shows the significant effect of suborders on hydrolysable C at the significance level of 0.0001 and post hoc multiple comparison results are denoted by the letter codes above the class names (classes share no common letter contained significantly different HC at α = 0.05).

124

Figure 3-17. Spatial distribution of soil suborders [Adapted from Natural Resources

Conservation (NRCS), 2007. Soil Survey Geographic Database (SSURGO). United States Department of Agriculture (USDA). Map scale 1:24,000. Accessible through http://datagateway.nrcs.usda.gov/GDGOrder.aspx].

125

Figure 3-18. Spatial distribution patterns of estimated soil total carbon stocks (kg m-2)

across Florida, U.S. The map generated with Random forest model which only developed with continuous all-relevant variables identified by Boruta algorithm

126

Figure 3-19. Spatial distribution patterns of estimated recalcitrant carbon stocks (kg m-2)

across Florida, U.S. The map generated with Random forest model which only developed with continuous all-relevant variables identified by Boruta algorithm

127

Figure 3-20. Spatial distribution patterns of estimated hot-water extractable carbon

stocks (kg m-2) across Florida, U.S. The map generated with Random forest model which only developed with continuous all-relevant variables identified by Boruta algorithm

128

CHAPTER 4 SUMMARY AND SYNTHESIS

Soil C storage of Florida to a standardized depth of 1 m has been estimated as

the highest among all conterminous U.S states (Guo et al. 2006). In the Anthropocene,

however, increasing population, industrialization, urbanization and human-induced

impacts on natural forces have largely impacted on the global soil C budget. Therefore,

increasing reliability of soil C estimation for Florida is particularly important to determine

the future status of Florida’s soil C in changing world. Hence, the research presented in

this thesis focused on constructing accurate, realistic and parsimonious geo-spatial soil

landscape models to explore both - deterministic and stochastic parts that explain the

variability of distinct soil carbon fractions. Namely, they are recalcitrant, labile and total

soil C.

In the first part of the thesis (Chapter 2), a comprehensive synthesis on RK as

one of the most widely used methods in DSM was conducted to gain insights into

stochastic and deterministic variation of the investigated soil properties. The evolution of

hybrid techniques in pedometrics is outlined in a historical perspective. Moreover, the

parameters that may influence the performance of RK predictions was reported

reviewing 40 different articles published in international soil science journal on 2004-

2014. Findings from 140 cases that were documented in these articles revealed that the

sample density and the strength of relationship between auxiliary variables and soil

property predominantly influence the prediction performance of RK. In addition, we

propose that the following criteria are explicitly documented in future soil science DSM

papers to ensure consistency among all studies: area of extent, sample design, sample

depth, sample size (training and validation separately), sample depth(s), SCORPAN

129

factors, spatial resolution of final map, transformation methods, the method of factorial

analysis, regression type, coefficient of determination of the deterministic function,

variogram model type, spatial autocorrelation range, N:S ratio, validation method and

R2, RPIQ. Also, we revealed that incorporating non-parametric machine learning

method into the standard RK framework can improve the prediction accuracy of soil

properties. However, it needs further investigation of how parameters and methods

specifically affect the spatial dependence of residuals. Lastly, various RK types have

been proposed for comparative assessment to gain further insight in the RK protocol:

RKRF, OK , RKRF, IDS , RKRF, BK , RKSVM, OK , RKSVM, OK, RKSVM, IDS , RKGWR, OK , RKGWR, IDS ,

RKGWR, BK , RKPLSR, OK , RKPLSR, IDS , RKPLSR, BK , RKPCR, OK , RKPCR, IDS , RKPCR, BK.

In the second part of the thesis (Chapter 3) we aimed to develop accurate,

realistic and parsimonious soil C models for total, labile and recalcitrant soil C pools. To

strategically select important predictor variables the machine learning data reduction

technique Boruta was employed to filter out all-relevant environmental stressor out of

327 STEP-ABWH factors. This not only enabled us to reduce the multicollinearity

among exhaustive grids of environmental variables but also to develop unbiased

models. This allowed identifying 36, 30 and 25 all-relevant variables to optimize

prediction quality in terms of fitting, accuracy and parsimony for TC, RC and HC.

Results revealed that human–induced biotic and hydro-pedological factors of a given

soilscape predominantly control the stabilization and destabilization processes of soil C

pools. Also, to guarantee the accurate model, eight different pedometrics methods

employed for comparative assessments: PLSR, CART, BaRT, BoRT, RF, SVM, OK and

RK. Findings reveal that RF as an ensemble machine learning method outperformed all

130

of its competitor in terms of R2, RPD, RPIQ and RMSD and accounted for up to three-

fourth of the total variability in TC, RC, whereas only one fourth of HC because of its

unstable, dynamic nature. The spatial dependence of residuals derived from different

methods was investigated to develop the most realistic model. There was no significant

RSA left among evaluated methods, except in residuals derived from SMLR. In other

words, the incorporation of data-mining method into the RK framework was not

necessary because there was no stochastic variation left among model residuals. This

can be attributed to both: First, sophisticated methods were capable to capture all

attainable variation offered with environmental variables; and second, the introduction of

all-relevant auxiliary environmental variables guaranteed the capturing of all attainable

information present as deterministic and stochastic variation. Based on these findings

we propose that in cases where a biased smaller set of environmental predictors is

used to model soil properties the residuals should be reported explicitly and routinely in

future DSM studies.

It may be not likely to identify a spatial prediction method that is best for every

case (Sun et al., 2012), but it may be possible to develop models that identify and

quantify the attainable variability with a given sampling configuration. Hence, in this

research we illustrated how to guarantee capturing the all attainable variation in three

different soil properties. For further improvement, to introduce more sophisticated

environmental predictors representing vegetation; soil-water and soil geochemistry is

the way forward to decrease uncertainty associated with regional scale C estimation.

The study also elaborated how the most sensitive environmental factors may influence

the soil C budget along pedo-climatic trajectories across Florida. The predicted maps

131

clearly displayed the lower C density associated with well-drained upland, higher C

density related to water rich wetlands.

132

APPENDIX LITERATURE REVIEW

133

Summary of DSMM papers (2004-2004) which utilized RK to map soil properties and classes

134

135

136

137

138

139

140

141

142

143

Description of properties Literature Review

144

Predicted soil properties and classes: soil organic carbon (SOC), total phosphorus (P total), organic P (P org), inorganic P(P inorg), and available P , Soil available P was characterized by different chemical extractions: ammonium acetate (P-AEE), water (P-H2O), CO2-saturated nanopure water (P-CO2), Sodium bicarbonate( P-NaHCO3), Soil Organic Carbon (SOC) stocks, resistant organic carbon (ROC), humus organic carbon (HOC) and particulate organic carbon (POC), soil pH (pH), soil organic matter (SOM), Carbon to Nitrogen ratio (C:N), , alkali-hydrolysable N (AN),total C, N, K, Al, Ca, Mg and Zn ,Cr Cu, Ni, decalcified loess material (C1), Arsenic(As), cadmium(Cd), chromium (Cr), copper (Cu), mercury (Hg), nickel(Ni), lead and zinc(Z), soil texture classes( Tex. Clas), nitrate–nitrogen concentrations (NO3-N conc), sparse mineral nitrogen (MinN) , potentially available nitrogen(PAN) , Soil (regolith) depth (Reg. Dep), electrical conductivity (EC), recalcitrant C (RC), hydrolysable C (HC), hot-water-soluble C (SC), and mineralizable C (MC), Available water capacity (AWC), Sample Design: Regular grid and its sample spacing (m) (G- m), Random sampling (R), Stratified random sampling (SR), Purposive sampling (PS), conditioned latin hypercube sampling (cLHS) Total number of training set (T), total number of validation (V), Cross-validation (Cval), SCORP: Soil(S), Climate(C), Organism (O), Relief (R), Parent material (P), spatial resolution of digital elevation model (m) (DEM) Logarithmic transformation (log), principal component analysis (PCA), Regression Type: Stepwise multiple linear regression (SMLR), GLM (Generalized linear model), Regression tree (RT), Support vector regression (SVR), Residual maximum likelihood (REML), Geographically weighted regression (GWR),Logistic regression (LR), Classification and regression tree( CART), Generalized Additive Model (GAM), Variogram Model: Exponential (Exp), Spherical (Sph), Validation: R2 (coefficient of determination), mean error (ME), root mean square error (RMSE), Normalized RMSE by the total variation (Standard Deviation) (RMSEr,) , mean standardized squared deviation ratio (MSDR), Lin's concordance correlation coefficient (CCC), the standardized prediction error (θ), normalized root mean square error (NRMSE) by ymax-ymin,, residual prediction deviation (RPD), Relative Operating Characteristic (ROC), model efficiency coefficient (MEF), mean rank of a method (MR), mean square error (MSE), mean relative error (MRE).

145

LIST OF REFERENCES

Ahmed, S., De Marsily, G., 1987. Comparison of geostatistical methods for estimating transmissivity using data on transmissivity and specific capacity. Water Resour. Res. 23, 1717–1737. doi:10.1029/WR023i009p01717

Ahn, M.-Y., Zimmerman, A.R., Comerford, N.B., Sickman, J.O., Grunwald, S., 2009. Carbon mineralization and labile organic carbon pools in the sandy soils of a north florida watershed. Ecosystems 12, 672–685. doi:10.1007/s10021-009-9250-8

Amundson, R., 2001. The carbon budget in soils. Annu. Rev. Earth Planet. Sci. 29, 535–562. doi:10.1146/annurev.earth.29.1.535

Angers, D.A., Arrouays, D., Saby, N.P.A., Walter, C., 2011. Estimating and mapping the carbon saturation deficit of French agricultural topsoils: Carbon saturation of French soils. Soil Use Manag. 27, 448–452. doi:10.1111/j.1475-2743.2011.00366.x

Balaria, A., Johnson, C.E., Xu, Z., 2009. Molecular-scale characterization of hot-water-extractable organic matter in organic horizons of a forest soil. Soil Sci. Soc. Am. J. 73, 812. doi:10.2136/sssaj2008.0075

Baldock, J.A., Skjemstad, J.O., 2000. Role of the soil matrix and minerals in protecting natural organic materials against biological attack. Org. Geochem. 31, 697–710.

Baldock, J.A., Wheeler, I., McKenzie, N., McBrateny, A., 2012. Soils and climate change: potential impacts on carbon stocks and greenhouse gas emissions, and future research for Australian agriculture. Crop Pasture Sci. 63, 269–283.

Basher, L.R., 1997. Is pedology dead and buried? Aust. J. Soil Res. 35, 979–994.

Baxter, S.J., Oliver, M.A., 2005. The spatial prediction of soil mineral N and potentially available N using elevation. Geoderma, Pedometrics 2003 128, 325–339. doi:10.1016/j.geoderma.2005.04.013

Belanche-Muñoz, L., Blanch, A.R., 2008. Machine learning methods for microbial source tracking. Environ. Model. Softw. 23, 741–750. doi:10.1016/j.envsoft.2007.09.013

Bellon-Maurel, V., Fernandez-Ahumada, E., Palagos, B., Roger, J. M., McBratney, A., 2010. Critical review of chemometric indicators commonly used for assessing the quality of the prediction of soil attributes by NIR spectroscopy. TrAC Trends Anal. Chem. 29, 1073–1081. doi:10.1016/j.trac.2010.05.006

Bishop, T.F.A., McBratney, A.B., 2001. A comparison of prediction methods for the creation of field-extent soil property maps. Geoderma, Estimating uncertainty in soil models 103, 149–160. doi:10.1016/S0016-7061(01)00074-X

146

Biswas, A., Cheng, B., 2013. Model Averaging for Semivariogram Model Parameters, in: Grundas, S. (Ed.), Advances in Agrophysical Research. InTech.

Blöschl, G., Sivapalan, M., 1995. Scale issues in hydrological modelling: A review. Hydrol. Process. 9, 251–290. doi:10.1002/hyp.3360090305

Bockheim, J.G., Gennadiyev, A.N., 2010. Soil-factorial models and earth-system science: A review. Geoderma 159, 243–251. doi:10.1016/j.geoderma.2010.09.005

Bouma, J., McBratney, A., 2013. Framing soils as an actor when dealing with wicked environmental problems. Geoderma 200–201, 130–139. doi:10.1016/j.geoderma.2013.02.011

Breiman, L. J.H. Friedman, R.A. Olshen, and C.J. Stone 1984. Classification and regression trees, Chapman & Hall, New York

Breiman, L., 1996. Bagging predictors. Mach. Learn. 24, 123–140.

Breiman, L., 2001. Random Forests. Mach. Learn. 45, 5–32. doi:10.1023/A:1010933404324

Brunsdon, C., Fotheringham, A.S., Charlton, M., 2008. Geographically weighted regression: a method for exploring spatial nonstationarity. Encycl. Geogr. Inf. Sci. 558.

Burgess, T.M. and Webster, R., 1980. Optimal interpolation and isarithmic mapping of soil properties. 1. The semi-variogram and punctual kriging. Journal of Soil Science, 31: 315-331.

Burrough, P.A., 1986. Principles of geographical information systems for land resources assessment, Monographs on soil and resources survey. Clarendon Press ; Oxford University Press, Oxford  New York.

Burrough, P.A., Bouma, J., Yates, S.R., 1994. The state of the art in pedometrics. Geoderma 62, 311–326. doi:10.1016/0016-7061(94)90043-4

Cambardella, C.A., Moorman, T.B., Parkin, T.B., Karlen, D.L., Novak, J.M., Turco, R.F., Konopka, A.E., 1994. Field-scale variability of soil properties in Central Iowa Soils. Soil Sci. Soc. Am. J. 58, 1501. doi:10.2136/sssaj1994.03615995005800050033x

Carré, F., Girard, M.C., 2002. Quantitative mapping of soil types based on regression kriging of taxonomic distances with landform and land cover attributes. Geoderma 110, 241–263. doi:10.1016/S0016-7061(02)00233-1

147

Chai, X., Shen, C., Yuan, X., Huang, Y., 2008. Spatial prediction of soil organic matter in the presence of different external trends with REML-EBLUP. Geoderma 148, 159–166. doi:10.1016/j.geoderma.2008.09.018

Chaplot, V., Lorentz, S., Podwojewski, P., Jewitt, G., 2010. Digital mapping of A horizon thickness using the correlation between various soil properties and soil apparent electrical resistivity. Geoderma 157, 154–164. doi:10.1016/j.geoderma.2010.04.006

Cheng, L., Leavitt, S.W., Kimball, B.A., Pinter, P.J., Ottman, M.J., Matthias, A., Wall, G.W., Brooks, T., Williams, D.G., Thompson, T.L., 2007. Dynamics of labile and recalcitrant soil carbon pools in a sorghum free-air CO2 enrichment (FACE) agroecosystem. Soil Biol. Biochem. 39, 2250–2263. doi:10.1016/j.soilbio.2007.03.031

Conant, R.T., Ryan, M.G., Ågren, G.I., Birge, H.E., Davidson, E.A., Eliasson, P.E., Evans, S.E., Frey, S.D., Giardina, C.P., Hopkins, F.M., Hyvönen, R., Kirschbaum, M.U.F., Lavallee, J.M., Leifeld, J., Parton, W.J., Megan Steinweg, J., Wallenstein, M.D., Martin Wetterstedt, J.A., Bradford, M.A., 2011. Temperature and soil organic matter decomposition rates - synthesis of current knowledge and a way forward. Glob. Change Biol. 17, 3392–3404. doi:10.1111/j.1365-2486.2011.02496.x

Conant, R.T., Six, J., Paustian, K., 2003. Land use effects on soil carbon fractions in the southeastern United States. I. Management-intensive versus extensive grazing. Biol. Fertil. Soils 38, 386–392. doi:10.1007/s00374-003-0652-z

Cressie, N.A.C., 1993. Statistics for spatial data, Rev. ed. Wiley series in probability and mathematical statistics. Wiley, New York.

Cruz-Cárdenas, G., López-Mata, L., Ortiz-Solorio, C.A., Villaseñor, J.L., Ortiz, E., Silva, J.T., Estrada-Godoy, F., 2014. Interpolation of Mexican soil properties at a scale of 1:1,000,000. Geoderma 213, 29–35. doi:10.1016/j.geoderma.2013.07.014

Dai, F., Zhou, Q., Lv, Z., Wang, X., Liu, G., 2014. Spatial prediction of soil organic matter content integrating artificial neural network and ordinary kriging in Tibetan Plateau. Ecol. Indic. 45, 184–194. doi:10.1016/j.ecolind.2014.04.003

Davidson, E.A., Janssens, I.A., 2006. Temperature sensitivity of soil carbon decomposition and feedbacks to climate change. Nature 440, 165–173. doi:10.1038/nature04514

de Carvalho Junior, W., Lagacherie, P., da Silva Chagas, C., Calderano Filho, B., Bhering, S.B., 2014. A regional-scale assessment of digital mapping of soil attributes in a tropical hillslope environment. Geoderma 232–234, 479–486. doi:10.1016/j.geoderma.2014.06.007

148

Dlugoß, V., Fiener, P., Schneider, K., 2010. Layer-specific analysis and spatial prediction of soil organic carbon using terrain attributes and erosion modeling. Soil Sci. Soc. Am. J. 74, 922. doi:10.2136/sssaj2009.0325

Doetterl, S., Stevens, A., Six, J., Merckx, R., Van Oost, K., Casanova Pinto, M., Casanova-Katny, A., Muñoz, C., Boudin, M., Zagal Venegas, E., Boeckx, P., 2015. Soil carbon storage controlled by interactions between geochemistry and climate. Nat. Geosci. doi:10.1038/ngeo2516

Doetterl, S., Stevens, A., van Oost, K., Quine, T.A., van Wesemael, B., 2013. Spatially-explicit regional-scale prediction of soil organic carbon stocks in cropland using environmental variables and mixed model approaches. Geoderma 204-205, 31–42. doi:10.1016/j.geoderma.2013.04.007

Douaoui, A.E.K., Nicolas, H., Walter, C., 2006. Detecting salinity hazards within a semiarid context by means of combining soil and remote-sensing data. Geoderma 134, 217–230. doi:10.1016/j.geoderma.2005.10.009

Eberhardt, R.W., Latham, R.E., 2000. Relationships among vegetation, surficial geology and soil water content at the Pocono Mesic Till Barrens. J. Torrey Bot. Soc. 127, 115–124. doi:10.2307/3088689

Efron, B., Tibshirani, R., 1993. An introduction to the bootstrap, Monographs on statistics and applied probability. Chapman & Hall, New York.

Ekschmitt, K., Kandeler, E., Poll, C., Brune, A., Buscot, F., Friedrich, M., Gleixner, G., Hartmann, A., Kästner, M., Marhan, S., Miltner, A., Scheu, S., Wolters, V., 2008. Soil-carbon preservation through habitat constraints and biological limitations on decomposer activity. J. Plant Nutr. Soil Sci. 171, 27–35. doi:10.1002/jpln.200700051

Elliott, E.T., Paustian, K., Frey, S.D., 1996. Modeling the measurable or measuring the modelable: a hierarchical approach to isolating meaningful soil organic matter fractionations, in: Powlson, D.S., Smith, P., Smith, J.U. (Eds.), Evaluation of Soil Organic Matter Models, NATO ASI Series. Springer Berlin Heidelberg, pp. 161–179.

Fissore, C., Giardina, C.P., Kolka, R.K., Trettin, C.C., King, G.M., Jurgensen, M.F., Barton, C.D., Mcdowell, S.D., 2008. Temperature and vegetation effects on soil organic carbon quality along a forested mean annual temperature gradient in North America. Glob. Change Biol. 14, 193–205. doi:10.1111/j.1365-2486.2007.01478.x

Flatman, G.T., Yfantis, A.A., 1984. Geostatistical strategy for soil sampling: the survey and the census. Environ. Monit. Assess. 4, 335–349. doi:10.1007/BF00394172

149

Florida Fish and Wildlife Conservation Commission (FFWCC), 2003. Florida Vegetation and Land Cover Data Derived from Landsat ETM Imagery. Available at: http://myfwc.com/research/gis/data-maps/terrestrial/fl-vegetation-land-cover/.

Garthwaite, P.H., 1994. An interpretation of partial least squares. J. Am. Stat. Assoc. 89, 122–127. doi:10.2307/2291207

Gessler, P.E., Chadwick, O.A., Chamran, F., Althouse, L., Holmes, K., 2000. Modeling soil–landscape and ecosystem properties using terrain attributes. Soil Sci. Soc. Am. J. 64, 2046. doi:10.2136/sssaj2000.6462046x

Ghani, A., Dexter, M., Perrott, K.., 2003. Hot-water extractable carbon in soils: a sensitive measurement for determining impacts of fertilisation, grazing and cultivation. Soil Biol. Biochem. 35, 1231–1243. doi:10.1016/S0038-0717(03)00186-X

Glinka, K.D., 1927. Dokuchaiev’s ideas in the development of pedology and cognate sciences. Academy of Science, Leningrad.

Goh, K.M., 2004. Carbon sequestration and stabilization in soils: Implications for soil productivity and climate change. Soil Sci. Plant Nutr. 50, 467–476. doi:10.1080/00380768.2004.10408502

Goovaerts, P., 1997. Geostatistics for natural resources evaluation, Applied geostatistics series. Oxford University Press, New York.

Goovaerts, P., 1999. Using elevation to aid the geostatistical mapping of rainfall erosivity. Catena 34, 227–242.

Goovaerts, P., 2001. Geostatistical modelling of uncertainty in soil science. Geoderma, Estimating uncertainty in soil models 103, 3–26. doi:10.1016/S0016-7061(01)00067-2

Goswami, M., O’Connor, K.M., 2007. Real-time flow forecasting in the absence of quantitative precipitation forecasts: A multi-model approach. J. Hydrol. 334, 125–140. doi:10.1016/j.jhydrol.2006.10.002

Grimm, R., Behrens, T., Märker, M., Elsenbeer, H., 2008. Soil organic carbon concentrations and stocks on Barro Colorado Island — Digital soil mapping using Random Forests analysis. Geoderma 146, 102–113. doi:10.1016/j.geoderma.2008.05.008

Grunwald, S., 2006. What do we really know about the space-time continuum of soil-landscapes? In: Grunwald, S. (Ed.), Environmental soil-landscape modeling: geographic information technologies and pedometrics. CRC Press, Boca Raton, FL, pp. 3–36.

http://myfwc.com/research/gis/data-maps/terrestrial/fl-vegetation-land-cover/

150

Grunwald, S., 2009. Multi-criteria characterization of recent digital soil mapping and modeling approaches. Geoderma 152, 195–207. doi:10.1016/j.geoderma.2009.06.003

Grunwald, S., McBratney, A.B., Thompson, J.A., Minasny, B., Boettinger, J.L., 2016. Digital Soil Mapping in a changing world. Comput. Ethics Multicult. Approach 301.

Grunwald, S., Thompson, J.A., Boettinger, J.L., 2011. Digital Soil Mapping and Modeling at Continental Scales: Finding Solutions for Global Issues. Soil Sci. Soc. Am. J. 75, 1201. doi:10.2136/sssaj2011.0025

Grunwald, S., Yu, C., Xiong, X., 2014. Transferability and scaling of soil total carbon prediction models in Florida. PeerJ PrePrints.

Guo, L.B., Gifford, R.M., 2002. Soil carbon stocks and land use change: a meta analysis. Glob. Change Biol. 8, 345–360. doi:10.1046/j.1354-1013.2002.00486.x

Guo, P.T., Li, M.F., Luo, W., Tang, Q.F., Liu, Z.W., Lin, Z.M., 2015. Digital mapping of soil organic matter for rubber plantation at regional scale: An application of random forest plus residuals kriging approach. Geoderma 237–238, 49–59. doi:10.1016/j.geoderma.2014.08.009

Guo, Y., Amundson, R., Gong, P., Yu, Q., 2006. Quantity and Spatial Variability of Soil Carbon in the Conterminous United States. Soil Sci. Soc. Am. J. 70, 590. doi:10.2136/sssaj2005.0162

Haberlandt, U., 2007. Geostatistical interpolation of hourly precipitation from rain gauges and radar for a large-scale extreme rainfall event. J. Hydrol. 332, 144–157. doi:10.1016/j.jhydrol.2006.06.028

Hartemink, A.E., Hempel, J., Lagacherie, P., McBratney, A., McKenzie, N., MacMillan, R.A., Minasny, B., Montanarella, L., Santos, M.L. de M., Sanchez, P., Walsh, M., Zhang, G.-L., 2010. GlobalSoilMap.net – A New Digital Soil Map of the World, in: Boettinger, D.J.L., Howell, D.W., Moore, A.C., Hartemink, P.D.A.E., Kienast-Brown, S. (Eds.), Digital Soil Mapping, Progress in Soil Science. Springer Netherlands, pp. 423–428.

Hartemink, A.E., McBratney, A., 2008. A soil science renaissance. Geoderma 148, 123–129. doi:10.1016/j.geoderma.2008.10.006

Hassink, J., 1997. The capacity of soils to preserve organic C and N by their association with clay and silt particles. Plant Soil 191, 77–87. doi:10.1023/A:1004213929699

Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning, Springer Series in Statistics. Springer New York, New York, NY.

151

Haynes, R.J., 2005. Labile organic matter fractions as central components of the quality of agricultural soils: an overview, in: Agronomy, B.-A. in (Ed.), Academic Press, pp. 221–268.

Hengl, T., de Jesus, J.M., MacMillan, R.A., Batjes, N.H., Heuvelink, G.B.M., Ribeiro, E., Samuel-Rosa, A., Kempen, B., Leenaars, J.G.B., Walsh, M.G., Gonzalez, M.R., 2014. SoilGrids1km — Global soil information based on automated mapping. PLoS ONE 9, e105992. doi:10.1371/journal.pone.0105992

Hengl, T., Heuvelink, G.B.M., Rossiter, D.G., 2007a. About regression-kriging: From equations to case studies. Comput. Geosci., Spatial Analysis Spatial Analysis 33, 1301–1315. doi:10.1016/j.cageo.2007.05.001

Hengl, T., Heuvelink, G.B.M., Stein, A., 2004. A generic framework for spatial prediction of soil variables based on regression-kriging. Geoderma 120, 75–93. doi:10.1016/j.geoderma.2003.08.018

Hengl, T., Toomanian, N., Reuter, H.I., Malakouti, M.J., 2007b. Methods to interpolate soil categorical variables from profile observations: Lessons from Iran. Geoderma, Pedometrics 2005 140, 417–427. doi:10.1016/j.geoderma.2007.04.022

Herbst, M., Diekkrüger, B., Vereecken, H., 2006. Geostatistical co-regionalization of soil hydraulic properties in a micro-scale catchment using terrain attributes. Geoderma 132, 206–221. doi:10.1016/j.geoderma.2005.05.008

Hernández, N., Kiralj, R., Ferreira, M.M.C., Talavera, I., 2009. Critical comparative analysis, validation and interpretation of SVM and PLS regression models in a QSAR study on HIV-1 protease inhibitors. Chemom. Intell. Lab. Syst. 98, 65–77. doi:10.1016/j.chemolab.2009.04.012

Heuvelink, G.B.M., Webster, R., 2001. Modelling soil variation: past, present, and future. Geoderma, Developments and Trends in Soil Science 100, 269–301. doi:10.1016/S0016-7061(01)00025-8

Hoeting, J.A., Madigan, D., Raftery, A.E., Volinsky, C.T., 1999. Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors. Stat. Sci. 14, 382–417. doi:10.1214/ss/1009212519

Holden, P.A., Fierer, N., 2005. Microbial processes in the vadose zone. Vadose Zone J. 4, 1–21.

Hornik, K., Meyer, D., Karatzoglou, A., 2006. Support vector machines in R. J. Stat. Softw. 15, 1–28.

152

Hu, K., Wang, S., Li, H., Huang, F., Li, B., 2014. Spatial scaling effects on variability of soil organic matter and total nitrogen in suburban Beijing. Geoderma 226–227, 54–63. doi:10.1016/j.geoderma.2014.03.001

Hudson, B.D., 1992. The soil survey as paradigm-based science. Soil Sci. Soc. Am. J. 56, 836–841.

J. Six, R.T.C., 2002. Stabilization mechanisms of soil organic matter: Implications for C-saturation of soils. Plant Soil 241, 155–176. doi:10.1023/A:1016125726789

James, G., Witten, D., Hastie, T., Tibshirani, R., 2013. An Introduction to Statistical Learning: with Applications in R. Springer Science & Business Media.

Janzen, H.H., 2004. Carbon cycling in earth systems - a soil science perspective. Agric. Ecosyst. Environ. 104, 399–417. doi:10.1016/j.agee.2004.01.040

Jastrow JD, Miller RM. 1998. Soil aggregate stabilization and carbon sequestration: Feedbacks through organomineral associations. In Lal R, Kimble JM, Follett RF, Stewart BA, (Ed). Soil Processes and the Carbon Cycle. Boca Raton (FL): CRC Press, pp. 207–223

Jenny, H., 1941. Factors of soil formation. McGraw-Hill Book Company New York, NY.

Jobbágy, E.G., Jackson, R.B., 2000. The vertical distribution of soil organic carbon and its relation to climate and vegetation. Ecol. Appl. 10, 423–436.

Karatzoglou, A., Smola, A., Hornik, K., Karatzoglou, M.A., SparseM, S., Yes, L., 2007. The kernlab package. Compr. R Arch. Netw.

Karunaratne, S.B., Bishop, T.F.A., Baldock, J.A., Odeh, I.O.A., 2014. Catchment scale mapping of measureable soil organic carbon fractions. Geoderma 219–220, 14–23. doi:10.1016/j.geoderma.2013.12.005

Kautz R, Stys B, Kawula R, 2007. Florida vegetation 2003 and land use change between 1985–89 and 2003. Fla Sci 70(1):12

Kerry, R., Oliver, M.A., 2007. Comparing sampling needs for variograms of soil properties computed by the method of moments and residual maximum likelihood. Geoderma, Pedometrics 2005 140, 383–396. doi:10.1016/j.geoderma.2007.04.019

Kleber, M., Nico, P.S., Plante, A., Filley, T., Kramer, M., Swanston, C., Sollins, P., 2011. Old and stable soil organic matter is not necessarily chemically recalcitrant: implications for modeling concepts and temperature sensitivity: slow turnover of labile soil organic matter. Glob. Change Biol. 17, 1097–1107. doi:10.1111/j.1365-2486.2010.02278.x

153

Knotters, M., Brus, D.J., Oude Voshaar, J.H., 1995. A comparison of kriging, co-kriging and kriging combined with regression for spatial interpolation of horizon depth with censored observations. Geoderma 67, 227–246. doi:10.1016/0016-7061(95)00011-C

Knox, N.M., Grunwald, S., McDowell, M.L., Bruland, G.L., Myers, D.B., Harris, W.G., 2015. Modelling soil carbon fractions with visible near-infrared (VNIR) and mid-infrared (MIR) spectroscopy. Geoderma 239–240, 229–239. doi:10.1016/j.geoderma.2014.10.019

Kravchenko, A.N., 2003. Influence of spatial structure on accuracy of interpolation methods. Soil Sci. Soc. Am. J. 67, 1564. doi:10.2136/sssaj2003.1564

Kravchenko, A.N., Robertson, G.P., 2007. Can topographical and yield data substantially improve total soil carbon mapping by regression kriging? Agron. J. 99, 12. doi:10.2134/agronj2005.0251

Kuhn, M., Johnson, K., 2013. Applied predictive modeling. Springer New York, New York, NY.

Kumar, S., Lal, R., Liu, D., 2012. A geographically weighted regression kriging approach for mapping soil organic carbon stock. Geoderma 189–190, 627–634. doi:10.1016/j.geoderma.2012.05.022

Kuriakose, S.L., Devkota, S., Rossiter, D.G., Jetten, V.G., 2009. Prediction of soil depth using environmental variables in an anthropogenic landscape, a case study in the Western Ghats of Kerala, India. CATENA 79, 27–38. doi:10.1016/j.catena.2009.05.005

Kursa, M.B., Rudnicki, W.R., 2010. Feature selection with the Boruta package. J. Stat. Softw. 36, 1-13

Lado, L.R., Hengl, T., Reuter, H.I., 2008. Heavy metals in European soils: A geostatistical analysis of the FOREGS Geochemical database. Geoderma 148, 189–199. doi:10.1016/j.geoderma.2008.09.020

Lal, R., 2004. Soil carbon sequestration impacts on global climate change and food security. Science 304, 1623–1627. doi:10.1126/science.1097396

Lamsal, S., Grunwald, S., Bruland, G.L., Bliss, C.M., Comerford, N.B., 2006. Regional hybrid geospatial modeling of soil nitrate–nitrogen in the Santa Fe River Watershed. Geoderma 135, 233–247. doi:10.1016/j.geoderma.2005.12.009

Lange, M., Eisenhauer, N., Sierra, C.A., Bessler, H., Engels, C., Griffiths, R.I., Mellado-Vázquez, P.G., Malik, A.A., Roy, J., Scheu, S., Steinbeiss, S., Thomson, B.C., Trumbore, S.E., Gleixner, G., 2015. Plant diversity increases soil microbial activity and soil carbon storage. Nat. Commun. 6, 6707. doi:10.1038/ncomms7707

154

Lark, R.M., 1999. Soil–landform relationships at within-field scales: an investigation using continuous classification. Geoderma 92, 141–165. doi:10.1016/S0016-7061(99)00028-2

Lark, R.M., 2012. Towards soil geostatistics. Spat. Stat. 1, 92–99. doi:10.1016/j.spasta.2012.02.001

Lark, R.M., Cullis, B.R., 2004. Model-based analysis using REML for inference from systematically sampled data on soil. Eur. J. Soil Sci. 55, 799–813. doi:10.1111/j.1365-2389.2004.00637.x

Lark, R.M., Cullis, B.R., Welham, S.J., 2006. On spatial prediction of soil properties in the presence of a spatial trend: the empirical best linear unbiased predictor (E-BLUP) with REML. Eur. J. Soil Sci. 57, 787–799. doi:10.1111/j.1365-2389.2005.00768.x

Lark, R.M., Webster, R., 2006. Geostatistical mapping of geomorphic variables in the presence of trend. Earth Surf. Process. Landf. 31, 862–874. doi:10.1002/esp.1296

Lawrence, C.R., Harden, J.W., Xu, X., Schulz, M.S., Trumbore, S.E., 2015. Long-term controls on soil organic carbon with depth and time: A case study from the Cowlitz River Chronosequence, WA USA. Geoderma 247–248, 73–87. doi:10.1016/j.geoderma.2015.02.005

Lawrence, R., 2004. Classification of remotely sensed imagery using stochastic gradient boosting as a refinement of classification tree analysis. Remote Sens. Environ. 90, 331–336. doi:10.1016/j.rse.2004.01.007

Leinweber, P., Schulten, H.-R., Körschens, M., 1995. Hot water extracted organic matter: chemical composition and temporal variations in a long-term field experiment. Biol. Fertil. Soils 20, 17–23. doi:10.1007/BF00307836

Leopold, U., Heuvelink, G.B.M., Tiktak, A., Finke, P.A., Schoumans, O., 2006. Accounting for change of support in spatial accuracy assessment of modelled soil mineral phosphorous concentration. Geoderma 130, 368–386. doi:10.1016/j.geoderma.2005.02.008

Levi, M.R., Rasmussen, C., 2014. Covariate selection with iterative principal component analysis for predicting physical soil properties. Geoderma 219–220, 46–57. doi:10.1016/j.geoderma.2013.12.013

Li, J., Heap, A.D., 2011a. A review of comparative studies of spatial interpolation methods in environmental sciences: Performance and impact factors. Ecol. Inform. 6, 228–241. doi:10.1016/j.ecoinf.2010.12.003

155

Li, J., Heap, A.D., Potter, A., Daniell, J.J., 2011b. Application of machine learning methods to spatial interpolation of environmental variables. Environ. Model. Softw. 26, 1647–1659. doi:10.1016/j.envsoft.2011.07.004

Li, Q., Yue, T., Wang, C., Zhang, W., Yu, Y., Li, B., Yang, J., Bai, G., 2013. Spatially distributed modeling of soil organic matter across China: An application of artificial neural network approach. CATENA 104, 210–218. doi:10.1016/j.catena.2012.11.012

Li, Y., 2010. Can the spatial prediction of soil organic matter contents at various sampling scales be improved by using regression kriging with auxiliary information? Geoderma 159, 63–75. doi:10.1016/j.geoderma.2010.06.017

Liaw, A., Wiener, M., (2002) Classification and Regression by randomForest. R News 2: 18-22

Lin, H., 2010. Earth’s Critical Zone and hydropedology: concepts, characteristics, and advances. Hydrol Earth Syst Sci 14, 25–45. doi:10.5194/hess-14-25-2010

Lin, H., 2012. Hydropedology, in: Hydropedology. Elsevier, pp. 3–39.

Lin, H., Wheeler, D., Bell, J., Wilding, L., 2005. Assessment of soil spatial variability at multiple scales. Ecol. Model., Scaling, fractals and diversity in soils and ecohydrology 182, 271–290. doi:10.1016/j.ecolmodel.2004.04.006

Lin, Y.P., Cheng, B.Y., Chu, H.J., Chang, T.K., Yu, H.L., 2011. Assessing how heavy metal pollution and human activity are related by using logistic regression and kriging methods. Geoderma 163, 275–282. doi:10.1016/j.geoderma.2011.05.004

Liu, H., Motoda, H., 2012. Feature Selection for Knowledge Discovery and Data Mining. Springer Science & Business Media.

Liu, H., Motoda, H., 2012. Feature selection for knowledge discovery and data mining. Springer Science & Business Media.

Lutzow, M. v., Kogel-Knabner, I., Ekschmitt, K., Matzner, E., Guggenberger, G., Marschner, B., Flessa, H., 2006. Stabilization of organic matter in temperate soils: mechanisms and their relevance under different soil conditions - a review. Eur. J. Soil Sci. 57, 426–445. doi:10.1111/j.1365-2389.2006.00809.x

Malone, B.P., Minasny, B., Odgers, N.P., McBratney, A.B., 2014. Using model averaging to combine soil property rasters from legacy soil maps and from point data. Geoderma 232–234, 34–44. doi:10.1016/j.geoderma.2014.04.033

http://www.r-project.org/doc/Rnews/Rnews_2002-3.pdf

156

Marschner, B., Brodowski, S., Dreves, A., Gleixner, G., Gude, A., Grootes, P.M., Hamer, U., Heim, A., Jandl, G., Ji, R., Kaiser, K., Kalbitz, K., Kramer, C., Leinweber, P., Rethemeyer, J., Schäffer, A., Schmidt, M.W.I., Schwark, L., Wiesenberg, G.L.B., 2008. How relevant is recalcitrance for the stabilization of organic matter in soils? J. Plant Nutr. Soil Sci. 171, 91–110. doi:10.1002/jpln.200700049

Martin, M.P., Orton, T.G., Lacarce, E., Meersmans, J., Saby, N.P.A., Paroissien, J.B., Jolivet, C., Boulonne, L., Arrouays, D., 2014. Evaluation of modelling approaches for predicting the spatial distribution of soil organic carbon stocks at the national scale. Geoderma 223–225, 97–107. doi:10.1016/j.geoderma.2014.01.005

Martin, M.P., Wattenbach, M., Smith, P., Meersmans, J., Jolivet, C., Boulonne, L., Arrouays, D., 2011. Spatial distribution of soil organic carbon stocks in France. Biogeosciences 8, 1053–1065. doi:10.5194/bg-8-1053-2011

Matheron, G., 1971. The theory of regionalized variables and its Applications.

McBratney, A., 1992. On variation, uncertainty and informatics in environmental soil management. Soil Res. 30, 913–935.

McBratney, A., Mendonça Santos, M., Minasny, B., 2003. On digital soil mapping. Geoderma 117, 3–52. doi:10.1016/S0016-7061(03)00223-4

McBratney, A.B., 1998. Some considerations on methods for spatially aggregating and disaggregating soil information, in: Soil and Water Quality at Different Scales. Springer, pp. 51–62.

McBratney, A.B., Odeh, I.O.A., Bishop, T.F.A., Dunbar, M.S., Shatar, T.M., 2000. An overview of pedometric techniques for use in soil survey. Geoderma 97, 293–327. doi:10.1016/S0016-7061(00)00043-4

McKenzie, N.J., Ryan, P.J., 1999. Spatial prediction of soil properties using environmental correlation. Geoderma 89, 67–94. doi:10.1016/S0016-7061(98)00137-2

McSweeney, K., Slater, B.K., David Hammer, R., Bell, J.C., Gessler, P.E., Petersen, G.W., 1994. Towards a new framework for modeling the soil-landscape continuum, in: SSSA Special Publication. Soil Science Society of America.

Meersmans, J., De Ridder, F., Canters, F., De Baets, S., Van Molle, M., 2008. A multiple regression approach to assess the spatial distribution of Soil Organic Carbon (SOC) at the regional scale (Flanders, Belgium). Geoderma 143, 1–13. doi:10.1016/j.geoderma.2007.08.025

157

Melillo, J.M., Aber, J.D., Linkins, A.E., Ricca, A., Fry, B., Nadelhoffer, K.J., 1989. Carbon and nitrogen dynamics along the decay continuum: plant litter to soil organic matter, in: Ecology of Arable Land—Perspectives and Challenges. Springer, pp. 53–62.

Merow, C., Smith, M.J., Edwards, T.C., Guisan, A., McMahon, S.M., Normand, S., Thuiller, W., Wüest, R.O., Zimmermann, N.E., Elith, J., 2014. What do we gain from simplicity versus complexity in species distribution models? Ecography 37, 1267–1281. doi:10.1111/ecog.00845

Michaletz, S.T., Cheng, D., Kerkhoff, A.J., Enquist, B.J., 2014. Convergence of terrestrial plant production across global climate gradients. Nature. doi:10.1038/nature13470

Miller, B.A., Koszinski, S., Wehrhan, M., Sommer, M., 2015. Impact of multi-scale predictor selection for modeling soil properties. Geoderma 239–240, 97–106. doi:10.1016/j.geoderma.2014.09.018

Milne, E., Powlson, D.S., Cerri, C.E., 2007. Soil carbon stocks at regional scales. Agric. Ecosyst. Environ., Soil carbon stocks at regional scales Assessment of Soil Organic Carbon Stocks and Change at National Scale, Final Project Presentation, The United Nations Environment Programme, Nairobi, Kenya, 23-24 May 2005 122, 1–2. doi:10.1016/j.agee.2007.01.001

Minasny, B., McBratney, A.B., 2005. The Matérn function as a general model for soil variograms. Geoderma, Pedometrics 2003 128, 192–207. doi:10.1016/j.geoderma.2005.04.003

Minasny, B., McBratney, A.B., 2006. A conditioned Latin hypercube method for sampling in the presence of ancillary information. Comput. Geosci. 32, 1378–1388. doi:10.1016/j.cageo.2005.12.009

Minasny, B., McBratney, A.B., 2007. Spatial prediction of soil properties using EBLUP with the Matérn covariance function. Geoderma, Pedometrics 2005 140, 324–336. doi:10.1016/j.geoderma.2007.04.028

Minasny, B., McBratney, A.B., 2015. Digital soil mapping: A brief history and some lessons. Geoderma. doi:10.1016/j.geoderma.2015.07.017

Minasny, B., McBratney, A.B., Malone, B.P., Wheeler, I., 2013. Digital mapping of soil Carbon, in: Advances in Agronomy. Elsevier, pp. 1–47.

Minasny, B., McBratney, A.B., Salvador-Blanes, S., 2008. Quantitative models for pedogenesis — A review. Geoderma, Antarctic Soils and Soil Forming Processes in a Changing Environment 144, 140–157. doi:10.1016/j.geoderma.2007.12.013

158

Mishra, U., Lal, R., Liu, D., Van Meirvenne, M., 2010. Predicting the spatial variation of the soil organic carbon pool at a regional scale. Soil Sci. Soc. Am. J. 74, 906. doi:10.2136/sssaj2009.0158

Mishra, U., Torn, M.S., Masanet, E., Ogle, S.M., 2012. Improving regional soil carbon inventories: Combining the IPCC carbon inventory method with regression kriging. Geoderma 189–190, 288–295. doi:10.1016/j.geoderma.2012.06.022

Moore, I.D., Gessler, P.E., Nielsen, G.A. and Peterson, G.A., 1993. Soil attribute prediction using terrain analysis. Soil Science Society of America Journal, 57: 443-452

Mora-Vallejo, A., Claessens, L., Stoorvogel, J., Heuvelink, G.B.M., 2008. Small scale digital soil mapping in Southeastern Kenya. CATENA 76, 44–53. doi:10.1016/j.catena.2008.09.008

Mulkey, S., Alavalapati, J., Hodges, A., Wilkie, A.C., Grunwald, S., 2008. Opportunities for greenhouse gas reduction through forestry and agriculture in Florida. Univ. Fla. Sch. Nat. Resour. Retrieved January 20, 2008.

Mulla D.J. and McBratney A.B. 2002 Soil spatial variability. In: Warrick, A.W. (Ed) Soil physic companion.CRC Press LLC, Boca Raton

National Climatic Data Center (NCDC), National Oceanic and Atmospheric Administration (NOAA), 2008. Monthly Surface Data. Available at: http://www.ncdc.noaa.gov.

Natural Resources Conservation Service (NRCS), U.S. Department of Agriculture, 2006. Soil Survey Geographic Database (SSURGO). Available at: http://www.nrcs.usda.gov/wps/portal/nrcs/main/soils/.



Niang, M.A., Nolin, M.C., Jégo, G., Perron, I., 2014. Digital mapping of soil texture using RADARSAT-2 polarimetric synthetic aperture radar data. Soil Sci. Soc. Am. J. 78, 673. doi:10.2136/sssaj2013.07.0307

Oades, J.M., 1988. The retention of organic matter in soils. Biogeochemistry 5, 35–70. doi:10.1007/BF02180317

http://www.ncdc.noaa.gov/

http://www.nrcs.usda.gov/wps/portal/nrcs/main/soils/



159

Odeh, I.O.A., McBratney, A.B., 2000. Using AVHRR images for spatial prediction of clay content in the lower Namoi Valley of eastern Australia. Geoderma 97, 237–254. doi:10.1016/S0016-7061(00)00041-0

Odeh, I.O.A., McBratney, A.B., Chittleborough, D.J., 1995. Further results on prediction of soil properties from terrain attributes: heterotopic cokriging and regression-kriging. Geoderma 67, 215–226. doi:10.1016/0016-7061(95)00007-B

Odeha, I.O.A., McBratney, A.B., Chittleborough, D.J., 1994. Spatial prediction of soil properties from landform attributes derived from a digital elevation model. Geoderma 63, 197–214. doi:10.1016/0016-7061(94)90063-9

Odgers, N.P., McBratney, A.B., Minasny, B., 2015. Digital soil property mapping and uncertainty estimation using soil class probability rasters. Geoderma 237–238, 190–198. doi:10.1016/j.geoderma.2014.09.009

Odgers, N.P., Sun, W., McBratney, A.B., Minasny, B., Clifford, D., 2014. Disaggregating and harmonising soil map units through resampled classification trees. Geoderma 214–215, 91–100. doi:10.1016/j.geoderma.2013.09.024

Oliver, M. a., 1987. Geostatistics and its application to soil science. Soil Use Manag. 3, 8–20. doi:10.1111/j.1475-2743.1987.tb00703.x

Oliver, M.A., Webster, R., 2014. A tutorial guide to geostatistics: Computing and modelling variograms and kriging. CATENA 113, 56–69. doi:10.1016/j.catena.2013.09.006

Parton, W.J., Schimel, D.S., Cole, C.V., Ojima, D.S., 1987a. Division s-3-soil microbiology and biochemistry. Soil Sci Soc Am J 51, 1173–1179.

Parton, W.J., Schimel, D.S., Cole, C.V., Ojima, D.S., 1987b. Analysis of Factors Controlling Soil Organic Matter Levels in Great Plains Grasslands1. Soil Sci. Soc. Am. J. 51, 1173. doi:10.2136/sssaj1987.03615995005100050015x

Pebesma, E.J., 2004. Multivariable geostatistics in S: the gstat package. Comput. Geosci. 30, 683–691. doi:10.1016/j.cageo.2004.03.012

Percival, H.J., Parfitt, R.L., Scott, N.A., 2000. Factors controlling soil carbon levels in New Zealand Grasslands Is Clay Content Important? Soil Sci. Soc. Am. J. 64, 1623–1630.

Peters, A., Hothorn, T., Ripley, B.D., Therneau, T., Atkinson, B., Hothorn, M.T., 2015. Package “ipred.”

Poggio, L., Gimona, A., 2014. National scale 3D modelling of soil organic carbon stocks with uncertainty propagation — An example from Scotland. Geoderma 232–234, 284–299. doi:10.1016/j.geoderma.2014.05.004

160

Poggio, L., Gimona, A., Brown, I., Castellazzi, M., 2010. Soil available water capacity interpolation and spatial uncertainty modelling at multiple geographical extents. Geoderma 160, 175–188. doi:10.1016/j.geoderma.2010.09.015

Prasad, A.M., Iverson, L.R., Liaw, A., 2006. Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems 9, 181–199. doi:10.1007/s10021-005-0054-1

R Core Team (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/

Raftery, A.E., Gneiting, T., Balabdaoui, F., Polakowski, M., 2005. Using Bayesian model averaging to calibrate forecast ensembles. Mon. Weather Rev. 133, 1155–1174.

Rawlins, B.G., Henrys, P., Breward, N., Robinson, D.A., Keith, A.M., Garcia-Bajo, M., 2011. The importance of inorganic carbon in soil carbon databases and stock estimates: a case study from England. Soil Use Manag. 27, 312–320. doi:10.1111/j.1475-2743.2011.00348.x

Richter, D. deB., Bacon, A.R., Megan, L.M., Richardson, C.J., Andrews, S.S., West, L., Wills, S., Billings, S., Cambardella, C.A., Cavallaro, N., DeMeester, J.E., Franzluebbers, A.J., Grandy, A.S., Grunwald, S., Gruver, J., Hartshorn, A.S., Janzen, H., Kramer, M.G., Ladha, J.K., Lajtha, K., Liles, G.C., Markewitz, D., Megonigal, P.J., Mermut, A.R., Rasmussen, C., Robinson, D.A., Smith, P., Stiles, C.A., Tate, R.L., Thompson, A., Tugel, A.J., van Es, H., Yaalon, D., Zobeck, T.M., 2011. Human–soil relations are changing rapidly: Proposals from SSSA’s Cross-Divisional Soil Change Working Group. Soil Sci. Soc. Am. J. 75, 2079. doi:10.2136/sssaj2011.0124

Ridgeway, G., Ridgeway, M.G., 2004. The gbm package. R Found. Stat. Comput. Vienna Austria.

Rivero, R.G., Grunwald, S., Bruland, G.L., 2007. Incorporation of spectral data into multivariate geostatistical models to map soil phosphorus variability in a Florida wetland. Geoderma, Pedometrics 2005 140, 428–443. doi:10.1016/j.geoderma.2007.04.026

Rodríguez-Lado, L., Martínez-Cortizas, A., 2015. Modelling and mapping organic carbon content of topsoils in an Atlantic area of southwestern Europe (Galicia, NW-Spain). Geoderma 245–246, 65–73. doi:10.1016/j.geoderma.2015.01.015

Roger, A., Libohova, Z., Rossier, N., Joost, S., Maltas, A., Frossard, E., Sinaj, S., 2014. Spatial variability of soil phosphorus in the Fribourg canton, Switzerland. Geoderma 217–218, 26–36. doi:10.1016/j.geoderma.2013.11.001

Rossiter, D.G., 2012. Applied geostatistics Exercise 3: Modelling spatial structure from point samples.

161

Rumpel, C., Kögel-Knabner, I., 2010. Deep soil organic matter—a key but poorly understood component of terrestrial C cycle. Plant Soil 338, 143–158. doi:10.1007/s11104-010-0391-5

Ryan, P.J., McKenzie, N.J., O’Connell, D., Loughhead, A.N., Leppert, P.M., Jacquier, D., Ashton, L., 2000. Integrating forest soils information across scales: spatial prediction of soil properties under Australian forests. For. Ecol. Manag. 138, 139–157. doi:10.1016/S0378-1127(00)00393-5

Schimel, D., Stillwell, M.A., Woodmansee, R.G., 1985. Biogeochemistry of C, N, and P in a Soil Catena of the Shortgrass Steppe. Ecology 66, 276–282. doi:10.2307/1941328

Schimel, D.S., Braswell, B.H., Holland, E.A., McKeown, R., Ojima, D.S., Painter, T.H., Parton, W.J., Townsend, A.R., 1994. Climatic, edaphic, and biotic controls over storage and turnover of carbon in soils. Glob. Biogeochem. Cycles 8, 279–293. doi:10.1029/94GB00993

Schmidt, M.W.I., Torn, M.S., Abiven, S., Dittmar, T., Guggenberger, G., Janssens, I.A., Kleber, M., Kögel-Knabner, I., Lehmann, J., Manning, D.A.C., Nannipieri, P., Rasse, D.P., Weiner, S., Trumbore, S.E., 2011. Persistence of soil organic matter as an ecosystem property. Nature 478, 49–56. doi:10.1038/nature10386

Shi, W., Liu, J., Du, Z., Stein, A., Yue, T., 2011. Surface modelling of soil properties based on land use information. Geoderma 162, 347–357. doi:10.1016/j.geoderma.2011.03.007

Simbahan, G.C., Dobermann, A., Goovaerts, P., Ping, J., Haddix, M.L., 2006. Fine-resolution mapping of soil organic carbon based on multivariate secondary data. Geoderma 132, 471–489. doi:10.1016/j.geoderma.2005.07.001

Simbahan, G.C., Dobermann, A., Goovaerts, P., Ping, J., Haddix, M.L., 2006. Fine-resolution mapping of soil organic carbon based on multivariate secondary data. Geoderma 132, 471–489. doi:10.1016/j.geoderma.2005.07.001

Smith, P., Fang, C., Dawson, J.J.C., Moncrieff, J.B., 2008. Impact of Global Warming on Soil Organic Carbon, in: Agronomy, B.-A. in (Ed.), Academic Press, pp. 1–43.

Smola, A.J., Schölkopf, B., 2004. A tutorial on support vector regression. Stat. Comput. 14, 199–222. doi:10.1023/B:STCO.0000035301.49549.88

Sollins, P., Homann, P., Caldwell, B.A., 1996. Stabilization and destabilization of soil organic matter: mechanisms and controls. Geoderma 74, 65–105. doi:10.1016/S0016-7061(96)00036-5

Stacey, K.F., Lark, R.M., Whitmore, A.P., Milne, A.E., 2006. Using a process model and regression kriging to improve predictions of nitrous oxide emissions from soil. Geoderma 135, 107–117. doi:10.1016/j.geoderma.2005.11.008

162

Steffen, W., Grinevald, J., Crutzen, P., McNeill, J., 2011. The Anthropocene: conceptual and historical perspectives. Philos. Trans. R. Soc. Lond. Math. Phys. Eng. Sci. 369, 842–867. doi:10.1098/rsta.2010.0327

Stein, M.L., 1999. Interpolation of spatial data, Springer Series in Statistics. Springer New York, New York, NY.

Stone, E.L., Harris, W.G., Brown, R.B., Kuehl, R.J., 1993. Carbon storage in Florida Spodosols. Soil Sci. Soc. Am. J. 57, 179. doi:10.2136/sssaj1993.03615995005700010032x

Stoorvogel, J.J., Kempen, B., Heuvelink, G.B.M., de Bruin, S., 2009. Implementation and evaluation of existing knowledge for digital soil mapping in Senegal. Geoderma 149, 161–170. doi:10.1016/j.geoderma.2008.11.039

Sun, W., Minasny, B., McBratney, A., 2012. Analysis and prediction of soil properties using local regression-kriging. Geoderma, Entering the Digital Era: Special Issue of Pedometrics 2009, Beijing 171–172, 16–23. doi:10.1016/j.geoderma.2011.02.010

Takagi, K., Lin, H.S., 2012. Changing controls of soil moisture spatial organization in the Shale Hills Catchment. Geoderma 173–174, 289–302. doi:10.1016/j.geoderma.2011.11.003

Therneau, T., Atkinson, B., Ripley, B., Ripley, M.B., 2015. Package “rpart”. Version.

Thompson, J.A., Roecker, S., Grunwald, S., Owens, P.R., 2012. Digital soil mapping, in: Hydropedology. Elsevier, pp. 665–709.

Thomsen, I.K., Schjønning, P., Olesen, J.E., Christensen, B.T., 2003. C and N turnover in structurally intact soils of different texture. Soil Biol. Biochem. 35, 765–774. doi:10.1016/S0038-0717(03)00093-2

Torn, M.S., Trumbore, S.E., Chadwick, O.A., Vitousek, P.M., Hendricks, D.M., 1997. Mineral control of soil organic carbon storage and turnover. Nature 389, 170–173. doi:10.1038/38260

Totsche, K.U., Rennert, T., Gerzabek, M.H., Kögel-Knabner, I., Smalla, K., Spiteller, M., Vogel, H.-J., 2010. Biogeochemical interfaces in soil: The interdisciplinary challenge for soil science. J. Plant Nutr. Soil Sci. 173, 88–99. doi:10.1002/jpln.200900105

Triantafilis, J., Odeh, I.O.A., McBratney, A.B., 2001. Five geostatistical models to predict soil salinity from electromagnetic induction data across irrigated cotton. Soil Sci. Soc. Am. J. 65, 869–878.

163

Umali, B.P., Oliver, D.P., Forrester, S., Chittleborough, D.J., Hutson, J.L., Kookana, R.S., Ostendorf, B., 2012. The effect of terrain and management on the spatial variability of soil properties in an apple orchard. CATENA 93, 38–48. doi:10.1016/j.catena.2012.01.010

United States Census Bureau, 2000. The Boundary of the State of Florida. Available at: http://www.census.gov/geo/www/cob/cbf_state.html.

United States Geological Survey (USGS), 1999. National Elevation Dataset (NED). Available at: http://ned.usgs.gov/.

United States Census Bureau, 2015. Population estimates. Available at https://www.census.gov/newsroom/press-releases/2014/cb14-232.html

Vanwalleghem, T., Poesen, J., McBratney, A., Deckers, J., 2010. Spatial variability of soil horizon depth in natural loess-derived soils. Geoderma 157, 37–45. doi:10.1016/j.geoderma.2010.03.013

Vapnik, V.N., 1998. Statistical Learning Theory. New York.

Vasenev, V.I., Stoorvogel, J.J., Vasenev, I.I., Valentini, R., 2014. How to map soil organic carbon stocks in highly urbanized regions? Geoderma 226–227, 103–115. doi:10.1016/j.geoderma.2014.03.007

Vasques, G.M., Grunwald, S., Comerford, N.B., Sickman, J.O., 2010. Regional modelling of soil carbon at multiple depths within a subtropical watershed. Geoderma 156, 326–336. doi:10.1016/j.geoderma.2010.03.002

Vasques, G.M., Grunwald, S., Myers, D.B., 2012. Associations between soil carbon and ecological landscape variables at escalating spatial scales in Florida, USA. Landsc. Ecol. 27, 355–367. doi:10.1007/s10980-011-9702-3

Vasques, G.M., Grunwald, S., Sickman, J.O., 2008. Comparison of multivariate methods for inferential modeling of soil carbon using visible/near-infrared spectra. Geoderma 146, 14–25. doi:10.1016/j.geoderma.2008.04.007

Vasques, G.M., Grunwald, S., Sickman, J.O., Comerford, N.B., 2010. Upscaling of dynamic soil organic carbon pools in a north-central Florida watershed. Soil Sci. Soc. Am. J. 74, 870. doi:10.2136/sssaj2009.0242

Veldkamp, E., Becker, A., Schwendenmann, L., Clark, D.A., Schulte-Bisping, H., 2003. Substantial labile carbon stocks and microbial activity in deeply weathered soils below a tropical wet forest. Glob. Change Biol. 9, 1171–1184. doi:10.1046/j.1365-2486.2003.00656.x

Wackernagel, H., 2003. Multivariate geostatistics: an introduction with applications. Springer, Berlin; New York.

http://www.census.gov/geo/www/cob/cbf_state.html.

http://ned.usgs.gov/

https://www.census.gov/newsroom/press-releases/2014/cb14-232.html

164

Wallis, J.R., 1965. Multivariate statistical methods in hydrology—A comparison using data of known functional relationship. Water Resour. Res. 1, 447–461. doi:10.1029/WR001i004p00447

Watt, M.S., Palmer, D.J., 2012. Use of regression kriging to develop a Carbon:Nitrogen ratio surface for New Zealand. Geoderma 183–184, 49–57. doi:10.1016/j.geoderma.2012.03.013

Webster, R., 1994. The development of pedometrics. Geoderma 62, 1–15. doi:10.1016/0016-7061(94)90024-8

Webster, R., 2000. Is soil variation random? Geoderma 97, 149–163. doi:10.1016/S0016-7061(00)00036-7

Webster, R., Burgess, T.M., 1980. Optimal interpolation and isarithmic mapping of soil properties Iii Changing Drift and Universal Kriging. J. Soil Sci. 31, 505–524. doi:10.1111/j.1365-2389.1980.tb02100.x

Webster, R., Oliver, M.A., 1992. Sample adequately to estimate variograms of soil properties. J. Soil Sci. 43, 177–192. doi:10.1111/j.1365-2389.1992.tb00128.x

Webster, R., Oliver, M.A., 2007. Geostatistics for environmental scientists (2nd ed.) John Wiley, Chichester, United Kingdom.

Wehrens, R., Mevik, B.-H., Mevik, M.B.-H., 2007. The pls package. Ref. Man.

Were, K., Bui, D.T., Dick, Ø.B., Singh, B.R., 2015. A comparative assessment of support vector regression, artificial neural networks, and random forests for predicting and mapping soil organic carbon stocks across an Afromontane landscape. Ecol. Indic. 52, 394–403. doi:10.1016/j.ecolind.2014.12.028

Wiesmeier, M., Barthold, F., Spörlein, P., Geuß, U., Hangen, E., Reischl, A., Schilling, B., Angst, G., von Lützow, M., Kögel-Knabner, I., 2014. Estimation of total organic carbon storage and its driving factors in soils of Bavaria (southeast Germany). Geoderma Reg. 1, 67–78. doi:10.1016/j.geodrs.2014.09.001

Williams, P.C., 1987. Variables affecting near-infrared reflectance spectroscopic analysis. In: Williams, P., Norris, K. (Eds.), Near-infrared Technology in the Agricultural and Food Industries. American Association of Cereal Chemists, St. Paul, MN, pp. 143–167

Wright, R.L., Wilson, S.R., 1979. On the analysis of soil variability, with an example from Spain. Geoderma 22, 297–313. doi:10.1016/0016-7061(79)90026-0

Xiong, X., Grunwald, S., Myers, D.B., Kim, J., Harris, W.G., Comerford, N.B., 2014a. Holistic environmental soil-landscape modeling of soil organic carbon. Environ. Model. Softw. 57, 202–215. doi:10.1016/j.envsoft.2014.03.004

165

Xiong, X., Grunwald, S., Myers, D.B., Ross, C.W., Harris, W.G., Comerford, N.B., 2014b. Interaction effects of climate and land use/land cover change on soil organic carbon sequestration. Sci. Total Environ. 493, 974–982. doi:10.1016/j.scitotenv.2014.06.088

Zhang, S., Huang, Y., Shen, C., Ye, H., Du, Y., 2012. Spatial prediction of soil organic matter using terrain indices and categorical variables as auxiliary information. Geoderma, Beijing 171–172, 35–43. doi:10.1016/j.geoderma.2011.07.012

Zhao, Y.-C., Shi, X.-Z., 2010. Spatial prediction and uncertainty assessment of soil organic carbon in Hebei province, China, in: Boettinger, J.L., Howell, D.W., Moore, A.C., Hartemink, A.E., Kienast-Brown, S. (Eds.), Digital Soil Mapping. Springer Netherlands, Dordrecht, pp. 227–239.

Zhu, Q., Lin, H.S., 2010. Comparing ordinary kriging and regression kriging for soil properties in contrasting landscapes. Pedosphere 20, 594–606. doi:10.1016/S1002-0160(10)60049-5

166

BIOGRAPHICAL SKETCH

Hamza Keskin was born in Istanbul, Turkey in 1988. He got his Bachelor of

Science degree in forestry engineering in 2010 at the University of Istanbul, Turkey. He

was awarded a scholarship from the Turkey Ministry of Forestry and Water Affairs

during his second year as a graduate student at University of Istanbul in 2012. He

decided to pursue his academic career in U.S. He enrolled at the University of Florida in

2013 where he earned his Master of Science degree in Soil and Water Science

Department in 2015. His academic and professional interests involve modeling and

mapping of soil properties to better understand the genesis and distribution of soil.

ufdcimages.uflib.ufl.eduufdcimages.uflib.ufl.edu/UF/E0/04/96/69/00001/KESKIN_H.pdf · ACKNOWLEDGMENTS I would like to thank my parents who always encouraged me to pursue my dreams.

Documents