The spatial assessment of soil functions requires maps of basic soil properties. Unfortunately, these are either missing for many regions or are not available at the desired spatial resolution or down to the required soil depth. The field-based generation of large soil datasets and conventional soil maps remains costly. Meanwhile, legacy soil data and comprehensive sets of spatial environmental data are available for many regions.

Digital soil mapping (DSM) approaches relating soil data (responses) to environmental data (covariates) face the challenge of building statistical models from large sets of covariates originating, for example, from airborne imaging spectroscopy or multi-scale terrain analysis. We evaluated six approaches for DSM in three study regions in Switzerland (Berne, Greifensee, ZH forest) by mapping the effective soil depth available to plants (SD), pH, soil organic matter (SOM), effective cation exchange capacity (ECEC), clay, silt, gravel content and fine fraction bulk density for four soil depths (totalling 48 responses). Models were built from 300–500 environmental covariates by selecting linear models through (1) grouped lasso and (2) an ad hoc stepwise procedure for robust external-drift kriging (georob). For (3) geoadditive models we selected penalized smoothing spline terms by component-wise gradient boosting (geoGAM). We further used two tree-based methods: (4) boosted regression trees (BRTs) and (5) random forest (RF). Lastly, we computed (6) weighted model averages (MAs) from the predictions obtained from methods 1–5.

Lasso, georob and geoGAM successfully selected strongly reduced sets of covariates (subsets of 3–6 % of all covariates). Differences in predictive performance, tested on independent validation data, were mostly small and did not reveal a single best method for 48 responses. Nevertheless, RF was often the best among methods 1–5 (28 of 48 responses), but was outcompeted by MA for 14 of these 28 responses. RF tended to over-fit the data. The performance of BRT was slightly worse than RF. GeoGAM performed poorly on some responses and was the best only for 7 of 48 responses. The prediction accuracy of lasso was intermediate. All models generally had small bias. Only the computationally very efficient lasso had slightly larger bias because it tended to under-fit the data. Summarizing, although differences were small, the frequencies of the best and worst performance clearly favoured RF if a single method is applied and MA if multiple prediction models can be developed.

Human well-being depends on numerous services that soils provide in
agriculture, forestry, natural hazards, water protection, resources
management and other environmental domains. The capacity of soil to deliver
services is largely determined by its functions, e.g. regulation of water,
nutrient and carbon cycles, filtering of compounds, production of food and
biomass or providing habitats for plants and soil fauna

Many recent DSM studies used relatively small sets of no more than 30
covariates (e.g.

efficiently build models without much user interaction, even if there are more covariates

cope with numerous multi-collinear and likely noisy covariates,

result in predictions with good accuracy and

avoid over-fitting the calibration data.

DSM approaches used in the past can broadly be grouped into (1) linear
regression models (LMs), (2) variants of geostatistical approaches,
(3) generalized additive models (GAMs), (4) methods based on single trees like
classification and regression trees (CARTs), (5) (ensemble) machine learners
like boosted regression trees (BRTs) or random forest (RF), and (6) averaging
predictions of any of the mentioned methods (model averaging, MA). LM

Generally, more complex approaches seem to yield more accurate predictions
than simpler DSM methods

The comparative studies mentioned above used only small sets of covariates
(

The objectives of this study were to evaluate for a broad choice of currently
used DSM methods how well they cope with requirements 1–4 listed above. We
compared in our study (a) lasso, (b) robust EDK (georob), (c) spatial GAM
with model selection based on boosting (geoGAM), two ensemble tree methods,
which are
(d) BRT and (e) RF, and (f) weighted MA. In more detail, our objectives
were to

automatically build models with methods (a)–(e) and compute MA of (a)–(e) for numerous responses from large sets of covariates (300–500),

evaluate the predictive performance of these models with independent validation data,

evaluate over-fitting behaviour and the practical usage of approaches

and briefly compare the accuracies of DSM predictions and predictions derived from a legacy soil map with scale

We focused on three study regions in Switzerland: a forested region and two
regions with agricultural land where harmonized legacy soil data and, in
the latter regions, airborne imaging spectrometer data were available. For the
agricultural land, the soil properties required for assessing regulation, habitat
and production functions were mapped
(Table

Basic soil properties needed for spatial soil function assessment in
the three study regions. Most soil functions required data on further,
expensive-to-measure soil properties that were inferred by pedotransfer
functions (PTFs) from the basic soil properties (see

We chose three study regions on the Swiss Plateau with contrasting patterns
regarding land use, geology, soil types and availability of airborne remote
sensing images (Fig.

Description of the three study regions (

Location of study regions Berne, Greifensee (agricultural soils) and the Canton of Zurich (forest soils).

The majority (80 %) of the study region Berne was covered by cropland
and 15 % by permanent grassland. In the Greifensee region cropland
covered roughly half of the area and one-third was permanent grassland. The
remaining areas were orchards, vineyards, horticultural areas or mountain
pastures

The third study region is comprised of the forested areas of the Canton of Zurich
(ZH forest), as derived from the forested area of the topographic landscape
model

Soils are rather young in all study regions (

We gathered and harmonized legacy soil data from various soil surveys
performed between 1960 and 2014. Berne data were collected mostly before 1980
in small soil mapping projects for land improvement. Data for Greifensee and
ZH forest originate from long-term soil monitoring of the Canton of Zurich
(KaBo), a soil pollutant survey

Horizon-based (and non-fixed depth) soil property data were converted to
fixed-depth data for 0–10, 10–30, 30–50 and 50–100

Soil properties were either measured by standard laboratory procedures, estimated in the field or calculated by PTF (see overview in Table S1). We accounted for fluctuations in the observations over the long period during which the data had been collected and for possible differences between laboratory measurements, field estimates and PTF predictions by statistical modelling. We included categorical covariates (factors) in the statistical models that coded separately for laboratory measurements, field estimates and PTF predictions the period when the data had been gathered. For Berne three periods (1968–1974, 1975–1978 and 1979–2010) were coded separately for laboratory measurements and field estimates. For Greifensee and ZH forest, coding required more care because we had replicate samples from soil monitoring. Instead of only using mean or median values per site this coding allowed us to use all individual observations. For Greifensee we coded the years 1960–1989, 1990–1994 and 1995–1999 separately for laboratory and field data and 2000–2014 for laboratory measurement only. For ZH forest we distinguished the periods 1985–1994, 1995–1999, 2000–2004, 2005–2009 and 2010–2014 for laboratory measurements and a further two levels for predictions by PTF or pH measurements on field-moist samples (see Table S1). Data on pH, soil organic matter (SOM) and effective cation exchange capacity (ECEC) that were older (or newer) than reported above were discarded. To compute model predictions for mapping we used the most recent time period and laboratory measurements as a reference level.

For the agricultural land (Berne, Greifensee) we modelled clay and silt,
gravel content, pH, SOM and effective soil depth available to plants (SD).
For ZH forest we modelled ECEC, pH and bulk density of the fine soil fraction
(

Models for properties of agricultural soils were calibrated with data from
700–900 sites. For SOM there were more topsoil sites available (1140), but
in the subsoil we had only data from 400 (Greifensee) and 530 (Berne) sites,
respectively. For ZH forest topsoil chemical properties were available for
1055 (ECEC) to 1470 (pH) sites, but for subsoils data were again scarce (ECEC
380 and pH 690 sites). For modelling BD we had only 550 (topsoil) to 370
(subsoil) sites. On average we calibrated the models with the following
spatial data densities: Berne 2.9–3.6, Greifensee 4.2–5.1 and ZH forest
1.2–1.8 observations per km

Tables S3 to S7 report descriptive statistics for all soil properties. In
general, soils in the Greifensee region were richer in clay (mean clay
content 26 %) than in Berne (17–19 %) and had larger gravel content
(8–13 % vs. 3–5 %). In both agricultural study regions, large SOM
contents were occasionally found (

To represent soil-forming factors we used data from 28 sources, totalling
roughly 480 covariates for Berne and Greifensee and 330 for ZH forest where
APEX imaging spectrometer data were not available
(Tables

Overview of geodata sets and derived covariates (for more information see
Table S2 in the Supplement);

The large number of responses – 21 each for Berne and Greifensee and 6 for
ZH forest – and of covariates (Table

For parametric methods (Sects.

For tree-based models (Sects.

To find optimal tuning parameters, we minimized root mean square error
(RMSE; Eq.

The lasso (least absolute shrinkage and selection operator) is a shrinkage
method that likely excludes non-relevant covariates and is therefore an
attractive framework for high-dimensional covariate selection. Lasso
estimates the coefficients of a linear model by minimizing a penalized residual
sum of squares, with the penalty being equal to the weighted sum of absolute
values of the estimated coefficients. By increasing the weight

We used the grouped lasso, which jointly shrinks all coefficients of a factor

We applied external-drift kriging (EDK) with robustly estimated trend
coefficients and exponential variogram parameters

Additive models accommodate linear effects and smooth non-linear
effects of continuous covariates. Spatial autocorrelation can be represented
in geoGAM by a smooth function of the spatial coordinates (smooth spatial
surface), and non-stationary effects are modelled by interactions between
smooth spatial functions and covariates. We based model building for geoGAM
on component-wise gradient boosting, a slow stage-wise additive model-building
algorithm. At each stage, base procedures are fitted to the residuals of the
previous model and the best-fitting base procedure is retained to update the
model by a small step size

Non-stationary effects were added for all continuous covariates, but
cross-validation RMSE did not substantially decrease, and we preferred the
simpler stationary models throughout. Maximum boosting iterations

Classification and regression trees (CARTs) are based on recursive binary
partitioning of the covariates and can capture complex interaction structures
in a dataset. Generally, single trees tend to be noisy (large variance), but
have small bias. Combining trees by ensemble methods aims to reduce their
variance

The optimal number of trees (the number of boosting iterations)

The learning rate was kept similarly small as for geoGAM with

RF

Tuning parameters are the number of trees

The five methods described above likely represent different aspects of the
covariates and can be seen as different means of reducing the
high-dimensional covariate input. Hence, combining the predictions of several
models possibly improves predictive performance over single methods as the large
variance in individual models is reduced through averaging

For the Greifensee region a legacy soil map with scale

The map defined topsoil by pedogenetic A horizon without indicating a
particular depth. We therefore compared predictions for topsoil to values
observed in 0–10 and 10–30

The accuracy of predictions by the six statistical DSM approaches and the
legacy soil map was evaluated by comparing predicted

Grouped lasso, robust external-drift kriging (georob) and boosted geoadditive
models (geoGAMs) successfully selected strongly reduced sets of covariates. On
average, lasso models had 21, georob 27 and geoGAM only 12 covariates in the
final models. This corresponds to only 3–6 % of all covariates. Boosted
regression trees (BRTs) performed weak covariate selection. The stage-wise
forward algorithm selected on average 43 % of all covariates (covariates
with importance

Optimal values of

In contrast, BRT profited more from tuning its parameters

The residual spatial autocorrelation of georob models was much weaker than
the autocorrelation of the original responses (Tables S4, S6 and S7). Effective
ranges for Greifensee and ZH forest were less than 300

Since cross-validation and out-of-bag RMSE did not vary much between the five
methods, model averaging (MA) weights generally did not differ much from

Summing up, lasso, georob, geoGAM and partly BRT effectively selected relevant covariates from a large set. The reduction of covariates in RF, as tested on a few responses, seems promising. The benefit of tuning model parameters was sometimes only small, but remained relevant when considering all responses.

Accuracy of the predictions of soil properties by study region and soil
depth computed with independent validation data (RMSE: root mean square
error, SS

Table

Box plots of SS

There was no method that consistently performed best for all soil properties,
soil depths and study regions. Each of the tested methods (lasso, georob,
geoGAM, BRT, RF) performed best for at least one response, and
SS

Apart from overall accuracy as captured by RMSE and SS

Box plots of SS

Box plots of bias

Difference of 10-fold cross-validation and independent validation
SS

Lastly, we evaluated whether the various methods tended to over-fit the data
by computing differences between cross-validation (CV) or out-of-bag (OOB,
RF) SS

Mean predictive skill (%) of covariates

We explored whether characteristics of the (spatial) empirical distributions
of the responses were in some way related to variations in predictive
performance observed between responses. We checked whether SS

Only for extremely positively skewed responses (SOM below 30

The RMSE and SS

To characterize the “predictive skill” of covariates by topic, we computed
weighted averages of RF covariate importance

The sampling period and type of soil data were important for many responses
(legacy data correction; Fig.

Overall, APEX covariates had very small importance (average rank of covariate
importance 168 for RF and 48 for BRT). Differences in reflectance
intensities between autumn and spring flights and between agricultural land
with partly bare soils and various crops most likely obscured relations
between the surface reflectance of vegetation and soil properties. Preprocessing
using co-kriging with data from bare soil areas possibly improves predictive
capabilities for the present study regions

In addition to studying covariate importance, we evaluated the effects of single
covariates on the responses by using partial effects or dependence plots.
Given the large number of models and covariates we chose a continuous and a
factor covariate to illustrate the effects for one response.
Figure

Example partial residual plots

In addition to the reported analysis, we visually inspected the soil property
maps generated by the six DSM methods. Figure

Predictions of clay content (%) in 0–10

Maps of georob and BRT predictions showed artefacts (single pixels in georob
or bands in BRT) with very large predicted values. In the MA map, outlying
predictions were smoothed out. Outlying georob predictions were caused by the
multiflow specific catchment area (2

In addition to creating extrapolation errors, parametric methods (lasso, georob,
geoGAM) predicted physically impossible values (e.g. clay content

All tested DSM methods were able to process large sets of factors and
continuous covariates. Although RF more often performed best and MA even
improved on that, the advantage measured in validation SS

In our study residual spatial autocorrelation was weak or short ranged. For a response with strong residual autocorrelation a geostatistical approach might still offer an advantage. The smooth spatial surface of geoGAM is possibly too coarse to capture short-ranged autocorrelation. BRT and RF include spatial coordinates as covariates, but if the response depends only weakly on other covariates, spatial coordinates become overly important. Repeated recursive splitting on coordinates likely leads to “chessboard-type” artefacts.

All methods allowed for an interpretation of modelled relationships
(Fig.

R packages are readily available for all methods used in this study. Lasso
and geoGAM optimize their tuning parameters directly without any further
input to the software, while RF and BRT require specification of parameter
ranges to be tested. The number of parameters to tune influences computing
times considerably. Using default

Moreover, ease of modelling predictive uncertainty is another factor relevant
for the choice of a DSM method. In georob uncertainties can be directly derived
from the kriging variances. For RF, conditional quantiles of predictive
distributions can be estimated directly at the cost of a larger memory
requirement

Responses for DSM are not always continuous soil properties. Binary,
multinomial (e.g. soil types) or ordinal (e.g. drainage classes) responses
are sometimes relevant. Grouped lasso is available for binary

We applied – to a total of 48 soil responses observed in three study regions in Switzerland – six statistical digital soil mapping (DSM) methods: grouped lasso (least absolute shrinkage and selection operator), robust external-drift kriging (georob), boosted geoadditive models (geoGAMs), boosted regression trees (BRTs), random forest (RF) and model averaging (MA). We used 300–500 environmental covariates as input to each method. Performance was assessed by comparing model predictions with independent validation data.

From this study we conclude the following.

All methods successfully built models automatically from large sets of covariates. The applied ad hoc procedure to find a parsimonious trend model for georob was, however, very inefficient.

Except for lasso, cross-validation and out-of-bag accuracy measures were sometimes better than actually observed for the validation data. This suggests that the methods partly tended to over-fit the data and underpins the necessity of model evaluation with independent data.

The best-performing method frequently did not have a much larger mean square error skill score (SS

Correcting for sampling period and soil data type by adding a factor to the models turned out to be important. Legacy soil data are inherently heterogeneous for various reasons, but one can (and should) compensate for this variation through careful statistical modelling.

The geoGAM model-building procedure was published as R
package geoGAM

The soil data for the Canton of Zurich were used under a
non-public data licence (Canton of Zurich, contract number TID 22742; WSL)
and could not be published. Data from the Berne study region were partly published
as test data

The supplement related to this article is available online at:

AP proposed the comparison of the selected DSM approaches and defined the model selection procedure for georob. MN implemented the DSM approaches for the three study regions and evaluated the results. KS explored the influence of tuning parameters and covariate selection on RF for selected responses. AB heavily contributed to the computation of the multi-scale terrain attributes, and UG and AK harmonized the soil data with collaborators. LG defined the demand for the soil properties to be mapped and proposed the derivation of SD from horizon qualifiers of Swiss soil classification data. MN prepared the paper with considerable input from AP and further contributions from all co-authors.

The authors declare that they have no conflict of interest.

We thank the Swiss National Science Foundation SNSF for funding this work in the framework of the national research program “Sustainable Use of Soil as a Resource” (NRP 68), the Swiss Earth Observatory Network (SEON) for funding aerial surveys using APEX and Sanne Diek for preprocessing the imagery. The contribution of Michael E. Schaepman was supported by the University of Zurich Research Priority Program on Global Change and Biodiversity (URPP GCB). Special thanks go to WSL and the cantonal agencies for soil protection in Zurich and Berne for sharing their soil data to make this study possible. Moreover, we are grateful to soil surveyors Peter Schwab, Martin Zürrer and Alexander Lehmann for their support to compare the legacy soil map with DSM results and to Sudan Tandy for the language improvements. Edited by: Bas van Wesemae Reviewed by: two anonymous referees