Performance of three machine learning algorithms for predicting soil

8 Soil organic carbon (SOC), as the largest terrestrial carbon pool, has the potential to influence climate change and 9 mitigation, and consequently SOC monitoring is important in the frameworks of different international treaties. 10 There is therefore a need for high resolution SOC maps. Machine learning (ML) offers new opportunities to do 11 this due to its capability for data mining of large datasets. The aim of this study, therefore, was to test three 12 commonly used algorithms in digital soil mapping – random forest (RF), boosted regression trees (BRT) and 13 support vector machine for regression (SVR) – on the first German Agricultural Soil Inventory to model 14 agricultural topsoil SOC content. Nested cross-validation was implemented for model evaluation and parameter 15 tuning. Moreover, grid search and differential evolution algorithm were applied to ensure that each algorithm was 16 tuned and optimised suitably. The SOC content of the German Agricultural Soil Inventory was highly variable, 17 ranging from 4 g kg to 480 g kg. However, only 4% of all soils contained more than 87 g kg SOC and were 18 considered organic or degraded organic soils. The results show that SVR provided the best performance with 19 RMSE of 32 g kg when the algorithms were trained on the full dataset. However, the average RMSE of all 20 algorithms decreased by 34% when mineral and organic soils were modelled separately, with the best result from 21 SVR with RMSE of 21 g kg. Model performance is often limited by the size and quality of the available soil 22 dataset for calibration and validation. Therefore, the impact of enlarging the training data was tested by including 23 1223 data points from the European Land Use/Land Cover Area Frame Survey for agricultural sites in Germany. 24 The model performance was enhanced for maximum 1% for mineral soils and 2% for organic soils. Despite the 25 capability of machine learning algorithms in general, and particularly SVR, in modelling SOC on a national scale, 26 the study showed that the most important to improve the model performance was separate modelling of mineral 27 and organic soils. 28

and the original 0-20 cm LUCAS data were then used by each algorithm to check the effect of depth extrapolation. organic soils (Roßkopf et al., 2015) that distinguishes mineral soils from organic ones and explains their spatial 155 https://doi.org/10.5194/soil-2021-107 Preprint. Discussion started: 8 November 2021 c Author(s) 2021. CC BY 4.0 License. decision tree. Subsequently, by aggregating the results of a large number of decision trees, the bias and variance 178 of the final model can be reduced (Breiman, 1999). The method of bootstrapping in conjunction with aggregating, 179 known as bagging, increases the robustness and stability of RF. However, the trees from different bootstraps may 180 form a similar structure if all covariates participate in a split of each node. Thus, the variance cannot be reduced 181 optimally through the bagging process (Kuhn and Johnson, 2013). In order to avoid this tree correlation, a random 182 subset of predictors is selected at each split. The parameter mtry defines the number of predictors included in this 183 subset and should be tuned (Kuhn and Johnson, 2013). The RF algorithm was implemented by the "Ranger" 184 package (Wright and Ziegler, 2017) in R.

186
SVR is a form of support vector machine adopted for regression. From all possible solutions, i.e. estimation 187 function, for the problem, SVR tries to obtain an estimation function with the maximum error while minimising 188 model complexity (Smola and Schölkopf, 2004). Thus, a symmetrical tolerance threshold, -insensitivity zone, is 189 created around the estimation function within which the vectors are not penalised (Awad and Khanna, 2015).

190
However, the vectors that lie on the boundary of the -insensitivity zone are called support vectors. Therefore,

191
is an optimisable parameter that controls the width of -insensitivity, alters the model complexity and impacts the can lead to overfitting, while a low C can cause under fitting (Kuhn and Johnson, 2013). The use of kernel functions this study, the Radial Basis Function (RBF) kernel was used with gamma as its tuneable parameter. This parameter

202
When training a predictive model, it is important to evaluate its generalisation performance on unseen data of the 203 same type (Hawkins et al., 2003). However, as the number of available samples is usually a limiting factor, the 204 evaluation process is often done by randomly splitting the available dataset into training and testing sets multiple 205 times, i.e. cross-validation (CV). Although this process is effective, it is not entirely immune from biased 206 estimation of error. However, to ensure that the estimated error in model evaluation is as unbiased as possible, 207 every model training step should be performed within the CV. This includes finding the best parameter sets for the 208 chosen algorithm (Varma and Simon, 2006). Thus, the algorithms in this study were applied on a stratified nested 209 CV.

210
First, to ensure that the SOC distribution was represented in the CV scheme, Germany was divided using a 100x100 211 km INSPIRE grid into 50 strata. Random samples from each stratum were then taken and compiled into a fold.

212
This procedure was continued to create five folds and was repeated five times, forming the outer loop of CV used 213 for model evaluation. Large distance between neighboring samples, 8120 m on average, prevents train and test 214 data from being spatially autocorrelated. Since the aim was to tune the parameters of the algorithms, the training 215 set of the outer loop of CV was nested, creating five folds as the inner loop on which the parameter tuning was 216 performed. To evaluate the performance of algorithms, root-mean-squared error (RMSE), Eq. 1, mean absolute 217 error (MAE), Eq. 2, and mean absolute percentage error (MAPE), Eq. 3, were used.

218
Where is the number of samples, and are the predicted and observed values, respectively. . This strategy was applied to RF since the tuning parameter is discrete.
were less accurate in the north of Germany compared with the centre and south of the country (Fig. 3A). This can  generally underpredicted the mineral soils in the northwest part of Germany, while RF overpredicted them.

299
Furthermore, unlike RF and SVR, BRT distinctively overpredicted SOC of the north-east's mineral soils with the 300 lowest SOC content (<10 g kg -1 ). This result indicates that the algorithms differed in their performance in mineral 301 soils. This difference was mainly due to the information they obtained from land use. As the second most important 302 covariate for all three algorithms ( Fig. 4 A), the value for variable importance for this covariate was 22% in SVR,

303
but just 11% in RF and 9% in BRT. Thus, SVR exploits more information from this covariate than RF and

325
With the Shapiro-Wilk test rejecting the normality assumption of residuals of all corresponding algorithms at 20 326 cm and 30 cm, the non-parametric Kruskal-Wallis test showed no significant difference between the residuals at 327 both depths. Thus, the extrapolation of the soil depth had no significant impact on the data quality to regionalize 328 SOC. As a result, any further change in the performance of the algorithms after adding LUCAS data was due to 329 the training set being enlarged. The result of the algorithms at both depths can be found in the supplementary 330 information (Fig. S1).
enlarging the training set does not provide enough information for BRT or RF to capture the high variability of