On the beneﬁts of clustering approaches in digital soil mapping: an application example concerning soil texture regionalization

. High-resolution soil maps are urgently needed by land managers and researchers for a variety of applications. Digital soil mapping (DSM) allows us to regionalize soil properties by relating them to environmental covariates with the help of an empirical model. In this study, a legacy soil dataset was used to train a machine learning algorithm in order to predict the particle size distribution within the catchment of the Bode River in Saxony-Anhalt (Germany). The random forest ensemble learning method was used to predict soil texture based on environmental covariates originating from a digital elevation model, land cover data and geologic maps. We studied the usefulness of clustering applications in addressing various aspects of the DSM procedure. To improve areal representativity of the legacy soil data in terms of spatial variability, the environmental covariates were used to cluster the landscape of the study area into spatial units for stratiﬁed random sampling. Different sampling strategies were used to create balanced training data and were evaluated on their ability to improve model performance. Clustering applications were also involved in feature selection and stratiﬁed cross-validation. Under the best-performing sampling strategy, the resulting models achieved an R 2 of 0.29 to 0.50 in topsoils and 0.16–0.32 in deeper soil layers. Overall, clustering applications appear to be a versatile tool to be employed at


Introduction
In order to sustain soil resources, land managers and researchers are in need of information on the continuous landscape-scale distribution of soil properties. One of the important soil properties which governs most physical, chemical, and biological soil processes is soil texture. Soil texture maps can be used for the assessment of erosion risk, water deficit, or pesticide and nutrient storage and percolation (Blume et al., 2016).
Conventional soil maps are usually created by a qualitative analysis of the landscape based on a conceptual model which subdivides the area into spatially assigned units with all soil properties set to uniform values within the units. The categories of these units do not necessarily represent soil systematic units and do not allow for the representation of smallscale, continuous variability. Overall, these soil maps were never meant to be used as input to landscape-scale process models that strive to simulate gas, matter and water flows. From this demand and an advance in information technology, the domain of digital soil mapping (DSM) has quickly advanced (Grunwald et al., 2011).
DSM strives to capture and quantify the influence of the soil-forming factors, which are represented by continuous gridded geoinformation from remote sensing and other sources (Scull et al., 2003). Laboratory and field observations are coupled with spatial environmental covariates covering the study area and are used to build an empirical model to predict the surveyed target variable based on the quantitative relationship between soil properties and environmental covariates (McBratney et al., 2003;Grunwald, 2009;Minasny and McBratney, 2016). The key technological advantages that allowed DSM are the increase in computational power which facilitates model development and the widespread availability of satellite systems (Rossiter, 2018). The latter are used for accurate georeferencing and as platforms for a variety of sensors which provide spatially continuous measurements which can be used as environmental covariates.
The algorithms used for DSM applications are of different degrees of complexity, ranging from linear regression (Gobin et al., 2001;Park and Vlek, 2002;de Carvalho Junior et al., 2014) to artificial neural networks (Park and Vlek, 2002;Zhao et al., 2009). Most of these studies used continuous covariates based on a digital elevation model (DEM) as predictors, but certain applications also included categorical covariates, such as information based on geologic maps (Adhikari et al., 2013;Vaysse and Lagacherie, 2017). The machine learning algorithm most frequently used in DSM approaches is the random forest (RF) ensemble learning method (Padarian et al., 2019). A key characteristic of RF is its adaptive nature which allows it to explore complex, nonlinear, and high-dimensional relationships without a prior understanding of the problem to be solved (Evans et al., 2011). Compared to decision tree methods, RF is less prone to overfitting and is less sensitive to irrelevant predictors and outliers (Heung et al., 2014). Nevertheless, many RF and other modelling applications use feature selection preceding the model-building procedure to detect and exclude predictors with little information content with regards to the response variable. Feature selection can be achieved though filter methods, which investigate the predictor-response relationship of each predictor individually without considering the model algorithm, or alternatively by using wrapper methods that evaluate the performance of the model using a variety of predictor subsets.
The essential foundation of creating soil maps is the availability of a soil dataset of sufficient size and adequate distribution, but the soil surveys providing these data are associated with high cost and labour (Grunwald et al., 2011). To forego this effort, DSM uses legacy soil data whenever available. However, sampling in traditional soil surveys usually did not follow statistical sampling theory, which can lead to a bias in the data and the models derived from it (Carré et al., 2007;Ließ, 2020). Because soil-forming factors operate on different scales, it is important that the spatial distribution of the data is suitable for capturing the large-and smallscale variation of soil. In order to construct a model that can effectively predict throughout the landscape, it is important to have a statistically representative sample of training and validation data that allows for the generalization from the data to the spatial landscape context (Ließ, 2020). The most common approaches in dealing with this issue involve (a) creating a more balanced training set by sampling from the entirety of observations and (b) cost-sensitive learning frameworks, in which the learning algorithm penalizes the prediction error of underrepresented samples (He and Garcia, 2008). Many DSM applications tackle the problem of data imbalance with the subsampling approach (Moran and Bui, 2002;Subburayalu and Slater, 2013;Heung et al., 2016;Sharififar et al., 2019). This can be achieved by clustering the study area into homogeneous subareas with regards to the covariates and drawing a certain number of samples from each of these clusters.
Another hurdle of modelling applications lies in training and tuning. Model building and performance evaluation can be sensitive to the data splitting into training and testing sets. Although resampling techniques like cross-validation (CV) reduce the influence of data splitting, the model outcome can still be compromised by an uneven distribution of sample characteristics between training and testing datasets.
Many of these challenges in the DSM procedure are related to identifying structures and similarities in the data. Therefore, here we want to investigate the usefulness of data-clustering applications in tackling some of the abovementioned challenges in DSM. Specifically, we want to examine the benefits of using clustering applications for feature selection, for landscape stratification to conduct data subsampling, and for stratified cross-validation to build robust models. This will be done on the basis of training an RF model to predict soil texture within the catchment of the Bode River in Saxony-Anhalt, Germany. The model is trained and validated using a soil legacy dataset containing soil survey data. Environmental covariates related to soilforming factors are obtained and used as predictors.

Study area
The study area of approximately 3300 km 3 is part of the TERENO network for environmental observations (Zacharias et al., 2011) and covers the water catchment of the Bode River in central Germany (Fig. 1). It corresponds to three federal German states: Saxony-Anhalt, Lower Saxony, and Thuringia. The elevation ranges between 1 and 1141 m a.s.l. with the Harz Mountains in the south-west, the north-eastern Harz foreland, and the Magdeburg Börde of the North German Plain covering the rest of the area. The climate is subarctic to humid continental (Peel et al., 2007), with the mean annual precipitation ranging from 433 to 1771 mm (Deutscher Wetterdienst, 2020). The geologic material in the area consists mostly of Triassic limestone and Carboniferous shale and granite (BGR, 2007). Dominating soils according to the German soil classification (Finnern and Kühn, 1994) are Braunerde, Parabraunerde, Gley, and Pararendzina (BGR, 2012).

Soil legacy data
The soil samples used for model training and validation are from a legacy dataset provided by the regional geological survey of the German federal state of Saxony-Anhalt -Landesamt für Geologie und Bergwesen (LAGB, 2018). The data were recorded by various soil surveyors between 1963 and 2006 and consist of soil profile data from 574 sites. For every site, a soil diagnostic survey was conducted. Soil horizon boundaries were recorded according to either the TGL (TGL, 1985) or KA4 (Finnern and Kühn, 1994) soil systematic system. For every soil horizon, the particle size distribution was measured in the laboratory using DIN ISO 11277:2002-08. The fractions of three particle sizes were measured according to the German soil separates (sand [2 to 0.063 mm], silt [0.063 to 0.002 mm], and clay [< 0.002 mm]). Sand, silt, and clay contents were extracted from the horizon data at two discrete soil depths (10 and 70 cm) and used as the response variables of the models. The two depths were chosen to investigate whether different soil-forming factors dominated soil-landscape development in the topsoil and subsoil, respectively. Because the maximum depth of the surveyed soil profiles is not uniform, a depth of 70 cm was chosen as a trade-off between maximum soil depth (closeness to parent material) while not compromising the sample size. One sample is located in a Quaternary sand dune of less than 2 km 2 (BGR, 2007) near the town of Blankenburg and has a sand content of 96 %. The sample was removed because one sample alone would not be sufficient for model training and validation. The soil texture of the soil legacy dataset used for model training and evaluation is shown in Fig. 2a and b. A cluster analysis targeting three equally sized subgroups was applied to differentiate clayey samples from silty and sandy samples; please refer to the "Cluster analysis" section for details. Figure 2c and d show the spatial distribution of these three clusters within the study area.

Model predictors
Spatially continuous geodata of the study area corresponding to the soil-forming factors parent material, topography, and land cover were gathered. They comprise geologic maps of 1 : 200 000 (GUEK 200) and 1 : 1 000 000 (GUEK 1000) map scale (BGR, 2007(BGR, , 2006, a DEM of 10 m resolution (BKG, 2012), and CORINE Land Cover data from 1990, 2000, and 2012 (Büttner et al., 2004). The local river network was generated from the OpenStreetMap dataset by querying rivers and streams with the Overpass API service (OpenStreetMap contributors, 2018). Some of the geodata were used without further modification, like the land cover data and the elevation from the DEM. Additionally, further predictors were derived from these data and have been resampled to the 10 m resolution of the DEM.
In the digital vector information underlying the geologic maps, a variety of attributes is contained, including age, material, and origin. The "petrography" layer was used from both geologic maps, and the "genesis" layer was used from the 1 : 200 000 map. As the information contained in the petrography layer is descriptive, it was categorized into binary information on the occurrence of particle size classes in addition to its inclusion as an unmodified layer. Three new predictors (Sandbin, Siltbin, and Claybin) were created for every landscape unit based on the occurrence of the words sand or sandstone, silt or siltstone, and clay or claystone, respectively. Several topographic predictors were derived from the DEM, since relief is often considered the main driver of soil formation (McBratney et al., 2003;Scull et al., 2003;Behrens et al., 2010). Topographic predictors were calculated with the SAGA GIS software Version 6.4.0 (Conrad et al., 2015). The used topographic predictors were selected according to their appearance in similar digital soil mapping applications Vaysse and Lagacherie, 2017;Blanco et al., 2018;Kalambukattu et al., 2018;Zhou et al., 2019; Table 1). Sink removal by Wang and Liu (2006) was applied prior to the calculation of the hydrological terrain parameters (minimal slope = 0.01). For the calculation of the vertical distance to the channel network, the layer of waterways acquired from OpenStreetMap was used. Indices for terrain convexity and terrain surface texture were calculated by using a flat area threshold of 0.08 in order to minimize the impact of inaccuracies and insignificantly small depressions and mounds (Conrad et al., 2015).
Since soil-forming factors can take effect on different spatial scales, it is advised to take multiscale approaches into account (Behrens et al., 2010). Accordingly, convergence index (Köthe and Lehmeier, 1996), terrain ruggedness index (Riley, 1999), convexity, and terrain surface texture were calcu-lated with a search radius of 10, 50, 100, and 200 m in order to express local to regional landscape attributes. The annulusbased topographic position index (Guisan et al., 1999) was calculated on two scales, one ranging from 0 to 100 m and one from 100 to 200 m.
2.2 Modelling procedure 2.2.1 Random forest RF models are based on regression trees (RTs), which use selected values of predictor variables to repeatedly split the data in a way that maximizes the homogeneity of the subsets regarding the response variable (Kuhn and Johnson, 2013). Instead of building a single tree as is the case in RTs, RF uses the bagging ensemble method which constructs several trees based on bootstrapped samples of the data. The resulting averaged prediction has a lower variance and thus increased model stability compared to RTs. Although randomness is added to the procedure through resampling of the data, the underlying predictor-response relationship is not altered by bagging. As a consequence, many of the trees share similar structures. This correlation between trees can lead to a decrease in the predictive performance of the ensemble  (Breiman, 2001). To introduce diversity to the ensemble and decorrelate the trees, RF is extended by a random feature selection. Instead of using the entire set of predictors to build a tree, a random subset of the predictors is used for each tree. This reduction of predictors leads to a trade-off between the strength of individual trees (high number of predictors) and more diversity between trees (low number of predictors). The respective tuning parameter, which controls this trade-off, is "mtry", the size of the predictor subset. Further parameters include "ntree", the number of trees, and "nodesize", the minimum number of samples to be kept in a terminal node of the trees (Were et al., 2015). For the interpretation of the RF models, the model function calculates a variable importance measure. This is done by building models which use permutations of a predictor variable. The accuracy of the permuted model is then compared to a model built from the original data. The returned value indicates the decrease in prediction accuracy after permutation.

Cluster analysis
A cluster analysis (CA) was conducted for several purposes.
-CA-1: feature selection and process understanding -CA-2: landscape stratification into subareas for subsampling -CA-3: data stratification in the CV approach In CA-1, k-means clustering was used to split the soil texture data of both depth levels into three clusters to distinguish between sandy, silty, and clayey soils (Fig. 2). The distribution of every predictor value among the three clusters is analysed to (a) determine whether the predictor has any influence on soil texture (feature selection in Sect. 2.2.3) and (b) gain process understanding by analysing the relationship between predictors and soil texture. The clustering was performed with the k-means function using 40 initializations for each of the 30 iterations on the centre-scaled sand, silt, and clay contents. The resulting clusters' predictor ranges at the assigned soil survey sites were retrieved. CA-2 was applied for landscape stratification into homogeneous subareas on behalf of the gridded continuous multivariate predictor data. These data, however, have certain traits which provide a challenge to cluster analysis. These traits are the high dimensionality of the data, correlation between predictor variables, and its consistency of numerical as well as categorical predictors. These issues were tackled by applying a factor analysis of mixed data (FAMD) from the FactoMineR package (Lê et al., 2008) to the dataset. Because of the high resource demand of conducting an FAMD on the whole dataset (33 million grid cells), the function was applied to a random data subset of 100 000 samples first, and then the resulting FAMD model was applied to the whole dataset. In order to allow for the application of the function for FAMD transformation trained on the data subset to the complete dataset, the minimum and maximum values of all numerical predictors and all classes of the categorical predictors were additionally included. The FAMD returns n factors, with the percentage of explained variance decreasing with every factor. A reliable method to determine the number of dimensions to retain from the FAMD is the "elbow" approach (Linting et al., 2007). The contribution of each retained dimension to the percentage of explained variance decreases strongly with the first dimensions until it reaches a nearly constant value. The "elbow" approach suggests using all dimensions before the stagnation of the explained variance. The resulting FAMD-transformed data were clustered using k-means in CA-2.
The number of clusters was determined by the use of cluster validation indices calculated with the NbClust function from the NbClust package (Charrad et al., 2014). The function calculates 27 clustering indices for each clustering solution in a given range of number of clusters. All of the clustering indices cast their vote for their favoured number of clusters. Because of the high computational cost, the function was repeatedly applied (2000 times) to random data subsets of size 2000 of the data resulting from the FAMD. Preliminary test runs have shown that a sample of size 2000 produced stable clustering results. A number of 2 to 17 clusters was tested.
CA-3 was conducted in order to perform a stratified CV. Stratification was conducted on behalf of the response variable. In a first step, the legacy dataset including sand, silt, and clay content was clustered into five equally sized subgroups. From each of the subgroups, the profiles were then in a second step equally assigned to the k folds in order to obtain a similar data distribution in each of the k folds. The clustering was achieved by using a same-size k-means algorithm (Schubert and Zimek, 2019) to divide the profiles of both depth levels into five clusters based on the soil texture.

Feature selection
The RF algorithm is relatively robust against uninformative predictors by only selecting the strongest predictors as a splitting criterion (Hamza and Larocque, 2005;Kuhn and Johnson, 2013), and even though the reduction of the predictor set may not necessarily improve model performance, it can still benefit model interpretability and reduce computational time (Chandrashekar and Sahin, 2014). However, the feature selection based on CA-1 was not only conducted to remove uninformative predictors, but also to study the predictor-response relation.
The clustered profiles were paired with the corresponding predictor values. The Kruskal-Wallis test was used to compare the distributions of predictor values between the three clusters. The resulting p values were adjusted for multiple comparison by controlling the false discovery rate (Benjamini and Yekutieli, 2001). All predictors with significant differences in means between the three clusters (α = 0.05) were used as predictors for the RF models of the respective depth levels. To gain further insight into the predictorresponse relationship, Dunn's test was performed on these predictors as a post hoc analysis. This allows us to determine which clusters show significant differences concerning the particular predictors.
Preliminary results have shown that categorical predictors and the usage of the Cartesian coordinate space can lead to artefacts in the maps of predicted soil texture. Two more models were built in addition to the model using the full predictor set (full) in order to tackle this problem. One model leaves out the petrography and genesis layers as predictors (no geo), and the other leaves out petrography, genesis, longitude, and latitude (no geo + coords).

Strategies for unbalanced data
Statistical sampling from the soil dataset was used in order to create training and validation data better balanced in regard to landscape characteristics corresponding to the interaction of the soil-forming factors. Please compare Ließ (2015Ließ ( , 2020 concerning a detailed discussion of this aspect. This was done by applying four subsampling approaches to the model training data based on the landscape clusters obtained from CA-2. Performance of the models trained on the thereby adapted data was compared to that of models built with the legacy dataset in its original distribution. Subsampling was conducted to match the spatial coverage of the landscape clusters (area-weighted method: AW) or in order to provide a sample that represents each landscape cluster with the same amount of data (equal number approach: EN) (similar to Heung et al., 2016). The subsampled dataset is obtained either by oversampling or undersampling (He and Garcia, 2008). Oversampling obtains the dataset by including all samples from all clusters and then replicating certain randomly selected samples until the desired sample size for each cluster is reached. Undersampling includes all samples from the minority cluster and then randomly draws samples from all other clusters until the desired sample size is obtained. The four applied sampling approaches are displayed in Fig. 3.

Model training, tuning, and evaluation
Model tuning and evaluation for the RF models was conducted by a nested approach of repeated stratified 5-fold CV (5 repetitions). The detailed procedure is shown in Fig. 4a. As a performance measure, the root-mean-square error (RMSE) was derived. In order to make the model performance values comparable for all models, the respective test set was kept the same, while data subsampling was only applied to the respective training sets (Fig. 4b). Furthermore, response data were centred and scaled (SD = 1) to allow for the comparability of model performance between models targeting sand, silt, and clay content. The k folds of the nested approach were derived by stratified sampling regarding the response data. In order to stratify the dataset regarding all three response variables at once, response strata were formed by applying CA-3. Tuning takes place in the inner CV, where the model is evaluated for mtry parameter values within the range of 5 to 25, while ntree was set to 1000 and nodesize to 5. Overall, the model-building procedure was applied six times in order to create individual RF models for each of the three particle sizes for the two soil depths.
For the data analysis and modelling, R version 3.5.1 was used (R Core Team, 2018). All computation was performed on a machine running Windows Server 2016 Standard with four Intel ® Xeon ® Processor E7-8867 v4 and 6.00 TB of memory.

Feature selection
The soil profiles used for model building were split into three groups based on their soil texture with CA-1. A clayey cluster was, thereby, distinguished from a silty cluster and a sandy cluster. Primarily, this was done in order to understand which predictors are best at separating these three groups and therefore are expected to have a high explanatory power in the models to predict spatial soil texture distribution within the investigation area. The soil texture of the profiles at 10 and 70 cm depth and their cluster affiliation is shown in Fig. 2a and b. The spatial distribution of the clusters is shown in Fig. 2c and d. The distribution of cluster affiliation within the study area shows that most of the profiles in the lowlands belong to the silty cluster. This is typical for the soils of this area, which are influenced by loess deposits.
The predictor data at each sampling site were assigned to the soil profile data. The data distribution of the predictors between the three soil texture clusters was then compared by applying a Kruskal-Wallis test. Out of 39 numeric predictors, 27 predictors showed a significant difference of the mean at either 10 or 70 cm depth ( Table 2). The predictors displaying significant differences between any two of the soil texture clusters were included in the random forest models. The trends in differences between the clusters are predominantly in agreement across the two depth levels (Fig. 5). For many of the predictor values with significant differences in the means, it was the silty cluster that was the most distinguishable from the other two clusters. From the 54 statistical tests (27 significant predictors for two soil depths), 51 showed differences between the sandy and silty clusters and 28 showed differences between the clayey and silty clusters, while only 19 tests showed differences between the sandy and clayey clusters.
Since the clustering application used here for feature selection is a filter method, it is unable to take interactions between different predictors into account. This could compromise the efficacy of the feature selection if there are predictor-response relationships which are only revealed in combination with other predictors. The advantage of the clustering method is to create meaningful categories in the data and investigate their relationship with the predictor values, which cannot be provided by a wrapper method.

Landscape stratification for subsampling approaches
CA-2 was conducted in order to subsample from the legacy soil dataset and create a balanced model training set. The FAMD data transformation showed an increased drop of explained variance with the sixth factor, resulting in the first five factors being used as input for CA-2. The NbClust application resulted in 29 % of the votes being appointed to finding two clusters in the data, while the second-largest vote was 13 % in favour of three clusters. Hence, the environmental data of the study area were divided into two clusters using k-means. The resulting clusters broadly divided the study area into the mountainous region and the lowlands (compare    Fig. 1). This way of stratifying the landscape is an apparent choice, since the relatively low number of training samples suggests taking a small number of clusters to have sufficient samples per cluster. Further, the heuristic approach of dividing the landscape, which is often superior to automated classification (MacMillan et al., 2004), also suggests the separation between the highlands and lowlands due to the relatively sharp divide.

Model performance
The predictive performance of the RF models was investigated under different subsampling approaches, a range of mtry tuning parameter values, and three predictor sets. Since the RMSE values for model evaluation were calculated for the modeled response variables scaled to an SD of 1 to provide a comparable metric, the RMSE values can be interpreted as zero being perfect predictability and values over one meaning a worse performance than using the observed mean as the predicted value. The RMSE values of all subsampling approaches for the full predictor set are shown in Fig. 6. The median RMSE is between 0.67 and 0.94, with the silt and sand models clearly outperforming the clay models. For all particle size classes, model performance is better for 10 cm compared to 70 cm depth, with an average difference in the RMSE of 0.08 for clay and silt and 0.12 for sand. This decrease in performance may be due to a decrease in sample size with soil depth. Studies where sample size have been consistent along profile depth have shown that the predictive performance does not necessarily decrease with soil depth (Adhikari et al., 2013;Vaysse and Lagacherie, 2015). There is no consistently better-performing subsampling method. However, both undersampling approaches seem to have higher median RMSE values than the two oversampling methods. It seems likely that the decline in model performance was due to the reduction of the sample size. Using the RMSEs of the whole study area as the selection criteria for the subsampling approach also has its limitations because it does not provide information on the spatial distri- bution of the prediction accuracy. Adding more weight to samples of a certain cluster can lead to increased accuracy in the respective area, while this gain is not necessarily covered by the validation data. The role of subsampling in the distribution of prediction accuracy is exemplarily displayed in Fig. 7. Although there are strong differences in the overall accuracy between clusters, neither of them profits explicitly from a certain subsampling method. The right choice of the subsampling method most likely depends on the underlying data, since other DSM studies have not revealed a distinctly better-performing method. While the EN approach increased model accuracy for the minority class in Heung et al. (2014), Schmidt et al. (2008) found the contrary effect in their study and Moran and Bui (2002) found AW to be the best-performing model. Sharififar et al. (2019) used a combination of oversampling and undersampling to create a balanced dataset which significantly improved model performance, while oversampling and undersampling decreased model performance in Taghizadeh -Mehrjardi et al. (2019).
In order to prevent the occurrence of artefacts, predictors have been retained from model building. This led to a decrease in model performance across all particle sizes and depth layers (Fig. 8). The no geo + coords models showed an average increase in scaled RMSE of 3 %, 7 % and 12 %  for sand silt and clay at 10 cm depth when compared to the full model. The R 2 values for model performance of the full and no geo + coords models are shown in Table 3. Model performance of the silt and sand models at 10 cm depth are comparable to the results of Vaysse and Lagacherie (2015) and de Carvalho Junior et al. (2014), while other publications have shown that R 2 values above 0.5 are achievable (Moore et al., 1993;Gobin et al., 2001;Adhikari et al., 2013). Moore et al. (1993) argue that R 2 values above 0.7 are not to be expected due to the underlying random variability of soil and limitations in the accuracy of measurements. Differences in model performance are most likely to be related to the size and the heterogeneity of the study area and the quality of soil samples. This is illustrated well in Fig. 7, which demonstrates the variability of predictive performance across landscape types.

Model specification
The mtry values for the full predictor model are shown in Fig. 9. There is no clear trend of optimal mtry value with model performance, and many models have a relatively large range of selected mtry values. It is worthwhile mentioning, though, that for certain models the selected mtry value is right at the lower boundary of the tested mtry parameter range, which is the case for the silt model at 10 cm. Accordingly, an extension of this lower boundary and the corresponding lower model complexity would likely have resulted in even better model performance.
Predictor importance is shown in Fig. 10a and b. The better model performance at 10 cm depth is reflected in the overall higher importance values. Altogether, petrography has the highest explanatory power. It should be noted, though, that GUEK 200 petrography was included for both depths, and GUEK 1000 petrography was only included in the model to predict soil texture at 10 cm depth (Table 2). There are few remaining predictors with notably increased predictive ability. These are latitude for silt and sand, positive openness, and the topographic ruggedness index (100 and 200 m radius) for sand and terrain surface texture (200 m radius) for clay.
Omitting the geologic information and coordinates leads to an overall increase in importance values for the remaining predictors (Fig. 10c). The importance value of elevation increased strongly. The same applies for many other topographical predictors, although in a less pronounced manner (positive and negative openness, diffuse insulation, terrain surface texture, 200 m radius).

Spatial prediction
Model output was generated by taking the median of all 25 models (CV procedure with five folds and five repetitions). The predicted, spatially continuous values of the sand, silt, and clay content at 10 cm depth corresponding to the models with the best median predictive performance (Fig. 6) are shown in Fig. 11. It needs to be noted that the maps of predicted values show the results of independent models for different soil texture classes, and the results do not add up to 100 %. The method to scale the data to 100 % should be selected with the purpose of the specific data utilization in mind. Different approaches could include leaving one of the texture classes out and summing up to 100 %, weighted scaling by texture class, or weighted scaling by the regional accuracy of the texture classes.
In the predicted spatial distribution of the sand and silt content, there is a strong regional difference between the lowlands and the mountainous region in the south-west. Sand content generally increases with elevation and is very high in riparian regions and valley bottoms. High silt contents can be expected in the lowlands outside riparian regions. The spatial variability of all three response variables is dominated by categorical predictor traits (petrography) that draw clear boundaries and even transfer artefacts present in the geological map products. However, it is more evident in the sand and clay model output. A limitation of the geologic maps, which is the lack of unity in the naming of geologic units between different geographic regions, but also at federal state boundaries for the GUEK 200, is also reproduced in the results. While the GUEK 1000 was generated by the German Federal Institute for Geosciences and Natural Resources (BGR), the GUEK 200 is a joint product between BGR and the regional geological survey institutions. Although the unit boundaries align across the map tiles, their description may differ because GUEK 200 harmonization at national level is not yet completed. This leads to an abrupt change in predicted sand and silt values in an otherwise homogeneous region (Fig. 11, areal zoom).
These model outputs clearly show the limitation in predictive capacity due to the limitations in the available data to represent the parent material. The prediction of soil texture is predominantly based on parent material, which allows us to distinguish the observed variability of soil texture between the lowlands and the mountains. Once parent material and coordinates are removed, the models increase the importance of those topographic predictors, which can , and (f) are for 70 cm depth. Models were built using either all predictors (full), leaving out the geologic predictors (no geo) or leaving out geology, latitude, and longitude (no geo + coords). The black horizontal lines stand for an RMSE of one, which equals the RMSE of predicting the observed mean.    Table 2. The scale bar shows distance in metres. be used to distinguish between these broad geographic regions (elevation, positive openness, diffuse insulation). Pedogenetic processes related to topography like the lateral redistribution of particles along slopes can only play a minor role, as the low importance values of predictors based on the immediate pixel neighbourhood have low importance values. However, other DSM approaches have successfully captured relief-based variability of soil texture on the scale of hillslopes using only topographical predictors (Moore et al., 1991;De Bruin and Stein, 1998;McBratney et al., 2000).
The inclusion of expert knowledge such as geological map products in machine learning models for the spatial continuous soil prediction at high resolution still requires further investigation. While the geologic maps have strong predictive power, they consist of too many geologic units. This leads to some units not having a sufficient number of soil samples to be able to generalize for that unit. However, our approach of reducing the number of geologic units by creating the particle size class bins was not able to produce useful predictors. One approach could be to use expert knowledge to merge geologic units that have parent material with similar soil texture classes. This merging should happen under the restriction that the resulting units should be as homogeneous as possible while providing enough samples for training and validation. Solving the issue of abrupt change in predicted values across geologic units could possibly be addressed by a fuzzy approach. Additionally, knowledge of the level of certainty of boundary demarcation between geologic units could be used to create fuzzy geologic maps as predictors.
In order to tackle the issue of artefacts present in the model output, two more models with a reduced set of predictors were built. The models using all predictors as specified in Table 2 (full) were compared to models leaving out the geologic predictors (no geo) or leaving out geology, latitude, and longitude (no geo + coords). Although the dominance of the categorical predictors on the model output was lifted in the "no geo" model version, an artefact due to the predictors longitude and latitude emerged. This new phenomenon appeared as a horizontal or vertical abrupt change in the predicted values across major parts of the study area (not shown). This aspect has already been observed in other DSM applications employing recursive partitioning algorithms (Behrens et al., 2018;Hengl et al., 2018;Nussbaum et al., 2018). Møller et al. (2020) addressed this problem with oblique geographic coordinates and provide an overview of ready applied approaches. Accordingly, we tested the usage of three Euclidean distance fields instead of Cartesian coordinates. However, the use of this alternative coordinate system led to the emergence of radial artefacts (results not shown).
The additional omission of latitude and longitude from the predictors leads to smoother maps, where only minor abrupt boundaries exist due to land cover, which is also a categorical predictor (Fig. 12). However, this aspect has to be differentiated from that of the geologic predictors. CORINE land cover classes were classified in remote sensing data products. Hence, spatial class boundaries do not reflect expert knowledge. Abrupt changes might in fact be due to land cover changes. The large agricultural fields of the lowlands are heavily impacted by wind erosion of the loess material during bare soil conditions. These "no geo + coords" predictions reproduce the spatial variability of the "full" model, even on relatively small scales. Strong deviations between the two model versions are in the eastern Harz region and in the riparian zones of the southern lowlands.
The differences in sand and silt content between the Harz and the lowlands were most likely derived from the predictors' elevation and positive openness. These predictors are strongly correlated (−0.95), have high importance values in both predictor sets, and show strong significant differences between the texture clusters (Fig. 5). The other predictors of Fig. 5 have lower values of absolute correlation with elevation (0.27-0.37) while still having a significant effect on the texture clusters. These predictors were more likely related to the variability within the two large-scale regions. The output of the "no geo + coords" models shows much more variability on smaller scales than the "full" models.

Conclusions
Our DSM approach has shown that RF is an appropriate method to model the variability of soil texture in the study area. The predictive performance of the silt and sand models is within the range of similar studies, while the prediction of the clay content did not seem feasible.
Clustering applications appear to be a versatile tool to be employed at various steps of the DSM procedure. Using a clustering application for feature selection offers additional insight into the predictor-response relationship, while clustering to conduct a stratified CV allowed for a robust model evaluation. Overall, stratified k-fold CV is common in DSM. To use the described cluster application allows for a simultaneous stratification regarding multiple response variables. However, to truly evaluate the power of this filter method, it would have to be compared to other feature selection methods which would have exceeded the workload for this study. We intend to do so in future studies.
The biggest area of application for data clustering in DSM appears to be in landscape stratification to divide the landscape into homogeneous subareas. Beyond their usage for stratified sampling and subsampling, the resulting stratification of the study area has further potential, like the use of landscape strata as predictors, the construction of individual models per landscape stratum, or the interpretation of the predictor-response relationship in different landscapes. A remaining difficulty in clustering applications is the determination of the number of clusters. Here, the combinations of clustering indices and heuristic methods have proven to be useful tools. Finally, clustering applications could also provide solutions to the problems encountered during the model-building procedure, like the replacement of the Cartesian coordinates, the inclusion of expert knowledge, pooling of geologic units, and blurring of the transitions between geologic units.