Decision tree algorithms, such as random forest, have
become a widely adapted method for mapping soil properties in geographic
space. However, implementing explicit spatial trends into these algorithms
has proven problematic. Using
Machine learning has become a frequently applied means for mapping soil properties in geographic space. The most common approach is to train models from soil observations and covariates in the form of geographic data layers. The models can often provide reliable predictions of soil properties. Many researchers have used decision tree algorithms as they are computationally efficient, do not rely on assumptions about the distributions of the input variables, and can use both numeric and categorical data (Quinlan, 1996; Mitchell, 1997; Rokach and Maimon, 2005; Tan et al., 2014). Additionally, they effectively handle nonlinear relationships and complex interactions (Strobl et al., 2009).
However, a disadvantage of decision tree models is that they do not explicitly take into account spatial trends in the data. Unlike geostatistical methods, such as kriging, the predictions can therefore contain spatial bias.
A number of studies have applied regression kriging (RK) as a solution (Knotters et al., 1995; Odeh et al., 1995; Hengl et al., 2004). By kriging the residuals of the predictive model and adding the kriged residuals to the prediction surface, this approach can account for spatial trends and achieve higher accuracies. A disadvantage of RK is that the combination of two models hinders the combination of spatial trends with the other covariates. Spatial trends therefore remain disconnected from other statistical relationships in the analysis, leading to difficulties in interpreting the model and its associated uncertainties.
An obvious solution to this problem would be to use the
Several researchers have proposed solutions to this problem.
Behrens et al. (2018) proposed the use of Euclidean distance fields
(EDFs) in the form of distances to the corners and middle of the study area
and the
On the other hand, Hengl et al. (2018) suggested an approach referred to as spatial random forest (RFsp). This method consists of calculating data layers with buffer distances to each of the soil observations in the training dataset. It then trains a random forest model, using the buffer distances as covariates, either combined with auxiliary data or on their own. One of the main advantages of this approach is that it incorporates distances between observations in a similar manner to geostatistical models. The authors assessed the use of RFsp on a large number of spatial prediction problems and showed that it effectively eliminated spatial trends in the residuals.
Although these two methods are able to integrate spatial trends in machine-learning models, they can be difficult to interpret. The distances used in EDFs depend on the geometry of the study area, and for RFsp, they depend on the locations of the soil samples. The meaning and interpretation of the distances therefore varies depending on the study area and the soil observations.
EDFs and RFsp also have limited flexibility as both methods specify the number of geographic data layers a priori. For EDFs, the number of distance fields is seven, and for RFsp, the number of buffer distances is equal to the number of soil observations. This means that there is no straightforward way to increase the number of spatially explicit covariates if the number is insufficient to account for spatial trends. Conversely, there is no way to decrease the number of spatially explicit covariates, even if a smaller number would suffice. The latter is especially relevant for RFsp as the method is computationally unfeasible for datasets with a large number of observations (Hengl et al., 2018).
In this study, we propose an alternative method for including spatially
explicit covariates for mapping soil properties. With the method, we aim to
directly address the cause of the orthogonal artifacts produced with
We refer to the method as oblique geographic coordinates (OGCs). In short, it
works by calculating coordinates for the observations along a series of
axes tilted at several oblique angles relative to the
We test the method on four spatial datasets. Firstly, we test it for
predicting soil organic matter contents in a densely sampled agricultural
field in Denmark, located in northern Europe. Secondly, we test it on three
publicly available spatial datasets (
We test OGCs and compare them to other methods based on four spatial datasets. Firstly, we test them for a predicting soil organic matter (SOM) for an agricultural field in Denmark (Vindum). Secondly, we test them on three publicly available datasets. For Vindum, we will present methods and results in detail. For the other three datasets, we will present methods and results in brief, while Appendix A contains a detailed presentation of the methods and results for these datasets.
This study area is a 12 ha agricultural field located in Denmark in northern
Europe (9.568
The SOM contents of the topsoil in the field range from 1.3 % to 38.8 %, with a mean value of 3.5 % and a median of 2.2 %. The values have a strong positive skew of 4.7 and are leptokurtic with a kurtosis of 26.9. Logarithmic transformation reduces skewness (2.9) and kurtosis (11.1). Pouladi et al. (2019) described the spatial structure of the data, with a stable variogram with 139 m range, nugget of 0 and sill of 23.8.
For additional analyses, we included the
Illustration for the derivation of the oblique geographic
coordinate for the point (
The method that we propose consists of calculating coordinates along a
number of axes titled at various oblique angles, relative to the
As the
Examples of rasters with coordinates tilted at six different angles for the Vindum study area. Easting and northing for the Universal Transverse Mercator (UTM) zone 32N (ETRS89).
We use the 285 SOM observations from the Vindum study area in order to test the accuracy of predictions made by random forest models using OGCs as covariates. In addition to OGCs, we also employed 19 data layers with auxiliary data, which Pouladi et al. (2019) derived from a 1.6 m digital elevation model (DEM), satellite imagery and electromagnetic induction. Topographic variables included the sine and cosine of the aspect, depth of sinks, plan and profile curvature, elevation, flow accumulation, valley bottom flatness, midslope position, standard and modified topographic wetness index, slope gradient, slope length, and valley depth. Satellite imagery included normalized difference, absolute difference, ratio and soil-adjusted vegetation indices. Lastly, we used the apparent electrical conductivity from a DUALEM-1 sensor in perpendicular mode.
In order to optimize the number of raster layers for OGCs, we generated
datasets with 2–100 coordinate rasters. We then trained random forest
models from each dataset, both with and without auxiliary data. In order to
assess predictive accuracy, we used 100 repeated splits on the SOM
observations, each using 75 % of the observations for model training and a
25 % holdout dataset for accuracy assessment. We trained models using the
R package
We used the same 100 repeated splits for each number of coordinate rasters,
with and without auxiliary data. We calculated accuracy based on Pearson's
We then compared the accuracies obtained with the optimal numbers of
coordinate rasters, with and without auxiliary data, to the accuracies
obtained with other methods. We tested kriging, random forest models trained
only on the auxiliary data and random forest models trained using EDFs and
RFsp, with and without auxiliary data. We trained the random forest models
using the same procedure outlined above. For kriging, we used variograms
automatically fitted on logarithmic-transformed SOM observations using the
We used the same 100 repeated splits for assessing the accuracies of all
methods. This allowed us to carry out pairwise
We also investigated the covariate importance of models trained with OGCs and
tested all methods for spatially autocorrelated residuals using experimental
variograms. To produce sample variograms of the residuals, we produced maps
with each method using all observations. We converted both observations and
predictions to a natural logarithmic scale. We then subtracted the predictions
from the observations and calculated variograms for these residuals. For
this purpose, we used the function
We also compared OGCs to other methods based on the three additional datasets
Effects of the number of coordinate rasters on the accuracy of SOM
predictions on the Vindum dataset, calculated as Pearson's
For the Vindum dataset, accuracies of predictions obtained with OGCs, without
auxiliary data, increased with the number of coordinate rasters up to an
optimum at seven coordinate rasters (Fig. 4).
However, with more than seven coordinate rasters, accuracies deteriorated
slightly with the number of coordinate rasters. This pattern was the same
for all three metrics. On the other hand, with OGCs in combination with
auxiliary data, accuracies generally increased with the number of coordinate
rasters. The increase was greatest when the number of coordinate rasters was
small, while the effect of more coordinate rasters decreased for larger
numbers of coordinate rasters. With auxiliary data, the optimal number of
coordinate rasters was 94 for Pearson's
Maps of soil organic matter (SOM) contents in the topsoil at Vindum predicted using random forest models trained with coordinate rasters at two to seven different angles as covariates. Easting and northing for UTM zone 32N (ETRS89).
Figure 5 shows SOM contents mapped for Vindum with
increasing numbers of coordinate rasters, without auxiliary data. The
predictions with only two coordinate rasters showed a pattern very typical
of predictions with
Maps of soil organic matter (SOM) contents in the topsoil at
Vindum predicted using random forest models trained using auxiliary data in
conjunction with coordinate rasters at
With auxiliary data, the effect of increasing the number of coordinate rasters was less clearly visible for the Vindum dataset (Fig. 6). Even with only two coordinate rasters, the predictions had no orthogonal artifacts. However, they contained noisy patterns and sharp boundaries in some areas. This is most likely an artifact from the auxiliary data. For example, using a high-resolution DEM may have created noise in the predictions. However, with coordinate rasters at 80 different angles, the spatial pattern of the predicted SOM contents became substantially smoother, with a reduction in both noise and sharp boundaries. Furthermore, some areas with moderately high SOM contents became more clearly visible and coherent, for example, in the area approximately one-third of the way from the western to the northern corner of study area. The predicted patterns with a higher number of coordinate rasters were therefore not only more accurate but also more realistic.
Violin plots showing accuracies of soil organic matter predictions
on the Vindum dataset with kriging, and random forest models trained using
either auxiliary data (AUX), Euclidean distance fields (EDFs), distances to
observations (RFsp), oblique geographic coordinates (OGCs), or EDFs, RFsp, or
OGCs in conjunction with AUX. The figure shows Pearson's
For the three additional datasets, the effect of increasing the number of
coordinate rasters without auxiliary data was generally the same as for the
Vindum dataset. In all three cases, there was relatively little, if any,
increase in accuracy after an initially very steep increase. For the
As for the Vindum dataset, the optimal number of coordinate rasters was
generally larger in combination with auxiliary data than without auxiliary
data. For the
In summary, the combination of OGCs with auxiliary data generally increased
the optimal number of coordinate rasters. Furthermore, in some cases,
accuracy deteriorates when the number of coordinate rasters surpasses an
optimal value, while in other cases it reaches a plateau. The decrease in
accuracy past the optimum may be due to the correlation between the coordinate
rasters. Coordinates
With OGCs in combination with auxiliary data, the process-based covariates in
the auxiliary data most likely help to reduce the effect of correlation
between the coordinate rasters. Furthermore, in this case, the number of
coordinate rasters also affects the relative weighting between the auxiliary
data and the coordinate rasters. When
At present, several factors could therefore explain the optimal number of coordinate rasters for each dataset with and without auxiliary data. The exact interplay between these factors is unclear, and the best option may therefore be to experiment with different numbers of coordinate rasters.
For all four datasets, there were large overlaps in the accuracies of the
methods, as accuracies varied across the 100 repeated splits
(Figs. 7, A2, A4 and A6).
However, an analysis on the Vindum dataset revealed that the accuracies
generally correlated between the methods across the repeated splits. The
mean correlation coefficient (Pearson's R) was 0.52 (0.19–0.88) for
Auxiliary data variables used as covariates in the study including the name, the description, the mean value and the range. Pouladi et al. (2019) describe the derivation of the variables.
Ranks for the accuracies of the methods on the Vindum dataset
calculated as Pearson's
For the Vindum dataset, kriging achieved the highest rank for
Auxiliary data used on their own, and RFsp without auxiliary data, had the lowest rank for all three accuracy metrics on the Vindum dataset. Furthermore, OGCs without auxiliary data had the same rank as EDFs without auxiliary data for all three accuracy metrics.
Pouladi et al. (2019) tested several methods for predicting SOM on the Vindum dataset, including kriging and the machine-learning algorithms cubist and random forest, with and without kriged residuals. The authors found that kriging provided the most accurate predictions of SOM. The results for Vindum affirm the high accuracy of kriging predictions, but they also show that random forest models combining auxiliary data with spatial trends can achieve similar accuracies.
Ranked accuracies obtained with each method on the
For the
For the
Ranks of the accuracies (percent cases correctly predicted)
obtained with each method on the
For the Swiss rainfall dataset, OGCs were the most accurate method for all three metrics (Table 5). RFsp was the second most accurate method, followed by EDFs. OK was the least accurate method.
Ranked accuracies on the Swiss rainfall dataset for each method.
Pairwise
In summary, for Vindum,
It is important to consider that in most cases all methods yielded acceptable accuracies. Although the differences between the accuracies of the methods were in many cases statistically significant, they were generally small. However, the results show that OGCs compare well with other methods for integrating spatial trends in machine-learning models.
Prediction of soil organic matter (SOM) contents for the topsoil
at Vindum using
For the Vindum dataset, kriging produced a smooth prediction surface, which
is very common for this method (Fig. 8a). The
prediction surface with EDFs was mostly smooth, but it also contained a
distinct “rings in the water” artifact caused by the raster with the
distance to the middle of the study area (Fig. 8b). The prediction surface with RFsp was smoother than the prediction
surface produced by kriging (Fig. 8c). The
predictions with only auxiliary data were very similar to the predictions
made with
Zinc contents predicted with each method for the
For the
Soil types predicted with each method for the
For the
Maps of rainfall on 8 May 1986 in Switzerland predicted with each
method. Northing and easting are coordinates according to the Swiss
coordinate system LV95.
For the Swiss rainfall dataset, OK produced a smooth, highly anisotropic prediction surface (Fig. 11a). The prediction surfaces of EDFs, RFsp and OGCs also showed anisotropy, but they were generally smoother and more rounded. For example, with OK, some individual observations showed an effect on the prediction surface as elongated spots in the direction of the anisotropy. With the other three methods, a few individual observations showed an effect in the prediction surface, but their effects are more rounded and less distinct. The predictions with EDFs, RFsp and OGCs therefore appear more general than the OK predictions. Moreover, the prediction surfaces of EDFs, RFsp and OGCs appear very similar.
Experimental variograms for the residuals of the SOM predictions made with each method for the Vindum dataset. The variograms use residuals from natural logarithmic-transformed SOM measurements and predictions. AUX – auxiliary data; EDFs – Euclidean distance fields; RFsp – spatial random forest; and OGCs – oblique geographic coordinates.
For the Vindum dataset, the residuals of the SOM predictions had some degree
of spatial dependence for all methods except kriging
(Fig. 12). This finding contrasts with Hengl et al. (2018), who found that there was no spatial trend in the residuals of
predictions with RFsp. EDFs, RFsp and OGCs used without auxiliary data had the
most spatially dependent residuals. However, the residuals of the combined
methods (EDFs
Covariate importance of the model using OGCs in combination with auxiliary data for Vindum. The importance of OGCs represents the sum of the importance of the coordinate rasters at 80 different angles.
For the Vindum dataset, the most important covariate from the auxiliary data was the depth of sinks (Table 6). The most likely reason for its high importance is the presence of a large sink with very high SOM contents northwest of the middle of this study area (Fig. 1). As sinks trap surface runoff, they often have wet conditions, which give rise to peat accumulation.
When used in combination with the auxiliary data, the importance of the
individual coordinate rasters varied from 0.6 % to 3.1 % of the
importance of the depth of sinks, with mean value of 1.7 %. The most
important coordinate raster had
Covariate importance of the coordinate rasters at various angles
for Vindum.
Figure 13 shows the importance of the coordinate
rasters relative to
Without auxiliary data, the most important coordinate rasters had a general
northwestern to southeastern angle (Fig. 13). On the
other hand, the coordinate rasters with angles between a north–south and a
northeast–southwest axis had low importance. The most likely reason for this
pattern is the location of the sink with very high SOM contents to the
northwest of the middle of this study area. This creates a large difference
in the SOM contents of the northwestern and southeastern parts of the study
area, giving large importance to covariates that can explain this
difference. Additionally, the northwestern side of the sink has a very steep
slope, creating a steep gradient in SOM contents in this direction. A stable
variogram showed anisotropy along a north–northeast to south–southwest
axis
On the other hand, with OGCs in combination with auxiliary data, the most
important coordinate rasters had tilt angles close to a north–south axis
(
Orthophoto of the study area from 27 September 2016 (Esri, 2019). Sources: Esri, DigitalGlobe, Earthstar Geographics, CNES/Airbus DS, GeoEye, USDA FSA, USGS, Aerogrid, IGN, IGP, and the GIS User Community.
A possible cause of the anisotropy in the residuals may be the plowing
direction. The main plowing direction in the Vindum study area is along an
east–northeast to west–southwest axis (
At Vindum, the three most accurate methods were kriging, RFsp with auxiliary
data and OGCs with auxiliary data. For
Although kriging was in most cases less accurate than other methods, some soil mappers would probably still choose it for mapping soil properties due to its computational efficiency and conceptual simplicity. However, aside from accuracy, an advantage of methods based on machine learning lies in the fact that they provide larger amounts of information than geostatistical models. Kriging in itself does not provide information on the processes that control spatial variation in soil properties, but machine-learning models can include covariates related to soil processes, providing information on the processes that are most likely to affect the spatial distribution of a soil property.
With spatial approaches such as EDFs, RFsp and OGCs, researchers can
incorporate feature space and geographic space in a machine-learning model.
Of the previously used approaches, OGCs are most similar to EDFs, as they used
the
One advantage of using spatially explicit covariates (EDFs, RFsp or OGCs) is that researchers can interpret local and spatial effects at once. In this regard, OGCs have an advantage over EDFs and RFsp, as it is clear what the coordinate rasters represent. It is less clear how researchers should interpret distances to the corners of the study area or the distance to a specific observation. We have also shown that it is straightforward to illustrate covariate importance of OGCs.
Furthermore, an advantage of OGCs relative to RFsp is that OGCs required fewer
covariates to achieve the same accuracy. In fact, without auxiliary data,
OGCs achieved a higher accuracy with a smaller number of covariates for the
datasets of Vindum,
We will stress that, as a rule, soil mappers should not use machine-learning
models relying only on spatial trends, as EDFs, RFsp and OGCs all yielded
lower accuracies without auxiliary data for the soil datasets (Vindum,
We have shown in this study that the use of oblique geographic coordinates
(OGCs) is a reliable method for integrating auxiliary data with spatial
trends for modeling and mapping soil properties. In most cases, the method
eliminated the orthogonal artifacts that arise from the use of
OGCs are more interpretable than previous similar approaches, and more flexible, as it is possible to adjust the number of coordinate rasters. This should allow soil mappers to find a good compromise between accuracy and computational efficiency for mapping soil properties, as the optimal number of coordinate rasters may vary depending on the study area and the soil property in question.
At this point, we have only tested the method for three soil datasets and one meteorological dataset. It will therefore be highly relevant to test the method for other soil properties and areas. It will especially be relevant to test the method in larger, less densely sampled areas. Previous studies have shown that machine learning is likely to provide higher accuracies in such areas (Zhang et al., 2008; Greve et al., 2010; Keskin et al., 2019), and it will be relevant to test if this is also the case for oblique geographic coordinates. Results from the Vindum and the Swiss rainfall datasets also suggest that the method can be useful for mapping variables with anisotropic spatial distributions, and it will therefore be relevant to test it on datasets with a high degree of anisotropy. Lastly, one should note that we carried out this study for relatively small areas using “flat” coordinate systems. Using OGCs for larger areas and other coordinate systems may require alterations to the method.
We call upon researchers within digital soil mapping to aid us in testing oblique geographic coordinates as covariates for additional datasets, and we have therefore made the function for generating oblique geographic coordinates available as an R package. Moreover, to allow other researchers to test methods on the Vindum dataset, we have made it available and part of the same package.
We mapped zinc contents for the
We tested all the methods applied to the Vindum dataset, with the addition
of regression kriging (RK). We used random forest models trained on the
auxiliary data for regression and then kriged the residuals using the
function
We mapped soil types for the
The
The dataset is highly clustered, which is likely to affect accuracy
assessments, as some areas have much higher point densities than others. To
counter this effect, we organized the data in 100 groups using
As we aimed to predict a categorical variable, we did not use kriging. Furthermore, due to the large size of the dataset, we did not use RFsp, as this would require us to produce more than 2000 raster layers with buffer distances. Hengl et al. (2018) avoided this by calculating only buffer distances to each soil type. However, we did not choose this solution as it would create problems for accuracy assessment. If a raster layer contains distances to test observations and training observations, the result would be circular logic, invalidating the accuracy assessment. Buffer distances based only on the training data would be less problematic. However, as we used 100 repeated splits, this was not an option.
We therefore tested only five methods for the
Due to the large size of the dataset, model training was slower than for the
other datasets. We therefore tuned a random forest model only once for each
method and used the resulting parameterization for all 100 data splits. For
each split, we calculated the accuracy on the test data as the fraction of
observations correctly predicted. We carried out pairwise
We produced maps of soil types with each of the five methods in order to compare results.
The Swiss rainfall dataset contains 467 rainfall observations from
Switzerland from 8 May 1986. We did not use any covariates for this
dataset, and we therefore tested only purely spatial methods. We tested
ordinary kriging with correction for anisotropy, EDFs, RFsp and OGCs. As for
the Vindum dataset, we tested each method with 100 repeated splits into
training data (75 %) and test data (25 %). For each split, we calculated
Pearson's
For the
For the
For the
For the
For the Swiss rainfall dataset, the accuracy of OGCs generally increased with
the number of coordinate rasters (Fig. A5). The
increase in accuracy was steep at first, then gradual. For Pearson's
As for the other datasets, variation in accuracies on the Swiss rainfall
dataset was greater between the splits into training and test data than
between the methods (Fig. A6). The distributions
of RMSE were mostly symmetric, but the distributions of
Accuracy of predictions on the
Violin plots showing the accuracies obtained on the
Accuracy (percent of cases correctly predicted) of predictions on
the
Violin plot showing the accuracies obtained on the
Accuracy of predictions on the Swiss rainfall dataset versus the number of rasters with oblique geographic coordinates. The values are averages obtained with 100 splits into training and test data.
Violin plots showing the accuracies obtained on the Swiss rainfall dataset with each method. The plots show values obtained with 100 splits into training and test datasets.
The function for generating oblique geographic coordinates is available as an R package at
ABM and NP prepared the data. ABM carried out the analyses and prepared the paper with inputs from all coauthors.
The authors declare that they have no conflict of interest.
We are obliged to the two anonymous referees and to Alexandre Wadoux, who provided vital feedback on the paper. Their comments and advice have greatly improved the paper, and we give them our thanks.
This paper was edited by Kristof Van Oost and reviewed by two anonymous referees.