Oblique geographic coordinates as covariates for digital soil mapping

Møller, Anders Bjørn; Beucher, Amélie Marie; Pouladi, Nastaran; Greve, Mogens Humlekrog

doi:https://doi.org/10.5194/soil-6-269-2020

Articles | Volume 6, issue 2

https://doi.org/10.5194/soil-6-269-2020

Articles | Volume 6, issue 2

Original research article

14 Jul 2020

Original research article |

| 14 Jul 2020

Oblique geographic coordinates as covariates for digital soil mapping

Anders Bjørn Møller, Amélie Marie Beucher, Nastaran Pouladi, and Mogens Humlekrog Greve

Abstract

Decision tree algorithms, such as random forest, have become a widely adapted method for mapping soil properties in geographic space. However, implementing explicit spatial trends into these algorithms has proven problematic. Using x and y coordinates as covariates gives orthogonal artifacts in the maps, and alternative methods using distances as covariates can be inflexible and difficult to interpret. We propose instead the use of coordinates along several axes tilted at oblique angles to provide an easily interpretable method for obtaining a realistic prediction surface. We test the method on four spatial datasets and compare it to similar methods. The results show that the method provides accuracies better than or on par with the most reliable alternative methods, namely kriging and distance-based covariates. Furthermore, the proposed method is highly flexible, scalable and easily interpretable. This makes it a promising tool for mapping soil properties with complex spatial variation.

Download & links

Article (PDF, 9630 KB)

Download & links

How to cite.

Received: 25 Oct 2019 – Discussion started: 07 Nov 2019 – Revised: 11 May 2020 – Accepted: 18 Jun 2020 – Published: 14 Jul 2020

1 Introduction

Machine learning has become a frequently applied means for mapping soil properties in geographic space. The most common approach is to train models from soil observations and covariates in the form of geographic data layers. The models can often provide reliable predictions of soil properties. Many researchers have used decision tree algorithms as they are computationally efficient, do not rely on assumptions about the distributions of the input variables, and can use both numeric and categorical data (Quinlan, 1996; Mitchell, 1997; Rokach and Maimon, 2005; Tan et al., 2014). Additionally, they effectively handle nonlinear relationships and complex interactions (Strobl et al., 2009).

However, a disadvantage of decision tree models is that they do not explicitly take into account spatial trends in the data. Unlike geostatistical methods, such as kriging, the predictions can therefore contain spatial bias.

A number of studies have applied regression kriging (RK) as a solution (Knotters et al., 1995; Odeh et al., 1995; Hengl et al., 2004). By kriging the residuals of the predictive model and adding the kriged residuals to the prediction surface, this approach can account for spatial trends and achieve higher accuracies. A disadvantage of RK is that the combination of two models hinders the combination of spatial trends with the other covariates. Spatial trends therefore remain disconnected from other statistical relationships in the analysis, leading to difficulties in interpreting the model and its associated uncertainties.

An obvious solution to this problem would be to use the x and y coordinates of the soil observations as covariates. However, results have shown that this approach can lead to unrealistic orthogonal artifacts in the output maps when used in conjunction with decision tree algorithms (Behrens et al., 2018; Hengl et al., 2018; Nussbaum et al., 2018). The cause of this problem lies in the splitting procedure of decision tree algorithms, as they use only one covariate for each split. Therefore, a dataset containing only the x and y coordinates will force the algorithm to make orthogonal splits in geographic space.

Several researchers have proposed solutions to this problem. Behrens et al. (2018) proposed the use of Euclidean distance fields (EDFs) in the form of distances to the corners and middle of the study area and the x and y coordinates. Their results showed that this approach efficiently integrated spatial trends and that accuracies were better than or on par with other methods for integrating spatial context.

On the other hand, Hengl et al. (2018) suggested an approach referred to as spatial random forest (RFsp). This method consists of calculating data layers with buffer distances to each of the soil observations in the training dataset. It then trains a random forest model, using the buffer distances as covariates, either combined with auxiliary data or on their own. One of the main advantages of this approach is that it incorporates distances between observations in a similar manner to geostatistical models. The authors assessed the use of RFsp on a large number of spatial prediction problems and showed that it effectively eliminated spatial trends in the residuals.

Although these two methods are able to integrate spatial trends in machine-learning models, they can be difficult to interpret. The distances used in EDFs depend on the geometry of the study area, and for RFsp, they depend on the locations of the soil samples. The meaning and interpretation of the distances therefore varies depending on the study area and the soil observations.

EDFs and RFsp also have limited flexibility as both methods specify the number of geographic data layers a priori. For EDFs, the number of distance fields is seven, and for RFsp, the number of buffer distances is equal to the number of soil observations. This means that there is no straightforward way to increase the number of spatially explicit covariates if the number is insufficient to account for spatial trends. Conversely, there is no way to decrease the number of spatially explicit covariates, even if a smaller number would suffice. The latter is especially relevant for RFsp as the method is computationally unfeasible for datasets with a large number of observations (Hengl et al., 2018).

In this study, we propose an alternative method for including spatially explicit covariates for mapping soil properties. With the method, we aim to directly address the cause of the orthogonal artifacts produced with x and y coordinates as covariates in decision tree models. Furthermore, we aim to improve upon the shortcomings of previous methods by developing a method that is both flexible and easily interpretable.

We refer to the method as oblique geographic coordinates (OGCs). In short, it works by calculating coordinates for the observations along a series of axes tilted at several oblique angles relative to the x axis. By including oblique coordinates as covariates, we enable the decision tree algorithm to make oblique splits in geographic space. As this is not possible with only x and y coordinates as covariates, this addition should allow the model to produce a more realistic prediction surface. Furthermore, the number of oblique angles is adjustable, and soil mappers can therefore choose a number that suits their purpose. Some mapping tasks may require a higher number of oblique angles than others, and soil mappers can therefore increase the number as necessary. Alternatively, if a small number of oblique angles suffices, soil mappers can reduce their number and thereby shorten computation times.

We test the method on four spatial datasets. Firstly, we test it for predicting soil organic matter contents in a densely sampled agricultural field in Denmark, located in northern Europe. Secondly, we test it on three publicly available spatial datasets (meuse, eberg and Swiss rainfall). We hypothesize that OGCs can provide accuracies on par with previous methods for including explicitly spatial covariates. We also hypothesize that it is possible to adjust the number of oblique angles in order to optimize accuracy and that the results allow for meaningful interpretations.

2 Materials and methods

2.1 Study areas

We test OGCs and compare them to other methods based on four spatial datasets. Firstly, we test them for a predicting soil organic matter (SOM) for an agricultural field in Denmark (Vindum). Secondly, we test them on three publicly available datasets. For Vindum, we will present methods and results in detail. For the other three datasets, we will present methods and results in brief, while Appendix A contains a detailed presentation of the methods and results for these datasets.

https://soil.copernicus.org/articles/6/269/2020/soil-6-269-2020-f01

Figure 1(a) Location of Denmark in northern Europe. (b) Location of the Vindum field within Denmark. (c) Map of the Vindum field, including locations of the samples extracted for soil organic matter (SOM) measurements. The thin black lines are 2 m contour lines. The background shows hill shade (northwest; 45^∘ altitude) based a digital elevation model (DEM) in 1.6 m×1.6 m resolution (National Survey and Cadastre of Denmark, 2012).

2.1.1 Vindum

This study area is a 12 ha agricultural field located in Denmark in northern Europe (9.568^∘ E, 56.375^∘ N; European Terrestrial Reference System (ETRS89), 1989; Fig. 1). It lies in a kettled moraine landscape 55–66 m above sea level (a.s.l.). The parent materials in the field include clay till, glaciofluvial sand and peat. The climate is temperate coastal, with mean monthly temperatures ranging from 1 ^∘C in January to 17 ^∘C in July and a mean annual precipitation of 850 mm (Wang, 2013). The field contains 285 measurements of SOM from the depth interval 0–25 cm located in a 20 m grid.

The SOM contents of the topsoil in the field range from 1.3 % to 38.8 %, with a mean value of 3.5 % and a median of 2.2 %. The values have a strong positive skew of 4.7 and are leptokurtic with a kurtosis of 26.9. Logarithmic transformation reduces skewness (2.9) and kurtosis (11.1). Pouladi et al. (2019) described the spatial structure of the data, with a stable variogram with 139 m range, nugget of 0 and sill of 23.8.

2.1.2 Additional datasets

For additional analyses, we included the meuse dataset, the eberg dataset and the Swiss rainfall dataset. The meuse dataset, available through the R package sp (Pebesma et al., 2020), contains 155 measurements of soil heavy-metal concentrations from a 5 km² flood plain of the Meuse River near the village of Stein in the Netherlands. For this dataset, we mapped zinc concentrations. The eberg dataset, available through the R package plotKML (Hengl et al., 2020), contains 3670 soil observations from a 100 km² area in Ebergötzen near the city of Göttingen in Germany. For this dataset, we mapped soil types. Lastly, the Swiss rainfall dataset contains 476 rainfall measurements from 8 May 1986 in Switzerland (Dubois et al., 2003). Although this is not a soil dataset, we included it because of the high anisotropy of the data, which makes it useful for comparing methods on their ability to account for anisotropic spatial problems. We describe these three datasets in more detail in Appendix A.

https://soil.copernicus.org/articles/6/269/2020/soil-6-269-2020-f02

Figure 2Illustration for the derivation of the oblique geographic coordinate for the point (b₁, a₁) along an axis tilted with the angle θ from the x axis. The coordinate is equal to the length of b₂. Triangles a₁b₁c and a₂b₂c are right triangles with the same hypotenuse c. The sides a₁ and b₁ are the x and y coordinates of the point (b₁, a₁), respectively. A₁ is the angle between the x axis and the line c between the origin of the coordinate system and the point (b₁, a₁); A₂ is the difference between θ and A₁.

Oblique geographic coordinates as covariates for digital soil mapping

2.1 Study areas

2.1.1 Vindum

2.1.2 Additional datasets

2.2 Oblique geographic coordinates

2.3 Method comparison

2.3.1 Vindum

2.3.2 Additional datasets

3.1 Optimal number of coordinate rasters

3.1.1 Vindum

3.1.2 Additional datasets

3.2 Method comparison

3.2.1 Predictive accuracy

3.2.2 Maps

3.2.3 Residuals

3.3 Covariate importance

3.4 Choice of method

A1 Methods

A1.1 meuse

A1.2 eberg

A1.3 Swiss rainfall

A2 Results

A2.1 meuse

A2.2 eberg

A2.3 Swiss rainfall