Using deep learning for Digital Soil Mapping

Digital soil mapping has been widely used as a cost-effective method for generating soil maps. However, current DSM data representation rarely incorporate contextual information of the landscape. DSM models are usually calibrated using point observations intersected with spatially corresponding point covariates. Here, we demonstrate the use of the convolutional neural network model that incorporates contextual information surrounding an observation to significantly improve the prediction accuracy over conventional DSM models. We describe a convolutional neural network (CNN) model that takes inputs 5 as images of covariates and explores spatial contextual information by finding non-linear local spatial relationships of neighbouring pixels. Unique features of the proposed model include: input represented as 3D stack of images, data augmentation to reduce overfitting, and simultaneously predicting multiple outputs. Using a soil mapping example in Chile, the CNN model was trained to simultaneously predict soil organic carbon at multiples depths across the country. The results showed the CNN model reduced the error by 30% compared with conventional techniques that only used point information of covariates. In the 10 example of country-wide mapping at 100 m resolution, the neighbourhood size from 3 to 9 pixels is more effective than at a point location and larger neighbourhood sizes. In addition, the CNN model produces less prediction uncertainty and it is able to predict soil carbon at deeper soil layers more accurately. Because the CNN model takes covariate represented as images, it offers a simple and effective framework for future DSM models. Copyright statement. Author(s) 2018. CC BY 4.0 License 15

The formalisation of the DSM methodology was done by the publication of McBratney et al. (2003). Following the ideas of Dokuchaev (1883) and Jenny (1941), they described the scorpan model as the empirical quantitative relationship of a soil attribute and its spatially implicit forming factors. Such factors correspond to a) s: soil, other properties of the soil at a point; b) c: climate, climatic properties of the environment at a point; c) o: organisms, vegetation or fauna or human activity; d) r: topography, landscape attributes; e) p: parent material, lithology; f) a: age, the time factor; and g) n: space, spatial position. 5 Explicitly, the scorpan model can be written as: S (x,y) = f (s (x,y) , c (x,y) , o (x,y) , r (x,y) , p (x,y) , a (x,y) , n (x,y) ) + e (x,y) (1) where (x, y) corresponds to the coordinates of a soil observation, and e is the spatial residual.
The usual steps for deriving the scorpan spatial soil prediction functions include intersecting soil observations (point data) with the scorpan factors (raster images at a particular resolution), and calibrating a prediction function f . In effect, we are only 10 looking at relationships between point observations and point representation of covariates. The scorpan factors have implicit spatial information, however the prediction function f does not explicitly take into account the spatial relationship.
Attempts have been made to incorporate more local information in the scorpan covariates, in particular topography. Approaches to include covariates information about the vicinity around the observations (x, y) have been devised. One approach is to derive topographic or terrain attributes (e.g. slope, curvature) at multiple scales by expanding the size of the window or 15 neighbour size in the calculation (Miller et al., 2015;Behrens et al., 2010). Another approach includes multi-scale analysis using spatial filters such as wavelets on the covariate raster (Biswas and Si, 2011;Sun et al., 2017). Thus, the raster represents larger spatial support. Studies indicated that, generally, covariates with larger support than its original resolution could enhance the prediction accuracy of the model (Mendonça-Santos et al., 2006;Sun et al., 2017).
DSM can be thought as linking observable landscape structure and soil processes expressed as observed soil properties. To 20 effectively link structure and processes, Deumlich et al. (2010) suggested the use of analysis that spans over several spatial and temporal scales. Behrens et al. (2018) proposed the contextual spatial modelling to account for the interactions of covariates across multiple scales and their influence on soil formation. The authors' approach (e.g.: Behrens et al. (2010Behrens et al. ( , 2014) derived covariates based on the elevation at the local to the regional extent. Their approaches include ConMap (Behrens et al., 2010) which is based on elevation differences from the centre pixel to each pixel in a sparse neighbourhood, and ConStat (Behrens 25 et al., 2014) which used statistical measures of elevation within growing sparse circular spatial neighbourhoods. These approaches produce a large number of predictors computed for each location, as shown in an example with 100 distance scales (e.g., from 20 m to 20 km) and 1000 predictors per grid cell. These hyper-covariates, solely based on elevation, are used as predictors in a random forest regression model.
Spatial filtering, multi-scale terrain calculation, and contextual mapping approaches require the pre-processing of each co- 30 variate independently. The useful scale for each covariate needs to be figured out via numerical experiments and most of the time the process relies on ad hoc decisions. Here, we take advantage of the success of deep learning models that are used for image recognition, as an effective tool in DSM to optimally search for local contextual information of covariates. This work aims to expand the classic DSM approach by including information about the vicinity around (x, y), to fully leverage the spatial context of a soil observation. The aim is achieved by devising a convolutional neural network (CNN) which can take multiple spatial contextual inputs.

Rationale
The theoretical background of DSM is based on the relationship between a soil attribute and soil forming factors. In practice, 5 a single soil observation is usually described as a point p with coordinates (x, y) (Eq. 1), and the corresponding soil forming factors are represented by a vector of pixel values of multiple covariate rasters (a 1 , a 2 , ..., a n ) at the same location, where n is the total number of covariate rasters.
Soils are highly dependant on their position in the landscape, and information at a particular pixel might not be sufficient to represent that complex relationship. Our method expands the classic DSM approach by replacing the covariates, usually 10 represented as a vector, with a 3D array with shape (w, h, n), where w and h are the width and height in pixels of a window centred at point p ( Fig. 1). Methods commonly used in DSM are not designed to adequately handle the data structure depicted in Fig. 1. The data representation is similar to the network model by Lee and Kwon (2017) Figure 1. Representation of the vicinity around a soil observation p, for n number of covariate rasters. w and h are the width and height in pixels, respectively.
As described in the introduction, while multi-scale or contextual mapping approaches have been used in DSM, they still 15 rely on a vector representation of covariate and rely on machine learning methods such as random forest to select important predictors. While deep learning methods have been used in DSM (e.g., (Song et al., 2016)), most studies still use a vector representation of covariate.
In the following sections, we introduce the use of convolutional neural networks (CNNs) to exploit spatial information of covariates that will perform a more effective DSM.

Deep learning 5
Deep learning is a machine learning method that is able to learn the representation of data through a series of processing layers.
In agriculture and environmental mapping, it is mainly used in hyperspectral and multispectral image classification problems, e.g. land cover classification (Kamilaris and Prenafeta-Boldú, 2018). We have not seen much application of deep learning in DSM, except for Song et al. (2016) which used the deep belief network for predicting soil moisture.
In this section we briefly introduce CNNs and some associated methods used during this work. For a more detailed and 10 general description about CNNs we refer the reader to LeCun et al. (1990) and Krizhevsky et al. (2012).

CNN
CNNs are based on the concept of a layer of convolving windows which move along a data array in order to detect features (e.g.: edges) of the data by using different filters (Fig. 2). When stacked together, convolutional layers are capable of extracting features of increasing complexity and abstraction (LeCun et al., 1990). Since CNNs have the capacity to leverage the spatial 15 structure of the data, they have been widely and effectively used in computer vision for image recognition or extraction (LeCun et al., 1995).  A CNN has a number of three dimensional hidden layers, each layer learning to detect different features of the input images (LeCun et al., 2015). In our case, each of the layers can perform one of the two types of operations: convolution, or pooling.
Convolution takes the input images through a set of convolutional filters (e.g., a 3x3 size filter), each of which detects and 20 enhances certain features from the images. Units in a convolutional layer are organised in feature maps (here we used 3 x 3).
Each unit of the feature map is connected to local patches in the feature maps of the previous layer through a set of weights.
These local weighted sum is then passed through a non-linear transfer function.
A pooling operation merges similar features by performing non-linear down-sampling. Here we used Max-Pooling layers which combine inputs from a small 2x2 window. Pooling also makes the features robust against noise. All the convolutional 5 and pooling layers are finally "flattened" to the fully connected layer. In effect, the fully connected layer is a weighted sum of the previous layers.
To obtain optimal weights for the network, we train the network using a training data set. Weights were adjusted based on a gradient-based algorithm to minimise the error using an Adam optimiser (Kingma and Ba, 2014). We refer to a review by LeCun et al. (2015) on the details of CNN.

Multi-task learning
CNNs have the capacity to predict multiple properties simultaneously. By doing so, a multi-task CNNs is capable of sharing learned representations between different targets and also use the other targets as "clues" during the prediction process. In consequence, the error of the simultaneous prediction is generally lower compared with a single prediction for each target (Padarian et al., 2019;Ramsundar et al., 2015). An additional advantage of using a multi-task CNN is the reported reduction 15 on the risk of overfitting (Ruder, 2017).
In DSM, where the combination of large extents, high resolution, and bootstrap routines, leads to running multiple model realisations on billions of pixels, combined with the fact that CNNs use a group of pixels around the soil observation instead of a single pixel, the time and computational resources required for training and inference is an important factor. Due to the simultaneous training an inference of multiple targets, a multi-task CNN present the advantage of reducing both, training and 20 inference time, compared with a single-task model.

Data
The data used in this work correspond to Chilean soil information. Since most observations are distributed on agricultural lands, we complemented that information with a second small data collection compiled from the literature and collaborators. 25 We selected soil organic carbon content (%) at depths 0-5, 5-15, 15-30, 30-60 and 60-100 cm as our target attribute. 485 soil profile where used after excluding soil profiles with total depth lower than 100 cm (in order to assure that all the profiles have observations at all depth intervals). For more details about the data and depth standardisation we refer the reader to Padarian et al. (2017).
As covariates, we used a) digital elevation model (HydroSHEDS, Lehner et al. (2008)), which are provided at 3 arc-second 30 resolution, in addition to its derived slope and topographic wetness index, calculated using SAGA (Conrad et al., 2015); and b) long term mean annual temperature and total annual rainfall derived from information provided by WorldClim (Hijmans et al., 2005), at 30 arc-second resolution. All data layers were standardised to a 100 m grid size.

Data augmentation
Deep learning techniques are described as "data-hungry" since they usually work better with large volumes of data. The direct effect of data augmentation is to generate new samples by modifying the original data without changing its meaning (Simard 5 et al., 2003). To achieve this, we rotated the 3D array shown in Fig. 1 by 90, 180, and 270 degrees, hence quadruplicating the number of observations. It is important to note that the central pixel preserves its initial position.
A secondary effect of data augmentation is regularisation, reducing the variance of the model and overfitting (Krizhevsky et al., 2012). Data augmentation also induces rotation invariance (Vo and Hays, 2016) by generating alternative situations (rotated data) where the model response should be similar to the original data (e.g: a soil profile next to a gully is expected to 10 be similar to a profile next to the opposite side of the gully, ceteris paribus).

Network architecture
The multi-task CNN used in this study ( Fig. 3; Table 1) consists of an input layer pass through a series of convolutional and pooling layers with a ReLU activation function, which adds a non-linearity by passing the learned weights through the function f (x) = max(0, x). The initial common/shared network has a function of extracting features shared between the five 15 target depth ranges. Next, the common features are propagated through independent branches, one per depth range, of 3 fullyconnected layers. The multiple connection between the layers generates a high number of parameters. In order to reduce the risk of overfitting, we introduce a dropout rate. In between the layers, 0.3 of the connections were randomly disconnected (Nitish et al., 2014).
We added this dropout operation in the shared layer and another dropout before the output.  Figure 3. Architecture of the multi-task network. "Shared layers" represent the layers shared by all the depth ranges. Each branch, one per depth range, first flattens the information to a 1D array, followed by a series of 2 fully-connected layer and a fully-connected layer of size=1, which corresponds to the final prediction.

Inputs
As explained in Section 2, our method uses a window around a soil observation which encloses a group of pixels instead of the single pixel that coincides with the observation. Most likely, the extent or size of that window will affect the model performance. To assess this effect, we compared the results of different models trained with a window size of 3, 5, 7, 9, 15, 21 and 29 pixels.

5
As the vicinity size increases, so does the number of parameters of the network (considering a fixed network architecture) and the risk of overfitting. To minimise overfitting, we modified the architecture of the network depending on the vicinity size (Table 2). • Droputs changed to 0.5

Training & Validation
First, 10% (n = 49) of the total dataset was randomly selected and used as a test set. The remaining 90% of the samples (n = 436) were augmented (see Section 4.2) obtaining a total of 1,744 samples. Following the data augmentation, we performed a bootstrapping routine (Efron and Tibshirani, 1993) with 100 repetitions, where the training set is obtained by sampling with replacement, generating a set of size 1,744. The samples which were not selected, about 1/3 (one-third), correspond to the 5 out-of-bag validation set.
As a control, we compared our results with a previous study by Padarian et al. (2017) where we used a Cubist regression tree model (Quinlan et al., 1992) to predict soil organic carbon (SOC) at a national extent. Cubist has been used in many other DSM studies due to its interpretability and robustness. In that study, we used the same set of soil observations and covariates described in Section 4.1 10

Uncertainty analysis
In this work (and in Padarian et al. (2017)), the uncertainty is represented as the 90% prediction interval derived from the 100 bootstrap iterations. To estimate the upper and lower prediction interval limits, we used the formula: wherex and σ 2 are the mean and variance of the 100 iterations, and M SE is the mean square error of the 100 fitted models.
5 Results and discussion 20

Data augmentation
To generalise and improve the CNN model, we created new data using only information from the training data by rotating the original image input. Data augmentation was effective at reducing model error and variability (Fig. 4). It was possible to observe a decrease of the error, by 10.56, 10.56, 11.25, 14.51, and 24.77% for 0-5, 5-15, 15-30, 30-60 and 60-100 cm depth ranges, respectively. The results are in accordance with image classification studies which generally showed that data 25 augmentation increased the accuracy of classification tasks (Perez and Wang, 2017). It is hypothesised that by increasing the amount of training data, we can reduce overfitting of CNN models.  In terms of the data spatial autocorrelation, we need to consider that after augmenting the data we have 4 samples in the same locations with exactly the same SOC content, therefore assuming that there is no variance when distance=0. That is theoretically true if we consider that the distance is exactly equal to 0. In practice, when calculating the semivariogram, the semivariance value of the first bin will be lower, but that does not significantly affect the final model.

Vicinity size 5
To incorporate contextual information for DSM prediction, we represent the input as image. The image is represented as observation in the centre, with surrounding pixels in a square format. The size of the neighbourhood window (vicinity) has a significant effect on the prediction error (Fig. 5). There is not significant difference when using a vicinity size of 3, 5, 7 or 9 pixels, but sizes above 9 pixels showed an increase in the error. Is possible to observe a lower error in the test dataset, compared with the training and validation, due to the slight differences in the dataset distributions (Fig. 6). Since the SOC distribution is  Paterson et al. (2018), where, based on 41 variograms, the authors estimated an average spatial correlation range of around 400 m.
As described in Section 4.3, we slightly modified the architecture of the network as input window size increased, in order to 20 minimise the risk of overfitting and isolate the effect of the vicinity size. As we increase the vicinity size, we give the model a  broader spatial context. Our results show that just a small amount of extra context provides enough information to improve the model predictions, and a larger amount of neighbouring information acts as noise, impairing the generalisation of the model.
Since we used the relatively large resolution of 100 m, it is hard to tell specifically what is the minimum amount of context needed to improve SOC predictions. We believe that using higher resolutions (< 10m) could produce more insights about this matter.
Soil forming factors interact in complex ways and affect soil properties with different strength. At local scale, a broader context (e.i.: larger vicinity size) does not necessarily provides extra information to the model, for instance when one of the factors is relatively homogeneous. The extra information could be even detrimental if the vicinity size is well beyond the area 5 of influence of a factor, which is what probably happened when we increased the vicinity size above 9 pixels (radius≈450m).
Representing this complexity in numerical terms would imply varying the size of the input array, such as each forming factor has a different vicinity size, most likely also varying depending on the spatial location of the soil observation (e.g.: smaller vicinity for homogeneous areas, larger for heterogeneous areas). This is technically possible but considerably increases the complexity of the modelling.

Comparison with other methods
We compared our approach with the Cubist model used in our previous study (Padarian et al., 2017), where we did not use any contextual information. We observed a significant decrease in the error (Fig. 5) by 23.0, 23.8, 26.9, 35.8, 39.8% for the 0-5, 5-15, 15-30, 30-60 and 60-100 cm depth ranges, respectively. Most current DSM studies rely on punctual observations without contextual information and, given the improvements shown by our approach, we believe there is a big potential for 15 CNNs to be used in operational DSM.
To compare our results with a method that uses contextual information, we ran a test using wavelet decomposition as per Mendonça-Santos et al. (2006). In addition to the five covariates, we used their approximation coefficients from the first, second and third levels of a Haar decomposition (Chui, 2016;Haar, 1910). The results including wavelet decomposed variables were similar to ones obtained with the Cubist model. The CNN approach reduced the error by 24.8, 24.7, 28.5, 28.6, 23.5 for 20 the 0-5, 5-15, 15-30, 30-60 and 60-100 cm depth ranges, respectively. Mendonça-Santos et al. (2006) reported an average improvement of 1% for the prediction of clay content. In our case the wavelet decomposition reduced the error of SOC content by 5.1%, in average, compared with Cubist but the reduction was only observed in depth (2.4, 1.2, 2.3, -10.1, -21.4% error change for the 0-5, 5-15, 15-30, 30-60 and 60-100 cm depth ranges, respectively), where SOC content is low, hence reducing the effect in applications such as carbon accounting. Our approach showed greater error reductions and through the whole 25 profile.

Prediction of deeper soil layers
Our approach uses a multi-task CNN to predict multiple depths simultaneously in order to produce a synergistic effect. Compared with predicting each depth range in isolation by training a network with the same structure (Section 4.3) but with only one output, our approach reduced the error by 1.5, 6.7, 6.6, 8.9, 13.0% for the 0-5, 5-15, 15-30, 30-60 and 60-100 cm depth 30 ranges, respectively. In this case, the reduction was modest and we believe the effect can be greater when more soil observations are available.
In DSM, there are two main approaches to deal with the vertical variation of a soil attribute: 2.5D and 3D modelling. In the first one, an independent model is fitted for each depth range. The latter explicitly incorporates depth in order to obtain a single model for the whole profile. Interestingly, both approaches show a decrease in the variance explained by the model as the prediction depth increases. In a 3D mapping of SOC for a 125 km 2 region in the Netherlands, Kempen et al. (2011) presented R 2 values of 0.75, 0.23 and 0.09 for the 0-30, 30-60, and 60-90cm depth ranges, respectively. Our previous study (Padarian 5 et al., 2017) the 2.5D mapping showed R 2 values of 0. 39, 0.39, 0.27, 0.19, and 0.17 for the 0-5, 5-15, 15-30, 30-60 and 60-100 cm depth ranges, respectively. Similar studies show the same trend (Akpa et al., 2016;Mulder et al., 2016;Adhikari et al., 2014), independent of the models used or the soil attribute predicted. This is expected as the information used as covariates usually represents surface conditions. Our multi-task network presented the opposite trend (Fig. 7), showing an increase of the explained variance as the prediction depth increases.  Adhikari et al. (2014) The prediction of the adjacent layers served as guidance, producing a synergistic effect. A soil attribute through a profile usually has a predictable behaviour (unless there are lithological discontinuities), which has been described by many authors in the form of depth functions (Kempen et al., 2011;Nakane, 1976;Russell and Moore, 1968). A CNN is capable of generating an internal representation of the vertical distribution of the target attribute, which resembles the observed pattern (Fig. 8).

Visual evaluation of maps 15
Visually, the maps generated with the Cubist tree model and our multi-task CNN showed differences (Fig. 9). In an example for an area in southern Chile (around 72.57º S), the map generated with the Cubist model (Fig. 9a) shows more details related with the topography, but also presents some artefacts due to the sharp limits generated by the tree rules. On the other hand, the map generated with the multi-task CNN using a 7x7 window (Fig. 9b) shows a smoothing effect, an expected behaviour consequence of using neighbour pixels.

Uncertainty
A recommended DSM practice is to present a map of a predicted attribute along its associated uncertainty (Arrouays et al., 2014). Our multi-task CNN significantly reduced the prediction interval width (PIW , Table 3) compared with the Cubist model.
In average, we observed a reduction of 13.1 and 13.8% for the CNN model generated without and with data augmentation pretreatment, respectively, for the first three depth intervals. Our multi-task CNN model showed a slightly lower prediction interval In terms of the spatial patterns of the uncertainty (Fig 10), the greater reductions of the PIW where observed in elevated areas of the Andes, followed by the central valleys. A slight increase, in the order of 6-8%, was observed in the western coastal ranges. The reduction of the PIW in the Andes is most most likely due to a more reserved extrapolation by the CNN models compared with Cubist. It is worth noting that the central valleys is were most of the agricultural lands are located and the uncertainty reduction observed in these areas could have important implications. PIW change (%) Figure 10. Percentage change on the prediction interval width using as a reference a Cubist model (Padarian et al., 2017) with and without data augmentation pre-treatment (right and left panels, respectively).

Conclusions
Incorporating contextual information into DSM models is an important aspect that deserves more attention. Since a soil surveyor will look at the surrounding landscape to make a prediction of soil type, DSM models should also incorporate information surrounding an observation. We demonstrated the use of a convolutional neural network as an efficient, effective, and accurate method to achieve this goal. In particular we introduce a deep learning model for DSM which has the following innovative features: -The representation of input as an image, which takes into account information surrounding a point observation. CNN is able to recognise contextual information, and extract multi-scale information automatically, which circumvents the need to pre-process the data in the form of spatial filtering or multi-scale analysis.

5
-The use of data augmentation as a general representation of soil in the landscape, which can reduce overfitting and improve model accuracy.
-The ability to predict different soil depth simultaneously in a model, and thus take into account the depth correlation of soil properties and attributes. In our example, prediction of soil properties at deeper layer, common problem in DSM studies, improved significantly.

10
The resulting prediction has less uncertainty, further, the use of this data structure with CNN seems to eliminate artefacts generally found in DSM products due to incompatible scale of covariates and sharp discontinuities due to tree models.
A CNN can handle large number of covariates and has advantages over other machine learning algorithms used in DSM, such as random forests, and Cubist regression tree because its architecture is flexible, and explicitly takes spatial information of covariates around observations. While there are attempts to include information surrounding an observation as covariates in 15 a random forest model, those inputs still do not have spatial relationship. CNN does not require pre-processing such as wavelet transformation, rather such function is built in the model. There are other features such as handling missing values via data imputation (Duan et al., 2016) which can be readily added in the network.
The example presented in this paper is for a country-wide modelling at 100 m resolution, and we need to further test such approach in the regional to landscape mapping. The CNN model would be highly suitable for mapping soil class. In addition, 20 the presented model can be used for other environmental mapping.