Performance of three machine learning algorithms for predicting soil organic carbon in German agricultural soil
- 1Thünen Institute of Climate Smart Agriculture, Braunschweig, Germany
- 2Department Soil System Science, Helmholtz Centre for Environmental Research – UFZ, Halle (Saale), Germany
- 1Thünen Institute of Climate Smart Agriculture, Braunschweig, Germany
- 2Department Soil System Science, Helmholtz Centre for Environmental Research – UFZ, Halle (Saale), Germany
Abstract. Soil organic carbon (SOC), as the largest terrestrial carbon pool, has the potential to influence climate change and mitigation, and consequently SOC monitoring is important in the frameworks of different international treaties. There is therefore a need for high resolution SOC maps. Machine learning (ML) offers new opportunities to do this due to its capability for data mining of large datasets. The aim of this study, therefore, was to test three commonly used algorithms in digital soil mapping – random forest (RF), boosted regression trees (BRT) and support vector machine for regression (SVR) – on the first German Agricultural Soil Inventory to model agricultural topsoil SOC content. Nested cross-validation was implemented for model evaluation and parameter tuning. Moreover, grid search and differential evolution algorithm were applied to ensure that each algorithm was tuned and optimised suitably. The SOC content of the German Agricultural Soil Inventory was highly variable, ranging from 4 g kg−1 to 480 g kg−1. However, only 4 % of all soils contained more than 87 g kg−1 SOC and were considered organic or degraded organic soils. The results show that SVR provided the best performance with RMSE of 32 g kg−1 when the algorithms were trained on the full dataset. However, the average RMSE of all algorithms decreased by 34 % when mineral and organic soils were modeled separately, with the best result from SVR with RMSE of 21 g kg−1. Model performance is often limited by the size and quality of the available soil dataset for calibration and validation. Therefore, the impact of enlarging the training data was tested by including 1223 data points from the European Land Use/Land Cover Area Frame Survey for agricultural sites in Germany. The model performance was enhanced for maximum 1 % for mineral soils and 2 % for organic soils. Despite the capability of machine learning algorithms in general, and particularly SVR, in modelling SOC on a national scale, the study showed that the most important to improve the model performance was separate modelling of mineral and organic soils.
- Preprint
(2153 KB) -
Supplement
(274 KB) - BibTeX
- EndNote
Ali Sakhaee et al.
Status: final response (author comments only)
-
RC1: 'Comment on soil-2021-107', Anonymous Referee #1, 27 Dec 2021
Overall this is a very interesting paper in which a fair comparison as regards different DSM approaches has been made across Germany, including the effect of a data-size extension (after combining 2 databases) and whether mineral and organic soils should be treated separately (by creating two different models).
In addition, the paper is well structured and writing, though some minor spellings and grammar improvements are possible (please note that I only focused on the language in the first couple of pages, but I’m convinced that the entire paper could benefit from some slight language polishing)
Nevertheless, I believe that this paper may require some major revisions based on following comments:
I. Main Suggestions/Remarks
I.1 This research considers agricultural soils, including both grassland and cropland, and as such I have some serious concerns as regards the presented (0-20 to 0-30) depth interpolation approach in order to match both databases (P. 3), which seems to be based on a (first order?) linear function depending on the soil class. However, in my opinion this analysis should be carried out per land use – soil type combination, because the depth distribution in cropland topsoil is remarkably different to that in grassland topsoil (i.e. more or less a Cte value versus exponential decline, respectively). Hence, I would like to ask the authors to carry-out this analysis again per land use soil – type. Moreover, only a (general / average?) slope parameter value has been given (in L115 – 116), and as such I would like to ask the authors to provide the readers with a much more detailed picture on the different slope parameter values obtained depending on the land uses (and soil types) setting. This can be done in a tabular format (in annex) by presenting the slope parameter (+/- the associated SE) for each land use and soil type combination - or - in a graphical format showing the distribution of slope parameter values per land use type.
I.2. From section 2.2 I can see that a wide range of covariates has been considered. However, I was wondering whether the authors did carry-out any multi-collinearity analysis in order to identify those who may be too strongly correlated (e.g. r > 0.9). Subsequently, I was wondering what they have done to solve this potential issue?
I.3 The model performance evaluation indicators (section 2.6) are all quite similar and have a particular focus on “random error”. Hence, I would like to suggest to include some others that could provide the readers with some information as regards the (%)bias. In addition, within ‘the spirit’ of SVR I think that including also a model performance evaluation indicator that also takes into account the concept of ‘model complexity’ could be an interesting add-on here. (I know that in the context of this kind of model this can be interpreted quite widely and may include a penalization term that depends on the number of parameters (like AIC and BIC) or the complexity of the trees / nodes, ect….)
I.4 I think that the “Results and Discussion” section requires some clarifications as regards the structure. In essence, I would like to suggest to add a short intro-paragraph explaining briefly the logic behind the structure (and clarifying as such the meaning of AP1, AP1L, AP2 and AP2L). Moreover, in the (bold) headings of the separate sub-section you could add the corresponding abbreviation in brackets to it at the end as well as give a short statement at the start of the section which case you’re going to consider (actually, similarly to what have been done in L318). In addition, I believe that in some cases a bigger effort could be made to discuss the regional differences (in results obtained by applying the different approaches). In that respect I would like to suggest the authors to provide the readers with a relative residual map with annotation of + or – in order to be able to interpret the under / over predictions patterns in a spatial explicit way (I think this will have more value that the maps in fig. 3 and 6).
I.5 I believe that ‘the main message’ should be highlighted more, i.e. the fact that creating 2 separate models (one for mineral soils and one for organic soils) is much more important than the choice of the type of model (at least those considered here) and/or the suggested data-size extension. Please make sure that this is highlighted in the discussion and the conclusions sections. In that respect I think that some small additional analysis could be useful, for example a table / figure showing the potential model improvement (e.g. average RMSE decrease – or any other model performance indicator – see comment I.3.) due to this 3 factors (i.e. model separations (org vs. mineral), type of model, data extension). I think that this can be calculated rather easily from the information given in figure 2.
I.6 As I understood (from reading section 2.2) that all the covariates are represented by a spatially continuous map, I was wondering whether it could also be an option to provide the readers with one spatially continuous predicted SOC map, for example created by applying ‘the best model’ (AP2L?) on the various covariate maps. I think this could be useful in order to obtain a more detailed interpretation of the results taking into account regional differences depending on various environmental settings.
II. Specific Suggestion/Remarks:
L 9-10: “to influence climate change and mitigation” is a somewhat strange formulation. (I guess this should have been “to influence and mitigate climate change”?) Please rephrase.
L 15: define topsoil, (e.g. add ‘(0-30cm)’)
L 32 – 37: you make several references to Meersmans et al 2012 but in your reference list there is 2012a and 2012b, so please specify “a” of “b” here.
L 46: “at a different scale” is a somewhat strange formulation. (I guess this should have been “at different scales’ or ‘across different scales”) Please rephrase or delete.
L 54: I suggest replacing “SOC inventory” by “SOC monitoring” because you make reference to the periodic character of it.
L57: “with a sampling depth down to 100 cm” is a somewhat strange formulation. A more common way to say this could be “considering a sampling depth of 1m” or “considering a reference depth of 1m”.
L61: What do you mean with “complete”? Is this a good spatial distribution? Please clarify.
L73: add a space between “(2017)” and “concluded”.
L 81: What do you mean with disparity? (Do you mean “sample design”? Or “spatial distribution”?) Please clarify.
L 120-126: Why didn’t you just use just the best quality product? Are all covariates resampled to a resolution of 100m? And if yes, why not use the any higher level of detail / precision if you have been provided with it anyway? Was this done in order to deal with some computation intensity issues?
L 137: What is the (initial) resolution of this DEM? Was this layers also resampled to 100m (see previous comment)? And if so, was this done before or after deriving the related co-variates (such as slope, curvature ect…). Please be more clear / specific about the exact methodological approach followed here.
L145-147: please make a reference to the source of this map.
L 164: What kind erosion map has been considered? Is it a map highlighting water erosion and/or tillage erosion? Hence, please specify what kind of model has been considered to generate this map (e.g. Is this map based on RUSLE or WatemSedem)? I’m also wondering whether it was really required to add this map, because you have already a lot of topographical related input variables which may provide you with similar info. (In that respect I like to reiterate my main comment I.2 – see above)
L 175: What kind of interaction depth did you consider?
L 188: Are you sure this needs to be “maximum error”? To me it sounds more logic to go for a model with “minimum error” but still with a limited model complexity.
L237: Can you clarify what you exactly mean with “shuffled 10 times”. I guess this is a kind of random perturbation? (following a normal distribution?) Is it similar to what one will do in Monte Carlo?
L 265 / Figure 2: Please add subplot labels to fig2 (a1, a2, a3, ….. b1, b2,…) and make always reference to the specific subplots in the text so the reader know immediately which subplots needs to be considered / compared (and which one he / she can ignore).
Figure 2: Besides adding subplot labels (see comment just above this one), I think there is an error in the x-ax labeling, because in all cased it is either “AP2” or “AP2L”, so there is no “AP1” or “AP1L” present, whereas I think that all the plots on the left-hand side of the figure (which are making reference to “one model approach”) should have the labels “AP1” or “AP1L” (and not “AP2” or “AP2L” is currently the case). Right?
Figure 5: Please add a regression line though these clouds of dots so one can evaluate a potential bias and /or over- /underprediction. (please note that this suggestion is related to my main comment I.3)
-
AC1: 'Reply on RC1', Ali Sakhaee, 20 Feb 2022
Dear Editor, Dear Referee1,
Beacuse the format doesnot allows us to show the reponses in detail, we kindely refer you to the attached zip file where you can find the detailed answers and the supplementary document. Thank you for your undrstanding.
Best regards
Ali Sakhaee
-
AC1: 'Reply on RC1', Ali Sakhaee, 20 Feb 2022
-
RC2: 'Comment on soil-2021-107', Anonymous Referee #2, 13 Jan 2022
In general, a well-written paper that presents soil organic carbon modelling at a national scale (Germany). The author focus on three aspects, namely the comparison of three machine learning models, expanding the national dataset with samples from a continental scale survey (LUCAS dataset) and how generating two separate models for mineral and organic soils affects the performance of such models.
Since this is a national scale digital soil mapping (DSM) study, I think that a major revision is required.
General comments
- I have a problem with the way maps are presented. As far as I understand, the paper is a digital soil mapping (DSM) study but I do not see any maps with continuous predictions but just some points on a map. Or are those the areas corresponding to croplands? Please clarify. Second, you use a discrete colour map to show the results which do not allow the reader to see the spatial pattern of the predicted maps. You discuss the distribution of the residuals but a more detailed visual inspection of maps could be useful (which is common in DSM). For instance, Boosted Regression Tree (BRT) seems to mostly use categorical covariates (except for total nitrogen). How does that map look like?
- The largest difference can be seen when you split the dataset in mineral/organic. There is no doubt that the difference is significant. What about the rest of the comparisons? You use a Kruskal-Wallis to show that extrapolation in depth of the LUCAS dataset is valid but it is not clear if the main comparison (between three models according to the title) is significant.
- Perhaps the paper is focussing too much on the differences between models which is not very interesting. We have seen hundreds of papers comparing different models just to confirm that the "best model" depends on many factors. However, your results on modelling mineral and organic models separately seem interesting and perhaps focussing on that could benefit the community and the readers.
- How do you actually use two separate models (mineral/organic) in practice? In this approach, to make a SOC prediction you first need to decide which model to use. But to make that decision, you need to know the SOC concentration. This is an important point that should be discussed. For instance, how do we generate a national map in this particular study? Is your potential solution applicable to other countries?
- I think a bit more discussion about the covariates could be useful. Many of the soil covariates used correspond to continental scale predictions (with significant uncertainty) which usually perform poorly at other scales (national). In addition to that, is interesting to see how just a few covariates are actually used by the models. Are we using too many useless covariates in DSM (studies with dozens of covariates)?
Specific comments
- Section 2.6.1: I think the way parameter tuning is described is not correct. First, you mention that grid search parameters need to be discrete or discretised, which is not true. You can use continuous parameters without problem (e.g. [1.0, 0.1, 0.01]). Second, you used a DE algorithm for BRT since the parameters are continuous but `number of trees` and `interaction depth` are discrete. Based on your criteria, you couldn't use any of the strategies for BRT. A clarification is required.
- Figure 2: The limits of the whiskers and boxes sometimes represent different things depending on the library. Please add what they represent in the caption.
-
AC2: 'Reply on RC2', Ali Sakhaee, 20 Feb 2022
Dear Editor, Dear Referee2,
Beacuse the format doesnot allows us to show the reponses in detail, we kindely refer you to the attached zip file where you can find the detailed answers and the supplementary document. Thank you for your undrstanding.
Best regards
Ali Sakhaee
-
AC2: 'Reply on RC2', Ali Sakhaee, 20 Feb 2022
Ali Sakhaee et al.
Ali Sakhaee et al.
Viewed
HTML | XML | Total | Supplement | BibTeX | EndNote | |
---|---|---|---|---|---|---|
433 | 186 | 15 | 634 | 45 | 9 | 7 |
- HTML: 433
- PDF: 186
- XML: 15
- Total: 634
- Supplement: 45
- BibTeX: 9
- EndNote: 7
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1