Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach

Rohmer, Jeremy; Belbeze, Stephane; Guyonnet, Dominique

doi:https://doi.org/10.5194/soil-10-679-2024

Articles | Volume 10, issue 2

https://doi.org/10.5194/soil-10-679-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/soil-10-679-2024

© Author(s) 2024. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 10, issue 2

Original research article

|

30 Sep 2024

Original research article |

| 30 Sep 2024

Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach

Jeremy Rohmer, Stephane Belbeze, and Dominique Guyonnet

Download

Final revised paper (published on 30 Sep 2024)
Supplement to the final revised paper
Preprint (discussion started on 21 Feb 2024)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2024-323', Anonymous Referee #1, 19 Mar 2024

This is a review for the manuscript Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach by Rohmer et al. The authors use SHAP, a common tool for assessing machine learning predictions at local scale, to investigate the contribution of covariates (or rather groups of covariates) on the uncertainty of a random forest model. It is well known that Shapley values are computationally very expensive, and so the authors propose to reduce the number of covariates to speed up computations. This is done before model training (a rather odd proposal) by using a statistical dependence test (i.e., HSIC), and then after model training by grouping covariates (again with the same dependence test). The main aim of investigating covariates with the model's uncertainty is intriguing within the field of digital soil mapping, but the manuscript has some major flaws. Major concerns are related to the methodology of the entire selection procedure of covariates as well as the with the presented case study. The quality of the writing is also unfortunately poor.

Main methodological concerns

• My first criticism is related to the first step, that is, the elimination of covariates before model training. This is a common pitfall within machine learning in DSM. The problem is with data leakage which may cause bias, and this occurred when covariates are removed from the entire training data set, and not within for example a cross-validation within each fold. Note that any data preprocessing (e.g., normalisation) dealt with in such a way can lead to data leakage. Data leakage may also cause the model’s uncertainty to be lower, and this is then also problematic if interpretative machine learning (IML) methods (like SHAP) are used to analyse the relationships between covariates and the model’s uncertainty. In addition, with a model such as random forest, covariate selection is not really required, especially with so few covariates (i.e., 15). I invite the authors to refer to the work such as that of Zhu et al. (2023) for guidance on data preparation so that data leakage is avoided.

• Linking to my previous point. if the goal is to speed up computations, then removing covariates should not be a first choice. In addition, in typical DSM projects the number of covariates is usually more than 100. Therefore, the presented case study, which only has 15 covariates, is not the best choice to showcase the proposed methodology. One could rather perform a sample of grid cells at which Shapley values are estimated. Like for example in the Wadoux et al. (2023) paper. Again, in many DSM projects, maps are sometimes created over millions of grid cells, so the presented case study is not the best one to showcase this methodology. Therefore, to speed up computations with a small data set (like the one in this study), I would rather use a stronger machine to do the calculations than to omit potentially important parts of my data. If not possible, then let the computations run for a few days.

• The grouping of covariates is a practical way of speeding up computation, but I am afraid it holds no meaning for DSM practitioners. The authors acknowledge this in the discussion, starting at Line 519. Doing inference on machine learning output with IML methods is hard enough. I cannot see how the grouping of covariates could hold much interpretive meaning.

• To sum up, exploring the relationship between covariates and model uncertainty is intriguing and worth exploring. However, the paper's emphasis on reducing computation with (questionable?) methods distracts from the main goal of the paper. That is, I would have liked to see more in-depth analysis of covariates related to SHAP (prediction) vs SHAP (uncertainty). I would also like to have seen more emphasis on: do we expect the same covariates to be related to both, why do we see different covariates in terms of predictions vs uncertainty.

Some other concerns / suggestions

• The synthetic case study adds no value to the paper. I suggest removing it as the paper is already a bit long for the topic at hand.

• Section 3.1 is difficult to follow without the knowledge of HSIC and some of the information in the many cited references. Maybe just restructure the manuscript and include essential methodology.

• Random forests are standard and already widely known in DSM. The sections on RF and QRF can be removed, and replaced with brief references to RF and QRF.

• Maps presented in this manuscript are of poor quality and not visually appealing. Captions and legend can also be improved. With Figure 3, show more information. Not everyone is that familiar with this region in France. The histogram is not very clear, especially the long right tail can be enhanced visually.

• General writing of the manuscript is poor. Some examples: The overuse of “etc”, too many brackets to give additional information, brief introductions at each section.

• The mathematical writing can also be improved. For example, are the authors sure that ML model is just y=f(x)? See Line 142.

• Figure 6 does not make sense. Why is there an arrow from Step 2 to 4?
References:

Wadoux, A., Saby, N., Martin, M. (2023). Shapley values reveal the drivers of soil organic carbon stock prediction. SOIL, 9, 21-38. doi: 10.5194/soil-9-21-2023.

Zhu et al. (2023). Machine Learning in Environmental Research: Common Pitfalls and Best Practices. https://pubs.acs.org/doi/10.1021/acs.est.3c00026.

Citation: https://doi.org/10.5194/egusphere-2024-323-RC1
- AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
  
  We would like to thank Referee #1 for the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall in the attached document the reviews and we reply to each of the comments in turn (outlined in blue). The main corrections made to the manuscript are described in a specific section of each response.
  
  Citation: https://doi.org/10.5194/egusphere-2024-323-AC1
RC2:
'Comment on egusphere-2024-323', Anonymous Referee #2, 11 Apr 2024

This manuscript is well written, clear and relevant, and presents methods that could provide stakeholders with valuable insights into where the uncertainty comes from: this has the potential to make uncertainty more concrete for them.
I appreciate the use of a synthetic test case, which makes the whole procedure a lot easier to understand.
I don’t have any major criticisms. I would be pleased to see this manuscript published after attention to the following minor details :
Line 44: However, at a local scale, these methods don’t (?) provide any information for a prediction at a certain spatial location.
Line 157: pushes the prediction uncertainty?
Line 442: I don’t see any circular pattern on the bottom middle panel of Figure 13 (in the bottom right one however, they are really clear).

Synthetic test case: isn’t the fact that in Z1, the biggest contributor to uncertainty is Tmean-Tmax (and that respectively in Z2, the biggest contributor is Pwettest) be linked to the fact that these covariates have uniquely high (respectively low) values there, that are not represented in the dataset? If you agree, this in my opinion would be interesting to put in the discussion.

Citation: https://doi.org/10.5194/egusphere-2024-323-RC2
- AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
  
  We would like to thank Referee #2 for the positive analysis and the constructive comments. We agree with most of the suggestions and, therefore, we have modified the manuscript to take on board their comments. We recall the reviews in the attached document and we reply to each of the comments in turn.
  
  Citation: https://doi.org/10.5194/egusphere-2024-323-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Reconsider after major revisions (14 May 2024) by Alexandre Wadoux

AR by Jeremy Rohmer on behalf of the Authors (25 Jun 2024) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (01 Jul 2024) by Alexandre Wadoux

RR by Anonymous Referee #1 (16 Jul 2024)

Suggestions for revision or reasons for rejection

I commend the authors on an improved manuscript. The methodology section now reads better, and the various steps are also clearer. I am also pleased with the improved methodology in terms of performing the screening analysis within the cross-validation. However, even with the improvements, certain sections are still a bit cryptic. See additional concerns below (some are minor and other are more major). I based my second review on the track changes document.
Line 260. The sentence starting with “Overall, the RF …” reads strange.
Line 280. maybe rather: “Shapley values, as defined in Sect. 3.2, … ”. Avoid to overly depend on brackets.
Line 289-291. Try rewriting in more than one sentence. It is currently hard to follow. In addition, I am unsure what the authors mean by “… having uniquely high and low values…”.
Line 299-300. First part of the sentence does not make sense. The part with “to estimate the conditional mean, which is used as the best estimate of the prediction,…”. Rewrite, because this is technically wrong. How can the estimate of the conditional mean be the estimate of the prediction?
Line 301. I must admit I am getting lost with this part. Maybe other readers will as well. The authors mention the difficulty of related to the clustering (i.e., verb) of the observations. Because this term is also used in this manuscript to refer to the clustering algorithm, this is a bad choice for meaning how the points are distributed. Could the authors clarify what they mean here. Are they referring to how the points are spatially distributed?
Line 301: I am also confused as to why this is a problem? Given that my understanding of the above point is right.
Lines 302-308. Is all of this necessary? Was this discuss in the methodology section? So, to make sure I understand all of this. Since the points are spatially clustered, that is, the points are not well dispersed over the region, the authors define weights which must then be used when observations are sampled when the bootstrap samples (i.e. trees) are drawn. If my understanding is correct, then this seems all a bit unnecessary. Could the authors elaborate why this is necessary? In addition, why would you bring additional methodology that was not discussed in the previous sections? Also, what if the weights do not address the feature space well? Another question, is this step necessary when you include covariates that used to address the spatial aspect of the data? I mean, you included covariates such as the coordinates and various distances. Can the authors highlight DSM studies where this has been done? Again, I am just trying to understand the motivation behind this methodology in these lines.
Line 337: “…covariates are retained in the construction of the RF model.” But the RF was already constructed if the cross-validation was performed. So why are covariates retained? What does this mean?
Line: 361. Oh, I see retained for the group based shap. Is this what the authors meant at Line 337? If so, then make it clearer. If not, please explain.
Line 406: models, plural?
Line 406: This is also a very strange sentence, because the RF model cannot extrapolate. See this post for example that explains it (https://stats.stackexchange.com/questions/235189/random-forest-regression-not-predicting-higher-than-training-data#:~:text=Decision%20Trees%20%2F%20Random%20Forrest%20cannot,outside%20of%20the%20observed%20range. ). So again, all of this is a bit cryptic, and I am cautious to what the authors mean (Lines 405-412). The authors referenced here the paper by Takoutsing and Heuvelink. Note the paragraph right above section 3.5 that also notes that RF cannot extrapolate beyond training data.
L411: What limitations?
Lines 438-443: rewrite to include the long line-in reference in the quotes.
Line 456: extrapolation mode: odd way of stating that RF is used to make spatial extrapolations. See also in Line 411.

Hide

ED: Revision (19 Jul 2024) by Alexandre Wadoux

AR by Jeremy Rohmer on behalf of the Authors (12 Aug 2024) Author's response Author's tracked changes Manuscript

ED: Publish as is (13 Aug 2024) by Alexandre Wadoux

ED: Publish as is (13 Aug 2024) by Rémi Cardinael (Executive editor)

AR by Jeremy Rohmer on behalf of the Authors (20 Aug 2024) Manuscript

Short summary

Machine learning (ML) models have become key ingredients for digital soil mapping. To explain why the ML model is reliable, we apply a popular method from explainable artificial intelligence to the uncertainty prediction, with an application to the mapping of hydrocarbon pollutants on urban soil. We show the benefit of a joint analysis of the influence on the best estimate and the uncertainty to improve communication with end users and support decisions regarding covariates’ characterisation.