Articles | Volume 10, issue 2
https://doi.org/10.5194/soil-10-679-2024
© Author(s) 2024. This work is distributed under the Creative Commons Attribution 4.0 License.
Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach
Download
- Final revised paper (published on 30 Sep 2024)
- Supplement to the final revised paper
- Preprint (discussion started on 21 Feb 2024)
Interactive discussion
Status: closed
Comment types: AC – author | RC – referee | CC – community | EC – editor | CEC – chief editor
| : Report abuse
-
RC1: 'Comment on egusphere-2024-323', Anonymous Referee #1, 19 Mar 2024
- AC1: 'Reply on RC1', Jeremy Rohmer, 29 Apr 2024
-
RC2: 'Comment on egusphere-2024-323', Anonymous Referee #2, 11 Apr 2024
- AC2: 'Reply on RC2', Jeremy Rohmer, 29 Apr 2024
Peer review completion
AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload
ED: Reconsider after major revisions (14 May 2024) by Alexandre Wadoux
AR by Jeremy Rohmer on behalf of the Authors (25 Jun 2024)
Author's response
Author's tracked changes
Manuscript
ED: Referee Nomination & Report Request started (01 Jul 2024) by Alexandre Wadoux
RR by Anonymous Referee #1 (16 Jul 2024)
ED: Revision (19 Jul 2024) by Alexandre Wadoux
AR by Jeremy Rohmer on behalf of the Authors (12 Aug 2024)
Author's response
Author's tracked changes
Manuscript
ED: Publish as is (13 Aug 2024) by Alexandre Wadoux
ED: Publish as is (13 Aug 2024) by Rémi Cardinael (Executive editor)
AR by Jeremy Rohmer on behalf of the Authors (20 Aug 2024)
Manuscript
This is a review for the manuscript Insights into the prediction uncertainty of machine-learning-based digital soil mapping through a local attribution approach by Rohmer et al. The authors use SHAP, a common tool for assessing machine learning predictions at local scale, to investigate the contribution of covariates (or rather groups of covariates) on the uncertainty of a random forest model. It is well known that Shapley values are computationally very expensive, and so the authors propose to reduce the number of covariates to speed up computations. This is done before model training (a rather odd proposal) by using a statistical dependence test (i.e., HSIC), and then after model training by grouping covariates (again with the same dependence test). The main aim of investigating covariates with the model's uncertainty is intriguing within the field of digital soil mapping, but the manuscript has some major flaws. Major concerns are related to the methodology of the entire selection procedure of covariates as well as the with the presented case study. The quality of the writing is also unfortunately poor.
Main methodological concerns
• My first criticism is related to the first step, that is, the elimination of covariates before model training. This is a common pitfall within machine learning in DSM. The problem is with data leakage which may cause bias, and this occurred when covariates are removed from the entire training data set, and not within for example a cross-validation within each fold. Note that any data preprocessing (e.g., normalisation) dealt with in such a way can lead to data leakage. Data leakage may also cause the model’s uncertainty to be lower, and this is then also problematic if interpretative machine learning (IML) methods (like SHAP) are used to analyse the relationships between covariates and the model’s uncertainty. In addition, with a model such as random forest, covariate selection is not really required, especially with so few covariates (i.e., 15). I invite the authors to refer to the work such as that of Zhu et al. (2023) for guidance on data preparation so that data leakage is avoided.
• Linking to my previous point. if the goal is to speed up computations, then removing covariates should not be a first choice. In addition, in typical DSM projects the number of covariates is usually more than 100. Therefore, the presented case study, which only has 15 covariates, is not the best choice to showcase the proposed methodology. One could rather perform a sample of grid cells at which Shapley values are estimated. Like for example in the Wadoux et al. (2023) paper. Again, in many DSM projects, maps are sometimes created over millions of grid cells, so the presented case study is not the best one to showcase this methodology. Therefore, to speed up computations with a small data set (like the one in this study), I would rather use a stronger machine to do the calculations than to omit potentially important parts of my data. If not possible, then let the computations run for a few days.
• The grouping of covariates is a practical way of speeding up computation, but I am afraid it holds no meaning for DSM practitioners. The authors acknowledge this in the discussion, starting at Line 519. Doing inference on machine learning output with IML methods is hard enough. I cannot see how the grouping of covariates could hold much interpretive meaning.
• To sum up, exploring the relationship between covariates and model uncertainty is intriguing and worth exploring. However, the paper's emphasis on reducing computation with (questionable?) methods distracts from the main goal of the paper. That is, I would have liked to see more in-depth analysis of covariates related to SHAP (prediction) vs SHAP (uncertainty). I would also like to have seen more emphasis on: do we expect the same covariates to be related to both, why do we see different covariates in terms of predictions vs uncertainty.
Some other concerns / suggestions
• The synthetic case study adds no value to the paper. I suggest removing it as the paper is already a bit long for the topic at hand.
• Section 3.1 is difficult to follow without the knowledge of HSIC and some of the information in the many cited references. Maybe just restructure the manuscript and include essential methodology.
• Random forests are standard and already widely known in DSM. The sections on RF and QRF can be removed, and replaced with brief references to RF and QRF.
• Maps presented in this manuscript are of poor quality and not visually appealing. Captions and legend can also be improved. With Figure 3, show more information. Not everyone is that familiar with this region in France. The histogram is not very clear, especially the long right tail can be enhanced visually.
• General writing of the manuscript is poor. Some examples: The overuse of “etc”, too many brackets to give additional information, brief introductions at each section.
• The mathematical writing can also be improved. For example, are the authors sure that ML model is just y=f(x)? See Line 142.
• Figure 6 does not make sense. Why is there an arrow from Step 2 to 4?
References:
Wadoux, A., Saby, N., Martin, M. (2023). Shapley values reveal the drivers of soil organic carbon stock prediction. SOIL, 9, 21-38. doi: 10.5194/soil-9-21-2023.
Zhu et al. (2023). Machine Learning in Environmental Research: Common Pitfalls and Best Practices. https://pubs.acs.org/doi/10.1021/acs.est.3c00026.