Assessing the potential of complex artificial neural networks for modelling small-scale soil erosion by water

Barthel, Nils; Ott, Simone; Burkhard, Benjamin; Steinhoff-Knopp, Bastian

doi:10.5194/soil-12-321-2026

Articles | Volume 12, issue 1

https://doi.org/10.5194/soil-12-321-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/soil-12-321-2026

© Author(s) 2026. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 12, issue 1

Original research article

|

30 Mar 2026

Original research article |

| 30 Mar 2026

Assessing the potential of complex artificial neural networks for modelling small-scale soil erosion by water

Nils Barthel, Simone Ott, Benjamin Burkhard, and Bastian Steinhoff-Knopp

Download

Final revised paper (published on 30 Mar 2026)
Preprint (discussion started on 08 Aug 2025)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2025-3583', Anonymous Referee #1, 29 Aug 2025
The manuscript explores whether machine learning approaches could improve our ability to predict soil erosion. We are just starting with this field, and I appreciate the authors' efforts. The manuscript reports an interesting piece of work that needs some improvement, but in general is clearly worth publishing.
However, the conclusions are based on unreplicated results and thus speculative. Setting up a replicated experiment would be relatively easy and fast (see my last paragraph). Furthermore, the authors justify their work with clearly wrong statements, and I wonder whether there is no better justification for their work. The enthusiasm about something relatively new cannot replace logically sound arguments:
The authors emphasise in several places that their goal is an accurate soil erosion prediction. If this is their goal, they fail and will always fail simply because soil erosion and its drivers are random quantities (in a statistical sense). Random quantities, by definition, are never accurate. Hence, the authors require a more realistic goal.

The authors write that the USLE comes with comparable low accuracy. This is clearly wrong, although this claim can often be found in literature. The USLE comes with the best possible accuracy for a random quantity. It is the modeller who fails because they apply the USLE with poor data or poor knowledge of its usage. It may be that the data needed to obtain good results are unavailable, but this is not the fault of the USLE. The data have to be gathered by the modeller. If we want to progress, it is essential to be more precise in describing deficits. Machine learning will likely not improve the modellers and their available data.

The authors claim that Al-supported modelling approaches are increasingly applied to overcome the limitations of the USLE (limited number of variables, low accuracy). This is wrong. I am not aware of any publication that used an unlimited number of variables. I am unaware of any application of an AI modelling approach independent of its developer, as is the case with the application of the USLE. We do not know whether AI models will perform better when widely applied because these models are unavailable. Hence, a comparison with the USLE is presently impossible, and it is wrong to write that such comparisons exist. This is still a long way to go.

Please be accurate in your arguments. They are not random quantities.
Details
L 77: Which models?
Chapter Data collection: In general, this chapter does not give enough details about the sources of data, the measurement methods, their range, their resolution, and their quality. The lack of reference to the sources also makes it impossible for the reader to get an idea about these relevant aspects.
L 95: What is the accuracy of the data? Were there independent repeated surveyors to estimate the accuracy? How did you know there had been an erosion event, given that high-intensity rain cells have only a spatial extent of about 1 km² (see Lochbihler et al. 2017, Geophysical Research Letters)?
L 97: What is sheet-to-linear erosion? Isn't this rill erosion, which is already in the first group?
L 101: Nineteen variables are pretty limited. I would not criticise this, but in L 38, you criticised a limited number of variables. Your arguments do not match. (BTW: The (R)USLE uses more than 19 variables to calculate the final six factors; hence, your data set is more limited).
L 110: Better call it the Pearson correlation coefficient because Pearson and even the regression have several coefficients. In the following, r is mostly in italics. Please be consistent.
Table 1: DEM is definitely wrong because this is the entity of all elevation data. Do you mean altitude?

More details about the resolution and the quality of your DEM have to be given (see the general remark regarding the data chapter) because many of your following variables depend strongly on these two parameters.

How was slope length defined, in the sense of the USLE or in a geomorphological sense? Was it defined for the field or for the raster cell? I guess you did not use slope length, which would be one value for the entire slope, but you may have used the upslope length of each raster cell. I do not like guessing what you did (a similar question could be raised for almost all variables).
Flow accumulation is described as the total accumulated runoff. This would require runoff modelling because runoff will depend on soil, crops, heterogeneity of rain and other variables. I guess you mean the upslope drainage area. More explanation required!
Wetness index: What is a 'modified catchment area calculation'?
Machining direction: This will differ on different field parts because of the headland and complex topography. How was it defined? It may also vary over time.
Regarding the R and LS factors, see below. How was the C factor determined? Did you consider individual rains and the corresponding field states, or did you use some more generalised C factor? Which degree of generalisation did you use? K factor, based on which data?
The table must be complemented with statistical metrics like mean, SD, min, and max, which give an idea of the range the data covers. This is essential for the interpretation of Fig. 5.
L 178: Conventional cross-validation is inappropriate in your case because your raster cells are highly autocorrelated. Hence, the left-out data are not an independent data set. I suggest using a seven-fold cross-validation by leaving out one of your study areas at a time.
L 185: I cannot see the five pairs in Table 1. Which pairs do you mean?
L 187: The correlation between R and altitude is strange. I am not aware of any meteorological process that would influence rain within your altitudinal and spatial range. I guess the correlation is an artefact of an inappropriate resampling procedure. Unfortunately, resampling was not described.
Fig. 4: The x-axis appears to have a log scale. Then, zero would not be possible, although shown (likely it is 0.001) and although being found in the data set. I recommend using a square-root scale, which allows for a true zero and does not compress the data in the relevant range of 0.1 to 50 t because of the inflation of the irrelevant range between 0.001 and 0.1.
This also leads to the question: Were there no negative values in your data set (colluviation)? Including negative values would be a clear advantage compared to the USLE. In any case, the reason for the lack of negative values has to be explained.
L 224: The high importance of altitude shows that the results of your approach lack transferability to other areas. I can easily imagine a similar erosion situation (similar topography, similar soils, similar land use, similar rain), but a few hundred meters higher (or even a few thousand meters higher if we think of a high valley in the Andes). The large importance of altitude would then cause very strange predictions. The matching of the training and the application situation is an indispensable requisite for your approach that does not restrict the input data to meaningful and universally valid variables (especially if you request unlimited variables). It is worth discussing this constraint, which is especially important in the black box of neuronal networks. Whether the variables are used meaningfully in view of the erosion process by the network is unknown and irrelevant for the result. It is, however, highly relevant for the transferability. While it is relatively easy to find out whether, for instance, the K factor equation is applicable in a specific case (e.g., peatland erosion), it is difficult to find out in which case a neural network result will fail when transferred to a different situation.
Fig. 5: The low importance of LS is strange, particularly because of the higher importance of flow accumulation and slope. Essentially, LS is the product of flow accumulation and slope gradient and thus must be of higher importance. Could LS be wrongly calculated by assuming straight slopes, although you have converging and diverging slopes? Furthermore, did you use the field's LS factor or the pixel's LS factor, which is entirely different information? Your M&M section requires clearly more information. Otherwise, the results cannot be understood.
CNN was the best method in your case. Does this have any relevance? Will CNN always or at least often be the best? We don't know because this is an unreplicated experiment. Usually, we regard unreplicated results as meaningless. I wonder whether you could improve the validity of your analysis. For instance, you could run your seven study areas separately. Is CNN the best in all seven cases? Is the ranking of variables similar in all seven cases (which would allow us to say something about transferability at least within your region)? You could run your analysis ten times with a subset of 10 randomly selected variables from your data set. Is CNN the best method in all cases? Presently, we do not know, and hence your conclusion that CNN outperforms other methods remains just a speculation.
Citation: https://doi.org/10.5194/egusphere-2025-3583-RC1
- AC1: 'Reply on RC1', Nils Barthel, 16 Oct 2025
  
  Dear Reviewer,
  
  We sincerely thank you for taking the time to provide such detailed and constructive feedback on our manuscript. Please find our point-by-point responses to each of your comments attached. We appreciate your insights and will revise the manuscript accordingly.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3583-AC1
RC2:
'Comment on egusphere-2025-3583', Anonymous Referee #2, 17 Sep 2025

The paper is well-written technically and linguistically and has a pleasant structure with a coherent logic. The soil erosion context is a bit out of my expertise but is understandable. The use of different ML models, ranging from standard approaches such as Random Forest to various deep learning architectures, is relevant to the study. However, the choice of models is not particularly novel. More importantly, the modelling suffers from data leakage issues, and I remain skeptical about the validity of the evaluation strategy. At a minimum, the authors should address the leakage problems and provide a clearer justification and description of their validation procedure. To strengthen the contribution and increase the novelty, the study could also benefit from incorporating more recent ML concepts or approaches. Based on these concerns, I recommend a major revision.
1. Introduction:
The introduction is well written and effectively prepares the reader for the paper. However, the authors largely restrict their literature review to soil erosion modelling. While this is understandable to a certain degree, the claimed novelty of the paper lies in applying “new” methods such as CNNs and multi-layer neural networks. These models, however, are not particularly novel in this context, as CNNs have been applied to soil prediction tasks at least since 2019 (e.g., Padarian et al., 2019). The study would offer stronger novelty by considering more recently proposed methods from the broader ML literature (for instance, the high-quality TabArena benchmark by Erickson et al., 2025, which compares state-of-the-art tabular learners). Several of these modern methods have already been successfully tested in soil science, and established approaches such as CatBoost have been available for even longer. I understand that it is not feasible to cover every recent method, but the current comparison does feel somewhat outdated for a paper that aims to emphasize on machine learning aspects.
Padarian, J., Minasny, B., & McBratney, A. B. (2019). Using deep learning for digital soil mapping. Soil, 5(1), 79-89.
Erickson, N., Purucker, L., Tschalzev, A., Holzmüller, D., Desai, P. M., Salinas, D., & Hutter, F. (2025). Tabarena: A living benchmark for machine learning on tabular data. arXiv preprint arXiv:2506.16791.

Minor comment
33: The use of the term AI does not seem appropriate in this context and comes across more as a buzzword. Since the paper exclusively discusses machine learning methods (e.g., L. 67), I suggest using machine learning consistently instead of AI.

2. Methodology:
I have several concerns about the hyperparameters and the validation used in this study. Other comments are of minor nature:
Hyperparameters:
It remains unclear how the authors tuned their models. From the description (L. 178–179), it appears that hyperparameters were adjusted directly on the validation folds of the 5-fold CV. This approach introduces data leakage, as the same data are effectively used both for model selection and for performance estimation, which reduces the penalty for overfitting. Proper hyperparameter optimisation requires a nested cross-validation scheme, where the data are split into three parts: a training set for fitting the model, an inner validation set for selecting hyperparameters, and an outer test set (or fold) for obtaining a performance estimate.
I looked into the provided code but could not find any script related to hyperparameter tuning. Instead, in the models script I found only fixed parameter settings. This is problematic, as optimal hyperparameters should be determined separately for each training fold within the cross-validation. Without such a procedure, the reported results may not reflect the best achievable model performance and risk being biased by arbitrary parameter choices.
Lastly, the search space for the hyperparameters was not given. This is extremely important for a fair model comparison, if a poorly tuned RF is compared to a well-tuned NN, the comparison would not be fair. There is a lot of studies on how this can induce bias in benchmarking (e.g., Nießl et al. 2022).

Nießl, C., Herrmann, M., Wiedemann, C., Casalicchio, G., & Boulesteix, A. L. (2022). Over‐optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(2), e1441.

Validation:
Do I understand correctly that this figure shows the “ground-truth” soil erosion dataset, and that these data are available in raster format, i.e., the true (or approximate true) erosion values are known across the entire study area? If so, I find this somewhat questionable, since such complete “ground truth” presumably relies on interpolation or modelling itself, and may therefore not represent true independent measurements. More importantly, it is unclear why additional modelling is applied, given that each cross-validation repetition already uses 80% of the study area for training. In digital soil mapping, modelling is typically motivated by sparse point observations, where the objective is to generate high-resolution maps from limited data. In contrast, this study seems to assume ground-truth values for every raster cell, a setup that almost inevitably leads to overly optimistic performance estimates with poor generalization value. Would a strategy such as “leave-one-validation-site-out” not provide a more realistic evaluation of model performance? I may be missing a domain-specific aspect of soil erosion mapping, but from a classical digital soil mapping perspective this design appears problematic.
For example in Gholami et al. (2021), which is also cited in this paper, they used some point data and they have specified validation points. I am missing something like this in this study. To me, this makes much more sense but I do not see this in Fig. 1.
Gholami, V., Sahour, H., & Amri, M. A. H. (2021). Soil erosion modeling using erosion pins and artificial neural networks. Catena, 196, 104902.

Small comments:
104: I may be wrong, but the overall study areas cover a few hundred ha, but the grid of the original R-factor was 1 km x 1 km. Even if resampled (how?), is this not too broad for the study area context. Maybe a reference which refers to this procedure could be useful?

123: It would be more precise to write “a random subset of the feature [or variables]”. Using a subset of data (i.e., training data) is also possible as a hyperparameter but not by definition a classical parameter in Random Forest.

2.3.4 It is not clear from the section but implied. Did the authors use a “2D” CNN, with what Y x Y raster cell?

3. Discussion & Results:
I do not have many comments on these sections, as they are well written. However, given my concerns regarding the validity of the results, I feel that any interpretation at this stage would be premature until these issues are addressed.

Small comments:
Figure 4: Why do the ECDF curves of the models appear so smooth? I would expect them, similar to the mapped erosion rate, to be step functions. This suggests that the ECDFs may have been constructed differently for the models and for the mapped erosion rate. Could the authors please clarify how these curves were generated?

Figure 5: The unit is missing. It is not simply [%], but rather increase of MSE in %. While this may be clear from the context, the figure should explicitly state the correct unit.

Citation: https://doi.org/10.5194/egusphere-2025-3583-RC2
- AC2: 'Reply on RC2', Nils Barthel, 16 Oct 2025
  
  Dear Reviewer,
  We sincerely thank you for taking the time to provide such detailed and constructive feedback on our manuscript. Please find our point-by-point responses to each of your comments attached. We greatly appreciate your insights and will revise the manuscript accordingly.
  
  Citation: https://doi.org/10.5194/egusphere-2025-3583-AC2

Peer review completion

AR – Author's response | RR – Referee report | ED – Editor decision | EF – Editorial file upload

ED: Reconsider after major revisions (further review by editor and referees) (21 Oct 2025) by Pedro Batista

AR by Nils Barthel on behalf of the Authors (02 Dec 2025) Author's response Author's tracked changes Manuscript

ED: Referee Nomination & Report Request started (04 Dec 2025) by Pedro Batista

RR by Anonymous Referee #1 (14 Dec 2025)

Suggestions for revision or reasons for rejection

The authors have carefully revised their manuscript, and I particularly appreciated their detailed reply letter. The manuscript now reads very well, and the inclusion of Table A1 effectively illustrates to the reader the large amount of work and the complexity of the study.

I still have a number of comments. They should be easy to consider, although two of them may even change the conclusions.

L 9: was
L 72: unit for RMSE is missing
L 13: unit missing
L 22: and pedological factors! You should not forget soil in a soil journal.
L 86: Is 'increasing' the appropriate word? Increasing relative to what and over which time?
L 88: followed in importance or in rotation? This will also make clear what is meant by 'primary'.
L 93: Is 'include' the correct word? Are there more than two? Better 'are'.
L 98: There is no variable in the data set which reflects snowmelt. I do not criticise this, but it may justify a remark in the discussion. Even with a relatively large dataset collected for an exceptionally well-known area (20 years of research), important information may still be missing. It remains always speculative whether we regard a data set as complete. We can hardly afford the same effort everywhere. Insufficient information will always be our fate. Although this is speculative, I would expect that ML handles missing information better than physical models. Physically based models must fail if only one key variable is missing or incorrect.
L 98: All organisations of standards require that units are printed upright while variables are tilted. Please follow this rule and please be consistent (upright and tilted units alternate).
L 98: Your definition of an erosive event cannot be found in DIN. It is acceptable to use your own definition, provided it is clearly defined. Is it correct that the rainfall amount was irrelevant for your definition? You may have missed many events.
Table 1: Wrong units for R and K. What unit should 'M' be?
L 111: 'resampled' is not clear. Was there any modification involved, or did you use the same value for each of your pixels within one radar pixel?
L 115: This sentence should appear much earlier. Otherwise, the reader may be puzzled by some descriptions and may not realise that additional information is available.
L 170: continuously?
L 209 -211 and Table 1: RMSE and MAE require units. Your statements that one method performs better than the other remains unjustified because you provide no statistical test that the values differ. I would suggest calculating the 95% confidence interval. This will also show whether three decimals are justified. I doubt this. Presumably, not more than one decimal is justified.
Table 3: Is the number of decimals justified?
Figure 2: This is a nice map, and I appreciate it. However, I hardly see any correspondence with Fig. 1 in erosion severity (particularly panel b). I do not criticise this, but it is not reflected in the text. The text only reports that very high values are underrepresented, but this is not sufficient. The patterns, at least in panel b, but to a lesser degree also in other panels, are very different. I think that this is really interesting. In your previous version, you criticised the USLE. If the USLE fails, we may attribute this to the limitations and constraints of the USLE. In this case, ML was free of all these limitations and still had problems reproducing the pattern even with good and abundant input data. To me, it appears that we have to live with considerable uncertainty, and this is not a critique but apparently a fact that we cannot overcome (at least presently).
L 228: The relative error can still be high (and likely is). Perhaps, the relative error of the lowest classes is even larger than that of the high classes. It would be good to differentiate RMSE and MAE in Table 2 according to class (similar to Table 3). This would allow us to judge the relative error depending on soil loss.
Chapter 3. 3: Really nice and insightful. The large influence of aspect is strange to me. I would not have expected this. Would you? Is there a physical reason? Snowmelt?
L 243: There was no test to prove this statement
L 249: was
L 254: I disagree with your interpretation that this is a failure of the models. This result was caused by your decision to use log values, which gives large weight to the low erosion rates. This effect is well known because it also appears in other statistical contexts, such as regression analysis. Fitting a power function to untransformed data yields a different equation than calculating a regression with log-transformed data. Log-transformation (and other transformations) leads to a distortion of the variances. There is a wealth of statistical papers on this effect.
L 255: More data with high soil losses cannot help. Suppose you include a field from Sumatra with significant losses in your dataset. How could this improve the pattern matching within your fields?
L 269-273: This was not tested. Nothing can be said.
L 288-290: Strange and interesting. Field observations and aerial photos consistently demonstrate a significant influence of these parameters. The calculation of these parameters strongly depends on the scale. May it be that your data were obtained on a wrong scale (either too coarse or too detailed) to capture these influences?
L 303-304: No; see remarks above
Table A1:
Slope and other parameters: Only the number of grid cells is reported, but not their size. The size determines on which scale a certain parameter is determined and which features can be detected. This may explain the low importance of some topographic attributes.
Aspect 360, 780: How is zero defined? Given the large importance of these variables, a more detailed description is required.
LS factor: I found the description in the reply letter, which is not available to the readers, clearer (pixel-based LS factors using the Desmet and Govers, 1996, method, which includes field boundaries)
K factor: either the numbers or the units are wrong (or even both)
C factor: Schwertmann is a bit outdated regarding the C factor due to climate change (different growing periods and seasonal distribution of rain erodibility).
R factor: Either the numbers or the units are wrong (or even both). I am not aware that R can be taken from Winterrath et al. (2018).

Hide

RR by Anonymous Referee #2 (22 Jan 2026)

Suggestions for revision or reasons for rejection

The revised manuscript addresses most of my previous concerns and has improved, particularly through the adoption of a clean and unbiased leave-one-area-out cross-validation scheme. While I do not see many additional issues, I still believe there is a methodological flaw in the machine learning pipeline that affects the validity of the reported performance. For this reason, I recommend major revision, as the models need to be re-run under a leakage-free tuning and evaluation procedure.

Major comment:
“The tuning of hyperparameters was conducted separately from the main training and validation procedure, using a grid search performed prior to the main cross-validation”
Based on the code and the accompanying description (please correct me if I have misinterpreted your implementation), the current workflow appears to have a data leakage because the hyperparameter selection is not strictly separated from the final model evaluation. Specifically, it seems that the same held-out data (or area) are being used both to (1) choose the hyperparameter configuration and (2) report the test performance. This would constitute a leakage: selecting the hyperparameter setting that performs best on the test set effectively “overfits” to the same test set and leads to optimistically biased performance estimates. I consider this a major methodological flaw, as the performance of the CNN may be inflated to a (potentially substantial) degree, and makes the benchmark with Random Forest “unfair”.

To avoid this, hyperparameter tuning should be performed within a nested cross-validation (or an equivalent three-way split strategy). Concretely, for each outer (leave-one-area-out) split used for performance estimation, there must be an inner split performed only on the outer training portion to select hyperparameters; the outer held-out area must remain completely untouched until final evaluation.

Small comments:
Given that erosion rate is a continuous target variable, I consider it unusual to use the F1 score as a relative error measure rather than reporting R². This may be established conventions from soil erosion literature, but the choice could be very shortly justified.

Out-of-distribution evaluation (as induced by leave-one-area-out) can involve substantial predictive bias. It would be informative to include Mean Error (ME) as an additional metric to quantify bias, alongside the existing performance measures, without any need for much further interpretation.

I still find the term “Permutation importance [%]” misleading or at least unusual, as the % can be related to the (absolute) permutation-based increase. However, the figure caption (Figure 4) provides the correct explanation, so this can be deemed acceptable.

Hide

ED: Publish subject to revisions (further review by editor and referees) (25 Jan 2026) by Pedro Batista

AR by Nils Barthel on behalf of the Authors (04 Mar 2026) Author's response Author's tracked changes Manuscript

ED: Publish as is (10 Mar 2026) by Pedro Batista

ED: Publish subject to technical corrections (10 Mar 2026) by Peter Fiener (Executive editor)

AR by Nils Barthel on behalf of the Authors (16 Mar 2026) Manuscript

Short summary

This study compares neural networks and a random forest model for predicting soil erosion in agricultural cropland using long-term data from northern Germany. All models captured general erosion patterns, while more complex neural networks slightly improved the distinction between soil loss classes. A permutation importance analysis identified slope and machine direction vs. aspect as the most influential predictors across all models.