To understand global soil organic matter (SOM) chemistry and its dynamics, we need tools to efficiently quantify SOM properties, for example, prediction models using mid-infrared spectra. However, the advantages of such models rely on their validity and accuracy. Recently,

A research gap to fill is that these models have not been validated in detail yet. What are their limitations and how can we improve them? This study provides a validation with the aim to identify concrete steps to improve these models. As a first step, we provide several improvements using the original training data.

The major limitation we identified is that the original training data are not representative for a range of diverse peat samples. This causes both biased estimates and extrapolation uncertainty under the original models. In addition, the original models can in practice produce unrealistic predictions (negative values or values

A key step to improve the models will be to collect training data that are representative for SOM formed under various conditions. This study opens directions to develop operational models to predict SOM holocellulose and Klason lignin contents from mid-infrared spectra.

Understanding soil organic matter chemistry and how it changes is important to understand future global carbon dynamics. The chemistry of soil organic matter (SOM) controls how fast it can be decomposed

Mid-infrared spectra (MIRS)-based models to predict SOM properties are a promising high-throughput approach which can replace more labor intensive or costly measurements

Soil and plant OM is often characterized by step-wise chemical fractionation into holocellulose (acid soluble) – a proxy for polysaccharides – and lignin (acid insoluble) – a proxy for aromatics –

Several models to predict holocellulose and lignin contents in different OM types have been developed (some based on near infrared spectra) (summarized for lignin by

Recently,

A major problem is that neither

Sample spectrum with the peaks and troughs, their heights, and their baselines (green) as detected by the script developed by

Our goals are to identify limitations in the original models and to give concrete recommendations for improvements. Moreover, we use the original data from

In investigating these issues, our main goals are to provide a concrete plan for how to improve the original models. Moreover, we want to analyze under which conditions it may not be appropriate to use the original or modified models. Where possible, using the original data, we want to provide improved models which we hope can be further improved in the future. In addition, this study provides general guidance for pitfalls during validation of spectral prediction models. With this, we want to contribute to the development of models to predict SOM holocellulose and Klason lignin contents which are important to provide diverse data to fit SOM decomposition models and to understand how environmental change affects global soil carbon dynamics.

We conducted a series of statistical analyzes to query each of our research questions listed in the introduction. During this, we also computed improved models using the same training data that were used to compute the original models. Since we used these improved models to uncover and analyze the limitations in the original models, we first describe the steps of how we improved the original models and then how we analyzed the limitations.

We use the data and models collected and developed by

We computed all models as Bayesian models. This allowed us to consider parameter uncertainty in predictions and to compute models with good predictive performance

All analyzes were performed in R (version 4.0.1, 6 June 2020)

Above, we mentioned that assuming a model with normal distribution can result in unrealistic predictions outside the interval [0, 100] mass-%. We were interested if and under which conditions unrealistic predictions can occur. Here, we differentiate two ways in which a prediction can be unrealistic: it can be either an unrealistic point estimate (median predictions), or the prediction interval can cover values outside [0, 100] mass-%, even though the median prediction can be within [0, 100] mass-%.

For our analysis, we first computed a sequence of normalized

After identifying conditions under which unrealistic predictions occur, we compared this to the behavior of a model with assumptions in line with the data generating process. Beta regression models assume all values of the dependent variable to be in (0, 100) mass-% which is a reasonable assumption given how the actual data are generated. Therefore, we repeated the previous analyses, but now using beta regression models. We used the same model structure and priors for intercept and slopes as the original models (“original beta models”).

To facilitate practical comparison between the modeling approaches, we compared the predictions of all models for the peat samples by plotting predicted values versus depths of the peat samples. Altogether, this allowed us to identify unrealistic predictions of the original models and to analyze how the improved models (beta regression models) perform in comparison.

The original approach of

We used all peak heights and trough heights returned by the peak identification algorithm of

We used all variables in the spectra and Bayesian regularization

An alternative popular approach to approach 2 would be dimension reduction, for example, via partial least squares regression, principal component regression, or variants of these

Out-of-sample model performance was estimated with PSIS-LOO ELPD as described above

To further validate and interpret the models using binned spectra, we visually identified the most important bins in the models (largest absolute median coefficients

Based on the model validation, we defined the following set of models used in the subsequent analyses: “best all peaks” denotes the models for holocellulose and Klason lignin with best average predictive accuracy of approach 1 described above. “Best binned spectra” denotes the models for holocellulose and Klason lignin with nearly the best average predictive accuracy of approach 2 described above. With “nearly the best predictive accuracy” we mean here that we used the models using binned spectra with a bin width of 20 cm

Are the training data used to compute the models representative for the spectral properties of peat and peat-forming vegetation? To answer this question, we needed to compare the spectral properties of the training samples to those of the peat and vegetation samples

For the original models, the prediction domain is simply the range of area normalized heights of the

For the improved models using binned spectra, the predictor variables form a multivariate prediction domain. We therefore created plots with which we could compare the value ranges for each bin for the training data with the respective values in the peat and vegetation spectra. This allowed us to identify for which spectral variables extrapolation is an issue. Finally, by identifying the most important variables in the improved models, we could qualitatively summarize how large the risk of this extrapolation is.

Since models with a bin width of 20 cm

We analyzed how predictions of the improved models (“best all peaks” and “best binned spectra” defined in Sect. 2.2) differ from the original models. We found considerable differences, even within the spectral range of the training data, and therefore analyzed what factors probably cause these differences.

In a first step, we were interested if the original or improved models are biased.

In a second step, we compared how predictions of the improved models differ in practice from the original models. To this end, we compared the models' predictions for the peat and vegetation data from

In the introduction we mentioned that the original

The original model for holocellulose indeed produced unrealistic predictions: it had a negative point prediction for one strongly decomposed sample from a tropical peat core. Moreover, for 22 % of the peat samples in the data from

Figure

The beta regression models produce realistic predictions across the complete range of the spectral predictor variables. In addition, the beta regression model has smaller prediction uncertainties for extreme values (Fig.

Thus, whereas our results indicate little differences in choosing a model distribution for Klason lignin, for holocellulose contents of peat samples it is crucial to use a beta regression model to avoid unrealistic predictions. Nevertheless, it is generally advisable to use a beta regression model for mass contents. It may not be known in advance how large holocellulose or Klason lignin contents are for a given sample, and even for Klason lignin, contents may be low, e.g., due to high contents of minerals.

Predicted holocellulose

Our analysis shows that both strategies to include more variables result in on average more accurate predictions (Table

Is one of the strategies to include more variables advantageous? The models with the largest predictive accuracy are models with small to moderate bin width (Table

To interpret the improved models using binned spectra, we plotted the median coefficients for the best models using binned spectra (Fig.

Similarly to the original model, a variable near the

For Klason lignin, the improved model does not use bins near to the

The bins with especially large absolute coefficients in the models using binned spectra are not represented in the extracted peaks because these cover different wavenumber ranges (compare Fig.

Coefficients of the best models using binned spectra for holocellulose (

Prediction domain for the original models

The training data do not cover the range of the spectral variables relevant for predicting peat and peat-forming vegetation holocellulose contents (Fig.

The same is true for the improved models using binned spectra (Fig.

Under what conditions does extrapolation occur? Samples with small

Overall, this indicates that the prediction domain formed by the training data does not cover the range needed for peat – particularly decomposed peat – and partly also does not cover the range needed for peat-forming vegetation samples. We assume that this is probably also true for non-peat SOM. Therefore the models' predictions can represent extrapolations in practice; the training data are not in general representative for peat and peat-forming vegetation.

Overview on the relative predictive performance of the models for holocellulose and Klason lignin content as measured using PSIS-LOO ELPD. For each variable, the model with the best average predictive performance (largest ELPD) is at the top and the other models follow in descending order. Models ending in “.2” and “.3” are the models with the original model structure as developed by

There were considerable differences in predictions of the original and improved models using binned spectra for both Klason lignin and holocellulose for the peat samples (Figs.

This is surprising for two reasons: first, for the training data, both models make quite similar predictions (Fig.

Predicted values of the original model versus predicted values of the original model (first column), and the best improved models using extracted peaks (second column) and binned spectra (last column), respectively, for holocellulose

We hypothesize that both the original and the improved models are not unbiased for samples with other spectral properties, even if these differences occur in variables not directly included in the models. For holocellulose, we could not find indications of which model is better. For Klason lignin, we provide mechanistic evidence for why the predictions of the improved models probably are more correct. Therefore, a key result of our analysis is that additional training and validation data are required to compute models to accurately predict SOM holocellulose and Klason lignin contents from MIRS.

In comparison to the training data, spectra for peat typically have larger absorption values at around 1250 cm

These spectral differences are directly related to the differences in predicted values of the original and improved model (Supplement Fig. S8): the smaller the OH peak is, the larger are predicted values of the original model in comparison to the improved model using binned spectra, especially for larger

What causes this sensitivity? Peak heights – for the original model – or spectral variables (bins) – for the improved models using binned spectra – are normalized by the sum of absorbance values in the complete spectra

Why do peat samples have such different spectral properties? We hypothesize the following mechanistic explanation for the differences: decomposition of peat results in distinct changes in the absorbance at specific wavenumbers, e.g., the “OH peak” and the fingerprint region

A consequence therefore is that it is questionable if the models (both original and improved) can be applied to peat and other SOM samples in general, without adapting the training data by including more representative samples.

For Klason lignin, whereas the original model is unbiased across all samples (Supplement Fig. S6), it is not within two classes of training samples. Samples of these classes differ in the relative contribution of the

We suggest that it is this bias which causes most of the difference in predictions of the original and improved models. Conditional on the relative contribution of the

What causes this bias? We suggest that

This is what distinguishes samples of class 1 and 2 in the training data (Supplement Fig. S11): samples of class 1 are wood samples and paper product samples derived from wood. Wood typically has smaller nitrogen contents, but larger Klason lignin contents

The improved model using binned spectra gives more weight to variables related to aromatic skeletal structures and not in a region where proteins cause large absorbances

Measured training data Klason lignin contents versus fitted values [mass-%] of the original model

Our analysis shows that models with similar fit to the training data can be computed also if mineral-rich samples are included. We see this as proof of concept that holocellulose contents can also be predicted from MIRS for SOM samples with mineral admixtures.

If the clay-rich old magazine samples are included, the model using binned spectra had the best average predictive performance. The model using extracted peaks had a worse but similar performance (

How do coefficients for the model trained with old magazine samples differ from the improved model trained without old magazine samples? According to Fig.

Our validation analysis provides general lessons for validating models using spectral data. What can we learn from the model validation? First, even though

Second, a linear relation between the target variable being predicted and a predictor variable is not sufficient validation of a spectral prediction model if the training data are not representative for the samples to which the model is applied. Most importantly, due to spectral normalization, predictions can be sensitive even to variables not included in the model. Therefore, to assess if training data (the prediction domain) are representative, whole spectra have to be compared.

Third, it is helpful to identify potential causal mechanisms which may affect differently the MIRS the model will be applied to than the MIRS it was trained on. As shown here (Sect. 3.5), differences in the degree of decomposition or the relative contribution of proteins and aromatic skeletal structures cause differences in MIRS which can result in biased predictions. Consequently, providing a causal explanation as to what causes the correlation between a target variable and specific MIRS variables is a useful tool to assess if a model may be applicable to new samples.

Fourth, our analysis also shows that re-evaluating existing models with their original training data can be an effective way to improve the models. Here, it was important that one modification of the original models often addressed multiple limitations at once due to interdependences between the limitations: improving the models' predictive performance did not only address underfitting, but also reduced the bias in the models (Fig.

There are several problems we could not solve. The most important problem is that also for the improved models, it is unclear if predictions are correct outside the prediction domain of the training data and therefore if the models are applicable to SOM in general.

Further issues are the following: (1) the overall accuracy of the models can certainly be improved with more training samples. (2) It is unclear how robust the original models and our improved models – especially models using binned spectra – are in terms of calibration transfer (calibration transfer is the application of a model to spectra measured differently than the training data, e.g., with a different procedure, on a different device, in a different laboratory

A further limitation is that, as the original models, the modified beta regression models do not consider the constraint that the contents of holocellulose, Klason lignin, and any remaining compounds should sum to 100 mass-%. This also represents a further test of how realistic model predictions are (compare with Sect. 2.1). In principle, this constraint could be considered by using a Dirichlet regression model

In summary, our analysis opens concrete and promising directions to further improve the models: we need training and validation data that include peat, particularly highly decomposed peat. Such data make it possible to analyze the impact of the bias and to compute models with less bias, higher representativity for SOM and peat, and potentially larger predictive accuracy. In addition to this, it is likely that an operative model for prediction of holocellulose contents in SOM samples with mineral admixtures can be developed by including such samples with more diverse minerals. Ideally, such improvements would be performed across multiple labs with an archive of sample materials such that calibration transfer of the models between different mid-infrared spectrometers can be further explored.

To support such developments, we implemented the best models using binned spectra (for holocellulose, the model which was also trained on mineral-rich samples) into the R package irpeat

Our aim was to validate the original models of

The main weakness of the original models is the underlying training data: it is not in general representative for SOM, peat, and peat-forming vegetation. This results in biased predictions for holocellulose and Klason lignin for SOM, such as peat. Results from currently published studies using the original models should be interpreted with caution (we are currently preparing a manuscript, see

Even though it was impossible to address the key problem of unrepresentative training data using the original data, we could address some of these issues, provide improved models, and develop a concrete strategy for future improvements. The improved models have less bias, avoid unrealistic predictions, and use information from the complete spectra and thus have a better predictive accuracy. Moreover, we provide a proof of concept that it is possible to predict holocellulose contents also for OM with mineral admixtures.

Our analysis thus opens concrete and promising directions to further improve the models: a major opportunity is to collect training and validation data representative for SOM, such that the improved models can be extended and thoroughly validated. In a next step, potential calibration transfer issues can be addressed.

Improved models to predict SOM holocellulose and Klason lignin contents can be of large importance in the long run because they allow cost-efficient high-throughput analyses of SOM. Detailed understanding of SOM chemistry across large scales, and the processes that result in changes in SOM chemistry, is only possible if such fast and effective tools are available.

The organic matter sample data from

The supplement related to this article is available online at:

HT conceptualized the study, developed the methodology and software, and performed and validated the formal analysis, data curation, writing, and visualization. KHK supervised the study, provided the resources, acquired funding, and edited the manuscript.

The contact author has declared that neither of the authors has any competing interests.

Publisher’s note: Copernicus Publications remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

We would like to thank

This study was funded by the Deutsche Forschungsgemeinschaft (DFG) (grant no. KN 929/23-1 to Klaus-Holger Knorr and grant no. PE 1632/18-1 to Edzer Pebesma) and the Open Access Publication Fund of the University of Münster.

This paper was edited by Bas van Wesemael and reviewed by Stephen Chapman and one anonymous referee.