Filling a key gap: a soil infrared library for central Africa

Information on soil properties is crucial for soil preservation, improving food security, and the provision of ecosystem services. Especially, for the African continent, spatially explicit information on soils and their ability to sustain these services is still scarce. To address data gaps, infrared spectroscopy has gained great success as a cost-effective solution to quantify soil properties in recent decades. Here, we present a mid-infrared soil spectral library (SSL) for central Africa (CSSL) that can predict key soil properties allowing for future soil estimates with a minimal need for expensive and time-consuming 5 wet chemistry. Currently, our CSSL contains over 1,800 soils from ten distinct geo-climatic regions throughout the Congo Basin and wider African Great Lakes region. We selected six hold-out core regions from our SSL, augmented them with the continental AfSIS SSL, which does not cover central African soils. We present three levels of geographical extrapolation, deploying Memory-based learning (MBL) to accurately predict carbon (TC) and nitrogen (TN) contents in the selected regions. The Root Mean Square Error of the predictions (RMSEpred) values were between 0.38–0.86 % and 0.04–0.17 % for TC and 10 TN, respectively, when using the AfSIS SSL only to predict the six regions. Prediction accuracy could be improved for four out of six regions when adding central African soils to the AfSIS SSL. This reduction of extrapolation resulted in RMSEpred ranges of 0.41–0.89 % for TC and 0.03–0.12 % for TN. In general, MBL leveraged spectral similarity and thereby predicted the soils in each of the six regions accurately; the effect of avoiding geographical extrapolation and forcing regional samples in the local neighborhood (MBL-spiking) was small. We conclude that our CSSL adds valuable soil diversity that can improve 15 predictions for the regions compared to using the continental scale AfSIS SSL alone; thus, analyses of other soils in central Africa will be able to profit from a more diverse spectral feature space. Given these promising results, the library comprises an important tool to facilitate economical soil analyses and predict soil properties in an understudied yet critical region of Africa. 1 https://doi.org/10.5194/soil-2020-99 Preprint. Discussion started: 8 January 2021 c © Author(s) 2021. CC BY 4.0 License.


Introduction
Soil health is critical to crop nutrition, agricultural production, food security, erosion prevention, and climate change mitigation via carbon (C) storage. Global climate change and soil degradation by deforestation and soil mismanagement critically threaten these ecosystem services (Birgé et al., 2016). In particular, the humid tropics are a front line for these anthropogenic impacts.
For example, increasing temperatures and accelerating deforestation in the humid tropics are estimated to enhance greenhouse 25 gas emissions (Cox et al., 2013;Don et al., 2011), but also to significantly reduce soil functions and ecosystem services such as soil fertility, water storage and filtration capabilities and erosion protection (Veldkamp et al., 2020). Despite the expected severity of these impacts, our understanding of the effects in the humid tropics are limited by sparse data and uneven distribution of low-latitude research.
Within the tropics, both the future impacts and data gaps are most severe in the Congo Basin, which contains the second  analyzed for a subset of samples. The chemical and MIR prediction results for these soil characteristics are not presented in this manuscript but were carried out using the same methods and are available on our GitHub repository. The large majority of 105 the soil samples originate from highly weathered and acidic soils and do not contain any carbonates. Therefore, TC contents correspond to total organic carbon contents. Only in a few samples from termite mounds in the subtropical Haut-Katanga province calcium carbonate has been detected and pH values are > 8 (Mujinya, 2012). Note, even if the proportion of samples with inorganic carbon was very low (5 %), the term TC will be used in the study.

MIR spectral libraries
110

Central African spectral library
All samples were finely ground using a ball mill and measured with a VERTEX70 Fourier Transform-IR (FT-IR) spectrometer with a High Throughput Screening Extension (HTS-XT) (Bruker Optics GmbH, Germany) in order to determine the MIR reflectance. A gold standard was used as a background material for all measured soils in order to normalize the sample spectra. Reflectance was transformed into absorbance (1/reflectance) prior to further processing and subsequent modeling. Two 115 replicates per sample were filled into the cups of a 24-well plate and the surface was flattened without compression using a spatula. For each sample, 32 co-added internal measurements were averaged and corrected for CO 2 and H 2 O using the OPUS spectrometer software (Bruker Optics GmbH, Ettingen, Germany).
AfSIS spectral library 120 We used a MIR SSL created by the World Agroforestry (ICRAF) centre. This SSL was created as part of the Africa Soil Information Service (AfSIS) in order to improve soil information and land management on the continental scale of Sub-Saharan Africa (Vågen et al., 2020). For this continental library (see Figure A1), reference values for TC and TN were obtained by using a ThermoQuest EA 1112 elemental analyzer. The MIR spectra of the samples were obtained by scanning them on a Ten-sor27 FT-IR spectrometer (Bruker Optics, Karlsruhe, Germany) with a high throughput screening extension. Four replicates 125 per sample were measured and an average of 32-co-added scans were used for each sample (Sila et al., 2016).

Spectral resampling and pre-processing
All CSSL and AfSIS spectra were processed using the R packages 'simplerspec' (Baumann, 2020), 'prospectr' (Stevens and Ramirez-Lopez, 2020) and 'resemble' (Ramirez-Lopez, 2020) in the R statistical computing environment (R Core Team, 2020). Replicates of spectral measurements were mean aggregated to obtain one spectrum per sample. The spectra were then 130 resampled to a resolution of 16 cm −1 and trimmed to the 4000-600 cm −1 spectral range.
As spectral pre-treatments have a marked impact on the performance of quantitative infrared models (Rinnan, 2014), the pre-processing procedure was specifically optimized for the MIR spectra of the central African samples. This procedure was based on the PLS method, which was also known as projection to latent structures. This method has been traditionally used for regression analysis in infrared spectroscopy. However, it is also useful for projecting the spectral data onto a low-dimensional 135 (and therefore less complex) subspace containing all the meaningful information of the original data. This projection model can be expressed as: where X is the original spectral matrix of n × d dimensions, T is the PLS score matrix of n × l dimensions (where l ≤ min(n, d)) which contains the extracted variables, P is the matrix of loadings of d × p dimensions which captures the spectral 140 variability across observations. E is an error term. For spectral data with high collinearity, the optimal l (or the number of PLS factors) is usually small, which means that the first few PLS factors are enough to properly represent the original variability of X. An important aspect of this type of projection is that it is obtained in such a way that the covariance between T and an external set of one or more variables is maximized. For a detailed description on PLS, see Wold et al. (2001). In PLS, P can be used on new spectral observations to project them onto the lower dimensional space: The spectral reconstruction error of the projection model can be then computed by back-transforming the matrix of scores to a spectral matrix and comparing it against the original spectral matrix as follows: The above spectral reconstruction error concept was used to find an optimal combination of spectral pre-treatments. We de-150 fined a set of different pre-treatments {h 1 , h 2 , ..., h z } where h i (X) represents one pre-treatment or a sequence of pre-treatments (with unique parameter values) to be applied on the spectral data. A projection model was built with the AfSIS spectra (using TC and TN as external variables) for each combination of spectral pre-treatments: this model was used on the CSSL pre-treated spectra and the reconstruction error (E CSSL ) was computed as follows: The final reconstruction error (re) is computed as the root mean squared of the elements in E CSSL : where m is the number of samples in the CSSL. To allow for comparisons across the reconstruction errors obtained for the different pre-treatments, the re(i) standardized as follows: The pre-treatments tested included different combinations of standard normal variate, multiplicative scatter correction, spectral detrend, first and second derivatives (with different window sizes).
The aim behind our reconstruction error approach was to identify a sequence of pre-processing steps that return spectral matrices which can be properly represented by a PLS model. In this respect, we assumed that a proper representation of the 165 spectral data by a global PLS projection model might also be appropriate for local PLS models which are at the core of the predictive methods presented in the next sections.
Minimal spectral reconstuction error was achieved with a Savitzky-Golay filter combined with a second derivative using a second order polynomial approximation with a window size of 17 cm -1 , resulting in a final resolution of 272 cm -1 (resampling resolution of 16 cm -1 x window size of 17 cm -1 ) (Savitzky and Golay, 1964), and a subsequent multiplicative scatter correction; 170 this optimized pre-treatment was used for MBL.

Modeling scenarios
Here we describe the method we used to assess the performance of MBL for predicting TC and TN for six distinct regions at different scenarios of regional soil extrapolation. Figure 2 gives an overview of the modeling strategies. Three specific modeling strategies were tested on the selected regional sets which we call prediction sets (see subsubsection 2.5.2). With the regional 175 analysis we demonstrate how predictions of soil properties within new sites from distinct regions-which are compositionally less variable than the available SSLs-might perform and profit from knowledge present in the AfSIS SSL. It also demonstrates the added value of our new CSSL in addition to the AfSIS SSL alone. The aim of the modeling scenarios were twofold: 1) to minimize the costs and time for traditional methods by optimizing the transfer of stored spectral information to the new region of interest 2) test different levels of geographical extrapolations to demonstrate how accurate predictions are for new regions, 180 when no local samples area available.

Modeling and prediction data
We used two main data sources and subsets as follows.
1. The AfSIS data set (A): Continental large-scale SSL including 1902 soil samples with MIR spectra and corresponding reference data, originating from Sub-Saharan Africa ( Figure A1). . This set can be written as where G i represents the data of the ith region.
(b) Six regional sets G i , (n = 80-718 after removal of 20 spiking samples for every set; see below).

195
(c) Six regional spiking sets (K i ): for each complete regional set, 20 samples were selected using the k-means sampling algorithm (Naes, 1987;Stevens and Ramirez-Lopez, 2020).

Modeling strategies
Three different scenarios were compared which are related to the scale of the geographical extrapolation: -Strategy 1: MBL predictions for the C set were computed from A. This scenario represents an extreme case of extrapo-200 lation (from the geographical perspective) since no samples from the entire central African area are present in the AfSIS set Figure A1, which is the only data used to build the predictive models. In addition, the MIR data from the central African (C) set originates from a different spectrometer type than the one used for scanning the AfSIS samples.
-Strategy 2: Predictions for every G i are computed by using MBL models built from the pooled AfSIS data A together with the data from the remaining five regions Although in this case there is also extrapolation (from the geographical perspective), is not as extreme as in Strategy 1.
-Strategy 3: Strategy 2 was repeated, but in this case, extrapolation was avoided by using the spiking samples from the same geographical region; Each regional set G i was predicted by the pooled AfSIS data, the data of the remaining regions and the respective spiking set, i.e. A ∪ C i ∪ K i . Predicting each hold-out region by the pooled remaining five regions (adding closer samples) together with the AfSIS SSL and Strategy 3: avoiding extrapolation by adding 1 to 20 spiking samples to the models regional models of Strategy 2.

Predictive modeling 210
We used Memory-based learning (MBL) as our predictive modeling approach. MBL describes a family of (non-linear) machine learning methods designed to handle complex spectral datasets (Ramirez-Lopez et al., 2013). In the chemometrics literature, MBL is also known as local modeling. This type of learning method does not attempt to fit a general (global) predictive function using all available data. Instead, a new and unique function (f i ) is built on-demand, every time a new prediction for a given response variable is required. This new function is built using only a subset of relevant observations from a reference set that 215 are queried through k-nearest neighbour search. The MBL method implemented for this study uses a spectral nearest neighbour search based on a moving window correlation dissimilarity. To measure the dissimilarity (d) between two spectra (x i and x j ), the following equation is used: where ρ represents the Pearson's correlation function and w the window size. After nearest neighbor retrieval, our MBL method fits a local model using the Weighted Average Partial Least Squares (WA-PLS) regression algorithm proposed by 220 Shenk et al. (1997). In WA-PLS, the final prediction is a weighted average of multiple predictions generated by PLS models built from different PLS factors. The weight for each component is calculated as follows: w j = 1 s 1:j × g j where s 1:j is the root mean square of the spectral residuals of the new observation when a total of j pls components are used and g j is the root mean square of the regression coefficients corresponding to the jth PLS component (see Shenk et al. (1997) for more details).

225
The number of neighbors to retrieve was optimized using the nearest neighbor (NN) cross-validation (Ramirez-Lopez et al., 2013). Using this method, for each observation to be predicted, its nearest neighbor was excluded from the group of neighbors and then a WA-PLS model is fitted using the remaining ones. This model is then used to predict the value of the response variable of the nearest observation. These predicted values are finally cross-validated with the actual values (see Ramirez-Lopez et al. (2013) for additional details). To avoid overfitting, the region was used as a grouping factor, which was a 'sentinel 230 site' for the AfSIS library and a 'province' or 'district' of the particular country for the CSSL. Samples from the same sampling region were consequently assigned to the same fold when dividing them into hold-out and validation sets. Neighborhood sizes varying from 150 to 500 neighbors in increments of 10 were tested. The best model and the optimal number of neighbours were determined by the minimal RMSE (Equation 8) of the nearest neighbour validation, where n is the number of neighbours used for the model, y i is the measured value of the hold-out neighbor, and y i is the value predicted by the remaining neighbours.

235
Subsequently, 1 to 20 spiking samples were added from the target region and forced into the neighbourhood of every observation and thus used in the predictive models, independent from their distances to the validation set. The stepwise spiking was applied to test the effect of spiking in general and to find the smallest number of samples required for satisfying model performances.

Model validation and prediction accuracy 240
For model validation, the RMSE statistics of the nearest neighbor validation described in the previous section were used.
Prediction accuracy of the seven sets (the combined 6 regions C and the six individual regional sets G i ; see above), which is the so-called independent or external validation, was also calculated using RMSE (Equation 8), where in this case y i is the actual measured reference value and y i the prediction of the final model.

245
Model validation and prediction performance were additionally evaluated using the Ratio of Performance to the InterQuartile distance (RPIQ) as suggested by Bellon-Maurel et al. (2010). The interquartile range of the observed reference data is divided by the RMSE of the nearest neighbor validation or by the RMSE of the prediction (RMSE pred ), respectively. The RPIQ is useful because it does not make any assumptions about the distribution of the reference data.
The sample archive of the CSSL covered a wide range of TC and TN contents (

Predictive performance of the three strategies
The prediction results for the three strategies are presented in Table 4 and  and Iburengerazuba) as well as TN predictions in all six regions showed a clear underestimation trend (Figure 4). This might be caused by one or the combination of the two following effects: i) spectral offset and/or multiplicative effects in the spectra Stepwise addition was done in order to find the lowest number of spiking samples that reduces the prediction accuracy to a satisfactory tolerance level.

Using soil spectral libraries in geographically different domains
We showed that TC and TN in six regions of our CSSL can be accurately predicted, leveraging existing SSLs informed by soils from completely different geographical areas using MBL methods (Table 4, Figure 4). The advantage of using MBL is that it 310 finds spectrally similar observations for every new observation to fit specific models. The spectral similarity is in fact reflecting the similarity between observations) in terms of soil composition which information is largely contained in the MIR features.
This means that the predictive success of MBL models largely depends on the quality of the spectra dissimilarity methods used to find the spectral neighbors. In other words, MBL can be described as a method driven by compositional similarity search.  (Table 3). Both sites contain samples from both tropical forests and agricultural fields, from diverse altitudes (Table 2) and parent materials and have therefore transformed under a variety of environmental conditions. We conclude that the particularly high soil diversity in these two regions in terms of soil biogeochemical properties introduces additional complexity in the soil spectral prediction 325 workflow. To improve predictions for these diverse regions, more data particularly with high TC and TN values are needed for calibrating the CSSL, and ultimately deliver better regional estimates using local methods (i.e., memory-based learning). High diversity in organic compounds and their stabilization in soils (i.e. organo-mineral association, complexation, aggregation) can introduce non-linear relationships that are difficult to predict with linear calibration models (i.e., memory-based learning in combination with PLS regression). Similarly high RMSEs have been shown in other studies for samples with organic C higher 330 than 15 % (Nocita et al., 2014). As in our study, these high errors were attributed to low sample numbers with high organic C contents. The creation of subsets from large spectral libraries via spectral similarities has been shown to be effective to train calibration models (e.g., Wetterlind and Stenberg, 2010;Clairotte et al., 2016;Tziolas et al., 2019;Dangal et al., 2019;Sanderman et al., 2020). Hence, in order to reduce uncertainties for regions in central Africa that are diverse in terms of soil chemical composition, in particular for the Eastern Congo Basin, there is an urgent need for filling the existing gaps in the 335 continental library by gathering more data on the ground.

Effect of spiking with local samples on prediction performance
The effect of spiking of the calibration models with local target samples was smaller than expected ( Figure 5). Although spiking could reduce RMSE pred somewhat for two regions (Iburengerazuba, South Kivu and Kabarole, Table 4), the effect was rather small for the remaining regions. Regions that occupied the same score space of the first two principal components as 340 the corresponding other regions and the AfSIS SSL (Figure 3) showed only a minimal effect from spiking ( Figure 1). This is especially true for the Tshuapa, Tshopo and Haut-Katanga regions. In these regions, similar spectra were apparently already available and the MBL found the required neighbours to build accurate models and predict TC and TN. For South Kivu and Iburengerazuba, the predictions could not be improved by adding the other regions to the AfSIS SSL, but spiking with samples from the target area could slightly improve their results. However, the prediction error (RMSE pred remained relatively high 345 ( Figure 5). On one hand, no other region and also not the AfSIS SSL cover the same score space as these two regions and on the other hand, the variability of soil properties within these two regions is large which also minimizes the effect of spiking.
Even though spiking is described as particularly effective in improving performance of small sized models (Guerrero et al., 2010), spiking, in our study, did not have as strong of an effect as reported by earlier studies (e.g., Guerrero et al., 2014;Seidel et al., 2019;Barthès et al., 2020;Wetterlind and Stenberg, 2010). 4.3 Suggestions for building new models and extending the existing spectral library Our regional predictions of TC and TN show promising results when analyzing soils from geographically distinct areas in central Africa that are not covered by the continental AfSIS SSL ( Figure A1). The addition of geographically proximal regions to the large-scale library, which are included in our CSSL, improved prediction accuracy significantly. This improvement underlines the usability of spectral libraries for new regions in general but encourages also the future amendment of currently 355 existing libraries to improve accuracy. To improve future soil analyses and to extend the geographical area covered by the SSL, we suggest the following workflow: 1. Preprocessing: Different spectral pre-processing methods influence model and prediction performance. We suggest selecting the best pre-processing strategies using spectral projections and minimizing the reconstruction error (see subsection 2.4). 2. Estimate uncertainty for new samples: When analyzing new soil samples from a region which is not covered by the SSL, samples with different composition and hence chemical properties are more likely to be introduced. Samples with high distances in the score space to the SSL cannot be predicted accurately with a high certainty, since they are often highly divergent from the SSL. A preliminary graphical inspection of resampled and pre-processed spectra can already allow for recognition of differences. A further dimension reduction (e.g. with a PCA) with a subsequent 2D or 3D 365 visualization of the first factors provides additional insights into dissimilarity.
3. Reference analysis for independent validation: If the new samples are from a completely new region or the new sample set trends to differ from the SSL, a certain number of validation samples is recommended to test for prediction accuracy.
The number is dependent on the similarity/dissimilarity to the SSL. Our CSSL is freely available to use and build upon at GitHub. As shown with the AfSIS SSL, the application of already 375 existing libraries and the extrapolation to new regions is accurate and suitable to estimate soil properties. However, to make predictions more accurate, especially for more diverse, heterogeneous and complex soils, more data is required. As demonstrated, the addition of new geographical regions improves the overall prediction accuracy when more proximal central African regions were added to the large-scale library. These results encourage the use and amendment of existing libraries, rather than the construction of new, separate, and extensive databases. Given the existing distribution of samples in the new CSSL, it is 380 especially important to increase the number of forest soils with high TC content, which represent a large portion of the Congo Basin. The future enlargement of the CSSL, preferably facilitated by our suggested workflow, is crucial to fill the gap of soil information in this highly understudied part of the world.

Conclusions
Our study presents the results and workflow for building the first central African SSL for predicting soil properties (TC and TN) 385 using lab based MIR spectroscopy in a crucial but understudied area of the African continent. Extrapolations were possible Thanks also to Heather Maclean and Travis Drake, for the additional editing of the manuscript. Tziolas, N., Tsakiridis, N., Ben-Dor, E., Theocharis, J., and Zalidis, G.: A memory-based learning approach utilizing combined spectral sources and geographical proximity for improved VIS-NIR-SWIR soil properties estimation, Geoderma, 340, 11-24,