The central African soil spectral library: a new soil infrared repository and a geographical prediction analysis

Information on soil properties is crucial for soil preservation, the improvement of food security, and the provision of ecosystem services. In particular, for the African continent, spatially explicit information on soils and their ability to sustain these services is still scarce. To address data gaps, infrared spectroscopy has achieved great success as a cost-effective solution to quantify soil properties in recent decades. Here, we present a midinfrared soil spectral library (SSL) for central Africa (CSSL) that can predict key soil properties, allowing for future soil estimates with a minimal need for expensive and time-consuming wet chemistry. Currently, our CSSL contains over 1800 soil samples from 10 distinct geoclimatic regions throughout the Congo Basin and along the Albertine Rift. For the analysis, we selected six regions from the CSSL, for which we built predictive models for total carbon (TC) and total nitrogen (TN) using an existing continental SSL (African Soil Information Service, AfSIS SSL; n= 1902) that does not include central African soils. Using memory-based learning (MBL), we explored three different strategies at decreasing degrees of geographic extrapolation, using models built with (1) the AfSIS SSL only, (2) AfSIS SSL combined with the five remaining central African regions, and (3) a combination of AfSIS SSL, the remaining five regions, and selected samples from the target region (spiking). For this last strategy we introduce a method for spiking MBL models. We found that when using the AfSIS SSL only to predict the six central African regions, the root mean square error of the predictions (RMSEpred) was between 3.85–8.74 and 0.40–1.66 gkg−1 for TC and TN, respectively. The ratio of performance to the interquartile distance (RPIQpred) ranged between 0.96–3.95 for TC and 0.59–2.86 for TN. While the effect of Published by Copernicus Publications on behalf of the European Geosciences Union. 694 L. Summerauer et al.: The central African soil spectral library the second strategy compared to the first strategy was mixed, the third strategy, spiking with samples from the target regions, could clearly reduce the RMSEpred to 3.19–7.32 gkg−1 for TC and 0.24–0.89 gkg−1 for TN. RPIQpred values were increased to ranges of 1.43–5.48 and 1.62–4.45 for TC and TN, respectively. In general, predicted TC and TN for soils of each of the six regions were accurate; the effect of spiking and avoiding geographical extrapolation was noticeably large. We conclude that our CSSL adds valuable soil diversity that can improve predictions for the Congo Basin region compared to using the continental AfSIS SSL alone; thus, analyses of other soils in central Africa will be able to profit from a more diverse spectral feature space. Given these promising results, the library comprises an important tool to facilitate economical soil analyses and predict soil properties in an understudied yet critical region of Africa. Our SSL is openly available for application and for enlargement with more spectral and reference data to further improve soil diagnostic accuracy and costeffectiveness.


Introduction
Soil health is critical to crop nutrition, agricultural production, food security, erosion prevention, and climate change mitigation via carbon storage. Global climate change and soil degradation by deforestation and soil mismanagement critically threaten these soil ecosystem services (Birgé et al., 2016). In particular, the humid tropics are a front line for these anthropogenic impacts. For example, increasing temperatures and accelerating deforestation in the humid tropics are estimated to enhance greenhouse gas emissions (Don et al., 2011;Cox et al., 2013) but also to significantly reduce soil functions and ecosystem services such as plant nutrient supply, water storage and filtration capabilities, and erosion protection (Veldkamp et al., 2020). Despite the expected severity of these impacts, our understanding of the effects on soils in the humid tropics of Africa is limited by sparse data and uneven distribution of low-latitude research. Within the tropics, both the future impacts and data gaps are most severe in the Congo Basin, which contains the second largest tropical forest ecosystem on Earth, represents a considerable reservoir of soil carbon, and is critically endangered by fast deforestation (Hansen et al., 2013). Thereby, forest loss in central Africa is mainly driven by smallholder farmers practicing shifting cultivation Curtis et al., 2018) and cropland expansion to feed a fast-growing population. For example, the human population in Uganda, Rwanda, and the DRC is projected to more than double in the coming 80 years (Vollset et al., 2020). Such dramatic growth will likely contribute to further conversion of forest to agricultural land. As a result of these current and future impacts, more spatially explicit soil information is urgently needed in many research fields ranging from agricultural to soil biogeochemistry and climate sciences. In recent decades, improvements have been made carrying out soil surveys and creating soil databases and maps for central Africa (Goyens et al., 2007), for Rwanda (Imerzoukene and Van Ranst, 2002), and for the DRC . Unfortunately, accessibility of such data is limited, and gaps are still large in central Africa , in part due to the high cost of specialized equipment and chemicals for analyses, limited accessibility to sampling areas, and lack of infrastructure.
Diffuse reflectance infrared Fourier transform (DRIFT) spectroscopy has gained attention as a cost-effective and rapid method for soil analyses (e.g., Nocita et al., 2015). Many soil minerals, as well as functional groups of soil organic matter, show distinct energy absorption features in the infrared (IR) region of the electromagnetic spectrum. These relationships can be empirically modeled to quantify soil properties relevant for soil quality, such as carbon (C), nitrogen (N), and other crop nutrients (e.g., Janik et al., 1998;Soriano-Disla et al., 2014). Due to its simple handling, quick measurements, low costs, and minimal need for chemical consumables, IR spectroscopy is an important tool for soil analyses that further allows for high reproducibility and coverage of spatial soil heterogeneity. Especially in developing countries, where practices are often hampered by the prohibitive costs of conventional soil analyses, IR spectroscopy has great potential (Shepherd and Walsh, 2007;Ramirez-Lopez et al., 2019).
Partial least squares (PLS) is a projection-based regression method which can be considered as the most widely used tool to calibrate models that translate spectral data into meaningful chemical and/or physical information. The method is especially useful in noncomplex contexts, where the relationships between spectra and response variables are essentially linear (e.g., spectral models developed for a small field where soil-forming factors are relatively constant). One of the main aims of establishing large-scale soil spectral libraries (SSLs) is to minimize the need for future wet chemical analyses (e.g., Stevens et al., 2013;Shi et al., 2014;Viscarra Rossel et al., 2016;Demattê et al., 2019). However, these libraries often span vast geographical areas that include different soil types and climate zones, which comprise complex soil organic carbon forms and mineral compositions. Due to this heterogeneity, predictions rendered by global linear regression models are often unfeasible for new local soil property assessments at a regional, field, or plot scale, especially when the new set covers a different geographical domain to the library. Despite the abundance of literature on the calibration of quantitative models of soil properties using both mid-infrared (MIR) and near-infrared (NIR) data, there is still a lack of simple and efficient modeling strategies that could bring SSLs to an operational level. Padarian et al. (2019) could considerably improve prediction accuracies for a new local set when using a compositionally related subset from a large-scale SSL, together with a small number of local reference analyses. Thus, a cost-accuracy trade-off can be met when the accuracy of the library-based prediction is similar to the one made when applying a local but more costly calibration strategy. Several data-driven methods have proven to be successful to overcome this issue, for example RS-LOCAL (Lobsey et al., 2017) and memory-based learning (a.k.a. local learning, e.g., Naes et al., 1990;Shenk et al., 1997;Ramirez-Lopez et al., 2013a). In addition, other promising approaches have also been proposed, although they require more research (e.g., deep learning, Ng et al., 2019;fuzzy rule-based systems, Tsakiridis et al., 2019). Memory-based learning (MBL), for example, searches for each new spectral observation, a subset of similar observations in a spectral library, which are then used to fit a custom predictive model for every new observation. This method has shown promising results when applied to extremely complex SSLs such as the MIR library of the United States (Dangal et al., 2019) and in one developed for the European continent . Spiking of libraries with samples from the target site has also shown to improve prediction accuracy (e.g., Guerrero et al., 2010;Wetterlind and Stenberg, 2010;Seidel et al., 2019;Barthès et al., 2020). So far, SSLs have mainly been used for predictions of soil samples originating from the same geographical domain. Studies have shown that subsetting large-scale libraries for new spectra by their geographical zones can result in good prediction accuracy (Nocita et al., 2014;Shi et al., 2015). These geographical restrictions could allow for extrapolation to new areas that contain similar soils.
The aim of the present work was to propose three strategies that leverage the use of a large soil infrared spectral library to accurately predict soil properties in regions which are poorly covered by it. Furthermore, here we describe a convenient method for spiking MBL or local models. Here, we also present the first SSL for central Africa (CSSL), which can be used to enlarge the existing continental library of African soils (a.k.a. African Soil Information Service, Af-SIS). This effort represents an important first step towards fulfilling the need for spatially explicit and high-resolution soil data in an important yet understudied region in the humid tropics of Africa, promoting vital soil information that is critical to the future of the region.

Site descriptions
Soil samples were collected from past projects in the Congo Basin and along the Albertine Rift, the western branch of the East African Rift System. Table 1 gives an overview of corresponding data sources and data contributors to the different sample sets and denotes the origin, the number of samples, and sampling layers used for our CSSL. The sample locations of the entire library are clustered over a large geographical area of central Africa, from a latitude of 2.8 to −11.6 • and a longitude of 12.9 to 30.4 • . From our entire library, six clustered regions were identified, which contained at least 80 samples to allow for reliable analysis. Therefore, this subset will be further presented in the paper (see Tables A1 and  A2 in the Appendix for information on the entire library). Four of the selected regions are located in the Democratic Republic of the Congo (Haut-Katanga, South Kivu, Tshopo, Tshuapa), while the other two are located in Rwanda (Iburengerazuba) and in Uganda (Kabarole), respectively (Figs. 1 and A1). Site-specific characteristics, coordinate ranges, altitudes, average climate, and dominant soil types are summarized in Table 2. Annual precipitation ranges from about 1200 mm in Haut-Katanga to over 2000 mm in the tropical forest of Tshuapa. Mean annual temperature varies from 17.6 • C in the high altitudes of Iburengerazuba and South Kivu to 24.9 • C in Tshopo (Fick and Hijmans, 2017). The study elevations range from 380 m a.s.l. in Tshuapa and Tshopo to high altitudes of 2300 m a.s.l. in South Kivu along the rift valley (Jarvis et al., 2008). Soil types are primarily Ferralsols, Acrisols, or Nitisols IUSS Working Group WRB, 2015). The different regions contain multiple Köppen-Geiger climatic zones: The three regions located close to the Equator (Tshuapa, Tshopo, Kabarole) are classified as Af (tropical rainforest), while the eastern DRC and western Rwanda are classified as a mixture of climate zones Cfb (temperate, without dry season, warm summer), Csb (temperate, dry summer, warm summer), Aw (tropical savannah), and Cwb (temperate, dry winter, warm summer). The regions along the rift valley (South Kivu, Iburengerazuba, Kabarole) are partially also classified as Am (tropical monsoon). Finally, the southeast of the DRC is classified as Cwa (temperate, dry winter, hot summer) (Beck et al., 2018).

Laboratory soil analyses
In preparation for total carbon (TC) and total nitrogen (TN) analyses, all soil samples were sieved through a 2 mm mesh and either air-dried or oven-dried at temperatures of 50 or 60 • C. After sieving and drying, soil samples were ground to a powder (< 50 µm) using a ball mill. TC and TN were analyzed via dry combustion using either a LECO 628 elemental analyzer (LECO Corporation, USA), an ANCA-SL automated nitrogen carbon analyzer (SerCon, UK), or a vario EL cube CNS elemental analyzer (Elementar, Germany). In order to ensure data quality and facilitate the harmonization of all TC and TN data, a subset of these samples was remeasured on the LECO. This performance comparison demonstrated high comparability of TC and TN data across all three instruments (R 2 = 0.99 for TC and TN, results not shown). The large majority of the soil samples originate from highly weathered and acidic soils and do not contain any carbonates, and therefore, TC contents correspond to total organic carbon contents. Only in a few samples from termite mounds in the subtropical Haut-Katanga province has calcium carbonate been detected where pH values are > 8 (Mujinya, 2012). Moreover, the widely used slash-and-burn practices could additionally have influenced soil TC contents, even when visible charcoal pieces were removed prior to any measurement. Additionally, soil pH (either in H 2 O, KCl or CaCl 2 , depending on the study), texture (laser diffraction particle size analyzer), and aqua-regiaextractable macro-/micronutrients (Al, Fe, Ca, Mg, Mn, Na, P, and K; inductively coupled plasma-optical emission spectroscopy) were analyzed for a subset of samples. The chemical and MIR prediction results for these soil characteristics are not presented in this paper but were carried out using the same methods and are available on our GitHub repository at https://doi.org/10.5281/zenodo.4351254 (last access: 20 December 2020).

Central African spectral library
In order to determine the MIR reflectance, all ground soil samples were measured with a VERTEX70 Fourier transform IR (FT-IR) spectrometer with a high-throughput screening extension (HTS-XT) (Bruker Optics GmbH, Germany). Spectra were acquired at a resolution of 2 cm −1 within a range of 7500 to 600 cm −1 , which corresponds to a wavelength range of 1333 to 16 667 nm. A gold coated reflectance standard (Infragold NIR-MIR Reflectance Coating, Labsphere) was used as a background material for all measured soils in order to normalize the sample spectra. Reflectance was transformed into absorbance using log(1/reflectance) prior to further processing and subsequent modeling. Two replicates per sample were filled into the cups of a 24well plate, and the surface was flattened without compression using a spatula. For each replicate, 32 co-added internal measurements were averaged and corrected for CO 2 and H 2 O using OPUS spectrometer software (Bruker Optics GmbH, Germany). This library is denoted as C = {Yc, Xc} m 1 throughout the rest of the paper, where Yc is the matrix containing the two response variables (TC and TN), Xc is the matrix of spectra, and m is the total number of samples in the library. Table 1.
Soil sample archive used for the central African soil spectral library. The references show the publications from which the corresponding data were sourced. For previously unpublished data, the contributor institution is listed. The listed regions are provinces of the Democratic Republic of Congo (DRC) and Rwanda (RWA) and a district of Uganda (UGA).
Data source or contributor

AfSIS spectral library
We used a MIR SSL created by the World Agroforestry Centre (ICRAF) to predict soil samples of the six selected regions of central Africa for their TC and TN contents. This SSL was created as part of the Africa Soil Information Service (AfSIS) in order to improve soil information and land management on the continental scale of sub-Saharan Africa (Vågen et al., 2020). For this continental library (see Fig. 1), reference values for TC and TN were measured using a Ther-moQuest EA 1112 elemental analyzer. The MIR spectra of the samples were obtained by scanning them on a Tensor27 FT-IR spectrometer (Bruker Optics GmbH, Germany) with a high-throughput screening extension. Soil samples were measured in a wavenumber range of 4000 to 600 cm −1 (2500 to 16 666 nm) with a spectral resolution of 2 cm −1 . Four replicates per sample were measured, and an average of 32 co-added scans were used for each sample (Sila et al., 2016).
Here we denote this library as A = {Ya, Xa} n 1 throughout the rest of the paper, where for all its samples (n), Ya represents the matrix containing the two response variables (TC and TN), and Xa represents the matrix of spectra.

Spectral resampling and preprocessing
All CSSL and AfSIS spectra were processed using the R packages "prospectr" (Stevens and Ramirez-Lopez, 2020), "simplerspec" (Baumann, 2020), and "resemble" (Ramirez-Lopez, 2020) in the R statistical computing environment (R Core Team, 2020). Replicates of spectral measurements were aggregated to one average spectrum per sample. The spectra were then resampled to a resolution of 16 cm −1 and trimmed to a spectral range of 4000-600 cm −1 . Both spectral libraries were scanned on two FT-IR Bruker spectrometers (Bruker Optics GmbH, Germany), which use the same settings and the same internal standards. The scanning methods of the CSSL were adapted to the standard operating procedures of the Soil Plant Spectral Diagnostics Laboratory at ICRAF. For these reasons, no instrument standardization was necessary.
As spectral pretreatments have a marked impact on the performance of quantitative infrared models (Rinnan, 2014;Seybold et al., 2019), the preprocessing procedure was specifically optimized for the MIR spectra of the central African samples. This procedure was based on the PLS method (Wold et al., 1984), which is also known as projection to latent structures and is widely used for regression analysis in infrared spectroscopy. However, it is also useful for projecting the spectral data onto a low-dimensional (and therefore less complex) subspace containing all the meaningful information of the original data. The projection model can be expressed as where X is the original spectral matrix of n × d dimensions, S is the PLS score matrix of n × l dimensions (where l ≤ min(n, d)) which contains the extracted variables, and P is the matrix of loadings of d × l dimensions which captures the spectral variability across observations. E is an error term. For spectral data with high collinearity, the optimal l (or the number of PLS factors) is usually small, which means that only a few PLS factors or latent variables are enough to properly represent the original variability of X. An important aspect of this type of projection is that it is obtained in such a way that the covariance between S and an external set of one or more variables is maximized. For a detailed description on PLS we refer the reader to Wold et al. (2001). In PLS, P can be used on new spectral observations to project them onto the lower dimensional subspace: The spectral reconstruction residuals of the projection model can then be computed by back-transforming the matrix of scores to a spectral matrix and comparing it against the original spectral matrix as follows: Finally, the spectral reconstruction error (also known as the Q statistic) is computed as the sum of squares of E new : The Q statistic indicates how well a given new sample is represented by the PLS model (Wise and Gallagher, 1996;Ballabio and Consonni, 2013). This statistic is widely used in chemometrics for outlier identification and uncertainty assessment (Wise and Roginski, 2015).
In summary, our approach offers a data-driven solution to the selection of the spectral preprocessing steps which are optimized for the target/prediction set. The optimal set of steps is defined as the one that minimizes the Q statistic. This approach does not require prior knowledge of the response values of the target set and therefore is well suited for preprocessing optimization. It assumes that PLS models that cannot account for the spectral variability in the target set may also fail at producing accurate predictions of the response variable. In other words, as suggested by Wise and Roginski (2015), large Q values can be used as proxies for large prediction errors, and therefore Q values can be used to judge the suitability of a set of preprocessing steps. To find an optimal combination of spectral pretreatments, we defined a set of different pretreatments {h 1 , h 2 , . . ., h z }, where h i represents one pretreatment or a sequence of pretreatments (with unique parameter values) to be applied on the spectral data. For this purpose, a projection model was built with the Af-SIS spectra (using TC and TN as external variables) for each combination of spectral pretreatments: This model was then used on the CSSL pretreated spectra with reconstruction residuals (Ec) computed as follows: where Pa denotes the loadings corresponding to the PLS model built with the AfSIS library and Sc the projected scores of the central African Library. For this analysis we fixed the number of PLS factors to 20 because projected variables beyond this dimension did not capture a sufficient amount of the original spectral variance. For example, PLS variable 21 amounted to less than 0.01 % of the original variance in all the cases. The mean Q value (Q) for the ith set of pretreatments was obtained by where m and d are the number of samples and the number of spectral variables in the CSSL respectively. To allow for comparisons across the reconstruction errors obtained for the different pretreatments, Q was standardized as follows: .
Tested pretreatments included different combinations of standard normal variate, multiplicative scatter correction, spectral detrending, first and second derivatives, and window sizes from 3 to 35 points in increments of 2. Minimal spectral reconstruction error was achieved with a Savitzky-Golay filter with a second-order derivative using a second-order polynomial approximation with a window size of 17 cm −1 (Savitzky and Golay, 1964), and a subsequent multiplicative scatter correction. This pretreatment was then applied to the spectra prior to MBL.

Principal component analysis data visualization
To analyze the difference between the two spectral libraries and to visualize the similarities between soil samples, a principal component analysis (PCA) was conducted on the preprocessed spectra of both libraries. The PCA was performed with centering but without scaling of the absorbance values.

Modeling approach
In the following we describe the method we used to assess the performance of MBL for predicting TC and TN for six distinct regions at different scenarios of regional soil extrapolation. Three specific modeling strategies were tested on the selected regional sets which we call validation sets (see Sect. 2.6.2). With the regional analysis we demonstrate how predictions of soil properties within new sites from distinct regions -which are compositionally less variable than the available SSLs -perform and profit from knowledge present in the AfSIS SSL. The analysis also demonstrates the added value of our new CSSL in addition to the AfSIS SSL alone. Doing so, the aims of the modeling scenarios were (1) to minimize the costs and time for traditional methods by optimizing the transfer of stored spectral information to the new region of interest and (2) to test different levels of geographical extrapolations for new regions, when no chemical analyses of local samples are available.

Modeling and prediction data
We used two main data sources and subsets as follows: 1. The AfSIS data set (A). The continental SSL from sub-Saharan Africa includes 1902 soil samples with both MIR spectra data and analytical reference data (Fig. 1).
2. The central African data set (C  (100 samples), after the removal of one outlier sample from South Kivu with a large Mahalanobis distance to the AfSIS SSL and therefore high prediction uncertainties (distance > 3; results not shown). Each regional set was split up into a regional validation set (G i \ K i ) and a spiking set (K i ). For this work we differentiated between three different subsets which are defined as follows: (a) the union of the six regional subsets C: (b) regional validation subsets, which are the regional sets without the spiking samples G i \ K i ; (c) six representative regional spiking subsets K i , which were selected from each regional set G i , using the k-means sampling method, which selects one sample per cluster calculated on a principal component analysis as described in Naes (1987) (for examples on k-means sampling in soil spectroscopy, we refer the reader to Ramirez-Lopez et al. (2014); Vohland et al. (2016); Viscarra Rossel and Brus (2018)); a size of 20 samples per region was selected to show a pronounced effect of spiking that avoided any geographical extrapolation.

Modeling strategies
Three different scenarios were compared which are related to the degree of the geographical extrapolation: -Strategy 1. MBL predictions for the regional validation subsets (G i \K i ) were computed from models built only with A. This scenario represents an extreme case of extrapolation (from the geographical perspective) because no samples from the entire central African area are present in the AfSIS set ( Fig. 1), which is the only data used to build the predictive models.
-Strategy 2. Predictions for every G i \K i are computed using MBL models built from the pooled AfSIS data A together with the data from the remaining five regions C i , i.e., A ∪ C i , where Strategy 2 evokes less pronounced geographical extrapolation than strategy 1.
-Strategy 3. This time, strategy 2 was repeated, but extrapolation was avoided using the spiking samples from the same geographical region. Each regional set G i \K i was predicted by the pooled AfSIS data, the data of the remaining regions, and the respective spiking set, i.e., A ∪ C i ∪ K i .

Predictive modeling
We used MBL as our predictive modeling approach. In the chemometrics literature, MBL is also known as local modeling, which describes a family of (nonlinear) machine learning methods designed to handle complex spectral data sets (Ramirez-Lopez et al., 2013b). This type of learning method does not attempt to fit a general (global) predictive function using all available data. Instead, a new and unique function (f i ) is built on demand, every time a new prediction for a given response variable is required. This new function is built using only a subset of relevant observations from a reference set that are queried through a k-nearest neighbor search. The MBL method implemented for this study uses a spectral nearest neighbor search based on a moving window correlation dissimilarity. To measure the dissimilarity (r) between two spectra (X i and X j ), the following equation was used: where d is the number of spectral variables, ρ represents the Pearson's correlation function, and w is the size of a moving window. This window size was optimized based on a spectral nearest-neighbor search within the AfSIS library. For every sample in the AfSIS library, its closest sample (in the spectral space) was identified. Then, samples were compared against their closest neighbors in terms of TC and TN and root mean squared differences (RMSDs), computed according to the following equations: where NN(xa i , Xa −i ) represents a function to obtain the index of the nearest neighbor of the ith observation found in Xa (excluding the ith observation), and yc i,h is the value of the ith observation for the hth property variable (either TC or TN). In total 10 window sizes were evaluated using this approach (from 31 up to 121 in steps of 10), and according to the RMSD, an optimal window size w of 71 was chosen. After nearest neighbor retrieval, our MBL method fits a local model using the weighted average partial least squares (WA-PLS) regression algorithm proposed by Shenk et al. (1997). In this WA-PLS, the final prediction is a weighted average of multiple predictions generated by PLS models built from different PLS factors. A range of latent variables from 5 to 30 in increments of 1 was used for the WA-PLS calculations. The weight for each component is calculated as follows: where s 1:j is the root mean square of the spectral residuals of the new observation when a total of j PLS components are used (i.e., all the components from the first one to the j th one), and g j is the root mean square of the regression coefficients corresponding to the j th PLS component (see Shenk et al., 1997, for more details). The number of neighbors that needed to be retrieved was optimized using nearest neighbor (NN) cross-validation (Ramirez-Lopez et al., 2013b). Using this method, for each observation to be predicted, its nearest neighbor was excluded from the group of neighbors, and then a WA-PLS model is fitted using the remaining ones. This model is then used to predict the value of the response variable of the nearest observation. Predicted values are finally cross-validated with the actual values (see Ramirez-Lopez et al., 2013b, for additional details). For the optimization of the nearest neighbor search, i.e., the nearest neighbor cross-validation, a grouping factor was used to avoid overfitting: keeping the nearest neighbor out, the model was trained with the remaining neighbors which were not from the same region as the hold-out neighbor (region corresponds to the sentinel sites within the AfSIS SSL). The minimum number of available neighbors was tested for each region prior to training the respective final models, which were then trained with neighborhood sizes varying from 150 to 500 neighbors in increments of 10. The best model and the optimal number of neighbors were determined by the minimal RMSE (Eq. 15) of the nearest neighbor cross-validation, where n is the number of neighbors used for the model, y i is the measured value of the hold-out neighbor, and y i is the value predicted by the remaining neighbors.
Subsequently, independent from their distances to the validation set, 1 to 20 spiking samples were added from the target region and forced into the neighborhood of every observation and thus used in the predictive models. Our approach differs from previous studies using local modeling methods in com-bination with spiking, where the samples were not forced into the neighborhoods (e.g. Barthès et al., 2020;Lobsey et al., 2017). Our approach guarantees that the spiking set (which is assumed to carry important information) is fully used.
Stepwise spiking was applied to test the effect of spiking in general and to find the smallest number of samples required for satisfying model performances. This was necessary since soil samples from the same geographical region are usually governed by very similar formation processes (spatial autocorrelation; Fortin et al., 2016), and MIR spectra partially reflect the compositional characteristics of these samples. Moreover, it is widely accepted that the most accurate predictions can be achieved by models built with samples originating from the same region because large nonlinear complexity is avoided (e.g., Tziolas et al., 2019).

Model validation and prediction accuracy
For model validation, the RMSE statistics of the nearest neighbor cross-validation described in the previous section were used. Prediction accuracy of the predicted vs. the measured values was also calculated using RMSE (Eq. 15), where in this case y i is the actual measured reference value and y i the prediction of the final model.
Model validation and prediction performance were additionally evaluated using the mean error (ME; mean of the absolute difference between predicted and observed values) and the ratio of performance to the interquartile distance (RPIQ; Bellon-Maurel et al., 2010). For calculating RPIQ, the interquartile range of the observed reference data is divided by the RMSE of the nearest neighbor validation or by the RMSE of the prediction (RMSE pred ). This is particularly useful since RPIQ does not make any assumptions about the distribution of the reference data.

Results
The samples that comprise the CSSL exhibited a wide range of TC and TN contents (Fig. 2). Validation and spiking sets for four of the six regions (Haut-Katanga, Tshopo, Tshuapa, Kabarole) had mean TC and TN of 9.30-18.10 and 0.95-1.74 g kg −1 , respectively. Maximum TC and TN values for these four regions were 56.69 and 5.05 g kg −1 , respectively. The other two regions, South Kivu in the eastern DRC and Iburengerazuba in western Rwanda, had considerably higher TC and TN contents, with mean values of 23.55-35.43 and 1.34-3.07 g kg −1 , respectively. The AfSIS SSL had generally lower mean TC and TN contents of 12.37 and 0.82 g kg −1 , respectively.

Principal components and spectral variability in the two libraries
The first three principal components accounted for 85 % of the spectral variability (Fig. 3). These components indicate that the majority of CSSL samples lie within the spectral domains of the AfSIS SSL as their PCA scores overlap. This overlapping is, however, less evident for the spectra of the South Kivu region and, to a lesser extent, for the samples of the Iburengerazuba and Tshuapa regions, which suggests that the type of soils in these regions may not be well represented by the AfSIS SSL compared to the other regions.

Predictive performance of the three strategies
In general, MBL retrieved accurate TC and TN predictions for all the strategies (with RMSE pred values below 9 g kg −1 for TC and below 1.7 g kg −1 for TN). South Kivu and Iburengerazuba regions showed the highest RMSE pred , which was mainly due to the high TC and TN ranges (Fig. 2 (Fig. 4). Moreover, TC predictions for Haut-Katanga, Tshopo, Tshuapa and Iburengerazuba, as well as TN predictions, in all six regions showed a clear trend towards underestimation (Fig. 4). This can be caused by one or a combination of the three following effects: (i) the central African samples were poorly represented by the continental AfSIS SSL due to the differing pedogenic features (Fig. 3); (ii) the preprocessing methods did not completely account for the spectral offset and/or multiplicative effects in the spectra (due to instrument differences); and (iii) performance dif- ferences exist between the conventional laboratory analyses used to obtain TC and TN reference values.

Strategy 2: regional predictions by soil spectral libraries
Compared with strategy 1, strategy 2 partially showed better predictive performance for TC and in all the cases retrieved better TN predictions. These improvements are exemplified by the larger RPIQ pred and smaller RMSE pred values in strategy 2 (  (Table 3).
Comparing the TC RMSE pred of each region across the first two strategies, errors for Haut Katanga, Tshopo, and Iburengerazuba were substantially reduced in strategy 2. Two regions performed equally well (South Kivu and Tshuapa) in both strategies, and only one region (Kabarole) saw an increase in errors (Table 3). For all regions, TN prediction errors (RMSE pred ) were consistently lower in strategy 2 than strategy 1 (Table 3). The R 2 pred values of the TC and TN predictions indicate that the precision of such models was, in general, equal or slightly better for strategy 2 than for strategy 1.  Table 3. Statistics of the independent validations of the predictions of total carbon and total nitrogen for each region and three strategies. Strategy 1: predictions of the combined six regions by the AfSIS soil spectral library (SSL); strategy 2: predictions of the individual regions by the remaining five regions together with the AfSIS SSL; strategy 3: spiking six regional models from strategy 2 with 20 samples from each target area.

Strategy
Region Total carbon (g kg −1 ) Total nitrogen (g kg −1 ) n pred RMSE pred R 2 pred ME pred RPIQ pred n pred RMSE pred R 2 pred ME pred RPIQ pred

Strategy 3: spiking of the regional models
For all regions, spiking the regional models with up to 20 local samples from each corresponding regional spiking set K i consistently produced lower prediction errors (Fig. 5) compared to strategy 1 and strategy 2. For Haut-Katanga, Tshopo, Tshuapa, and Iburengerazuba, the RMSE pred for TC and TN could be reduced with 10 to 13 spiking samples and did not change substantially thereafter (Fig. 5). In contrast, for South Kivu and Kabarole, RMSE pred values were mini-mized, with 16 or more spiking samples from each target region (Fig. 5). To present the strong and contrasting effect of foregoing any spatial extrapolation in strategy 3, the results for 20 spiking samples are presented in Table 3 and Fig. 4. The strongest reduction of the RMSE pred for TC in strategy 3 (with 20 spiking samples) compared to strategy 2 (no spiking) was achieved for Kabarole (4.44 g kg −1 ), Iburengerazuba (1.62 g kg −1 ), and South Kivu (1.56 g kg −1 ), followed by Tshuapa, Haut-Katanga, and Tshopo, which decreased by 0.45-0.93 g kg −1 . Similarly, shifting from strategy 2 to 3 https://doi.org/10.5194/soil-7-693-2021 SOIL, 7, 693-715, 2021 had the strongest effect on the RMSE pred for TN for South Kivu (0.2 g kg −1 ), for Kabarole (0.2 g kg −1 ), and for Iburengerazuba (0.15 g kg −1 ), whereas differences were smaller for Haut-Katanga, Tshuapa, and Tshopo (0.03-0.06 g kg −1 ). Strategy 3 also resulted in predictions that better represented the measured values (consistently higher R 2 pred and RPIQ pred values than in strategy 1 or 2; Table 3). The Kabarole region showed the best predictive performance for TC in strategy 3 (RPIQ pred of 5.48), followed by Iburengerazuba, South Kivu, Haut-Katanga, and Tshuapa (RPIQ pred 2.22-3.57). For TN, Iburengerazuba, Kabarole, South Kivu, Tshuapa, and Haut-Katanga showed accurate predictions (RPIQ pred of 1.87-4.45). RPIQ pred values for the predictions of TC and TN for Tshopo were less than 2 (RPIQ pred TC: 1.43 and RPIQ pred TN: 1.62). However, the trend from strategy 1 to strategy 3 was a clear reduction in prediction errors and an increase in accuracy.

Strategy 1 and strategy 2: using soil spectral libraries outside of their respective geographical domains
Our analysis shows that TC and TN in six regions of our CSSL can be reasonably well predicted through the use of existing SSLs comprised of soils from completely different geographical areas and without any local samples using MBL methods (RMSE pred < 9 g kg −1 TC and < 0.17 g kg −1 TN, Table 3). The resulting prediction errors were comparable to other large-scale MIR prediction studies (e.g., Dangal et al., 2019;Angelopoulou et al., 2020) and also to other soil infrared studies, which analyzed geographical extrapolation possibilities (e.g., Padarian et al., 2019;Briedis et al., 2020;Gomez et al., 2020). The advantage of using MBL as the method to build prediction models is that it finds similar spectral observations for every new observation to fit suitable models. This approach works efficiently since spectral similarity is in fact reflecting the similarity between observations in terms of soil composition, information which is largely contained in the MIR features of a sample. This means that the predictive success of MBL models largely depends on the quality of the spectra dissimilarity methods used to find spectral neighbors. In other words, MBL can be described as a method driven by compositional similarity search. The improved prediction accuracy (lower RMSE pred and higher RPIQ pred ) when reducing extrapolation (strategy 2) can be explained by the addition of more proximal central African soil samples to the library that are more similar to each predicted region. The continental AfSIS SSL is missing data for most of central Africa (Fig. 1); none of the tropical forest soils with high contents of organic carbon or with distinctive mineral-organic composition are covered by this large-scale SSL. Naturally, this variability impacts the generalization ability of any predictive model or modeling strat-egy. Moreover, variance arising from instrument and reference laboratory differences was avoided through the use of local models. However, it is not clear why Kabarole exhibited higher prediction errors in strategy 2. A possible reason could be random variance (Fig. 4) or nonlinearity. Two regions (South Kivu and Tshuapa) did not show any substantial changes on RMSE pred and RPIQ pred values for TC when comparing strategy 1 and strategy 2. Note that both South Kivu and to some extent also Tshuapa cover a distinct score space in Fig. 3 and therefore are not well represented by the remaining central African regions, nor by the AfSIS SSL. All central African regions from the CSSL show large variability in TC and TN contents (Fig. 2) and contain samples from various land cover (forest/croplands), altitudes (Table 2), and parent materials. These differences suggest that soils have developed and been transformed under a variety of environmental conditions. For example, high diversity in organic compounds and their stabilization in soils (i.e., organo-mineral association, complexation, aggregation) can introduce nonlinear relationships that are difficult to predict with locally linear calibration methods (i.e., memory-based learning in combination with PLS regression). Thus, we conclude that the particularly high soil diversity in these two regions, in terms of biogeochemical and physical properties, introduces additional complexity into the soil spectral prediction workflow. Similarly high RMSEs have been shown in other studies for samples with organic carbon higher than 150 g kg −1 (Nocita et al., 2014). As in our study, these high errors were attributed to high TC contents. To improve prediction accuracies for these diverse regions, more data are needed. The creation of subsets from large spectral libraries via spectral similarities, for example, has been shown to be effective to train calibration models (e.g., Wetterlind and Stenberg, 2010;Clairotte et al., 2016;Sanderman et al., 2020). Hence, in order to reduce uncertainties for regions in central Africa that are diverse in terms of soil chemical composition, in particular for the Great Lakes region, there is a pressing need to fill the existing gaps in the continental library by gathering more data on the ground.

Strategy 3: effect of spiking with local samples on prediction performance
The spiking of the calibration models with local target samples had a positive effect for all included regions ( Fig. 5 and Table 3). Kabarole, Iburengerazuba, and South Kivu, which showed the most substantial reductions of RMSE pred for TC and TN by spiking, cover different land uses, high altitudes along the Albertine Rift, and larger climatic ranges (Table 2). These soils are not adequately represented by the continental AfSIS SSL, nor by the remaining central African regions, and therefore exhibited a strong effect when spiked with local soil data. Although the effect of spiking on RMSE pred for TC and TN was somewhat smaller for the other included regions (Haut-Katanga, Tshopo, and Tshuapa), it still produced noticeable improvements compared to strategy 1 and strategy 2 (smaller RMSE pred and larger RPIQ pred values). The TC and TN ranges of Haut-Katanga, Tshopo, and Tshuapa were narrower, and they also seem to be better represented by each other and by the AfSIS SSL (with the exception of a few samples of Tshuapa; Fig. 3). In these three regions, sufficiently similar spectra were available, and the MBL found the required neighbors to build accurate models and predict TC and TN, thus lowering the positive effect of spiking. Additionally, the weaker influence of spiking on soils of Tshopo (RPIQ pred TC: 1.43 and RPIQ pred TN: 1.62) can be explained by an outlier in the predictions (Fig. 4) and a slightly uneven distribution of the reference data between the validation and spiking sets (Fig. 2). In summary, spiking has already been shown to improve performance (e.g., Guerrero et al., 2014;Seidel et al., 2019;Barthès et al., 2020) and also proved its value in our study. However, a threshold of 20 samples poses non-negligible additional costs for laboratory reference analysis, and the benefit in terms of gain of accuracy by spiking depends on the region and is not always guaranteed. In some cases, however, a smaller number of spiking samples can substantially reduce the RMSE pred (e.g. Iburengerazuba and Kabarole). The required prediction accuracy and additional investments depend hereby on the field of application.
The achieved predictions and their errors from this study are more than satisfactory for the study of TC and TN dynamics and will improve the availability of high-resolution soil data of central Africa. Thus, spiking is recommended when soils are highly variable and show large distances to existing spectral libraries.

Suggestions for building new models and extending the existing spectral library
Our regional predictions of TC and TN show promising results when analyzing soils from geographically distinct areas in central Africa that are not covered by the continental Af-SIS SSL (Fig. 1). Six central African regions were predicted for soil TC and TN with sufficient accuracy using the largescale AfSIS soil spectral library only. The general positive effect of adding geographically closer samples to the AfSIS SSL (strategy 2) underlines the usability of spectral libraries for new regions. The generally positive effect of strategy 3, spiking of all regional predictions for TC and TN with samples from the target area, encourages the future amendment of currently existing libraries to improve prediction accuracy.
To improve future soil analyses and to extend the geographical area covered by an SSL, we suggest the following workflow: 1. Preprocessing. Different spectral preprocessing methods influence model and prediction performance. We suggest selecting the best preprocessing strategies using spectral projections and minimizing the reconstruction error (see Sect. 2.4).
2. Estimation of uncertainty for new samples. When analyzing new soil samples from a region which is not covered by the existing SSL, samples with different composition and hence chemical properties are more likely to be introduced. Samples with high distances in the score space to the SSL cannot be predicted accurately with a high certainty, since they are often highly divergent from the SSL. We recommend that a preliminary graphical inspection of resampled and preprocessed spectra can already allow for recognition of differences. A further dimension reduction (e.g. with a PCA) with a subsequent 2D or 3D visualization of the first factors provides additional insights into dissimilarity. As shown with the AfSIS SSL, the application of already existing libraries and the extrapolation to new regions is accurate and suitable to estimate soil properties. However, to make predictions more accurate, especially for more diverse, heterogeneous, and complex soils, more data are required. As demonstrated, the addition of new geographical regions improves the overall prediction accuracy when more proximal central African regions were added to the large-scale library. These results encourage the use and amendment of existing libraries, rather than the construction of new, separate, and extensive databases. Given the existing distribution of samples in the new CSSL, it is especially important to increase the number of forest soils with high TC contents, which represent a large portion of the Congo Basin. The future enlargement of the CSSL, preferably facilitated by our suggested workflow, is crucial to fill the gap of soil information in this highly understudied part of the world and can be assisted by the soil science community by adopting a sharing-oriented open data policy.

Conclusions
Our study presents the results and workflow for building the first central African SSL for predicting soil properties (TC and TN) using lab-based MIR spectroscopy in a crucial but understudied area of the African continent. Extrapolations were possible for central Africa and for all the six selected regions. Our results further demonstrate how MBL algorithms are useful to find spectral similarities and reduce the need for spiking when a new set covers the same score space as the existing library. These encouraging insights highlight the utility of spectral libraries for future applications, since they are not necessarily limited to certain geographical areas. Our approach of augmenting a smaller SSL with a continental SSL, even when scanned on different instruments, leads to reasonably accurate predictions for new regions, which allows for analyses of TC and TN dynamics in soils but also meets a competitive cost-benefit trade-off. Furthermore, the CSSL fills an appreciable continental gap of the continentalscale AfSIS SSL and contributes to covering an important range of soil variability with spectral data, particularly from tropical forests. However, in order to improve the accuracy of predicting soil organic matter across regions, especially for soil compartments with high TC and TN contents, our study highlights the need to extend the existing library into new regions. The inclusion of more samples and regions, in particular with more (varying) data of humid tropical forest soils, is crucial to fill existing gaps. Combining spectral libraries will allow for fast analyses of soil samples and provide spatially explicit data across humid tropical Africa. Table A1.
Number of samples, GPS coordinates, elevation, annual precipitation (AP), mean annual temperature (MAT), Köppen-Geiger climate classifications, and soil types for entire soil spectral library for the Democratic Republic of Congo, Rwanda and Uganda. Data were extracted for all coordinates from raster files. Climate data are sourced from Fick and Hijmans (2017), elevation from SRTM (90 m resolution; Jarvis et al., 2008), Köppen-Geiger climate classifications from Beck et al. (2018), and soil types from the Soil Atlas of Africa IUSS Working Group WRB, 2015