The number of samples used in the calibration data set affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS–NIR–SWIR) spectroscopy for soil attributes. Recently, the convolutional neural network (CNN) has been regarded as a highly accurate model for predicting soil properties on a large database. However, it has not yet been ascertained how large the sample size should be for CNN model to be effective. This paper investigates the effect of the training sample size on the accuracy of deep learning and machine learning models. It aims at providing an estimate of how many calibration samples are needed to improve the model performance of soil properties predictions with CNN as compared to conventional machine learning models. In addition, this paper also looks at a way to interpret the CNN models, which are commonly labelled as a black box. It is hypothesised that the performance of machine learning models will increase with an increasing number of training samples, but it will plateau when it reaches a certain number, while the performance of CNN will keep improving. The performances of two machine learning models (partial least squares regression – PLSR; Cubist) are compared against the CNN model. A VIS–NIR–SWIR spectra library from Brazil, containing 4251 unique sites with averages of two to three samples per depth (a total of 12 044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration data set was then created to represent a smaller calibration data set ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, which is equivalent to a sample size of approximately 350, 840, 1400, 2800, 4200, 5600, 7000 and 7650. All three models (PLSR, Cubist and CNN) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic carbon, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated 10 times to provide a better representation of the model performances. Learning curves showed that the accuracy increased with an increasing number of training samples. At a lower number of samples (
There has been an increasing demand for a rapid and cost-effective method as an alternative to conventional laboratory soil analysis. Visible, near and shortwave infrared (VIS–NIR–SWIR) spectroscopy have been proposed to be used as an alternative tool for soil analysis for the last few decades (Bendor and Banin, 1995; Shepherd and Walsh, 2002; Stenberg et al., 2010). This method enables the simultaneous prediction of various properties and has non-destructive characteristics.
Various machine learning models, such as partial least squares regression (PLSR), Cubist, random forest and support vector machines have been utilised to model spectroscopy data. However, the performances of these regression models are dependent on the spectral preprocessing methods (Rinnan et al., 2009) and the size and representativeness of the calibration samples (Kuang and Mouazen, 2012; Ng et al., 2018). Different combinations of the spectral preprocessing methods will result in various model performances. Furthermore, the spectral preprocessing techniques developed for a particular data set might not work for a different data set. Better generalisation can be made by training the model in a larger data set. However, several studies demonstrated that the performance of the machine learning model did not increase significantly, or it even plateaued, as the calibration sample size increased (Figueroa et al., 2012; Ramirez-Lopez et al., 2014; Ng et al., 2018).
Advances in artificial intelligence, such as deep learning, enable the
possibility of extracting features from data without hand-engineered
features (LeCun et al., 2015), such as preprocessing. Various
deep learning convolutional neural network (CNN) models (i.e. AlexNet, VGGnet, GoogLeNet and ResNet) had been developed and trained on large volumes of data, which included over 10 million image data (Krizhevsky et al.,
2012; Simonyan and Zisserman, 2014; Szegedy et al., 2015; He et al., 2016). CNN has recently been applied in soil science (Padarian et al., 2019; Tsakiridis et al., 2020). Although CNN often deals with images as input data, it has recently been successfully applied to vibrational and reflectance spectroscopy (Acquarelli et al., 2017; Cui and Fearn, 2018; Liu et al., 2018; Ng et al., 2019; Padarian et al., 2019; Tsakiridis et al., 2020; Zhang et al., 2020). Acquarelli et al. (2017) found that the CNN-based model outperformed other models (partial least square, least discriminant analysis, logistic regression and
These days, deep learning, such as CNN, that was developed to handle a large amount of data (millions of images) and soil spectra is not that large yet. For example, a recent study used deep learning on 135 soil samples (Chen et al., 2018). The advantage of using CNN on such a small number of samples is uncertain. A recent review on spectroscopy showed that there were several studies in which deep learning was used with a small calibration sample size (Yang et al., 2019). The review indicated that an increase in calibration sample size should further improve the calibration performance. However, there was no guideline as to how much improvement can be expected and what the minimum number of samples was for it to be effective.
A strategy to select an adequate calibration set in terms of representativeness and size is vital for obtaining a model with good generalisation ability. Although various sampling algorithms (e.g. Kennard–Stone, conditioned Latin Hypercube sampling and
Model performance of deep learning vs. other machine learning algorithms as a function of number of samples.
Thus, the purpose of this study is to assess the amount of calibration data
needed for the CNN model to outperform machine learning models. PLSR and
Cubist are chosen as the representatives of the machine learning models
which have been found to perform well in soil spectra (e.g. Dangal
et al., 2019). In addition, to be able to predict soil properties accurately, we need to understand and interpret how a CNN model can predict soil properties from spectra. This paper presents the following specific
contributions:
testing the idea that common machine learning models will reach a plateau in accuracy with an increasing number of calibration samples, establishing the number of calibration samples required for deep learning to be effective for VIS–NIR–SWIR spectra, establishing how much improvement in accuracy is achieved when the number of calibration samples for deep learning and machine learning models is increased, and demonstrating how to interpret a deep learning model using a sensitivity analysis.
This data set comprises 12 044 soil samples from 4251 unique sites. The soil samples, collected from several regions in Brazil, i.e. the states of São Paulo, Minas Gerais, Goiás and Mato Grosso do Sul. This data set is part of the Brazilian Soil Spectral Library and has been extracted from Terra et al. (2018) and Bellinaso et al. (2010). The soils were derived mostly from basalt (volcanic rock) and sedimentary rocks (sandstone). Each site has up to seven measurements from the surface up to 1 m depth.
The measured properties include soil texture (sand, silt and clay), organic
carbon (OC) and cation exchange capacity (CEC). The soil particle size was
quantified by the pipette method, as described in Donagema et al. (2011). The method consists of using a 0.1 M NaOH solution as a dispersing
agent under high-speed mechanical stirring for 10 min. Then, the sand
fraction was separated by sieving, and the clay fraction was measured by
sedimentation. The silt was quantified based on the pre- and post-difference.
Organic carbon (OC) was determined by the Walkley–Black method
(Walkley and Black, 1934), in which OC was oxidised, using
K
Descriptive statistics of the soil properties measurements.
The VIS–NIR–SWIR spectra of the soil samples were obtained with a FieldSpec3 spectroradiometer (Analytical Spectral Devices, Boulder, Colorado), with a spectra range of visible to shortwave infrared (350–2500 nm) and a spectra resolution of 1 nm from 350 to 700 nm, of 3 nm from 700 to 1400 nm and of 10 nm from 1400 to 2500 nm. The sensor scanned an area of approximately 2 cm
To better represent the soil distribution, we split and subset the data based on sites. The data set was first randomly split into 75 % calibration (3188 sites) and 25 % validation (1063 sites) based on the unique sites.
From the calibration data set, smaller sample sizes ranging between 125, 300,
500, 1000, 1500, 2000, 2500 and 2700 unique sites were created, which is
equivalent to a sample size of approximately 350, 840, 1400, 2800, 4200,
5600, 7000 and 7650. Better representations of model performances were
provided by 10 replicates of these sizes. Each sampling for the same number
of sites could generate a slightly different number of samples, since the
number of measurements varied from one site to another. However, the model
performance was evaluated on the common validation data set using a total of
1063 sites (sample size
Prior to the development of machine learning models (PLSR and Cubist), the
spectra were subjected to some preprocessing methods, namely the (i) conversion to absorbance followed by (ii) a Savitzky–Golay smoothing filter, with a window size of 11 and second-order polynomial (Savitzky and Golay, 1964), (iii) spectra trimming to discard region that has a low signal-to-noise ratio (
PLSR is one of the standard and most commonly used models with spectroscopy data. It is a linear chemometric regression model that projects spectra into latent variables that explain the variances within the spectra and the response variables (Wold et al., 1983). The optimal number of latent variables used in the PLSR regression that resulted in the smallest root mean square error (RMSE) using the cross-validation approach was used to create the models. PLSR was implemented in the R statistical software (R Core Team, 2019) using the “pls” package (Mevik et al., 2018).
Cubist is a rule-based data mining model, which is an extension of the M5 model tree by Quinlan (1993). Cubist has been used successfully in soil spectroscopy studies and, in many cases, has been found to perform better than PLSR and other machine learning models (Dangal et al., 2019). Cubist creates one or more rules so that, if the rules are met, a certain linear model can be utilised to predict the target task. The model was evaluated using the “Cubist” package (Kuhn and Quinlan, 2018) in R.
The CNN model is composed of three types of layers, namely the convolutional, pooling and fully connected layer. The convolutional layer extracts features from the inputs, the pooling layer reduces the dimensionality of the input feature and the fully connected layer connects the outputs from previous layers to the desired target outputs. The CNN model utilised in this study was derived from our previous study (Ng et al., 2019), in which the spectra were fed into the model as 1D data. The architecture of the CNN model is included in Table 2 and Fig. 2. Some of the layers within the network are shared to enable simultaneous output predictions.
Architecture of the 1D convolutional neural network (CNN) model.
Architecture of the convolutional neural network model.
ReLUs – rectified linear units.
The CNN model was trained with an initial learning rate of 0.001 and an Adam optimiser (Kingma and Ba, 2014). The network was trained using a batch size of 50 and a maximum epoch of 200. For model optimisation purposes, the calibration data are further divided into a 75 % training and a 25 % testing set. Drop out, early stopping and reduced learning rates are used as a regularisation technique to prevent network overfitting. For further details of the CNN model, the reader is referred to Ng et al. (2019). The CNN model was implemented in Python (version 3.5.1; Python Software Foundation, 2017) using the Keras library (version 2.1.2; Chollet, 2015) and TensorFlow (version 1.4.1; Abadi et al., 2015) back end.
All the model performances were compared in terms of the coefficient of
determination (
To uncover how CNN predicts different soil properties, a sensitivity
analysis was conducted to assess the importance of each wavelength in
contributing to predictions. Evaluating the sensitivity of the model can be
done in several ways; for example, Cui and Fearn (2018)
calculated the sensitivity of a CNN model for NIR by taking a numerical
partial derivative of the output with respect to each wavelength. For
wavelength
In our previous study (Ng et al., 2019), we calculated the sensitivity as
a function of the variance of the model for each window of spectra. Here, we
calculated the sensitivity based on the variance principle as an alternative
approach, as follows:
The current sensitivity analysis (Eq. 2) considered the actual variance of the data for a better approximation of the wavelength's sensitivity. To calculate the variance sensitivity, two new data frames were
created. The first data frame contained data which was the average of all
the validation spectra (
The illustrations of the process of deriving new data frames are included in
Fig. 3. Both data frames were then fed into the pretrained CNN model (
Illustration of the sensitivity analysis process, with
Large variability within the soil properties and texture could potentially
influence the soil spectra characteristics (shown in Fig. 4). In general, there was an increase in reflectance between 400 and 1000 nm, with several prominent absorption features at 1400, 1900 and 2200 nm. There are absorption features in the VIS–NIR (400–1000 nm), which are related to iron oxides, such as haematite (Fe
Visible, near and shortwave infrared (VIS–NIR–SWIR) spectra of 10
soil samples without spectral preprocessing
An attempt to take a look at what the CNN model actually learns was conducted. As the raw reflectance spectrum was fed into the CNN model, it passed through a convolutional layer which extracted information from the spectra. Filters from the first two convolutional layers were included in Fig. 5. Though only raw spectra were fed into the CNN model, we could see that the spectra underwent some spectral preprocessing within each filter of the layers. Some of the filters shown in the first convolution layer looked like the input spectra pattern (filter nos. 3, 4 and 10), and some of them mimicked the transformation pattern, namely absorbance (filter nos. 1, 5, 6, 7, 9, 13 and 16) and derivatives (filter nos. 2, 8, 11, 12, 14 and 15). The spectrum became smoother when they passed through the second convolutional layer, where some filters only accentuated certain peaks (Fig. 5).
Visualisation of the filters in the first two convolutional layers within the 1D convolutional neural network (CNN) model of the visible, near and shortwave infrared (VIS–NIR–SWIR) spectra.
The model performances for the validation data set using the full calibration
data (
Results of model validation for the prediction of various soil attributes using the full calibration data set.
OC – organic carbon; CEC – cation exchange capacity.
The performance was achieved using the CNN model with the prediction of sand
(
Among all the properties predicted, the sand and clay content showed the
best performance with
A total of nine subset models based on the unique sample sizes were
generated to investigate the effect of training a sample size. The performance comparison of all the models expressed as average
Model performances (in terms of average
In general, the PLSR and Cubist models tended to perform better when the sample size was relatively small (
We further compared the average model performance based on the RMSE ratios
of machine learning models against the CNN model (Fig. 7). This comparison was developed using the model performance for each unique property, and the variances presented were based on 10 simulations. If a particular
Model performances in terms of root mean square error (RMSE) ratios of
Upon comparing the RMSE ratios of the PLSR and Cubist models, we found that PLSR performed better than the Cubist model when the sample size was less than 1400. The Cubist model performed better than the PLSR model as the sample size was increased. Using the RMSE ratios of PLSR and CNN models, PLSR was found to perform better than CNN when the sample was less than 1400 (Fig. 7). Similar performance of both PLSR and CNN models was achieved when the sample size was approximately 1400. In terms of RMSE ratios of Cubist and CNN, the CNN model performed better overall in comparison to the Cubist model, regardless of sample size. This was slightly different to the one that was observed when only the
The critique of CNN is that it is a complex model and a black box. To uncover how the CNN model works, a sensitivity analysis was conducted to show how CNN predicts each of the soil properties, as illustrated in Fig. 8. Only certain parts of the spectra were used by the CNN model for prediction, which corresponded to the soil properties and composition. The important wavelengths for the prediction of CEC are between the regions of 1600 and 2000 nm. This result is similar to the observations made by Lee et al. (2009) on the surface horizon data set, where 1772 and 1805 nm are essential for predicting the CEC. The presence of high CEC is often linked to the presence of OC and clay content. It is interesting that the same region is important for predicting organic carbon but not clay content. Aside from the same region used by CEC, the wavelengths' region between 1100 and 1200 nm is also deemed relevant by the CNN model for the prediction of OC content. This finding is slightly different to those reported by Lee et al. (2009) in which the important wavelengths reported are at 1772, 1871, 2069, 2246, 2351 and 2483 nm for the profile data set and 1871, 2072 and 2177 nm for the surface horizon data set.
Sensitivity analysis of the visible, near and shortwave infrared (VIS–NIR–SWIR) spectra in predicting various soil properties using the convolutional neural network (CNN) model. The graph depicts the sensitivity index (calculated from Eq. 2) for different soil properties as a function of wavelength.
Similar wavelength regions are deemed to be important for predicting the soil texture, although the importance varied slightly among the types of texture of interest (sand, silt and clay) at wavelengths between 500 and 1800 nm. The important wavelengths for the prediction of sand and clay content share a higher similarity in comparison to those of silt content prediction. The most crucial wavelength identified is around 850 nm for the prediction of sand and clay content and around 1100 nm for the prediction of silt content. These observations are also different from those reported by Demattê (2002) and Lee et al. (2009), where the important wavelengths for the prediction of soil texture are at 1800–2400 nm. In particular, the soil texture prediction found in the CNN model is strongly related to hematite and/or goethite, -OH and Al–OH groups from kaolinite (Viscarra Rossel and Behrens, 2010; Pinheiro et al., 2017; Fang et al., 2018).
We also compare important wavelengths from the machine learning models against the one from the deep learning model for the prediction of OC, as an example. Common wavelengths found to be related to the organic carbon predictions are 1100, 1600, 1700–1800, 2000 and 2200–2400 nm (Dalal and Henry, 1986; Stenberg et al., 2010).
As a comparison, we calculated important wavelengths used in the PLSR and Cubist models. The important wavelengths utilised in the PLSR model were derived based on the absolute value of the regression coefficients. The height of the line indicates the importance of particular wavelengths for the determination of organic carbon content in the soil. Important wavelengths identified for the prediction of organic carbon were 500–700, 1400 and 1715 nm.
The wavelengths used in the Cubist model were derived based on model usage, either as predictors (blue lines) or conditions (pink lines) (Fig. 9). Some of the wavelengths used in the Cubist model are similar to those observed in the PLSR model, particularly the visible (500–700 nm) and shortwave infrared regions (1400 and 1900 nm).
Important wavelengths for the prediction of organic carbon (OC) content using partial least squares regression (PLSR), Cubist and convolutional neural network (CNN) models.
While conventional PLSR and machine learning models require preprocessing for the spectra input, the CNN model takes raw spectra as inputs. CNN has been shown to be a successful end-to-end learning model which learns feature automatically while minimising hand-crafted preprocessing processes. Upon taking a closer look at the various filters within the convolutional layers, we found that the filters behaved like spectral preprocessing methods. It is interesting to note that, using the raw spectra input, various spectral preprocessing that was commonly used within spectroscopy could be observed within the layer itself. Given the various complexities within the CNN model, the use of spectral preprocessing prior to being fed is unnecessary. This advantage opens up possibilities for developing a highly accurate chemometrics model, which also plays a role in automatic spectral preprocessing.
CNN has been proven to be extremely successful; however, how it works remains largely a mystery as it are buried in layers of computations (Tsakiridis et al., 2020). Sensitivity analysis enabled us to see the inner workings of the CNN model better. We could better understand which wavelength's features are essential to the spectra when used in developing the regression prediction. Important wavelengths derived from the sensitivity analysis based on the CNN model looked slightly different from those of the PLSR and Cubist models. Wavelengths around the 1700 nm region were deemed to be the most important, followed by those in the 1150 nm region. Nonetheless, some of the important regions overlapped. It is also worth noting that the model did not use the visible part of the spectra for prediction. In comparison to the sensitivity of MIR spectra in a previous study (Ng et al., 2019), the NIR model's sensitivity index was much broader, which reflected NIR's characteristic broad peak.
Although all three methods used different ways to derive important wavelengths, the PLSR model tended to use most parts of the spectra. When irrelevant wavelengths are included in model development, it may reduce the model performance. The Cubist model seemed more selective in terms of wavelengths used; however, this example showed that it also used most parts of the VIS–NIR–SWIR spectra. The CNN model used wavelengths between 800 and 2000 nm, with emphasis at around 1100 and 1700 nm.
PLSR, Cubist and CNN represent models with increased complexity. By
combining results from five soil properties, we can better show a
generalisation of the performance of the models as a function of training
sample sizes. Simpler models (PLSR) performed better at a smaller sample sizes (
Previous studies by Ng et al. (2019) and Padarian et al. (2019) showed that CNN performed better than PLSR and Cubist when the model was trained with more than 10 000 samples. However, there were also studies using CNN with a small number of training samples. This study showed that the CNN model only outperformed PLSR and Cubist models when the sample size was greater than 2000. As the sample size increases, so the efficiency of the CNN model is increased. We observed a larger reduction in RMSE (CNN compared to the other two models) with an increasing calibration sample size. Thus, we recommend using a minimum of 2000 samples to train the CNN model for the VIS–NIR–SWIR spectra. To further improve the performance of the CNN model, simultaneous prediction of soil properties could also be implemented within the model.
The advantage of using deep learning on a small number of samples is minimal, as CNN is a data-hungry model; it is also more computationally expensive than the typical machine learning models. While our results pertain to the spectra set from Brazil and a particular structure of the CNN, we believe our results can serve as a guide for the number of samples needed to create a better deep learning model. Future research could test this idea on larger and more variable data sets (e.g. a global spectra library with more than 100 000 samples) to see if a more complex and deeper network of CNN can handle such data set.
We assessed the effect of the training sample size and identified important
wavelengths in predicting various soil properties using Cubist and CNN
models. Here, we found that, with its current model structure, CNN is more
accurate than machine learning models when the number of calibration samples
is above 2000. The more complex and deeper the network of a deep learning model, the more likely it will need a larger number of samples for training. PLSR and Cubist models perform less accurately than the CNN model as sample size increases, and both models reached a plateau after a sample size of 4000–5000. Meanwhile, the performance of CNN still increased until the maximum number of data used in this study (
Data are courtesy of Dematte; they are not publicly accessible as they were taken from private farms.
WN was responsible for the data analysis and prepared the paper. BM contributed to the idea, data analysis and editing the paper. WdSM and JAMD contributed to the idea, provided the data and assisted with the editing of the paper.
The authors declare that they have no conflict of interest.
This study was supported in part by the Australian Research Council (ARC) Linkage Project (grant no. LP150100566) on the optimised field delineation of contaminated soils. The authors would also
like to thank members of the Geotechnologies in Soil Science Group
(
This research has been supported by the ARC Linkage Project (grant no. LP150100566) and the São Paulo Research Foundation (FAPESP; grant nos. 2014/22262-0 and 2016/26124-6).
This paper was edited by Bas van Wesemael and reviewed by three anonymous referees.