Journal cover Journal topic
SOIL An interactive open-access journal of the European Geosciences Union
Journal topic

Journal metrics

Journal metrics

  • IF value: 3.343 IF 3.343
  • IF 5-year value: 4.963 IF 5-year
  • CiteScore value: 9.6 CiteScore
  • SNIP value: 1.637 SNIP 1.637
  • IPP value: 4.28 IPP 4.28
  • SJR value: 1.403 SJR 1.403
  • Scimago H <br class='hide-on-tablet hide-on-mobile'>index value: 25 Scimago H
    index 25
© Author(s) 2019. This work is distributed under
the Creative Commons Attribution 4.0 License.
© Author(s) 2019. This work is distributed under
the Creative Commons Attribution 4.0 License.

  17 Sep 2019

17 Sep 2019

Review status
A revised version of this preprint is currently under review for the journal SOIL.

Estimation of effective calibration sample size using visible near infrared spectroscopy: deep learning vs machine learning

Wartini Ng1, Budiman Minasny1, Wanderson de Sousa Mendes2, and José A. M. Demattê2 Wartini Ng et al.
  • 1School of Life and Environmental Sciences & Sydney Institute of Agriculture, The University of Sydney, NSW, Australia
  • 2Department of Soil Science, "Luiz de Queiroz" College of Agriculture, University of São Paulo, Av. Pádua Dias 11, Portal Box 9, Piracicaba, São Paulo state Code 13418-900, Brazil

Abstract. The number of samples used in the calibration dataset affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS-NIR-SWIR) spectroscopy for soil attributes. Recently, convolutional neural network (CNN) is regarded as a highly accurate model for predicting soil properties on a large database, however it has not been ascertained yet how large the sample size should be for CNN model to be effective. This paper aims at providing an estimate of how much calibration samples are needed to improve the model performance of soil properties predictions with CNN. It is hypothesized that the larger the amount of data, the more accurate is the CNN model. The performance of two commonly used machine learning models (Partial least squares regression (PLSR) and Cubist) are compared against the CNN model. A VIS-NIR-SWIR spectral library from Brazil containing 4251 unique sites, with averages of 2–3 samples per depth (a total of 12,044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration dataset was then created to represent smaller calibration dataset ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, or equivalent to sample size approximately 350, 840, 1400, 2800, 4200, 5600, 7000, and 7650. All three models (PLSR, Cubist, and CNN models) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic matter, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated ten times to provide a better representation of the model performances. Similar results were observed when the performances of both PLSR and Cubist model were compared to the CNN model where the performance of CNN outweighed the PLSR and Cubist model at sample size of 1500 and 1800 respectively. It can be recommended that deep learning is most efficient for spectral modelling for sample size above 2000. The accuracy of the PLSR and Cubist model seemed to reach a plateau above sample size of 4200 and 5000 respectively. A sensitivity analysis was performed on the CNN model to determine important wavelengths region that affected the predictions of various soil attributes.

Wartini Ng et al.

Interactive discussion

Status: final response (author comments only)
Status: final response (author comments only)
AC: Author comment | RC: Referee comment | SC: Short comment | EC: Editor comment
[Login for Authors/Topical Editors] [Subscribe to comment alert] Printer-friendly Version - Printer-friendly version Supplement - Supplement

Wartini Ng et al.

Wartini Ng et al.


Total article views: 896 (including HTML, PDF, and XML)
HTML PDF XML Total BibTeX EndNote
614 238 44 896 48 51
  • HTML: 614
  • PDF: 238
  • XML: 44
  • Total: 896
  • BibTeX: 48
  • EndNote: 51
Views and downloads (calculated since 17 Sep 2019)
Cumulative views and downloads (calculated since 17 Sep 2019)

Viewed (geographical distribution)

Total article views: 579 (including HTML, PDF, and XML) Thereof 574 with geography defined and 5 with unknown origin.
Country # Views %
  • 1



No saved metrics found.


No discussed metrics found.
Latest update: 24 Sep 2020
Publications Copernicus
Short summary
The number of samples utilized to create predictive models affected model performance. This research compares the number of samples needed by deep learning model to outperform the traditional machine learning models using visible near infrared spectroscopy data for soil properties predictions. Deep learning model was found to outperform machine learning models when the sample size is above 2000.
The number of samples utilized to create predictive models affected model performance. This...