Interactive comment on “ No Silver Bullet for Digital Soil Mapping : Country-specific Soil Organic Carbon Estimates across Latin America ”

We appreciate this comment. We clarify that we did not developed any new pedotransfer function applied to missing bulk density values. To fill missing values, we used a simple pedotransfer function based on organic matter OM (Drew, 1973, BD = 1/(0,6268 + 0,0361 * OM). We decided to use the equation because it showed less extreme values than other available pedotransfer functions during preliminary training exercises (data not shown, see FAO, 2017 p7). Another reason is that there is not a single pedotransfer function applicable to all soil types across Latin America. The proposed equation is representative for soils with organic matter content between 0.17 to 13.5% (Drew, 1973). We assumed a value of 0 when coarse fragments were missing, which could lead to overestimations of SOC stocks. Because of these reasons we focus on model comparisons across 19 possible scenarios of data under a variety of environmental conditions, rather than reporting SOC stocks. Country-specific SOC stocks are required by the United Nations to be officially reported by the institutions of each country with the mandate to generate soil information with certain data-specifications (e.g., 1km or less). This effort is beyond the scope of this study as our intention is to provide a fully reproducible framework with no major computational requirements (i.e., conventional laptop) and in short periods of time (2-6 hours). This approach is meant to provide capacity building for digital soil mapping across Latin America.


Introduction
Soils store nearly 1500 Pg of carbon and represent the largest terrestrial carbon pool ; thus, it is crit- 20 ical to accurately quantify the variability of soil organic carbon (SOC) from local-to-global scales. During the 4 th Session of the Global Soil Partnership (GSP) Plenary Assembly held in May 2016 in Rome, it was agreed to develop a Global Soil Organic Carbon Map (GSOCmap) (FAO, 2017). The overarching goal is that a Global SOC Map of the Global Soil Partnership (GSOCmap-GSP) will be developed using a distributed approach relying on country-specific SOC maps. The Food and Agriculture Organization (FAO) recently compiled how different statistical methods (e.g., regression-kriging and machine 25 learning) could be used to generate country-specific SOC maps and uncertainty estimates (Yigini et al., 2017). All these approaches consider the reference framework of the SCORPAN model for digital soil mapping (DSM; McBratney et al. (2003)).
In the SCORPAN reference framework a soil attribute (e.g., SOC) can be predicted as a function of the soil forming environment, in correspondence with soil forming factors from the Dokuchaev hypothesis and Jenny's soil forming equation based on climate, organisms, relief, parent material and elapsed time of soil formation (Florinsky, 2012). The SCORPAN (Soils, 30 2 SOIL Discuss., https://doi.org /10.5194/soil-2017-40 Manuscript under review for journal SOIL Discussion started: 25 January 2018 c Author(s) 2018. CC BY 4.0 License.
Climate, Organisms, Parent material, Age and (N) space or spatial position, see McBratney et al. (2003)) reference framework is a empirical approach that can be expressed as in Eq. (1): where Sa is the soil attribute of interest at a specific location N (represented by the spatial coordinates of field observations x; y) and representative for a specific time frame (t); S is the soil or other soil properties that are correlated with Sa; C is the 5 climate or climatic properties of the environment; O are the organisms, vegetation, fauna or human activity; R is topography or landscape attributes; P is parent material or lithology; and A is the substrate age or the time factor. To generate predictions of Sa across places where no soil data is available, N should be explicit for the information layers representing the soil forming factors. These predictions will be representative of the time period (t) when soil available data was collected. Therefore, the prediction factors ideally should represent, the conditions of the soil forming environment for the same period of time (as much 10 as possible) when soil available data was collected. In Eq. (1) the left side is usually represented by the available geo-spatial soil observational data (e.g., from legacy soil profile collections) and the right side of the equation is represented by the soil prediction factors. These prediction factors are normally derived from four main sources of information: a) thematic maps (i.e., soil type, rock type, land use type); b) remote sensing (i.e., active and passive); c) climate surfaces and meteorological data; and d) digital terrain analysis or geomorphometry. The SCORPAN reference framework is widely used, but one critical challenge 15 is to quantify the relative importance of the soil forming factors (i.e., prediction factors) that could explain the underlying soil processes controlling the spatial variability of a specific soil attribute (i.e., SOC).
Arguably, there are two cultures for statistical modeling (Breiman, 2001) that influence the predictions of the spatial variability of SOC. One assumes that the variability of observations can be reproduced by a given stochastic data model (e.g., with hypothesis about the spatial structure of the variable). The other uses algorithmic models and treats as unknown, the 20 mechanisms generating the structure of values in available datasets (e.g., with hypothesis about the statistical distribution and moments of the variable). Different mapping approaches use a set of given available predictors in different ways. Thus, comparing different approaches and methods is useful to quantify the relative importance of prediction factors across data configurations and distributional properties. We argue that a systematic analysis of predictive algorithms and consequently selection of predictors (by each one of the algorithms) could provide insights about the underlying factors that control the spatial 25 variability of SOC.
The last decade has seen an increasing diversity of approaches for DSM. Data mining techniques have been successfully used to model and predict the spatial variability of soil properties (Rossel and Behrens, 2010;Hengl et al., 2017;Shangguan et al., 2017) and generate country-specific SOC maps (Viscarra Rossel et al., 2014;Adhikari et al., 2014). The combination of regression modeling approaches with geostatistics of model residuals (i.e., regression Kriging) is a combined strategy that 30 has been widely used to map SOC (Hengl et al., 2004;Mishra et al., 2009;Marchetti et al., 2012;Kumar et al., 2012;Peng et al., 2013;Adhikari et al., 2014;Yigini and Panagos, 2016;Nussbaum et al., 2014;Mondal et al., 2017). Machine learning algorithms such as random forests or support vector machines have also been used to increase statistical accuracy of soil  (Martin et al., 2011;Hashimoto et al., 2017;Hengl et al., 2017) including applications for SOC mapping (Grimm et al., 2008;Sreenivas et al., 2016;Yang et al., 2016;Hengl et al., 2017;Delgado-Baquerizo et al., 2017;Ließ et al., 2016;Viscarra Rossel et al., 2014).Machine learning methods do not necessarily allow to extract information about the main effects of prediction factors in the response variable (e.g., SOC); consequently, a selection strategy is always useful to increase the interpretability of machine learning algorithms. With this diversity of approaches one constant question is if there is a 5 method that systematically improve the prediction capacity of the others aiming to predict SOC across large geographic areas (e.g., Latin America). We postulate that probably there is no universal method (i.e., silver bullet) for DSM, and country-specific efforts are needed to test a variety of predictive algorithms to maximize explained variance while minimizing prediction bias.
The overarching goal of this study is to compare different predictive algorithms across 19 data/country scenarios with publicly available information to support the development of country-specific SOC maps to be included in the GSOCmap-

10
GSP. Currently, SOC information across Latin America is derived from global models such as the SoilGrids system, or the Harmonized World Soil Database Köchy et al., 2015), which lack quantification of uncertainty and large areas are parameterized with limited country-specific information. This challenge is not unique for Latin America as many regions around the world (e.g., Africa, Siberia) have limited SOC information to parameterize models to predict SOC. To inform future SOC mapping efforts, this study addresses two specific questions: a) Which environmental variables (derived 15 from publicly available information) have the highest correlations with country-specific SOC information?; and b) Which is the best method (i.e., predictive algorithm) to represent SOC across Latin America and within each country? The ultimate aim of this study is to contribute with the discussion about the importance of integrating country-specific information for representing and predicting soil-related variables (e..g., SOC) to improve regional-to-global predictions.

SOC observations
Soil organic carbon information was extracted from the WoSIS soil profile database . This dataset includes local-to-national soil profile collections with a sampling strategy generally based on morphological soil attributes . The goal of the GSOCmap-GSP is to produce global information for the first 30 cm; thus, we generated synthetic horizons for this depth using a mass preserving spline approach (Bishop et al., 1999). We applied a pedotransfer function if the bulk density (BLD) 25 information was missing (Yigini et al., 2017), and assumed a value of 0% of coarse fragments when information on coarse fragments (CRFVOL) was missing. The organic carbon stock for 0 to 30 cm was estimated using Global Soil Information Facilities R, GSIF following a standardized SOC calculation method (D.W. and L.E., 1982): where ORC is SOC density (g · kg −1 ) and H is soil depth (30 cm). Finally, each country-specific dataset was transformed to 30 its natural logarithm to reduce the right-skewed distribution of SOC values and because exploratory analysis showed that this

Soils prediction factors
We used environmental information from WorldGrids (worldgrids.com), which is an initiative of ISRIC-World Soil Information. We downloaded and masked 118 environmental layers (i.e., prediction factors) for each country to quantitatively represent 5 the soil forming environment. The prediction factors were harmonized into a 1x1km global grid by the WorldGrids project from three main information sources: remote sensing, climate surfaces, and digital terrain analysis (http://worldgrids.org/doku.php/wiki:layers).
Additional terrain parameters (e.g., terrain slope, aspect, catchment area, channel network base level, terrain curvature, topographic wetness index, length-slope factor) from elevation data were calculated in SAGA GIS for each country following the standard implementation for basic terrain parameters (Conrad et al., 2015). We re-sampled the prediction factors into a 5x5km 10 pixel size grid to reduce the computational demand required to make predictions and facilitate the reproducibility of this DSM framework without the need of High Performance Computing.

Prediction of SOC and model evaluation
First, the relationship between SOC and prediction factors was explored using simple correlation analysis. Second, the 10 prediction factors with highest correlations with SOC data were selected for each country and used for further analyses. Third, 15 we implemented Regression-Kriging (based on a multiple linear regression model (RK) and partial least squares regression (PLS)), and three machine learning models: support vector machines (SVM), random forests (RF), and kernel weighted nearest neighbors (KK) to generate SOC maps for each country. A brief explanation for each modeling approach is provided in Appendix A1.
We also analyzed the influence of the maximum allowed prediction limits for each prediction algorithm. The units of the 20 SOC estimates are kg · m −2 . The sensitivity of the total SOC stock related to the model prediction limit was tested by changing the maximum prediction limit from 2.7 kg · m −1 ( 1 in a log scale) to 2980.95 kg · m −2 (8 in a log scale).
To generate a combined SOC map, we used a weighted average of the country-specific predictions. The weights of this average were defined by the relationship between the errors (measured as the RMSE) and the correlation (EC r ). We propose this EC r as an approach to better understand the agreement between the correlation (calculated by the means of cross validation) 25 and the RMSE (derived from the unbiased residuals of cross validation). Before calculating the RMSE/correlation ratio, the RMSE and the correlation between observed and predicted were standardized (by its maximum and minimum values) to a range between 0 and 1 using: Where EC r is the proposed ratio between errors and correlation between observed and predicted (derived by cross-validation); RMSE i is the observed RMSE for the ith model; min(RMSE) is the minimum observed value of RMSE, and range(RMSE) is the difference between the maximum and minimum observed values of RMSE; corr i is the observed correlation for the ith If the value of the EC r was close to 0, then there is a stronger agreement between high RMSE and low correlation, or low RMSE and high correlation. If this value deviated from 0 (up to 1 or more), then the RMSE would tend to be high while the correlation was also high, suggesting that the method represents the variability of SOC but with high bias. Finally, the 10 uncertainty (represented by the variance of the different prediction approaches) was divided by the mean and multiplied by 100 to provide an interpretable standardized visualization of uncertainty (i.e., in percent). Country-level SOC stocks are reported as the sum of all 5x5km pixels of all SOC predicted values (i.e., weighted average of SOC) within each country. All analyzes were performed using the R software. (R Core Team, 2017).

15
3.1 Descriptive statistics SOC across the different countries showed a wide diversity of data-scenarios (Table 1). Costa Rica (with a mean of 11.05 g · kg −1 ), Chile (with a mean of 9.88 g · kg −1 ) and Colombia (with a mean of 8.15 g · kg −1 ) are the countries with the highest SOC values. Brazil (n=5616) and Mexico (n=4321) were the countries with highest data availability. In contrast, Honduras (n=11), Guatemala (n=20) and Belize (n=21) were the countries with less density of of SOC estimated values (Table 1). 20 With the original (untransformed) dataset, the only countries that showed a normal distribution after the Shapiro-Wilk test of normality with an alpha of 0.05 were Belize, Guatemala, Honduras and Suriname.

Correlation of SOC and its predictors
Best correlated predictors were not the same across countries. We found higher correlations with the original data sets transformed to its natural logarithm, as data had a right-skewed distribution and did not follow a normal distribution (i.e., log-25 normal). Highest correlations of available SOC data and its environmental predictors were associated with temperature-relatedvariables across Honduras, Costa Rica, Peru, Chile, Guatemala and Suriname (the r 2 varied from from 0.35 to 0.58). However, there were a low number of available SOC observations across these countries in the WoSIS system (between 11 to 34).
Similarly, across countries with high data availability (e.g., Mexico and Brazil) the strongest correlations between SOC and prediction factors were associated with temperature-related variables (Table 2). In all cases, the relationship between SOC

SOC related properties
Correlations between ORCDR and prediction factors were higher with maximum and mean night-time temperature, where Costa Rica and Chile had the highest correlations (r 2 varied from 0.61 to 0.71). The best correlated variables with BLD were terrain parameters: relative slope position, vertical distance to channel network, flow accumulation areas, and potential incoming solar radiation. These correlations were stronger across Guatemala, Belize and Panama (r 2 varied from 0.52 to 5 0.67). We found that terrain slope and the standard deviation of temperature were the variables with highest correlations with CRFVOL; where Nicaragua, Honduras and Argentina had the highest correlations (r 2 varied from 0.40 to 0.55). We did not found a dominant algorithm to predict SOC related properties. Slightly higher correlations between observed and predicted values were achieved with RF, but in most cases different methods showed similar prediction capacity. The highest prediction error was found with RK for CRFVOL, but for all other output variables all prediction algorithms had a similar range of errors

Country-specific SOC predictions
We did not find a dominant algorithm to predict SOC in a country-specific basis (Fig. 2). Overall, machine learning prediction algorithms generated similar results. Higher agreement of machine learning prediction algorithms was found in small countries where environmental conditions and land cover/use characteristics tend to be more homogeneous (e.g. Jamaica, Suriname).
RK showed higher discrepancies in countries where data distribution was sparse (e.g., Suriname, Chile, Guatemala), but was 5 effective across countries with higher and/or well distributed data availability (e.g., Mexico, Brazil). Machine learning SOC predictions were conservative compared with RK (RK generated the higher density of extreme and unreliable SOC values).
PL had comparable results with machine learning algorithms (i.e., KK, SVM, RF). Higher correlation between observed and predicted data was found for Costa Rica (0.76; n=21) using SVM while the lowest error was found Suriname (0.36; n =37) using PL. In contrast, algorithms had lower prediction capacity for countries with large areas (e.g., Brazil, Mexico) despite the 10 large data availability.
The correlation between the r 2 and rmse for RF, PL, KK and RK was positive (0.18, 0.35, 0.32 0.1; respectively). In contrast, this correlation was stronger for SVM (but negative; -0.65) where increasing the explained variance resulted in a lower error.
These results suggest a low level of agreement between these two information criteria (r 2 and rmse) commonly used on DSM to assess performance of prediction algorithms. 15 Agreement between the rmse and r 2 was found only in 12 of the 19 countries, resulting in country-specific "recommended" prediction algorithms. Here we list the prediction algorithms that generated the best correlation and the best rmse for each  Table 1). 20 High discrepancy was found across the SOC predictions because algorithms use available data in different ways (Fig. 3B).
The higher EC r was found with PL (0.96) followed by RF (0.54) and KK (0.43), informing that these predictive algorithms do not minimize prediction bias while increasing the explained variance. SVM (with 0.008) and RK (with 0.003) had the lowest EC r (inset histogram in Fig. 4), informing that they maximize the explained variance while minimizing prediction bias. 25 We found a strong linear relationship (r 2 0.84) between SOC stocks and the area of each country (Fig. 4 A). The relationship between SOC predicted values by unit area and SOC prediction variance was negative (Fig. 4 B). Higher uncertainty and a relatively low density of SOC per unit area was found across Mexico, Bolivia, Brazil Honduras, Peru Suriname and Cuba. The standardized uncertainty of the total stocks reached values over 300% for countries such as Mexico and Bolivia (Fig. 4 B). In contrast, countries with higher SOC per unit area and a relatively low prediction variances were Panama, Guatemala, Costa   . Mosaic of country specific SOC maps (kg · m 2 ) and SOC prediction variance. In A we show a weighted average of predictions in which the weights were the ECr. In the map we shrink the range of values between 0 and 1 to better illustrate the gradients of spatial variability of SOC. Note that no-data countries were filled using simple Geostatistics. The map in B shows the standardized uncertainty, which was generated by dividing the SOC variance by the mean. Red color areas suggest represent areas were the discrepancy of the models reached up to 100% or more. The inset histogram shows the median ECr for each method. PL was the method with higher discrepancy between explained variance and bias. The inset scatter plot shows the relationship of SOC stock (in Pg) and the maximum limit of SOC prediction values, showing the sensitivity of the total estimated stock to the limit of maximum limit of predictions (1 to 8 in a log scale).

Estimated SOC stocks and uncertainties
3A, B). Across countries, SOC stocks varied from 28.14 ±14.92 Pg (considering a maximum prediction limit of 2.71 kg · m 2 ) to 62.99 ± 33.39 Pg (considering a maximum prediction limit of 2980 kg · m 2 ; inset scatterplot in Fig. 4).

Discussion
We developed a reproducible DSM framework to characterize the spatial variability of SOC across Latin America. Our results suggest that several predictive algorithms can be used to better understand modeling bias, which can be associated with a) 5 the property of interest (i.e., SOC), b) the environmental complexity and area/country of interest, and c) the characteristics of available data (e.g., spatial distribution and representativeness) to meet model-specific assumptions.
Our results incorporate a multi-model perspective for quantifying/evaluating the spatial variability of SOC. This effort is expected to increase the capacity of Latin American institutions to provide accurate baseline estimates of SOC with a countryspecific perspective following recommendations of GSOCmap-GSP. Ultimately, these efforts will enhance the development 10 of new guidelines for measuring, mapping, reporting, verification and monitoring SOC stocks at national level (Vargas et al., 2013). Accurate country-specific DSM frameworks for SOC are required to facilitate interoperability and inform environmental policy across developing countries . Our results highlight that attention is needed to better understand the influence of model prediction limits (e.g., the full conditional distribution) for the predicted SOC stocks. Setting a unreliable 12 SOIL Discuss., https://doi.org /10.5194/soil-2017-40 Manuscript under review for journal SOIL Discussion started: 25 January 2018 c Author(s) 2018. CC BY 4.0 License.
(excessive or low) prediction limit can have important effects (under or overestimating) on the overall estimated stocks (Fig.   3). Therefore, we argue that data science systems for DSM carbon assessments should be fundamentally based on SOC expert knowledge and informed by expert-based soil mapping systems.
Across Latin America we did not find a common predictive algorithm for SOC. These results suggest that country-specific environmental predictors and available data influence the applicability of different approaches. This assessment is needed 5 to address the requirements from the GSOCmap-GSP with the official mandate to generate and update country-specific soil information by the means of DSM. Thus, we argue that the DSM form of each country should assess and incorporate countryspecific available data and environmental predictors to select the best prediction algorithm. The FAO SOC mapping cookbook explores possibilities to derive country-specific SOC maps from a variety of prediction algorithms (Yigini et al., 2017), and multiple resources have described the state of the art of modeling methods focused on DSM of soil carbon (Minasny et al., 2013;10 Malone et al., 2017) including geostatistics (Hengl, 2009). Thus, data characteristics (e.g., spatial structure, representativeness) are specifically important for developing a DSM framework as legacy soil profile collections, generated with long-term soil inventory purposes, will determine data availability and spatial distribution within a country.
This country-specific approach to map regional SOC results in artifacts across geo-political borders. Therefore, data sharing, model validation and calibration experiments across borders (i.e., between countries) are required to better capture the spatial 15 variability of SOC. The use of a natural-defined prediction domain (e.g., ecoregional or physiographic map) could reduce the border effects. However, we understand that geo-political limits are required for public policy decisions around country-specific needs. We highlight that there is a lack of publicly available country-specific data that ultimately influence the assessment of country-specific prediction algorithms (Fig. 3 A). The selection of a proper prediction algorithm for sparse scenarios of available data is then required to achieve the highest possible accuracy of country-specific SOC estimates. Our results highlight 20 important uncertainty levels ( >100%) across large areas of Latin America (Fig. 3B). The data contained in WoSIS has a low density distribution given the large area and environmental complexity of several analyzed countries. Thus, larger uncertainty dominates countries with larger carbon pools probably because available data does not capture the large spatial heterogeneity of SOC stocks. We highlight that the WoSIS dataset is a unique and invaluable effort that has proven to generate global SOC predictions Sanderman et al., 2017), but there is a global need to increase information and networking 25 capabilities for SOC (Harden et al., 2017).
This study generated predictions of SOC across Latin America, but also provided information about the main relationships driving the spatial distribution of SOC. Machine learning (i.e., data driven) models have proven to be more efficient to model non-linear relationships of SOC (Hengl et al., 2015), but our results suggest that linear-based models (e.g., RK) could outperform machine learning methods under well distributed and representative SOC data scenarios. Similar results were found 30 across productive landscapes of Brazil (Bonfatti et al., 2016). We argue that our capacity to meet modeling assumptions will determine the most suitable prediction algorithm. Machine learning models are usually conceived as black boxes and the influence of non-informative SOC prediction factors on machine learning-based SOC models has not been evaluated in detail.
Therefore, we propose that the use of simple linear methods (i.e., correlation of available data and its predictors) can be a useful and parsimonious first step to inform data driven approaches and enhance the interpretability of machine learning models to predict SOC. Furthermore, our data suggests that country-specific predictor factors are needed to better parameterize models but also could be useful for country-specific model interpretation. These results have important implications because it has been proposed that an extensive set of prediction factors are required to capture the large variance of the global SOC pool . Thus, we propose that a limited but informative country-specific prediction factors should be explored to describe the local biophysical characteristics controlling SOC variability. 5

Conclusions
We provide a multi-model comparison approach to map SOC stocks across Latin America and found that there is not a dominant best prediction algorithm given available data. The relatively performance of the different methods vary from one place to another as well as the relatively correlation of SOC with the prediction factors given available data. We tested hypothesis driven approaches (e.g. linear Geo-statistics) and data driven algorithms (e.g. machine learning) which are used, respectively, to 10 generate interpretable and predictable models of soil variability. We argue that models should not be conceived as competitors, because they have different assumptions (about the data itself, or about the empirical relationship between the response variable and its predictors). Therefore, different models will capture different portions of soil variability. There are no silver bullets on digital soil mapping across the 19 analyzed countries given available data in the WOSIS system. We highlight important levels of uncertainty SOC stocks associated with the maximum allowed prediction limit. Public data may not be representative across 15 large areas and we call for the countries to strength digital soil mapping capacity building initiatives, reproducible research and data sharing. The use country-specific information and the use of different modeling approaches will enhance regional soil carbon mapping efforts, so we can easily identify where and the reasons why different modeling approaches generate different results.
This scientific paper shows that the initiative to build digital soil mapping capacities in Latinamerica offers very positive 20 results. Each country has its particularities, the best methods or algorithms are not the same for everyone. It encourages us to keep increasing our capacities and collaboration at a regional level.
Code availability. The code used for this work will be available under the AGPL 3.0 license at https://github.com/DSM-LAC/NoSilverBulletsForDSM (Guevara et al., 2018) Data availability. The soil dataset used is this paper is kindly provided by ISRIC. It can be downloaded from WOSIS http://www.isric.org/ 25 explore/wosis and correponds to the July 2016 snapshot  observations.
PLS is a common method to deal with the presence of highly correlated predictors. The PLS algorithm integrates the compression and regression steps and it selects successive orthogonal factors that maximize the covariance between predictor and response variables (Wold, 1983;Viscarra Rossel et al., 2014). Most of its development and application is in the fields of chemometrics, but is used in several research areas to effectively solve regression and classification problems.

10
SVM apply a simple linear method to the data but in a high-dimensional feature space non-linearly related to the input space (Karatzoglou et al., 2006). It creates a hyperplane through n-dimensional spectral-space. Then, SVM separates numerical data based on a kernel function and parameters (e.g. gamma and cost) that maximize the margin from the closest point to the hyperplane that divides data with the largest possible margin, being the support vectors the points which fall within (Heumann, 2011). Then, linear models are fitted to the support vectors.

15
RF is an ensemble of regression trees based on bagging (Breiman, 1996). This machine learning algorithm uses a different combination of prediction factors to train multiple regression trees. Each tree is generated using a different subsets of available data (Breiman, 2001). The number of prediction factors to use on each tree is known as the mtry parameter. The final prediction is the weighted average of all individual trees.
KK is a pattern recognition technique which is based on the distances to training examples in the feature space (Silverman 20 and Jones, 1989). The observations within the learning set, which are particularly close to the new observation (y, x), should get a higher weight in the decision than such neighbors that are far away from (y, x) (Hechenbichler and Schliep, 2004).
The parameter k determines the number of neighbors from which information will be considered for prediction and a kernel function (eg. triangular, Gaussian among others) converts distances into weights which will be used for regression problems.
Competing interests. The authors declare that they have no conflict of interest.