A review on the global soil datasets for earth system modeling

Abstract. Global soil dataset is a pillar to the challenge of earth system modeling. But it is one of the most important uncertainty sources for Earth System Models (ESMs). Soil datasets function as model parameters, initial variables and benchmark datasets for model calibration, validation and comparison. For modeling use, the dataset should be geographically continuous, scalable and with uncertainty estimates. The popular soil datasets used in ESMs are often based on limited soil profiles and coarse resolution soil maps. Updated and comprehensive soil information needs to be incorporated in ESMs. New generation soil datasets derived by digital soil mapping with abundant soil observations and environmental covariates are preferred to those by the linkage method for ESMs. Because there is no universal pedotransfer function, an ensemble of them may be more suitable to provide secondary soil parameters to ESMs. Aggregation and upscaling of soil data are needed for model use but can be avoid by taking a subgrid method in ESMs at the cost of increases in model complexity. Uncertainty of soil data needs to be incorporated in ESMs.



Introduction
Soil or pedosphere is a key component of Earth system, and plays an important role in the water, energy and carbon balances and biogeochemical processes.An accurate description of soil properties is essential in advancing the modeling capabilities of Earth System Models (ESMs) to predict land surface processes at the global and regional scales.Soil information is required by the land surface models (LSMs), which is a component of ESMs.With the help of computer-based geographic systems, many researchers have produced geographical databases to organize and harmonize large amount of soil information generated from soil surveys during the last decades (Batjes, 2017;Hengl et al., 2017).However, soil dataset used in ESMs is not well updated nor well utilized yet.The popular soil datasets used in ESMs are outdated and with limited accuracy.Some available soil properties such as gravel (or coarse fragment) and depth to bedrock are not utilized in most ESMs.Meanwhile, it is needed to change ESMs' schemes and structure to better represent the soil process in utilizing new soil information (Brunke et al., 2016;Oleson et al., 2010).Better soil information with high resolution and better representation of soil in models have improved and will improve the performance in simulating the Earth system (Kearney and Maino, 2018).
ESMs require detailed information on the soil physical and chemical properties.
Site observations (called soil profiles) from soil surveys include soil properties such as soil depth, soil texture (sand, silt and clay fractions), organic matter, coarse fragments, bulk density, soil colour, soil nutrients (carbon (C), nitrogen (N), phosphorus (P), potassium (K) and sulfur (S)), etc.However, soil hydraulic and thermal parameters as well as biogeochemical parameters are usually not observed in soil surveys, which need to be estimated by pedotransfer functions (PTFs) (Looy et al., 2017).This review focus on the soil data (usually time-invariant) from soil surveys, while variables such as soil temperature and soil moisture are beyond this paper's scope.
Soil properties are functioned in three aspects in ESMs: 1) Model parameters.The soil thermal (soil heat capacity and the thermal conductivity) and hydraulic characteristics (empirical parameters of soil water retention curve and hydraulic conductivity) are usually obtained by fitting equations (PTFs) to easily measured and widely available soil properties, such as sand, silt and clay fractions, organic matter content, rock fragments and bulk density (Clapp and Hornberger, 1978;Farouki, 1981;Vereecken et al., 2010;Dai et al., 2013).Soil albedos are significantly correlated with Munsell soil color value (Post et al., 2000).
2) Initial variables.The nutrient (C, N, P, K, S, etc.) amounts and the nutrients associated parameters (pH, cation-exchange capacity, etc.) in soils can be used to initialize the simulations.Generally, their initial values are assumed to be at steady state by running model over thousands of model years (i.e., spin-up) until no trend of change in pool sizes (McGuire et al., 1997;Thornton and Rosenbloom, 2005;Doney et al., 2006;Luo et al., 2016).To initialize nutrient amounts using the reliable soil data as background field could largely reduce the times of model spin-up, and also could avoid the possibility of the non-linear singularity evolution of the modeling, and 3) Benchmark data.Soil data, as observations, could serve as a reference for modeling calibration, validation and comparison.Soil carbon stock is one of the most frequently used soil properties as benchmark data (Todd-Brown et al., 2013).
Soil properties are of great spatial heterogeneity both horizontally and vertically.
As a result, ESMs usually incorporate soil property maps for multiply layers rather than a global constant.ESMs, especially LSMs, are evolving towards hyperresolutions of 1km or finer with more detailed parameterization schemes to accommodate the land surface heterogeneity (Singh et al., 2015;Ji et al., 2017).So spatially explicit soil data at high resolutions are necessary to improve land surface representation and simulation.Because soil properties are observed at individual locations, soil mapping or spatial prediction model is needed to derive the 3D representation of soil distribution.The traditional way (i.e., the linkage method) is to link soil profiles and soil mapping units on soil type maps, sometimes with ancillary maps such as topography and land use (Batjes, 2003).In the past decades, various digital soil mapping technologies were proposed by finding the relationships between soil and environmental covariates (usually remote sensing data) such as climate, topography, land use, geology and so on (McBratney et al., 2003).Soil datasets are usually not ready for the use of ESMs and some upscaling issues need to be addressed.The soil datasets produced by the linkage methods are polygon-based and need to be converted to fit the grid-based ESMs.This conversion can be done by either subgrid method or spatial aggregation.The up-to-date soil data are provided at a resolution of 1km or finer, while the LSMs are mostly ran at a coarser resolution.So upscaling of soil data is necessary before it can be used by ESMs.Proper upscaling methods need to be chosen carefully to minimize uncertainty in the modeling results introduced by them (Hoffmann and Christian Biernath, 2016;Kuhnert et al., 2017).This paper is organized in the following sections.In section 2, we first introduce soil datasets at global and national scales produced by the linkage method and digital soil mapping technology and then the soil datasets that have already been incorporated in ESMs.Section 3 presents PTFs that are used in ESMs to estimate soil hydraulic and thermal parameters.Section 4 describes how to deal with soil data derived by the linkage methods.Section 5 introduces the upscaling of high resolution soil data to the coarse resolution of ESMs.Section 6 gives the summary and an outlook of further improvements.

Global and national soil datasets
Two kinds of soil data are generated from soil surveys: soil polygon maps representing distribution of soil types and soil profiles with observations of soil properties.ESMs usually require the spatial distribution of soil properties, or soil property maps rather than soil classification information.Two kinds of methods, i.e. the linkage method and the digital soil mapping method, are used to derive soil property maps.Soil maps show the geographical distribution of soil types, which are compiled under a certain soil classification system.At the global level, there is only one generally accepted global soil map, i.e., the FAO-UNESCO Soil Map of the World (SMW) (FAO, 1971(FAO, -1981)).It was made based on soil surveys conducted between the 1930s and the 1970s, and technology available in 1960s.Several versions exist in the digital format (FAO, 2003b(FAO, , 1995;;Zöbler, 1986).At the regional and national level, there are many soil maps based on either national or international soil classifications.
Here are some examples of major soil maps available in digital formats: the Soil and Terrain Database (SOTER) databases (Van Engelen and Dijkshoorn, 2012) for different regions, the European Soil Database (ESB, 2004), the 1: 1 million Soil Map of China (National Soil Survey Office, 1995), the U.S. General Soil Map (GSM), the 1:1 million Soil Map of Canada (Soil Landscapes of Canada Working Group, 2010) and the Australian Soil Resource Information System (ASRIS) (Johnston et al., 2003).
Soil profiles are composed of multiply layers called soil horizons.For each horizon, soil properties are observed.At the global level, several soil profile databases exist.Here we only discuss the two most comprehensive ones.The World Inventory of Soil Emission Potentials (WISE) database was developed as a homogenized set of soil profiles (Batjes, 2008).The newest version (WISE 3.1) contains 10,253 soil profiles and 26 physical and chemical properties.The soil profiles database of World Soil Information Service (WoSIS) contains the most abundant profiles (about 118,400) from national and global databases including most of the databases mentioned below, though only a selection of important soil properties are included (Batjes, 2017).At the regional and national level, there are many soil profile databases, usually with soil classifications corresponding to the local soil maps.Here are some examples: the USA National Cooperative Soil Survey Soil Characterization database (http://ncsslabdatamart.sc.egov.usda.gov/),profiles from the USA National Soil Information System (http://soils.usda.gov/technical/nasis/),Africa Soil Profiles database (Leenaars, 2012), the Australian Soil Resource Information System (Karssies, 2011), the Chinese National Soil Profile database (Shangguan et al., 2013), soil profile archive from the Canadian Soil Information System (MacDonald and Valentine, 1992), soil profiles from SOTER (Van Engelen and Dijkshoorn, 2012), the soil profile analytical database for Europe (Hannam et al., 2009), the Mexico soil profile database ( Instituto Nacional de Estadística y Geografía, 2016), and the Brazilian national soil profile database (Cooper et al., 2005).
The linkage method (called the taxotransfer rule-based method) is to link soil mapping units or soil polygons and soil profiles according to taxonomy-based pedotransfer (taxotransfer) rules (Batjes, 2003).The criteria used in the linkage could be one or many factors as following: soil class, soil texture class, depth zone, topographic class, distance between soil polygons and soil profiles and so on (Shangguan et al., 2012).Each soil type is represented by one or a group of soil profiles that meet the criteria, and usually the median or mean value of a soil property is assigned to the soil type.There are many sources of uncertainty in the linkage method (Shangguan et al., 2012).The major source is spatial errors of soil maps, i.e. the location of soil types, as the estimation relies heavily on the soil map and the SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.purity of soil map units is likely to be around 50 to 65%.Because the linkage method assigned only one value or a statistical distribution to a soil type in soil polygons, the intra-polygonal variation is not taken into account.At the global level, many databases were derived by the linkage method: the FAO Soil Map of the World with derived soil properties (FAO, 2003a), the Soil and Terrain Database (Van Engelen and Dijkshoorn, 2012) for multiply regions and countries, the ISRIC-WISE derived soil property maps (Batjes, 2006), the Harmonized World Soil Database (HWSD) (FAO/IIASA/ISRIC/ISS-CAS/JRC, 2012), and the Global Soil Dataset for Earth System Model (GSDE) (Shangguan et al., 2014).Two most recent ones are HWSD and GSDE.HWSD was built via combining the existing regional and national updates of soil information.GSDE as an improvement of HWSD incorporated more soil maps and more soil profiles related to the soil maps, with more soil properties.GSDE accomplished the linkage based on the local soil classification, which required no correlation between classification systems and avoided the error brought by taxonomy reference.In addition, GSDE provided estimation of eight layers to the depth of 2.3 m, while HWSD provided estimation of two layers to the depth of 1 m.Many national and regional agencies around the world have organized their soil surveys by linking soil maps and soil profiles, including the USA State Soil Geographic Database (STATSGO2) (Soil Survey Staff, 2017), Soil Landscapes of Canada (Soil Landscapes of Canada Working Group, 2010), the ASRIS (Johnston et al., 2003), the Soil-Geographic Database of Russia (Shoba et al., 2008) the European Soil Database (ESB, 2004), the China dataset of soil properties (Shangguan et al., 2013) and so on.
Digital soil mapping (McBratney et al., 2003) is the creation and the population of a geographically referenced soil database, generated at a given resolution by using field and laboratory observation methods coupled with environmental data through quantitative relationships (http://digitalsoilmapping.org/).Usually, the soil datasets derived by digital soil mapping provide grid-based spatial continuous estimation while the soil datasets derived by the linkage method provide estimations with abrupt changes at the boundary of soil polygons.The uncertainty could be estimated quantitatively by methods such as geostatistical methods and quantile regression forest (Vaysse and Lagacherie, 2017).The GlobalSoilMap is a global consortium that aims to create global digital maps for key soil properties (Sanchez et al., 2009).This global effort takes a bottom-up framework and will produce the best available map of soil at a resolution of 3 arc sec (about 100 m) along with the 90% confidence of predictions.Soil properties will be provided for six soil layers (i.e.0-5, 5-15, 15-30, 30-60, 60-100, and 100-200 cm).Many countries have produced soil maps following the GlobalSoilMap specifications (Odgers et al., 2012;Viscarra Rossel et al., 2015;Mulder et al., 2016;Ballabio et al., 2016;Ramcharan et al., 2018;Arrouays, 2018).The Soilgrids system (https://www.soilgrids.org) is another global soil mapping project (Hengl et al., 2014;Hengl et al., 2015;Hengl et al., 2017).The newest version (Hengl et al., 2017)  The new generation soil dataset produced by digital soil mapping method gave a quite different distribution of soil properties from those produced by the linkage method.Figure 1 shows soil sand and clay fraction at the surface 0-30 cm layer from Soilgrids, IGBP-DIS (Data and Information System of International Geosphere-Biosphere Programme) and GSDE. Figure 2 shows soil organic carbon and bulk density at the surface 0-30 cm layer from Soilgrids and GSDE.Significant differences are visible in these datasets.This will lead to different modelling results in ESMs.(Marwa et al., 2018) found that the global soil organic carbon stocks at 1m depth is 3,400 Pg estimated by Soilgrids while it is 2500 Pg by HWSD, and the estimates by Soilgrids are closer to the observations.

Soil dataset incorporated in ESMs
Table 1 shows several most popular ESMs (specifically, their land surface models) and the input soil datasets.Land surface models (LSMs) are key tools to predict the dynamic of land surface under climate change and land use.Five datasets are widely used, e.g., the datasets by Wilson and Henderson-Sellers (1985), Zöbler (1986), Webb et al. (1993), Reynolds et al. (2000), Global Soil Data Task (2000), and Miller and White (1998).Except HWSD and STATSGO (Miller and White, 1998) for USA, these datasets were derived from the Soil Map of the World (FAO, 1971(FAO, -1981) ) and limited soil profile data (no more than 5,800 profiles), which gained popularity because its simplicity and ease of use.But they are outdated and should no longer be used because much better soil information as introduced in Section 2.1 can be incorporated.
In recent years, efforts were taken to improve the soil data condition in ESMs.
This was started by the Land-Atmosphere Interaction Research Group at Beijing Normal University (BNU, now at Sun Yat-sen University).Shangguan et al. (2012Shangguan et al. ( , 2013) ) 2018) proposed an approach that can assimilate coarse global soil data by finer land use and coverage dataset which improved the performance of hydrologic modeling at watershed scale.Kearney and Maino (2018) incorporated the new generation of soil data produced by digital soil mapping method into a climate model and found that, compared to the old soil information, this improved the simulation of soil moisture at fine spatial and temporal resolution over Australia.A global gridded hydrologic soil groups (HYSOGs250m) was developed based on soil texture and depth to bedrock of Soilgrids (Hengl et al., 2017) and groundwater table depth (Fan et al., 2013) for curve-number based runoff modeling of U.S. Department of Agriculture (Ross et al., 2018).
Except soil properties, the estimation of underground boundaries including the groundwater table depth, the depth to bedrock (DTB) and depth to regolith and its implementation in ESMs is also a new focus.Fan et al. (2013) compiled global observations of water table depth and inferred the global patterns using a groundwater model.Pelletier et al. (2016) developed a global DTB dataset by using process based models for upland and an empirical model for lowland.This dataset was implemented in the CLM4.5 and found that there were significant influences on water and energy simulations compared to the default constant depth (Brunke et al., 2015).Shangguan et al. (2017) developed a global DTB by digital soil mapping based on about 1.7 million observations from soil profiles and water wells, which has a much higher accuracy than the dataset by Pelletier et al. (2016).Vrettas and Fung (2016) shows that the weathered bedrock stores a significant fraction (more than 30%) of the total water despite its low porosity.Jordan et al. ( 2018) estimated global permeability of the unconsolidated and consolidated earth for groundwater modelling.However, due to the lack of data, an accurate global estimation of depth to regolith is not feasible.Caution should be paid to use of the products of so-called soil depth in ESMs.Soil depth maps are usually estimated based on observations from soil survey, and soil depth (or depth to the R horizon) is assumed to be equal to DTB.However, these observations are usually less than 2 meters and usually do not meet the depth to bedrock (Shangguan et al., 2017).Thus, soil depth maps based on soil profiles only are significantly underestimated (one order of magnitude lower) compared to the actual depth to bedrock and should not be taken as the lower boundary of ESMs.For the convenience of ESMs' application, we present basic descriptions about the new soil datasets in Table 2 and 3.As described in section 2.1, three available global soil datasets, i.e.HWSD, GSDE and Soilgrids, have been developed in the last several years (Table 2).Table 3 shows the available soil properties of these soil datasets.
Though all three databases do not contain uncertainty estimation, Soilgrids is considered to be the most accurate one.The explained variance of soil properties in Soilgrids is between 56% and 83%, while HWSD and GSDE do not offer quantitative accuracy assessment.GSDE has the largest number of soil properties, while Soilgrids contains only ten primary soil properties.

Estimating secondary parameters using pedotransfer functions
Earth system modellers have employed different pedotransfer functions (PTFs) to estimate soil hydraulic parameters (SHP), soil thermal parameters (STP), and biogeochemical parameters (Looy et al., 2017;Dai et al., 2013).Almost all ESMs incorporated SHPs and STPs estimated by PTFs but not biogeochemical parameters.
PTFs are the empirical functions that account for the relationships between these secondary parameters and more easily obtainable soil property data.Direct measurement of these parameters is difficult, expensive and in most cases impractical to take sufficient samples to reflect the spatial variation.Thus, most soil databases do not contain these secondary parameters.PTFs provide the alternative to estimate them.In ESMs, SHPs and STPs are usually derived using simple PTFs taking only soil texture data as the input.As more soil properties become available globally, including gravel, soil organic matter and bulk density, more sophisticated PTFs using additional soil properties can be utilized in ESMs.
PTFs can be expressed as either numerical equations or by machine learning methodology which is more flexible to simulate the highly nonlinear relationship in analysed data.PTFs can also be developed based on soil processes.Most researches did not indicate where the PTFs can potentially be used, and the accuracy of a PTF outside of its development dataset is essentially unknown McBratney et al. (2011).
Therefore, they should never be considered as an ultimate source of parameters in soil modelling.Looy et al. (2017) reviewed PTFs extensively in earth system science and emphasized that PTF development has to go hand in hand with suitable extrapolation and upscaling techniques such that the PTFs correctly represent the spatial heterogeneity of soils in ESMs.Though the PTFs were evaluated, it is not clear which are the best set of PTFs for global applications.Due to these limitations, a better way to estimate the secondary parameters may be to use an ensemble of PTFs, which can give the variability of parameters.Dai et al. (2013)

Model use of soil data derived by the linkage method
In the soil map, a soil mapping unit (SMU) is composed of more than one component in most cases, and thus a one-to-many relationship exists between the SMU and the profile attribute.This condition makes representing attributes characterizing a SMU a non-trivial task.To keep the whole variation of soil in a SMU, the best way is using the subgrid method in ESMs (Oleson et al., 2010), i.e. aggregate values of soil properties and provide the area percentage of each value.
However, this will increase the computing time and the complexity of ESMs' structure, which needs to implement the soil processes over each subgrid soil column within a grid instead of the entire model grid.
Usually, the ESMs uses grid data as input and each grid cell has one unique value of a soil property.Three spatial aggregation methods were proposed to aggregate compositional attributes in a SMU to a representative value (Batjes, 2006;Shangguan et al., 2014).The area-weighting method (method A) takes area-weighting of soil attributes.The dominant type method (method D) takes the soil attribute of the dominant type.The dominant binned method (method B) classifies the soil attribute into several preselected classes and takes the dominant class.All three methods can be applied to quantitative data, while the method D and the method B can be applied to categorical data.The advantages and disadvantages of these methods were discussed (Batjes, 2006;Shangguan et al., 2014).The choice should be made according to the specific applications (Hoffmann and Christian Biernath, 2016).The method B provides binned classes, which are not convenient for modelling, though method B is considered more appropriate to represent a grid cell.The method A keeps mass conservation, which can meet most demands of model applications.However, the method A may be misleading in cases when extreme values appeared in a SMU.For the linkage method, the uncertainty is usually estimated by giving the 5 and 95 percentile soil properties of the soil profiles that linked to a SMU.Because the frequency distribution of soil properties within a SMU is usually not a normal distribution or any other typical statistic distribution, the application of statistic such as standard deviation in model use is not proper.This means that the uncertainty of soil dataset derived by the linkage method can not be incorporated into ESMs in a straight forward way.
The basic soil properties are often used to derive secondary parameters including SHPs and STPs by PTFs and soil carbon stock or other nutrient stocks by certain equations (Shangguan et al., 2014).This procedure could be done either before or after the aggregation (here referred to ''aggregating after" and ''aggregating first'').
Because the relationship between the soil basic properties and the derived soil parameters is usually nonlinear, the ''aggregating first'' method should be taken.This was also proved by case studies (Romanowicz et al., 2005;Shangguan et al., 2014).
However, some researchers (Hiederer and Köchy, 2012) were not aware of this and used the ''aggregating after" method producing misleading results.
The aggregation smooths the variation of soil properties between soil components within a given SMU (Odgers et al., 2012).To avoid the aggregation, the spatial disaggregation of soil type maps can be used to determine the location of the SMU components, though the location error may be high in some cases (Thompson et al., 2010;Stoorvogel et al., 2017).This method depends on high density of soil profiles to establish soil and landscape relationships.Folberth et al. (2016) shows that the correct spatial allocation of the soil type to present cropland was very important in global crop yield simulations.Currently, aggregation is still the pragmatic way at the global scale due to lack of data.

Upscaling detailed soil data for model use
The updated soil datasets derived by both the linkage method and digital soil mapping are usually at a high resolution from 1 km to 100 m, and upscaling or aggregation is required to derive lower resolution datasets for model use.The aggregation methods mentioned above can be used.Moreover, there are plenty of upscaling methods such as the window median, variability-weighted methods (Wang et al., 2004), variogram method (Oz et al., 2002), fractal theory (Quattrochi et al., 2001) and Miller-Miller scaling approach (Montzka et al., 2017).However, few studies have been devoted to test out which upscaling methods are suitable for soil data.A preliminary effort was done by (Shangguan, 2014).Five upscaling methods compared were the window average, widow median, widow modal, arithmetic average variability-weighted method and bilinear interpolation method.Differences between aggregation methods varied from 10% to 100% for different parameters.The upscaling methods affected the data derived by the linkage method more than the data by digital soil mapping.The window average, window median and arithmetic average variabilityweighted method performed similar in upscaling.The root mean square error increased rapidly when the window size was less than 40 pixels.Similar to the aggregation of SMUs, the ''aggregating first'' method is recommended when secondary soil parameters are derived.Again, alternative to avoid the aggregation into one single value The upscaling effect of soil data on model simulation has been investigated in previous studies with controversial conclusions.For example, Melton et al. (2017) used two linked algorithms to provide tiles of representative soil textures for subgrids in a terrestrial ecosystem model and found that the model is relatively insensitive to subgrid soil textures compared to a simple grid-mean soil texture at a global scale.However, the treatment without soil subgrid structure in JULES resulted in soil-moisture dependent anomalies in simulated carbon flux (Park et al., 2018).Further researches are necessary to investigate the upscaling effect on models

Summary and outlook
This paper reviews the status of soil datasets and their usage in ESMs.Soil physical and chemical properties served as model parameters, initial variables or benchmark datasets in ESMs.Soil profiles, soil maps and soil datasets derived by the linkage method and digital soil mapping are reviewed at national, regional and global levels.The soil datasets derived by digital soil mapping are considered to provide more realistic estimation of soils than those derived by the linkage method.
The popular soil datasets used in ESMs are outdated and there are soil datasets available for the updates.In the recent several years, efforts were taken to update the soil data in ESMs.The effects of updated soil properties which are used to estimate soil hydraulic and thermal parameters were evaluated.Other major updates include soil reflectance, ground water tables and depth to bedrock.
Pedotransfer functions (PTFs) are employed to estimate secondary soil parameters, including soil hydraulic and thermal parameters, and biogeochemical parameters.PTFs can take more soil properties (i.e., soil organic carbon, bulk density etc.) as input in addition to soil texture data.An ensemble of PTFs may be more robust to provide secondary soil parameters as direct input to ESMs.Soil data derived by the linkage methods and high resolution data can be aggregated by different methods to fit the use in ESMs.The aggregation should be done after the secondary parameters are estimated.However, the aggregation will omit the variation of soil properties.To avoid the aggregation, the subgrid method in ESMs is an alternative which increases the model complexity.The effect of different upscaling methods on the performance of ESMs needs to be investigated further.
Because digital soil mapping has many advantages compared to the traditional linkage method, especially in representing spatial heterogeneity, the new generation soil datasets derived by digital soil mapping needs to gain popularity in ESMs.As a datadriven method, digital soil mapping requires soil profiles observations and environmental covariates (in which the importance of soil maps is low), and including more these data in mapping will improve the global predictions (Hengl et al., 2017).included by global soil databases such as WoSIS make up a very small fraction of the soil pits dug by human beings.For example, there are more than 100,000 soil profiles from the second national soil survey of China (Zhang et al., 2010) and no more than 9,000 were used to produce the national soil property maps freely available (Shangguan et al., 2013).In the last century, national soil survey was accomplished widely, majorly for agriculture purpose.However, most of these legacy data are not digitalized and they are usually not made available to the science community even if digitalized.How to flush out these hidden soil data requires some mechanism such as government mandatory regulations and investing money on making them available.In addition, investments on new soil samplings should be made, especially in the under-represented areas.A good example is the US, which has the most abundant soil data freely available like many other data.Data compatibility of different analysis methods and different description protocols including soil classifications is also an important issue and data harmonization is necessary when the data are made available to public.
The gap between the soil data availability and data usage in ESMs is still large.
Most popular ESMs have not utilized the recent available global soil datasets yet.

Soil texture classes Soil color classes
Prescribed for BATS vegetation/land cover type (Elguindi et al., 2014).
at a resolution of 250 m was produced by fitting an ensemble of machine learning methods based on about 150,000 soil profiles and 158 soil covariates, which is currently the most detailed estimation of global soil distribution.SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.
developed a China dataset of soil properties for land surface modeling based on 8,979 soil profiles and the Soil Map of China using the linkage method.Dai et al. (2013) derived soil hydraulic parameters using pedotransfer functions based on the soil properties byShangguan et al. (2013).Shangguan et al. (2014) further developed a comprehensive global dataset for ESMs.The above soil datasets were widely used in the ESMs.Soil properties from these soil datasets, including soil texture fraction, organic carbon, bulk density and derived soil hydraulic parameters, were implemented in the Common Land Model Version 2014 (CoLM2014, http://land.sysu.edu.cn/).Li et al. (2017) shows that CoLM2014 was more stable than the previous version and had comparable performance to that of CLM4.5 which may be attributed in part to the new soil parameters as input.Wu et al. (2014) shows that soil moisture values are closer to the observations when simulated by CLM3.5 with the China dataset than those simulated with FAO.Zheng and Yang (2016) estimated effects of soil texture datasets from FAO and BNU on regional terrestrial water cycle simulations with the Noah-MP land surface model.Tian et al. (2012) used the China soil texture data in a land surface model (GWSiB) coupled with a groundwater model.Lei et al. (2014) used the China soil texture data in CLM to estimate the impacts of climate change and vegetation dynamics on runoff in the mountainous region of the Haihe River basin.SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.Zhou et al. (2015) estimated age-dependent forest carbon sink with a terrestrial ecosystem model utilizing the soil carbon data of China.Dy and Fung (2016) updated the soil data for the Weather Research and Forecasting model (WRF).Researchers have also put efforts to update ESMs with other soil data.Lawrence and Chase (2007) used MODIS data to derive soil reflectance, which was used as a soil colour parameter in Community Land Model 3.0 (CLM).De Lannoy et al. (2014) updated the Catchment land surface model of the NASA with soil texture and organic matter data from HWSD and STATSGO2.Livneh et al. (2015) evaluated the influence of soil textural properties on hydrologic fluxes by comparing the FAO data and STATSGO2.Folberth et al. (2016) evaluated the impact of soil input data on yield estimates in a global gridded crop model.Slevin et al. (2017) utilized the HWSD to simulate global gross primary productivity in the JULES land surface model.Trinh et al. ( SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.
derived a global soil hydraulic parameters using the ensemble method.Selection of PTFs was carried out based on the following rules, including the consistent physic definition, large enough training sample and positive evaluations in comparison with other PTFs.The PTFs selected included not only those in equations but also PTFs of machine learning.As a result, the modellers could use these parameters as inputs instead of calculating them in ESMs every time running the model.The new generation soil information has already been utilized to derive SHPs SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.and STPs in some researches.Montzka et al. (2017) produced a global map of SHPs at a resolution of 0.25° based on the SoilGrids 1km dataset.Tóth et al. (2017) calculated SHPs for Europe with the EU-HYDI PTFs (Tóth et al., 2015) based on SoilGrids 250 m.Wu et al. (2018) used an integrated approach that ensembles PTFs to map field capacity of China based on multi-source soil datasets.The performance of PTF in ESMs is evaluated in many researches, though PTFs has not been fully exploited and integrated into ESMs (Looy et al., 2017).Here are some examples.Chen et al. (2012) incorporated soil organic matter to estimate soil porosity and thermal parameters for the use of land surface models.Zhao et al. (2018a) evaluated PTFs performance to estimate SHPs and STPs for land surface modelling over the Tibetan Plateau.Zheng et al. (2018) developed PTFs to estimate the soil optical parameters to derive soil albedo for the Tibetan Plateau, and the PTFs incorporated into an eco-hydrological model which improved the model simulation of surface energy budget.Looy et al. (2017) envisaged two possible approaches to improve parameterization of Earth system models by PTFs.One is to replace constant coefficients in the current ESMs with spatially distributed values by PTFs.The other is to develop spatially exploitable PTFs to parameterize specific processes using knowledge of environmental controls and variation of soil properties.
SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.for a grid cell is to use the subgrid methods in ESMs.
The temporal variation of global soil is quite challenging due to lack of data.Soil image fusion is also needed to merge the local and global soil maps.Mapping the soil depth and depth to bedrock separately at the global level is also still challenging due to lack of data and the understanding of relevant processes.Uncertainty estimation should be included in the soil datasets developed in the future.The gap between soil data existence and data availability is huge.The soil profiles SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.

960*Figure 2
Figure 2 Soil organic carbon and bulk density at the surface 0-30 cm layer from Soilgrids971

Table 1 .
Lists of the soil dataset used by land surface models (LSM) of Earth System Models (ESM) or climate models (CM) SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.Three new global soil datasets for the updates of ESMs.SOIL Discuss., https://doi.org/10.5194/soil-2018-32Manuscript under review for journal SOIL Discussion started: 15 October 2018 c Author(s) 2018.CC BY 4.0 License.exchangeable cations (nutrients) Na, Ca, Mg and K as a percentage of the overall 954 exchange capacity of the soil (including the same cations plus H and Al).TEB is total 955 exchangeable base including Na, Ca, Mg and K. ESP is exchangeable sodium percentage, 956 which is calculated as Na*100/CECsoil.ECE is electrical conductivity.AWC is the 957 available water storage capacity.The first 9 soil properties on the left including drainage 958 class, AWC class and so on are available for soil types, while the other properties are 948Table2959 available for each layer.