How well does Predictive Soil Mapping represent soil geography? An investigation from the USA

We present methods to evaluate the spatial patterns of the geographic distribution of soil properties in the USA, as shown in gridded maps produced by Predictive Soil Mapping (PSM) at global (SoilGrids v2), national (Soil Properties and Class 100m Grids of the USA), and regional (POLARIS soil properties) scales, and compare them to spatial patterns known from detailed field surveys (gSSURGO). The methods are illustrated with an example: topsoil pH for an area in central New York State. A companion report examines other areas, soil properties, and depth slices. A set of R Markdown scripts is 5 referenced so that readers can apply the analysis for areas of their interest. For the test case we discover and discuss substantial discrepancies between PSM products, as well as large differences between the PSM products and legacy field surveys. These differences are in whole-map statistics, visually-identifiable landscape features, level of detail, range and strength of spatial autocorrelation, landscape metrics (Shannon diversity and evenness, shape, aggregation, mean fractal dimension, cooccurence vectors), and spatial patterns of property maps classified by histogram equalization. Histograms and variogram 10 analysis revealed the smoothing effect of machine-learning models. Property class maps made by histogram equalization were substantially different, but there was no consistent trend in their landscape metrics. The model using only national points and covariates was not better than the global model, and in some cases introduced artefacts from a lithology covariate. Uncertainty (5–95% confidence intervals) provided by SoilGrids and POLARIS were unrealistically wide compared to gSSURGO low and high estimated values and show substantially different spatial patterns. We discuss the potential use of the PSM products as a 15 (partial) replacement for field-based soil surveys. Copyright statement. Creative Commons Attribution 4.0 License

Predictive Soil Mapping (PSM), also commonly referred to as Digital Soil Mapping (DSM), has been defined as "the development of a numerical or statistical model of the relationship among environmental variables and soil properties, which is then applied to a geographic data base to create a predictive map" (Scull et al., 2003). Since the seminal paper of McBratney et al. 55 (2003), recently reviewed by Minasny and McBratney (2016), PSM has been widely-applied from the field to global levels.
A principal attraction of PSM is that it produces consistent, geometrically-correct and reproducible gridded maps over large areas, given training data ("point" observations of soil classes, properties or conditions), a set of environmental covariates covering the entire area to be mapped at some fixed grid resolution, and a set of algorithms implemented in computer code.
This removes the need for expertise in discovering and interpreting the soil-landscape relations, also known as the "paradigm" 60 of soil survey (Hudson, 1992), which is vital for traditional soil survey and difficult to acquire and harmonize among surveyors.
Further, fewer locations can be visited in order to develop reliable models, as compared to traditional survey techniques.
If the relation with covariates is strong, and locations representative of the entire covariate feature space are included in the training set, large areas can be mapped from relatively few field observations. Maps made by PSM can include areas that are not accessible to field mappers because of permissions or difficult access, if the available training data cover the covariate space 65 of the inaccessible area. However, PSM requires sufficient sampling density to cover the full covariate space, since most PSM methods do not extrapolate, and in any case extrapolation is inadvisable.
It is not claimed that PSM is inherently superior to traditional survey, but well-known problems of traditional survey are avoided: multiple survey projects over time with inconsistent standards and mapping concepts, inconsistency among mappers, difficulties in objectively identifying boundaries, and indeed the need to identify boundaries. However, traditional soil surveyors 70 and users of their maps are often critical of PSM products, and may not understand how they were made and how they should be used (Arrouays et al., 2020). In the USA there is increasing awareness of, and interest in, PSM products. Here the most important point of contention has to do with PSM resolution (pixel size), which implies a mapping scale, compared to the scale at which differences can be reliably interpreted for user needs. Criticism of PSM products is proportional to the degree to which their implied spatial precision and accuracy is over-sold.

75
The success of PSM in reproducing known "point" observations is typically reported by evaluation ("validation") statistics based on data splitting (independent evaluation) or by cross-validation. These evaluations are almost never based on random sampling (Brus et al., 2011), and since the source point datasets are almost always biased towards certain land uses, access constraints or landscape locations, these evaluations carry forward these biases and must be interpreted with caution.
A more serious issue is that point evaluations of PSM products do not consider the spatial pattern of predictions. By contrast, 80 soil surveys based on extensive field observation of the soil landscape (sometimes called the "traditional" soil survey method) produce polygon maps of soil map units (SMU) composed of one or more soil type units (STU), e.g., soil series, which then provide modal or representative profiles with measurements or estimates of soil properties, usually by genetic horizon. These explicitly show the soil landscape, as interpreted by the soil surveyor and as managed by the land user. Since PSM maps are not derived from a paradigm such as the surveyor's explicit stratification of the soil landscape, their spatial pattern depends 85 on the input data (training points and covariates) and the PSM algorithm applied. The question is thus to what degree PSM products represent the actual soil landscape spatial pattern and, more importantly, the underlying pedogenetic and geomorphic processes.
PSM maps are most commonly produced at grid cell resolutions from 1 km to 30 m, and even to 10 m for precision agriculture applications. Environmental covariates are available at these resolution, so that PSM products at high resolutions can 90 show fine details that can not be presented at the design scale of conventional maps. These have minimum legible delineations (MLD) of 0.25 cm 2 (Vink, 1975) or 0.40 cm 2 (Forbes et al., 1982) on the published map, multiplied by the scale factor. For example, a polygon map at 1:24 000, typical of USA conventional soil survey, can represent spatial patterns of 1.44 (Vink) to 2.3 (Forbes) ha minimum-size polygons. These correspond to single grid cell resolutions of 240 to 384 m, coarser than higher-resolution PSM products from (30 to 100 m). But the question remains whether this implied fine detail represents 95 true differences or artefacts of the mapping process -in other words, should the PSM map unit trust the apparent differences between adjacent grid cells, or are some or most of these differences due to artefacts ("noise") of the PSM process? Further, there is the question of how well the medium-resolution products (e.g., 250 m) represent the soil landscape at regional extent.
The objective of this study is to present methods with which to evaluate the landscape and detailed level spatial patterns of PSM maps. These maps may have been developed for global, national, or regional spatial extents. These patterns are compared 100 with digital soil maps based on polygon maps produced by field survey and expert soil-landscape analysis. We chose the USA as a study area because of the availability of field-based soil surveys at 1:12 000 to 1:24 000 design scale, linked to detailed descriptions of modal soil profiles, available as a seamless digital product. These comparisons may be useful in the context of current plans (Thompson et al., 2020) for updating and completing the USA soil survey using PSM methods and GlobalSoilMap (GSM) specifications (Arrouays et al., 2014). They should also be useful for developing realistic expectations 105 for what PSM can and cannot deliver (Arrouays et al., 2020).
To evaluate PSM methods we apply them to selected test areas and soil properties, and comment on the results. This paper introduces the methods and data sources, and includes an illustrative example (one area, one soil property, one depth slice), with a minimum of discussion of soil geography. A companion ISRIC Report (Rossiter et al., 2021) presents four case studies in diverse soil geographic contexts, each with different soil properties and depth slices. We encourage readers to apply the 110 methods to their own study areas within the USA and to their soil properties of interest, to evaluate the utility of the several PSM products. For this, we provide our analysis scripts as R Markdown documents (R Studio, 2020); see "Code availability" at the end of this paper.

Data sources
The source maps are of three kinds: (1) digital products based on field survey without any statistical modelling; (2) PSM 115 products based on field survey products and enhanced by statistical modelling using environmental covariates; and (3) PSM products based on statistical modelling using training points and environmental covariates. This latter is the most common PSM method worldwide, especially for areas without extensive field surveys. These differ in their primary data, their environmental covariates and geographic scope, their mapping methods, the resolution of their predictions, and their uncertainty assessment.
These have implications for interpreting their relative success. We summarize these below; see the Supplementary Materials 120 and corresponding papers for details.
The first source is represented by the reference product from the Natural Resources Conservation Service ( There are two products representing the third source, one for the world and one for the continental USA only. This allows us to compare globally-and nationally-consistent products. The global product is SoilGrids v2.0 (further SG2) (ISRIC -World Soil Information, 2020; Poggio et al., 2021), a further development of SoilGrids1km (Hengl et al., 2014) and SoilGrids250m (Hengl et al., 2017). This uses a global point dataset and environmental covariates that cover the entire world (except the high Arctic and Antarctica), and models that are globally-consistent. The continental product is the Soil Properties and Class 100m Grids of the United States (further SPCG) (Ramcharan et al., 2018), which followed the methodology of Hengl et al. (2017) with the addition of USA-specific covariates, notably parent material and drainage classes extracted from SSURGO or STATSGO2, and only used the CONUS extent of environmental covariates in model building.

150
A major difference between products is the extent to which primary data relies on field soil survey. However, the transfer from unrectified photos to topographic base and the edge matching between survey areas has not 170 always been flawless, and in addition polygons may have been mis-drawn on the original survey ( Supplementary Fig. 1). Thus we can not take gSSURGO as a completely reliable reference.
PSP does not use any point observations; rather, it samples pseudo-points from gSSURGO and uses these as training points for the DSMART disaggregation algorithm (see below).
At the other extreme is SG2, which does not use any information derived from SSURGO or STATSGO. Its training points are 175 extracted from the freely-shareable World Soil Information Service (WoSIS) point dataset from ISRIC-World Soil Information . These include all profiles in the NRCS pedon database. The WoSIS points are augmented by several datasets that can not be published externally due to restrictions by the original data providers to ISRIC, but which can be used in mapping. In total ≈ 240 000 profiles were used in model building.
SPCG is similar to SG2 in that it is primarily based on point observations, but it has a richer source of these than SG2:

Environmental Covariates and geographic scope
The three PSM products (SG2, PSP, and SPCG) use a large number of gridded GIS coverages as environmental covariates in 185 their predictive models. These represent soil-forming factors, and include climate, ecology, geology, land use/cover, terrain, vegetation and hydrography; see Supplementary Information for details.
PSP also uses coarse-resolution (≈ 2 km) estimates of U, Th, and K γ-ray decay products. The model is trained in overlapping tiles, thus each tile is derived from a local model. SPCG also uses SSURGO and STATSGO2 polygons to derive parent material (87) and drainage (4) classes as environmental covariates. The model is trained for the CONUS, thus it is a regional model.  (Schoeneberger et al., 2012). Mapping is based on conceptual models of soil-landscape relations developed in each survey area (Hudson, 1992), confirmed by purposive auger observations and a small number of full profile descriptions to characterize map unit composition. SSURGO is progressively updated by field inspection and correlation, as problems are identified by soil surveyors or map users. The SSURGO polygons are rasterized to gSSURGO. Since SSURGO is compiled from diverse field surveys over many years, in some areas there are artefacts of that survey process ( Supplementary Fig. 2).

205
PSP uses the DSMART disaggregation algorithm (Odgers et al., 2014) to predict the most probable component (STU), along with their probability of occurrence, at each 30 m resolution grid cell, and from the modal soil properties of the component, a probability-weighted aggregation. In this context "disaggregation" is the process of attempting to take a coarser-resolution gridded or smaller-scale polygon product which is known to have multiple STU, and identify the locations at a finer grid resolution where these components would be found, should the original survey have been at larger scale. This depends on 210 fine-scale covariates that, in theory, relate to the STU within an SMU. It attempts to deal with the problems caused by multiple surveys over time, inconsistencies among mappers, and poor georeference of SMU boundaries by sampling out of mapped SMU polygons according to declared proportions of map unit components (STU) and using these as pseudo-observations to train PSM models of STU occurrence. PSP provides a fine-scale map equivalent to ≈1:3 000 design scale, i.e., from 16 to 64 times finer resolution than the original 1:12 000 to 1:24 000 surveys included in SSURGO. An obvious question is whether it is possible to map at this resolution from the SSURGO source, even with the fine-resolution covariates used by DSMART. See Chaney et al. (2019) for details.
The other two methods are representative of the dominant PSM method as implemented, with some differences in detail, in many countries and for many properties (e.g., Reddy et al., 2021;Liu et al., 2020;Araujo-Carrillo et al., 2021). SG2 uses random forests implemented in the ranger R package, with prior covariate selection by recursive feature elimination and

Uncertainty assessment
SG2 and PSP predict the 5% and 95% quantiles of the distribution of predictions using Quantile Regression Forests (QRF) (Meinshausen, 2006). These limits are specified by the GlobalSoilMap consortium (Arrouays et al., 2014), defined as "the 90% Prediction Interval (PI) which reports the range of values within which the true value is expected to occur 9 times out of 10 . . . there is no assumption that this prediction interval is necessarily symmetric around the predicted value." (Science Resources Conservation Service, n.d.), these are used as the basis for establishing the range. In all cases expert opinion is used to adjust these to represent the range that a map user can expect to find in the field. Thus these are not directly comparable to the results of QRF, but do give some idea of how the field mappers, supported by laboratory observations, conceive of the spread of a property. Note that none of these assessments imply a parametric probability distribution.
As pointed out by Arrouays et al. (2020), "[t]he user community requires training in, and experience with, the new digital 255 soil map products, especially about the use of uncertainties". It would be hoped that the uncertainties computed by different methods would be similar, but as will be seen here, that is not the case.

Evaluation methods
We compared PSM products at regional (nominal 250 m grid cells) and local (nominal 30 m grid cells) levels. We evaluated both qualitatively, i.e., by visual inspection followed by expert interpretation, and numerically, over a 1 × 1 • tile. Spatial 260 patterns were evaluated over a 0.2 × 0.2 • subtile. To compare maps at the regional resolution, the higher-resolution maps were aggregated to the lower resolution by weighted averaging (resampling) of the high-resolution pixels within one low-resolution pixel. To compare maps at the local resolution, we only included the two products (gSSURGO and PSP) provided at that resolution, along with the global product (SG2) as reference, this latter downscaled by increasing the grid resolution without any attempt to disaggregate within the larger grid cell, over a 0.15 × 0.15 • subtile.

Qualitative methods
Qualitative methods for comparing maps rely on expert judgement to identify known soil-geographic patterns and evaluate to what extent they are represented on the gridded maps. The maps are displayed side-by-side along with a map of their pairwise differences. Areas of disagreement are identified and discussed.

Numerical methods -whole map 270
Numerical methods for comparing gridded maps as a whole include (1) MD: Mean difference, i.e., the disagreement between maps; (2) RMSD: root mean squared difference; (3) RMSD adjusted for MD, i.e., the RMSD after accounting for any bias.
These take the first-listed map as reference and the second as the map to evaluate. They can be normalized by the number of grid cells or total area. In addition, all maps can be compared by their Pearson (linear) correlations. These methods are of limited interpretive value -they do characterize the entire map, especially for overall bias (MD), but do not explain where the 275 spatial discrepancies occur.

Numerical methods -spatial continuity
Soil properties are usually spatially-correlated: we expect similar values of properties in nearby grid cells. The degree of local spatial continuity can be assessed by the variogram computed over local neighbourhoods of the gridded map. We computed and modelled the variogram within a local neighbourhood and automatically fit with an exponential model, using the 280 fit.variogram function of the gstat R package (Pebesma, 2004). Spatial structure is characterized by the range, proportional nugget and structural sill of the fitted variogram model. The range shows the radius over which the selected property has spatial correlation. The proportional nugget shows the inherent variability at a point, at a scale shorter than the grid spacing.
The structural sill shows the overall variability within the range.

V-measure
The V-measure method evaluates the spatial association between two regionalizations, i.e., partitions of a map into classes, called regions in the first map and zones in the second map. These are intersected to produce segments of the combined map.
The segments, labelled with both zone and region, are used to compute two metrics, called homogeneity and completeness, with 300 respect to the base (first-listed) regionalization. Homogeneity of the second map is a measure of the variance of the regions within a zone, normalized by the variance of the regions in the entire domain of the first map. If the regions within zones have less variance than within the entire second map, the second map is to some extent homogeneous with respect to the first map.
A perfectly homogeneous partition (second map) (with value 1) is when each zone of the first map is entirely within a single region of the second map. A perfectly inhomogeneous partition (with value 0) is when each zone has the same composition of 305 regions as the entire domain of the first map, i.e., the second map's partition (to be tested) is essentially random with respect to the first map's regionalization.
Completeness of the second map is the inverse of homogeneity: it assesses the variance of the zones within a region, normalized by the variance of the zones in the entire domain of the second map. This shows how well the partition of the second map fits inside the regionalization.

310
These two together are combined into a single measure, the V-measure, as the harmonic mean of homogeneity h and completeness c (Equation 1). This has a range between 0 (no spatial association between the maps) and 1 (perfect association).
Obviously, we prefer high association between maps produced by PSM and a reference map. We can also assess the agreement of the patterns produced by different PSM methods.

Landscape metrics
The landscape metrics applicable to soil maps (as opposed to, e.g., vegetation maps) have diverse interpretations. We compare the metrics of two maps to see if they have a similar concept of the (classified) soil landscape. The landscapemetrics package can compute many indices; we choose a few that will best show the landscape-level difference between maps. We do not consider metrics of individual patches, except as they contribute to landscape-level metrics. The algorithms for these can 320 be read from the code repository (Hesselbarth, 2021); we present the formulas and interpretations in the following text.

AI =
3.5 Regional patterns Regional patterns are at the scale of regional trends such as lithologic units, elevation zones in mountains, and repeating patterns 360 (e.g., basin-and-range, ridge-and-valley). A "region" in this context is on the order of 1×1 • to 5×5 • . An appropriate resolution for this scale is (250 m) 2 , as used in SG2.
For this evaluation we produced difference maps of the PSM products (SG2, SPCG, PSP) vs. the gNATSGO gridded maps.
All except SG2 were upscaled to the SG2 resolution (250 m). The gNATSGO maps are assumed to be the most correct. We then comment on the differences and speculate on the causes, based on our knowledge of the PSM procedures used to make 365 each product and the nature of the soil landscape.
We selected an example 1 × 1 • test area, based on its well-known soil-forming factors and environments and our familiarity with the soil geography. For the pattern analysis within this area we selected a 0.20 × 0.20 • subtile and projected the maps to the UTM18N grid on the WGS84 datum (ESPG code 32618).

Local patterns 370
Local patterns are at the scale of geomorphic features such as hillslope catenas, fluvial terraces, outwash fans, valley trains and drumlin fields. "Local" in this context is on the order of 10 × 10 km. An appropriate resolution for this scale is ≈ 30 × 30 m.
Only gSSURGO and PSP map at this level; we included the other two products to see the effect of (unjustified) downscaling.
For this evaluation we used the same test areas as for the regional patterns, but examined smaller areas with distinctive soil-landscape relations. These were evaluated by two methods.

Quantitative method
This follows the procedures of the regional assessment, except that V-measures are not computed, due to the very fine pattern 390 of classified polygons.

Example area and soil property
To illustrate the method, we selected one area familiar to the first author, and an important soil property with strong spatial variability and pattern, namely pH in the 0-5 layer. We selected this property because in our experience this is often well- 5.1 Regional maps Table 1 shows the statistical differences between gNATSGO (reference) and the PSM products. All PSM products under-405 predict topsoil pH with respect to gNATSGO, by about 0.38-0.48 pH units. The RMSD is substantial also, on the order of 0.49-0.67 pH units, somewhat less than this when corrected for bias.  Figure 2 shows whole-map histograms. PSP has a bimodal distributions, and predicts few pH values around pH 5.8. This is quite strange since this value is well-represented in the gNATSGO map. The other distributions are fairly symmetric, although SG2 and SPCG are more even than gNATSGO, which is strongly concentrated near pH 6.2. This shows the smoothing effect 410 of the machine learning models. Figure 3 shows the pairwise Pearson correlations between the products. The products are overall well-correlated. SG2 and SPCG are very closely correlated, since they use similar mapping methods, despite the additional covariates used by SPCG.
PSP and gNATSGO are also closely-correlated. These correlations do not account for bias. They do however show that the maps are similar in their overall pattern as evaluated per-grid cell. 415 Figure 4 shows gNATSGO (reference) along with the predictions of pH of the PSP products. Figure 5 shows these as difference maps. These figures reveal substantial differences between products. The most obvious difference is in the detail of the spatial pattern. Despite having been upscaled to regional resolution, gNATSGO shows finer detail than the other products, especially PSP.
These figures also show the spatial distribution of the bias. Compared to gNATSGO, SG2 and SPCG underestimate pH in 420 the higher hills in the NE portion of the map, and in the glacio-lacustrine sediments along the lakeshores. SG2 misses the soils derived from Onondaga limestone glacial till towards the southern end of the till plain. SG2 has no information on parent material and uses global models. SPCG has very similar differences, despite using SSURGO-derived parent material as a covariate.
PSP predictions are closer to gNATSGO than are those of SG2, which is not surprising since PSP also uses gSSURGO as its 425 primary information source. This product has removed some of the fine variation of gNATSGO. However the disaggregation by DSMART results in quite some discrepancies with gNATSGO. In particular, the Homer-Tully outwash valley (northeast side of map) is under-predicted by one pH unit, and the surrounding hills over-predicted by almost as much. Many of the valley trains (southern side of map, running towards the Susquehanna River) are under-predicted. This is likely due to PSP' soil series

Uncertainty
The 5%, 50%, and 95% prediction quantile maps are shown in Fig. 6 (SG2) and 7 (PSP). The "low", "representative" and "high" values from gNATSGO are shown in Fig. 8 Each figure has its own stretch. Figure 9 shows the inter-quartile range 5-95% (IQR) for the two PSM products, along with the low-high range for gNATSGO.

435
SG2 has a fairly consistent IQR, mostly from about 2.5 to 3.5 pH, whereas PSP has a much wider range of uncertainties, mostly from about 1.5 to 4.5 pH, and shows much more spatial pattern. PSP has the widest ranges on the steep valley sides and especially in the Seneca Army Depot at the north inter-lake area, and the lowest on the broad till plains and through valleys.
These are wide ranges, and although an honest reflection of the PSM models, should give pause to map users. This suggests Figure 3. Pearson correlations between all products, pH, 0-5 cm that the GlobalSoilMap specifications for uncertainty (Arrouays et al., 2014) are unduly pessimistic. Sources for uncertainty 440 assessment (SG2: training points and global covariates, PSP: mapped soil series and national covariates) and the different machine learning methods lead to greatly different estimates of prediction uncertainty. gNATSGO has narrower ranges than the two PSM products and by design does not include unrealistic values.

Local spatial autocorrelation
The local variograms and their fitted exponential models are shown in Fig. 10. Table 2  gNATSGO has the shortest effective range. This indicate fine-scale structure at 250 m resolution. The PSM products have longer ranges, showing that these models do not capture well fine-scale variation. These show a smoothing effect, likely due Figure 5. Difference between gNATSGO and PSM products, pHx10, 0-5 cm Figure 6. Quantiles of the prediction, SG2, pHx10, 0-5 cm to spatial continuity in the covariates. PSP has a long range and low sill, due to the harmonization from DSMART. The low proportional nuggets are due to the relatively coarse resolution. Figure 11 shows the topsoil pH classified into eight histogram-equalized classes in the 0.2 x 0.2 • sub-tile. The higher pH are shown in green, the lower in red. Class limits are approximately 5. 01, 5.14, 5.27, 5.40, 5.54, 5.71, and 6.02    class, covering the highest hills, whereas SG2 presents a more nuanced view.  Table 3. V-measure statistics, pHx10 0-5 cm Figure 12. Homogeneity (left) and Completeness (right) of the SG2 pH class map, with respect to gNATSGO pH class map, 0-5 cm Figure 12 shows the computed homogeneity and completeness of the SG2 pH class map, with respect to the gNATSGO pH class map. In the yellow areas of the homogeneity map one, one gNATSGO predicted class is contained in the SG2 region; in the blue areas many are. Overall the agreement is fairly good. Table 4 shows the statistics from the landscape metrics calculations. The mean fractal dimensions are almost identical. There 465 is quite some range of aggregations, with SPCG most aggregated, i.e., least complex. Otherwise the results are inconsistent; all we can say is that the map patterns vary considerably among products.  Table 5. Jensen-Shannon distance beween co-occurence vectors co-occurence vectors of the four products. The co-occurence patterns of SG2 is quite similar to that of the other products, whereas gNATSGO is quite different than PSP.

Landscape metrics
6 Local spatial patterns 470 The main interest here is to see how well PSM methods at relatively fine resolution reproduce known relations at the local geomorphic level, e.g., hillslopes, transects across valleys with multiple terrace levels, and within farms. It has been claimed that PSM at 30 m resolution is sufficient for management of, or even within, individual farm fields. This is the only PSM product which predicts at this resolution.
We examine this first qualitatively, i.e., by visual inspection, and then quantitatively, mostly following the methods of the 475 regional assessment.

Qualitative assessment
Here we use silt concentration, as it reveals stronger qualitative discrepancies than pH in this test area. Figure 13  on the hilltops. These differences are because the predicted silt concentrations are taken from the official series descriptions.

485
PSP follows the map unit lines fairly well, but is much finer-grained; each 30 m pixel is separately predicted. This results in some smoothing of the abrupt boundary lines from gSSURGO on the hilltops. However within some SSURGO map units PSP predicts quite some differences in topsoil silt concentration. These are map units with contrasting components, which PSP attempts to disaggregate according to their correlation with covariates. For the most part these do not seem to be related to terrain or land use.

490
For example, Fig. 14 shows detail of the Holly-Papakting map unit within this PSP window. This map unit has two contrasting soils in similar proportions: a mineral alluvial soil (Holly series) and an organic soil (Papakting series); the second has much lower silt concentration. It is difficult to see the reason for the pattern within this map unit. PSP has placed the component series in their proper proportions but not according to any landscape feature. To see the fine differences at this high resolution, we consider a 0.15×0.15 • subtile with lower-right corner −76.30 • E, 42.45 • N and evaluate pH, as in the regional assessment ( §5). Table 6 shows the statistical differences between gSSURGO (reference) and the PSM products, along with the predictions of pH. Figure 15 shows the pairwise Pearson correlations between the maps. These results are comparable to those for the full tile at regional resolution: both SG2 and PSP under-predict pH by about 0.35-0.45 pH. Correlations are fairly strong between 500 PSP and gSSURGO, and between SG2 and PSP, but weak between SG2 and gSSURGO.  Table 6. Statistical differences between gSSURGO and PSM products, pHx10, 0-5 cm. Centre of map −76 • 30 30"E, 42 • 52 30"N Figure 15. Pearson correlations between local products, pH, 0-5 cm Figure 16 shows gSSURGO (reference) along with the predictions of pH by the PSP products. Figure 17 shows these as difference maps. Clearly, gSSURGO has overall higher values than the other two products, and despite the fine resolution, has in general large areas of identical values. The differentiation between map units follows sharp boundaries even within a single landscape (e.g., the plateau towards the S of the map), and this is likely an artefact of relying on the representative profiles in 505 the official series descriptions for property values. PSP has a finer pattern, due to disaggregation, and shows a more realistic local pattern by smoothing out the sharp boundaries between map units within a landscape. PSP shows large areas of low pH. SG2 does not follow well the landscape lines, especially the sharp boundaries between uplands and valleys, and predicts very low pH (≈ 4.5) on the plateau. It is difficult to recognize local landscape units in this global product.
6.2.1 Class maps 510 Figure 18 shows the topsoil pH classified into eight histogram-equalized classes. Class limits in this area are approximately 5. 30, 5.44, 5.55, 5.61, 5.74, 5.89, and 6.15 pH, with the extreme values of 4.44 and 7.00 pH. SG2 clearly is less detailed than the other two products. PSP shows a fine pattern, not closely related to the fine pattern of gSSURGO. As previously noted, gSSURGO is consistently about one pH class higher than the other products.  The local variograms and their fitted exponential models are shown in Fig. 19. Table 7 shows their statistics. gSSURGO has the shortest effective range and highest sill. PSP has a longer range and low sill, due to the harmonization from DSMART that removes some of the overall variability. SG2 has no nugget variance, a low sill, and long range, consistent with its regional scale.  Table 9. Jensen-Shannon distance beween co-occurence vectors (local) 6.2.3 Landscape metrics 520 Table 8 shows the statistics from the landscape metrics calculations. The mean fractal dimensions are almost identical. SG2 is much more aggregated, i.e., least complex, than gSSURGO or PSP. PSP has a higher landscape shape and Shannon diversity than the other products. Table 5 shows the Jensen-Shannon distance beween co-occurence vectors of the four products. The co-occurence patterns of SG2 is somewhat similar to that PSP but quite different from gSSURGO.

525
The presented methods are well-able to expose differences between PSM mapping methods, and between these and field survey.
It is clear from the "best case" example presented in this paper that different PSM methods, with different training points, covariates, and algorithms, can produce quite different predictive soil maps. Comparing maps with point-wise evaluation from (almost always biased) field observations gives an incomplete picture of how the different methods represent the soil landscape, which is after all what dictates how the soil is used and managed. The main findings from the example case are: 530 1. Although the regional products (250 m resolution) are well-correlated, the PSM products are biased, under-predicting topsoil pH by about 0.38-0.48 pH units. They also differ substantially, with a RMSD adjusted for bias on the order of 0.31-0.48 pH. This is based on representative pH values of the mapped soil series, not on measured values.
2. The PSM products differ substantially among themselves and with the reference product in their local spatial pattern, as revealed by empirical variograms. gNATSGO has a short effective range, but this is smoothed to a range 2 to 3.5 times 535 as long by PSM.
3. Classification by histogram equalization reveals major differences in the spatial patterns of the produced class maps, as evaluated both by visual inspection and landscape metrics.
4. Despite using USA-specific covariates (parent material, drainage classes) derived from gSSURGO and covariates limited in geographic scope to the USA, the predictive map made by SPCG is not substantially different from that made by SG2, 540 likely due to the similar modelling method.
5. The estimates of uncertainty provided by SG2 and PSP are substantially different, both in width of the uncertainty interval (much narrower in SG2) and in spatial pattern. The confidence intervals seem unrealistically wide compared to the expert-derived high-low value range provided by gSSURGO. 6. At the local level (30 m resolution) the disaggregation provided by PSP does not appear to correspond to landscape 545 positions associated with STU components. PSP obscures the fine-scale details of the local spatial pattern, and SG is substantially more general, due to its resolution. These results will differ in different soil geographic regions, for different soil properties, and for different depth slices, as shown in the companion Case Studies report.
Our methods have two limitations due to the decision that they be applicable, using the supplied computer code, to any area 550 within the USA. The first is the use of histogram equalization for the class maps which are then evaluated for the class pattern.
For specific areas and properties it would be preferable to use established class limits relevant for land use, for example limits from soil survey interpretation tables. The second is the choice of the exponential model for automatic variogram fitting, as well as the somewhat arbitrary choice of empirical variogram cutoff and bin width. For each area, property and depth slice variograms could be computed and fit according to the analysts' prior knowledge.

555
Despite the discrepencies between PSM products and field survey, PSM can be a valuable tool for soil survey. It has the advantage of being reproducible and objective, given a set of training points, relevant environmental covariates, and a PSM method. Therefore, its output should be examined and compared with field-based maps to identify possible improvements, especially in areas with difficult field access or complex, difficult to interpret soil-landscape patterns. For unsurveyed areas PSM provides a useful pre-map for planning sampling and field survey. There is no substitute for actually examining the soil 560 and landscape, but despite the issues revealed in this study, PSM can be an important aid to the soil surveyor.
Code availability. Source code as R Markdown documents are available at https://github.com/ncss-tech/compare-psm. These can be used to (1) import all products to compare, as well as some others not considered in this study; (2) create ground overlays and corresponding KML files for display in Google Earth; (3) compare SG2 and PSP for 1 × 1 • tiles; (4) compare SG2 with SPCG and gNATSGO for any rectangular tile; (5) compute landscape metrics and compare them between products for any subtile of these; (6) evaluate the success of PSP