Stratification Process


This document outlines the workflow developed for the SoilStack stratification service to optimally locate soil carbon samples across agricultural fields to achieve a set of samples that best represent potential variability in soil carbon content. It is based on previous work by researchers in the digital soil mapping field. This research demonstrates that using these ancillary data considered predictive of spatial variability in soil carbon to select a series of points across an area can help to more efficiently achieve a representative sample with a limited sampling effort as opposed to simple random sampling.

Briefly, the process entails collecting remote sensing and digital soil mapping data clipped to a target field, processing that data for use, and implementation of k-means clustering and the conditioned Latin Hypercube Sampling (cLHS) approach for determination of sample coordinates. Key inputs and data processing steps are detailed below.

Data Inputs

Normalized differential vegetation index (NDVI)

  • Source: Sentinel-2, Level 1C Top of Atmosphere Reflectance
  • Resolution: 10 m
  • Notes:
    • Retrieved from S2
    • Retrieved all available tiles for target range within most recent full calendar year
    • Used the S2 QA60 layer of corresponding tile for cloud masking
      • Workflow drawn from here:
      • Filtered to subset of images in which cloud cover was 30% or less
      • In remaining images, masked out pixels obscured by clouds (Band 10) or cirrus (Band 11)
    • NDVI calculated for each remaining image as the normalized difference between Band 8 and Band 4.
    • A composite ‘greenest’ pixel image was then created by selecting the maximum NDVI for each pixel location across the year of interest.


  • Source (USA): USGS National Elevation Dataset
    • Resolution: ⅓ arc second
    • Notes:
    • Elevation data retrieved from AWS and converted from WGS84 to NAD83 10m resolution
    • Slope calculated using RasterModelGrid calc_slope_at_node function.
  • Source (global): Copernicus GLO-30
    • Resolution: 30m
    • Notes:
      • Elevation data retrieved from AWS and expanded into 10m resolution using the bilinear resampling algorithm in the rio.warp.Resampling python package
      • Slope calculated using RasterModelGrid calc_slope_at_node function.

Soil type

  • Source: SSURGO (USA)
  • Notes
    • Relevant state level geodatabases retrieved and locally downloaded
    • Map Unit polygon (MUPOLYGON) layer extracted from geodatabase via API through python
    • Extract ‘om_r’ and ‘claytotal_r’ layers
    • Map Unit symbol (MUSYM) data joined to MUPOLYGON data and converted into a 10 m resolution raster in the NAD83 projection
    • MUSYM is a string representing the soil type of each map unit.
    • Converting these data to raster format also required converting them from a string format to numerical format. Discovered an error after stratification wherein some MUSYM values were unsuccessfully converted, resulting in NA values.

Layer Processing

Data inputs including all those described above are resampled to match the highest resolution of all layers included (10 m) using the bilinear interpolation technique from the rasterio python package then clipped to the target field boundary. A 10-30m buffer is applied to the field boundary based on field size to exclude field edges and avoid potential edge effects. These layers are then converted into a dataframe using the pandas package in python.


K-Means Clustering

This dataframe is then passed into the sklearn KMeans python package with 2-20 clusters. Each clusters’ wss (Within-Cluster-Sum) score is plotted to determine the “elbow” of the graph, where the wss score reduction flattens as the strata number increases. This is done to find the optimum strata number that has the highest reduction in wws score with the lowest strata number. Once the optimum cluster number is determined, the KMeans cluster with that number of strata’s values are added as a column to the layers dataframe.

Sampling Point Locations - CLHS

To generate a set of sampling points for soil carbon stock quantification, we load each of the strata’s data layers into the cLHS algorithm available through the ‘cLHS’ package in python. Using this methodology we are able to find the optimum sampling points to best capture the variability in each strata.

The number of samples is based on the sample densities requirements of each individual project.


  1. J.J. de Gruijter et al., “Farm-Scale Soil Carbon Auditing,” Geoderma 265 (March 2016): 120–30,; Brendan Malone et al., “Auditing On-Farm Soil Carbon Stocks Using Downscaled National Mapping Products: Examples from Australia and New Zealand,” Geoderma Regional 13 (June 1, 2018): 1–14,
  2. de Gruijter et al., “Farm-Scale Soil Carbon Auditing”; Budiman Minasny and Alex B. McBratney, “A Conditioned Latin Hypercube Method for Sampling in the Presence of Ancillary Information,” Computers & Geosciences 32, no. 9 (November 1, 2006): 1378–88,