Zonal Statistics computes summary statistics for raster band values within zone polygons. Each row in the output DataFrame represents statistics for a zone and raster band.
Usage notes
-
Zonal Statistics requires an input DataFrame containing rasters and a zone DataFrame containing polygon geometries. The tool calculates statistics for pixels that fall inside each zone. A pixel is included in the statistic calculation if the center of the pixel is contained within a zone polygon.
-
Use
.setto specify the DataFrame containing the zone geometries, andZones() .setto specify one or more columns in the zone DataFrame that identify zones. The specified columns will be included in the output DataFrame as the zone IDs.Zone Id Column() When multiple polygon features share the same zone ID, raster pixels from all associated polygons are included in the statistics, and the statistics are returned per zone ID.
If the zone ID columns are not specified, the tool generates a unique zone ID for each polygon geometry in the zone DataFrame.
-
Use
.includeto specify whether the zone geometries are included in the output DataFrame. When set toZone Geometry() True, the output DataFrame includes the geometry column representing each zone. When not set or set toFalse, zone geometry is not included in the output DataFrame, which can improve performance if geometry is not needed.When multiple polygon features share the same zone ID, the returned geometry is a multipart polygon composed of the original zone features. The result statistics, such as count, sum, and mean, reflect the combined contribution of all pixels from those polygons. The tool does not dissolve the zone geometries with the same zone ID.
-
If the raster and zone geometries have different spatial references, the tool will transform the raster to match the zone geometries. For better performance, it is recommended to have both the raster and zone geometries in the same coordinate system before running the tool.
-
You can specify one or more band IDs to summarize from the input raster with
set. Band IDs are 1-based. By default, all bands of the raster are used for statistical calculations.Band Ids() -
The supported statistics type depends on the pixel type of the input raster and the statistics calculation type.
By default, the tool calculates arithmetic statistics, which are listed in the following table.
| Count | Minimum | Maximum | Range | Mean | Standard deviation | Sum | Median | Percentile | Variety | Majority | Majority count | Majority percentage | Minority | Minority count | Minority percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Integer pixel type | ||||||||||||||||
| Float pixel type |
-
You can optionally use
.setto enable circular statistic calculation. Circular statistics can be used for directional or cyclic variables (for example, aspect or wind direction), where values wrap around at the range boundary.Circular Wrap(low, high) The circular statistic types depend on the the pixel type. If pixel values are uniformly distributed across the circular range, the circular mean and standard deviation will be
NULL.
| Count | Minimum | Maximum | Range | Mean | Standard deviation | Sum | Median | Percentile | Variety | Majority | Majority count | Majority percentage | Minority | Minority count | Minority percentage | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Integer pixel type | ||||||||||||||||
| Float pixel type |
-
For majority and minority calculations, when there is a tie, the output will be any of the tied values.
-
Percentile statistics are optional and can be enabled using
includeandPercentiles() set. If neither is set, percentiles will not be calculated. IfPercentile Value() includeis called, the tool calculates the median and the 90th percentile using the pixel values.Percentiles( True) The percentile value can be specified using
setwith any value from 0 to 100. For example, if you specify a percentile value of 25, the tool returns the 25th percentile value of pixel values that fall within each zone.Percentile Value() -
Percentile values are computed using approximate quantile implementation based on the Greenwald-Khanna algorithm with additional optimizations for performance. As a result, percentile statistics produced by the Zonal Statistics tool may differ slightly from results computed by RT_ZonalStatistics which uses an exact quantile calculation. This approach is designed to provide efficient and scalable performance for large datasets while still delivering near accurate percentile estimates.
-
Pixels with
Novalues are excluded from all statistics calculations.Data
Results
Zonal Statistics returns a DataFrame for each unique combination of zone ID and band ID.
The output DataFrame includes the Zone ID columns specified by .set
or a generated zone ID if none are specified.
Statistics are returned in a wide format, where each statistic is represented as a separate column.
| Field | Description |
|---|---|
Zone | The zone identifier. If a zone ID column(s) is specified, the output includes the provided column(s). Otherwise, a Zone column is generated for each zone geometry. |
Band | The raster band ID (1-based). |
Count | The number of raster pixels included in the zone. |
Min | The minimum pixel value among all pixels in the zone. |
Max | The maximum pixel value among all pixels in the zone. |
Range | The difference between the maximum and minimum pixel value in the zone. |
Mean | The mean pixel value among all pixels in the zone. |
Stdev | The standard deviation of all pixels in the zone. |
Sum | The total value of all pixels in the zone. |
Variety | The number of unique pixel values among all pixels in the zone. It will be NULL if the pixel type is float. |
Majority | The pixel value that occurs most often among all pixels in the zone. It will be NULL if the pixel type is float. |
Majority | The frequency of all pixels that contain the majority value in the zone. It will be NULL if the pixel type is float. |
Majority | The percentage of pixels that contain the majority value in the zone. It will be NULL if the pixel type is float. |
Minority | The value that occurs least often among all pixels in the zone. It will be NULL if the pixel type is float. |
Minority | The frequency of all pixels that contain the minority value in the zone. It will be NULL if the pixel type is float. |
Minority | The percentage of pixels that contain the minority value in the zone. It will be NULL if the pixel type is float. |
If percentile statistics are enabled, the output DataFrame will also include the following fields:
| Field | Description |
|---|---|
Median | The median pixel value among all pixels in the zone. |
Percentile | The percentile value specified by .set. If not specified, the 90th percentile is calculated by default. |
If .include is called, the output DataFrame includes the following field:
| Field | Description |
|---|---|
zone | The geometry of the zone. |
Performance notes
Improve the performance of Zonal Statistics by doing one or more of the following:
- Exclude percentile calculation if percentiles are not required for the analysis.
- Exclude zone geometries from the output DataFrame.
- Restrict the analysis to specific raster bands using
.set.Band Ids() - Use unique zone IDs when possible. While duplicate zone IDs are supported, unique zone IDs avoid additional geometry grouping and multipart geometry handling, which can improve execution efficiency.
- When the raster and zone geometries have different spatial references, the raster is transformed to match the zone geometries at runtime. Transforming the inputs to the same spatial reference before running the tool can reduce processing overhead.
-
Only analyze the records in your area of interest. You can pick the records of interest by using one of the following SQL functions:
- ST_Intersection—Clip to an area of interest represented by a polygon. This will modify your input records.
- ST_BboxIntersects—Select records that intersect an envelope.
- ST_EnvIntersects—Select records having an evelope that intersects the envelope of another geometry.
- ST_Intersects—Select records that intersect another dataset or area of intersect represented by a polygon.
Similar capabilities
Syntax
For more details, go to the GeoAnalytics Engine API reference for zonal statistics.
| Setter (Python) | Setter (Scala) | Description | Required |
|---|---|---|---|
set | set | Sets the zone DataFrame containing the zone polygons and attributes. | Yes |
set | set | Sets the raster column in the rasters DataFrame. The raster pixel values are used to calculate zonal statistics for each zone geometry. | Yes |
set | set | Sets one or more zone id columns from the zone DataFrame. If not specified, the tool generates the zone id for each of the zone geometries in the zone DataFrame. | No |
set | set | Sets the ids of one or more raster bands used to calculate zonal statistics. Band ids are 1-based. If not specified, the tool calculates zonal statistics for all bands in the input raster. | No |
set | set | Sets the circular wrap values to enable circular statistic calculation. | No |
include | include | Sets whether the output includes percentiles. When set to True, the output DataFrame includes percentile statistics columns, including median and the percentile. | No |
set | set | Sets the percentile value to compute. The tool will include percentile results when either include is set to True, or this setter is called with the custom percentile value specified. | No |
include | include | Sets whether the output includes the zone geometry column. When set to True, the output DataFrame includes a geometry column representing the zone geometries associated with each zone id. If not set or set to False, the output does not include the zone geometry column. | No |
run(dataframe) | run(rasters) | Runs the Zonal Statistics tool using the input raster DataFrame. | Yes |
Examples
Run Zonal Statistics
# Imports
from geoanalytics.tools import ZonalStatistics
from geoanalytics.sql import functions as ST
# Path to the US Annual Average Wind Speed image service
raster_path = "https://tiledimageservices.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/US_Annual_Average_Wind_Speed/ImageServer"
raster_df = spark.read.format("image-service").load(raster_path)
# Path to the US County layer
zones_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/USA_Counties_Generalized_Boundaries/FeatureServer/0"
zones_df = spark.read.format("feature-service").load(zones_path)\
.withColumn("shape", ST.transform("shape", 102100))
# Use Zonal Statistics to summarize annual wind speed into county zones
result = ZonalStatistics() \
.setZones(zones_df) \
.setZoneIdColumns("STATE_NAME", "NAME") \
.setRasterColumn("raster") \
.includeZoneGeometry(True)\
.run(raster_df)
# Show the first 5 rows of the result DataFrame
result.sort("STATE_NAME", "NAME").show(5)+----------+--------------+------+-----+------------------+------------------+------------------+------------------+-------------------+------------------+-------+--------+-------------+---------------+--------+-------------+---------------+--------------------+
|STATE_NAME| NAME|BandID|Count| Min| Max| Range| Mean| Stdev| Sum|Variety|Majority|MajorityCount|MajorityPercent|Minority|MinorityCount|MinorityPercent| zone_geometry|
+----------+--------------+------+-----+------------------+------------------+------------------+------------------+-------------------+------------------+-------+--------+-------------+---------------+--------+-------------+---------------+--------------------+
| Alabama|Autauga County| 1| 552| 2.087801456451416|3.3896076679229736|1.3018062114715576| 2.590863266284918| 0.2431685697847177|1430.1565229892747| NULL| NULL| NULL| NULL| NULL| NULL| NULL|{"rings":[[[-9664...|
| Alabama|Baldwin County| 1| 1474|2.1721317768096924| 6.154635429382324| 3.982503652572632|2.9251511904083722| 0.5298230901258558| 4311.672854661941| NULL| NULL| NULL| NULL| NULL| NULL| NULL|{"rings":[[[-9793...|
| Alabama|Barbour County| 1| 806|2.0778417587280273|3.1731228828430176|1.0952811241149902|2.5632828003715336|0.23228035438458972| 2066.005937099456| NULL| NULL| NULL| NULL| NULL| NULL| NULL|{"rings":[[[-9544...|
| Alabama| Bibb County| 1| 590|2.0551438331604004|3.0911967754364014| 1.036052942276001| 2.530248590647161|0.21468073344232885| 1492.846668481825| NULL| NULL| NULL| NULL| NULL| NULL| NULL|{"rings":[[[-9731...|
| Alabama| Blount County| 1| 628|1.9109033346176147| 5.404839038848877| 3.493935704231262| 2.751029093174417| 0.5631666218622156|1727.6462705135339| NULL| NULL| NULL| NULL| NULL| NULL| NULL|{"rings":[[[-9681...|
+----------+--------------+------+-----+------------------+------------------+------------------+------------------+-------------------+------------------+-------+--------+-------------+---------------+--------+-------------+---------------+--------------------+
only showing top 5 rowsPlot results
# Plot the county-level mean annual wind speed across the United States
result_plot = result.st.plot(cmap_values = "Mean",
legend=True,
legend_kwds={"orientation": "horizontal", "location": "bottom", "shrink": 0.7, "pad": 0.08},
figsize=(14,8),
basemap="light")
result_plot.set_title("County-level mean annual wind speed across the United States")
Version table
| Release | Notes |
|---|---|
2.0.0 | Python tool introduced |
2.0.0 | Scala tool introduced |