Spatial joins

GeoAnalytics Engine includes a spatial join that combines rows from two DataFrames based on a spatial relationship. This join generates spatial indices for your data automatically at runtime to improve performance. You can perform a spatial join with either Spark SQL or the Spatiotemporal Join tool. Both options utilize the same underlying spatial join implementation included with GeoAnalytics Engine. Differences between the two options and examples of when you might use one versus the other are detailed below.

Spatial joins in Spark SQL

To perform a spatial join with Spark SQL, you must use a GeoAnalytics Engine SQL function in the join condition expression. For example, if you want to join two DataFrames based on where one DataFrame's geometries are within the other, you could use ST_Within in the join condition expression. The following example shows joining a DataFrame of wind turbine locations to a DataFrame of burn area polygons using the Spark SQL join. Because a GeoAnalytics Engine SQL function is part of the join condition, spatial indices are generated automatically and a spatial join is performed.

SQL
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
burn_areas_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/MTBS_Polygons_v1/FeatureServer/0"
turbines_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/US_Wind_Turbine_Database/FeatureServer/0"

spark.read.format("feature-service").load(burn_areas_url) \
     .withColumn("polygon", ST.transform("shape", 102007)) \
     .createOrReplaceTempView("burn_areas")
spark.read.format("feature-service").load(turbines_url) \
     .withColumn("point", ST.transform("shape", 102007)) \
     .createOrReplaceTempView("turbines")

spark.sql(
    """
    SELECT COUNT(*) AS `Count of turbines in burn areas`
    FROM turbines JOIN burn_areas ON
    ST_Within(turbines.point, burn_areas.polygon)
    """
).show()
Result
Use dark colors for code blocksCopy
1
2
3
4
5
+-------------------------------+
|Count of turbines in burn areas|
+-------------------------------+
|                           4812|
+-------------------------------+

You can also perform the same spatial join with Python only using pyspark.sql.DataFrame.join, as shown in the following example.

Python
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
burn_areas_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/MTBS_Polygons_v1/FeatureServer/0"
turbines_url = "https://services.arcgis.com/P3ePLMYs2RVChkJx/arcgis/rest/services/US_Wind_Turbine_Database/FeatureServer/0"

burn_areas = spark.read.format("feature-service").load(burn_areas_url) \
                  .select(ST.transform("shape", 102007).alias("polygon"))
turbines = spark.read.format("feature-service").load(turbines_url) \
                .select(ST.transform("shape", 102007).alias("point"))

result = turbines.join(burn_areas, ST.within(turbines.point, burn_areas.polygon))

print("Count of turbines in burn areas:")
result.count()
Result
Use dark colors for code blocksCopy
1
2
Count of turbines in burn areas:
4812

You can verify that a spatial join is being used by Spark SQL by calling explain on the join result and looking for SpatialJoin in the physical plan string, as shown below.

Python
Use dark colors for code blocksCopy
1
2
3
result.explain()
Result
Use dark colors for code blocksCopy
1
2
3
4
5
6
== Physical Plan ==
SpatialJoin point#928: point, polygon#868: polygon, SpatialXYRelationship(Within), Default, false
:- *(1) Project [ST_Transform(shape#898, in=WGS_1984_Web_Mercator_Auxiliary_Sphere:3857, out=Hawaii_Albers_Equal_Area_Conic:102007, datum_transform="WGS_1984_(ITRF00)_To_NAD_1983") AS point#928]
:  +- BatchScan Wind Turbines[shape#898] FeatureServiceLayerScan[f=json, query="resultType=standard&returnZ=true&where=1=1&returnM=true&outFields=", paging=oid-range(field=FID,size=32000,count=3)] RuntimeFilters: []
+- *(2) Project [ST_Transform(shape#857, in=GCS_North_American_1983:4269, out=Hawaii_Albers_Equal_Area_Conic:102007) AS polygon#868]
   +- BatchScan MTBS_Polygons[shape#857] FeatureServiceLayerScan[f=json, query="resultType=standard&returnZ=true&where=1=1&returnM=true&outFields=", paging=oid-range(field=OBJECTID,size=4000,count=8)] RuntimeFilters: []

Spatiotemporal Join tool

The Spatiotemporal Join tool offers similar functionality to spatially joining with Spark SQL but also supports joining on temporal relationships and attribute relationships. See the tool documentation for more information. The table below summarizes some key differences between performing spatial joins with PySpark or Spark SQL vs. the Spatiotemporal Join tool.

Spark SQL joinSpatiotemporal Join tool
UsageCan be called with SQL or Python syntax.Available in Python only as a tool.
Spatial relationshipsUse one or more of 9 SQL functions to define a spatial relationship. Using the following functions in your join condition expression will trigger a spatial join:
  • ST_Contains
  • ST_Crosses
  • ST_DWithin
  • ST_Equals
  • ST_Intersects
  • ST_Overlaps
  • ST_Relate
  • ST_Touches
  • ST_Within
Choose from one of 9 spatial relationships. These relationships are evaluated using the primary geometry column of each DataFrame. Available spatial relationships include:
  • Contains
  • Crosses
  • Equals
  • Intersects
  • NearGeodesic
  • NearPlanar
  • Overlaps
  • Touches
  • Within
Temporal relationshipsUse unix timestamps and Spark SQL to define temporal relationships.Choose from one of 14 temporal relationships. These relationships are evaluated using the primary time column(s) of each DataFrame. Available temporal relationships include:
  • Contains
  • During
  • Equals
  • Finishes
  • Finished by
  • Intersects
  • Meets
  • Met by
  • Near
  • Near after
  • Near before
  • Overlaps
  • Overlapped by
  • Starts
  • Started by
Attribute relationshipsUse Spark SQL to define an attribute relationship in your join expression.Use ArcGIS Arcade to define an attribute relationship.

Performance considerations & best practices

  • When using ST_DWithin or the Near spatial relationships, transform your data to a projected spatial reference that preserves distances between geometries for the most accurate results. For more information see Choosing a projected coordinate system.
  • When using ST_DWithin or the Near spatial relationships, a larger near distance will be slower. To improve performance of near joins across large distances, increase the ratio of memory to cores in your Spark cluster.
  • Spatial joins are memory-intensive operations. It is recommended that you have at least 8 GB of memory available per core in your Spark cluster to allow for optimal performance.
  • DataFrames with geometries that are highly clustered will be slower than if the geometries are dispersed evenly over the DataFrame's spatial extent. If possible, add a temporal relationship and/or attribute relationship to your join condition when joining highly clustered data. You can also increase the ratio of memory to cores in your Spark cluster to improve performance with clustered or dense data.
  • Geodesic distance calculations will be slower than planar distance calculations. For more information see Planar vs. geodesic distance calculations.
  • Limiting the extent of your data can improve performance by filtering out invalid coordinates and locations irrelevant to your analysis. You can filter by extent prior to joining with ST_EnvIntersects.

What's next?

Learn more about spatial relationships and spatial functions:

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.