Group by proximity

Groups records that are within spatial or spatiotemporal proximity to each other.

Usage notes

The output result is a copy of the input with a new field named group_id. The group_id field represents the grouping of records. Records with the same group_id value are in the same group. The group numbers represent membership in a particular group and do not imply value. The group numbers may not be sequential or the same number in repeated runs of the tool.
The supported spatial relationships and input geometries are described in the following table:
Intersects Touches Geodesic Near Planar Near Local Projection Near
Point
Linestring
Polygon
Full supportPartial supportNo support
The spatial relationship definitions are outlined below.

Overlay method	Description
Intersects	Records intersect when records or portions of records overlap. This is the default.
Touches	Records touch another record if they have an intersecting vertex, but the records do not overlap.
Geodesic Near	Records are near if a vertex or edge is within a given geodesic distance of another record.
Bin-local Projection Near	Records are near if a vertex or edge is within a given geodesic distance of another record, as approximated in several temporary local projections with locally-planar distance calculation.
Planar Near	Records are near if a vertex or edge is within a given planar distance of another record.

When 'PlanarNear' is specified with setSpatialRelationship(), it is required that the input DataFrame's geometry column is projected or the tool will fail. You can transform your data to a projected coordinate system by using ST_Transform.

Learn more about coordinate systems and transformations
Bin-local projection near is an optional distance calculation method available in GeoAnalytics Engine 1.6+. It is a fast approximation of geodesic near with reduced distortion over large areas compared to the planar near method. It provides a practical balance between the accuracy of geodesic distance calculations and the performance of planar distance calculations.
The supported temporal relationships and temporal types are described in the following table:
Intersect Near
None
Instant
Interval
Full supportPartial supportNo support
The temporal relationship definitions are outlined below.

Temporal relationship method	Description
Intersects	When any part of a record's time overlaps another.
Near	Records are near one another if a record's time is within a given time distance of another record.

You can specify any of the following combinations of relationships:
- A spatial relationship value
- A spatial relationship and a temporal relationship
- A spatial relationship and an attribute relationship
- A spatial relationship, temporal relationship, and an attribute relationship
Records are grouped when all specified relationships are met.
The attribute expression is a symmetric operation. The tool takes a single DataFrame that's compared against itself to group. Because of this, the input dataset is denoted as both a and b, and all expressions should include both a and b.
When specifying the attribute relationship you can create a Spark SQL expression or an Arcade expression. For example, to group all records where the column Amount has the same value do the following:
- SQL: a.Amount = b.Amount
- Arcade: $a.Amount == $b.Amount

Limitations

Values will not be grouped across the anti-meridian when using the planar near spatial relationship.
When using bin-local projection near, group membership may change between runs due to slight variations in the distance calculation resulting from changes in data partitioning (e.g, repartitioning before calling the tool). These variations can affect geometries on the very edge of a group boundary defined by the provided search distance.

Results

In addition to the original fields, the following additional fields are included:

Field	Description
`group_id`	The grouping of records. Records with the same group_id value are in the same group. The group numbers represent membership in a particular group and don't imply value. The group numbers may not be sequential or the same number in repeated runs of the tool.

Performance notes

Improve the performance of Group By Proximity by doing one or more of the following:

Only analyze the records in your area of interest. You can pick the records of interest by using one of the following SQL functions:
- ST_Intersection—Clip to an area of interest represented by a polygon. This will modify your input records.
- ST_BboxIntersects—Select records that intersect an envelope.
- ST_EnvIntersects—Select records having an evelope that intersects the envelope of another geometry.
- ST_Intersects—Select records that intersect another dataset or area of intersect represented by a polygon.
When using planar, bin-local projection, or geodesic near, use a smaller distance.
When using the spatial relationship parameter, the planar near is the fastest option, and the geodesic near is the most accurate. Bin-local projection near provides a balanced compromise, maintaining a closer approximation to geodesic distance than planar near while offering improved performance over geodesic near, especially over large areas.
When using the temporal relationship parameter's near option, use a smaller temporal near distance.

Similar capabilities

Similar tools:

The following functions complete spatial overlay operations:

Syntax

For more details, go to the GeoAnalytics Engine API reference for group by proximity.

Setter (Python)	Setter (Scala)	Description	Required
`run(dataframe)`	`run(input)`	Runs the Group By Proximity tool using the provided DataFrame.	Yes
`setSpatialRelationship(spatial_relationship='Intersects', near_distance=None, near_distance_unit=None)`	`setSpatialRelationship(spatialRelationship, nearDistance=None, nearDistanceUnit=None)`	Sets the type of spatial relationship to group by.	Yes
`setTemporalRelationship(temporal_relationship='Intersects', temporal_distance=None, temporal_distance_unit=None)`	`setTemporalRelationship(temporalRelationship, nearDuration=None, nearDurationUnit=None)`	Sets the type of temporal relationship to group by.	No
`setAttributeRelationship(expression=None, expression_type="sql")`	`setAttributeRelationship(expression, expressionType="sql")`	Sets the attribute expression to group by. The expression type can be `sql` or `Arcade`.	No

Examples

Run Group by Proximity

Python

Scala

Use dark colors for code blocksCopy

# Log in
import geoanalytics
geoanalytics.auth(username="myusername", password="mypassword")

# Imports
from geoanalytics.tools import GroupByProximity
from geoanalytics.sql import functions as ST

# Path to the USA rivers and streams data
usa_rivers_data_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest" \
            "/services/USA_Rivers_and_Streams/FeatureServer/0"

# Create an Oregon rivers DataFrame from the USA rivers and streams data
oregon_rivers_df = spark.read.format("feature-service") \
                                .load(usa_rivers_data_path) \
                                .where("State = 'OR'")

# Run the Group by Proximity tool to find intersecting rivers and streams
result = GroupByProximity() \
           .setSpatialRelationship(spatial_relationship="Intersects") \
           .run(dataframe=oregon_rivers_df)

# View first 5 rivers that are assigned in the same group as Willamette River

query_group_id = result.where("Name == 'Willamette River'").select("GROUP_ID").collect()[0]["GROUP_ID"]

result.where("GROUP_ID == '{}'".format(query_group_id)) \
      .select("State", "Name", "Region", "Feature", "Miles") \
      .sort("Miles", ascending=False).show(5)

Result
Use dark colors for code blocksCopy
+-----+-------------------+------+-------+-----+
|State|               Name|Region|Feature|Miles|
+-----+-------------------+------+-------+-----+
|   OR|   Willamette River|    17| Stream|160.8|
|   OR|    Clackamas River|    17| Stream| 71.1|
|   OR|     Tualatin River|    17| Stream|71.05|
|   OR|    Calapooia River|    17| Stream|64.31|
|   OR|South Yamhill River|    17| Stream|53.06|
+-----+-------------------+------+-------+-----+
only showing top 5 rows

Plot results

Python
Use dark colors for code blocksCopy

# Plot the grouped results
# Create an Oregon boundary DataFrame and transform geometry to NAD 1983 StatePlane Oregon
usa_states_path = "https://services.arcgis.com/P3ePLMYs2RVChkJx/ArcGIS/rest" \
          "/services/USA_State_Boundaries/FeatureServer/0"
oregon_df = spark.read.format("feature-service").load(usa_states_path) \
                            .where("STATE_NAME == 'Oregon'") \
                            .withColumn("shape", ST.transform("shape", 6558))

# Transform the grouped rivers' geometry to NAD 1983 StatePlane Oregon spatial reference
result = result.withColumn("shape", ST.transform("shape", 6558))

# Plot the result DataFrame with the Oregon data
oregon_plot = oregon_df.st.plot(facecolor="none", linewidth = 2, edgecolors="black",
                             figsize=(16,10), basemap="light")
result_plot = result.st.plot(geometry="shape" ,cmap_values="GROUP_ID",
                             is_categorical=True, cmap="tab20c",
                             ax=oregon_plot )
result_plot.set_title("Oregon rivers grouped by proximity")
result_plot.set_xlabel("X (Meters)")
result_plot.set_ylabel("Y (Meters)");

Plotting example for a Group by Proximity result. Average speed is shown.

Version table

Release	Notes
1.0.0	Python tool introduced
1.6.0	Scala tool introduced
1.6.0	Added support for bin-local projection near spatial relationship.

	Intersect	Near
None
Instant
Interval