geoanalytics.tools¶
Aggregate Points¶
- class geoanalytics.tools.AggregatePoints¶
Aggregates points into square or hexagon bins, or existing polygons.
The tool first determines which points fall within each specified area. After determining this point-in-area spatial relationship, statistics about all points in the area are calculated and assigned to the area.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Aggregate Points
- addSummaryField(summary_field, statistic, alias=None)¶
Adds a summary statistic of a field in the input DataFrame to the result DataFrame.
- Parameters
summary_field (str) – The name of a field from the input DataFrame.
statistic (str) – Choose from Count, Sum, Mean, Max, Min, Range, Stddev, Var, or Any.
alias (str) – The name of the result field containing the statistic. The default is the field name and statistic separated by an underscore.
- run(dataframe)¶
Runs the AggregatePoints tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point column.
- Returns
A DataFrame containing a polygon column, count of points within the polygon, and any summary statistics for each polygon.
- Return type
DataFrame
- setBins(bin_size, bin_size_unit, bin_type='square')¶
Sets the size and shape of bins used to aggregate into.
Note
This method will override setPolygons.
- Parameters
bin_size (float) – Distance between parallel sides of a bin.
bin_size_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
bin_type (str) – Choose from Square or Hexagon.
- setPolygons(polygons)¶
Sets the DataFrame containing a column of polygons into which the input points will be aggregated.
Note
This method will override setBins.
- Parameters
polygons (pyspark.sql.DataFrame) – A DataFrame containing a column of polygons.
- setTimeStep(interval_duration, interval_unit, repeat_duration=None, repeat_unit=None, reference_time=None)¶
Sets the time step interval, time step repeat, and reference time. If set, points will be aggregated into each bin for each time step. The input DataFrame must have a datetime column to use this setter.
- Parameters
interval_duration (int) – Duration of each time step.
interval_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
repeat_duration (int) – Time between one time step to the next time step.
repeat_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years
reference_time (int/long/datetime.datetime) – A reference datetime to align the time steps to. The default is epoch time 0.
Calculate Density¶
- class geoanalytics.tools.CalculateDensity¶
Calculates the density of points and their attributes.
Each point represents the location of some event or incident, and the result calculation represents a count of incidents per unit area. A higher density value in a new location means that there are more points near that location.
In many cases, the result layer can be interpreted as a risk surface for future events. For example, if the input points represent locations of lightning strikes, the result layer can be interpreted as a risk surface for future lightning strikes.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Calculate Density
- run(dataframe)¶
Runs the CalculateDensity tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point column with a spatial reference.
- Returns
A DataFrame of square or hexagon bins with a column of calculated density values.
- Return type
DataFrame
- setAreaUnit(area_unit)¶
Sets the desired output units of the density values. The default is SquareKilometers. If density values are very small, you can increase the scale of the area units to return larger values.
- Parameters
area_unit (str) – Choose from SquareMeters, SquareKilometers, Hectares, SquareFeet, SquareYards, SquareMiles or Acres.
- setBins(bin_size, bin_size_unit, bin_type='square')¶
Sets the size and shape of bins used to calculate density.
- Parameters
bin_size (float) – Distance between parallel sides of a bin.
bin_size_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
bin_type (str) – Choose from Square or Hexagon.
- setFields(*fields)¶
Sets one or more fields specifying the number of incidents at each location. You can calculate the density on multiple fields. The density of the count of points will always be calculated.
- Parameters
fields (*str) – The names of one or more fields from the input DataFrame.
- setNeighborhood(distance, distance_unit)¶
Sets the size of the neighborhood within which to calculate density. The distance must be larger than the bin size.
- Parameters
distance (float) – Radius of the neighborhood, measured from each bin center.
distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setTimeStep(interval_duration, interval_unit, repeat_duration=None, repeat_unit=None, reference_time=None)¶
Sets the time step interval, time step repeat, and reference time. If set, density will be calculated for each time step at each bin location. The input DataFrame must have a datetime column to use this setter.
- Parameters
interval_duration (int) – Duration of each time step.
interval_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
repeat_duration (int) – Time between one time step to the next time step.
repeat_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
reference_time (int/long/datetime.datetime) – A reference datetime to align the time steps to. The default is epoch time 0.
- setWeightType(weight_type)¶
Sets the type of weighting applied to density calculations. This parameter supports two options: * Uniform: calculates density as magnitude-per-area. This is the default. * Kernel: calculates density by applying a kernel function to fit a smooth tapered surface to each point.
- Parameters
weight_type (str) – Choose from Uniform or Kernel.
Calculate Field¶
- class geoanalytics.tools.CalculateField¶
Creates and populates a new field or edits an existing field using ArcGIS Arcade.
Your calculation can optionally be track aware. Track-aware equations use Arcade expressions that include track functions. To include a track-aware calculation, setTrackFields must be called and the input DataFrame must have datetime and track ID columns.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Calculate Field
- run(dataframe)¶
Runs the CalculateField tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame.
- Returns
A copy of the input DataFrame with the calculated field appended or overwritten.
- Return type
DataFrame
- setExpression(expression)¶
Sets an Arcade expression used to calculate the new field values. You can use any of the Date, Logical, Mathematical, or Text functions available with Arcade expressions.
- Parameters
expression (str) – An Arcade expression.
- setField(field_name, field_type)¶
Sets the name and type of the new field. If the name already exists in the dataset the field will be overwritten.
- Parameters
field_name (str) – The name of the column that will be appended to the input DataFrame.
field_type (str) – Choose from Date, Double, Integer, or String.
- setTimeBoundarySplit(time_boundary_split, time_boundary_split_unit, time_boundary_reference=None)¶
Sets boundaries to limit calculations to defined spans of time. For example, if you use a time boundary of 1 day, starting on January 1, 1980 tracks will be analyzed one day at a time.
- Parameters
time_boundary_split (int) – The scale of the time boundary.
time_boundary_split_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
time_boundary_reference (int/long/datetime.datetime) – A reference datetime to align the time boundaries to. The default is epoch time 0.
- setTrackFields(*track_fields)¶
Sets one or more fields used to identify distinct tracks.
- Parameters
track_fields (*str) – The names of one or more fields from the input DataFrame.
Calculate Motion Statistics¶
- class geoanalytics.tools.CalculateMotionStatistics¶
Calculates motion statistics and descriptors for time-enabled points that represent one or more moving entities.
Points are grouped together into tracks representing each entity using a unique identifier. Motion statistics are calculated at each point using one or more points in the track history. Calculations include summaries of distance traveled, duration, elevation, speed, acceleration, bearing, and idle status.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Calculate Motion Statistics
- run(dataframe)¶
Runs the CalculateMotionStatistics tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a track ID column and a datetime column.
- Returns
A copy of the input DataFrame with motion statistics appended to each row.
- Return type
DataFrame
- setDistanceMethod(distance_method)¶
Sets the method used to calculate distances between track observations. There are two methods to choose from:
Planar: measures distances using a Euclidean plane and will not calculate statistics across the date line.
Geodesic: calculations will cross the date line when appropriate. This is the default. If the spatial reference cannot be panned, calculations will be limited to the coordinate system extent and may not wrap.
- Parameters
distance_method (str) – Choose from Planar or Geodesic.
- setIdleTolerance(distance_tolerance, distance_tolerance_unit, time_tolerance, time_tolerance_unit)¶
Sets the tolerances to use to decide if an entity is idling. An entity is idling when it hasn’t moved more than the distance tolerance in at least the time tolerance.
- Parameters
distance_tolerance (float) – Spatial idling tolerance.
distance_tolerance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
time_tolerance (int) – Temporal idling tolerance.
time_tolerance_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
- setMotionStatistics(*motion_statistics)¶
Sets the statistic groups that will be calculated.
- Parameters
motion_statistics (*str) – Choose from Distance, Speed, Acceleration, Duration, Elevation, Slope, Idle, and Bearing.
- setStatisticUnits(distance_unit='Meters', duration_unit='Seconds', speed_unit='MetersPerSecond', acceleration_unit='MetersPerSecondSquared', elevation_unit='Meters')¶
Sets the output units for each statistic group.
- Parameters
distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
duration_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
speed_unit (str) – Choose from MetersPerSecond, KilometersPerHour, FeetPerSecond, MilesPerHour, or NauticalMilesPerHour.
acceleration_unit (str) – Choose MetersPerSecondSquared or FeetPerSecondSquared.
elevation_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setTimeBoundarySplit(time_boundary_split, time_boundary_split_unit, time_boundary_reference=None)¶
Sets boundaries to limit calculations to defined spans of time. For example, if you use a time boundary of 1 day, starting on January 1, 1980 tracks will be analyzed one day at a time.
- Parameters
time_boundary_split (int) – The scale of the time boundary.
time_boundary_split_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
time_boundary_reference (int/long/datetime.datetime) – A reference datetime to align the time boundaries to. The default is epoch time 0.
- setTrackFields(*track_fields)¶
Sets one or more fields used to identify distinct tracks.
- Parameters
track_fields (*str) – The names of one or more fields from the input DataFrame.
- setTrackHistoryWindow(track_history_window)¶
Sets the number of observations (including the current observation) that will be used when calculating summary statistics that are not instantaneous. This includes minimum, maximum, average, and total statistics.
The default track history window is 3, which means that at each point in a track summary, statistics will be calculated using the current observation and the previous two observations.
Note
This setter does not affect instantaneous statistics or idle classification.
- Parameters
track_history_window (int) – Number of observations.
Clip¶
- class geoanalytics.tools.Clip¶
Extracts geometries that overlay clip geometries.
Note
This tool operates on the entire input DataFrame and thus can more performant than equivalent row-wise operations using SQL functions.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Clip
- run(input_dataframe, clip_dataframe)¶
Runs the Clip tool using the provided DataFrames.
- Parameters
input_dataframe (DataFrame) – A DataFrame containing a geometry column.
clip_dataframe (DataFrame) – A DataFrame containing a polygon column to clip with.
- Returns
A DataFrame containing the result of the clip.
- Return type
DataFrame
Detect Incidents¶
- class geoanalytics.tools.DetectIncidents¶
Determines which observations are incidents of interest using a specified condition.
Rows in the input DataFrame are grouped using a track ID and ordered sequentially before an incident condition is applied. Rows that meet the starting incident condition are marked as an incident. An ending incident condition can be applied; when the end condition is true, the track is no longer in an incident. You can return all input rows or only rows that are incidents.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Detect Incidents
- run(dataframe)¶
Runs the DetectIncidents tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a track ID column and a datetime column.
- Returns
A copy of the input DataFrame with incident status appended to each row.
- Return type
DataFrame
- setEndConditionExpression(end_condition_expression)¶
Sets the condition used to end incidents. If there is an end condition, any feature that meets the start condition expression and does not meet the end condition expression is an incident.
- Parameters
end_condition_expression (str) – Arcade expression used to identify incidents.
- setOutputMode(output_mode)¶
Sets which observations are returned. There are two options: * All: all of the input observations are returned. This is the default. * Incidents: only observations that were found to be incident are returned.
- Parameters
output_mode (str) – Choose from All or Incidents.
- setStartConditionExpression(start_condition_expression)¶
Sets the condition used to start incidents. If there is no end condition expression specified, any feature that meets this condition is an incident. If there is an end condition, any feature that meets the start condition expression and does not meet the end condition expression is an incident.
- Parameters
start_condition_expression (str) – Arcade expression used to identify incidents.
- setTimeBoundarySplit(time_boundary_split, time_boundary_split_unit, time_boundary_reference=None)¶
Sets boundaries to limit calculations to defined spans of time. For example, if setting a time boundary of 1 day starting on January 1, 1980 tracks will be analyzed one day at a time.
- Parameters
time_boundary_split (int) – The scale of the time boundary.
time_boundary_split_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
time_boundary_reference (int/long/datetime.datetime) – A reference datetime to align the time boundaries to. The default is epoch time 0.
- setTrackFields(*track_fields)¶
Sets one or more fields used to identify distinct tracks.
- Parameters
track_fields (*str) – The names of one or more fields from the input DataFrame.
Find Dwell Locations¶
- class geoanalytics.tools.FindDwellLocations¶
Finds where entities dwell within a specific distance and duration using a record of their location through time.
Dwell locations are determined using time and distance tolerances. First, the tool groups points into tracks representing each entity using a track identifier and orders them sequentially. Next, the distance between the first point in a track and the next is calculated. If two temporally consecutive points stay within the given distance for at least the given duration, they are considered part of a dwell. When two points are found to be part of a dwell, the first point in the dwell is used as a reference point, and the tool finds consecutive points that are within the specified distance of the reference point in the dwell.
Once all points within the specified distance are found, the tool collects the dwell points and calculates their mean center. Features before and after the current dwell are added to the dwell if they are within the given distance of the dwell location’s mean center. This process continues until the end of the track.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Find Dwell Locations
- addSummaryField(summary_field, statistic, alias=None)¶
Adds a summary statistic of a field in the input DataFrame to the result DataFrame.
- Parameters
summary_field (str) – The name of a field from the input DataFrame.
statistic (str) – Choose from Count, Sum, Mean, Max, Min, Range, Stddev, Var, or Any.
alias (str) – The name of the result field containing the statistic. The default is the field name and statistic separated by an underscore.
- run(dataframe)¶
Runs the FindDwellLocations tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point column with a spatial reference, a track ID column, and a datetime column
- Return type
DataFrame
- setDistanceMethod(distance_method)¶
Sets the method used to calculate distances between track observations. There are two methods to choose from:
Planar: measures distances using a Euclidean plane and will not calculate statistics across the date line.
Geodesic: calculations will cross the date line when appropriate. This is the default. If the spatial reference cannot be panned, calculations will be limited to the coordinate system extent and may not wrap.
- Parameters
distance_method (str) – Choose from Planar or Geodesic.
- setDwellMaxDistance(max_distance, max_distance_unit)¶
Sets the maximum distance between points for them to be considered part of a single dwell event.
Note
This method is used along with setDwellMinDuration to define dwell criteria.
- Parameters
max_distance (float) – The maximum distance between points to be considered in a single dwell location.
max_distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setDwellMinDuration(min_duration, min_duration_unit)¶
Sets the minimum time between points for them to be considered part of a single dwell event.
Note
This method is used along with setDwellMaxDistance to define dwell criteria.
- Parameters
min_duration (int) – The minimum time duration of a dwell to be considered in a single dwell location.
min_duration_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years
- setOutputType(output_type)¶
Sets the output type.
DwellMeanCenters: A point representing the centroid of each discovered dwell location. This is the default.
DwellConvexHulls: Polygons representing the convex hull of each dwell group.
DwellPoints: All of the input points determined to belong to a dwell are returned.
AllPoints: All of the input points are returned.
- Parameters
output_type – Choose from DwellMeanCenters, DwellConvexHulls, DwellPoints, or AllPoints.
- Returns
The result DataFrame specified by output_type
- setTimeBoundarySplit(time_boundary_split, time_boundary_split_unit, time_boundary_reference=None)¶
Sets boundaries to limit calculations to defined spans of time. For example, if you use a time boundary of 1 day, starting on January 1, 1980 tracks will be analyzed one day at a time.
- Parameters
time_boundary_split (int) – The scale of the time boundary.
time_boundary_split_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
time_boundary_reference (int/long/datetime.datetime) – A reference datetime to align the time boundaries to. The default is epoch time 0.
- setTrackFields(*track_fields)¶
Sets one or more fields used to identify distinct tracks.
- Parameters
track_fields (*str) – The names of one or more fields from the input DataFrame.
Find Hot Spots¶
- class geoanalytics.tools.FindHotSpots¶
Aggregates points into square bins and finds statistically significant bins of high incidents (hot spots) and low incidents (cold spots).
This tool finds hot and cold spots using the Getis-Ord Gi* statistic. The local counts of points for a bin and its neighbors are compared proportionally to the sum of points in all bins. A local sum is considered statistically significant (larger z-score) when it is very different from the expected local sum and when that difference is too large to be the result of random chance.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Find Hot Spots
- run(dataframe)¶
Runs the FindHotSpots tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point column with a projected spatial reference.
- Returns
A DataFrame of square bins assigned a z-score, p-value, and confidence level.
- Return type
DataFrame
- setBins(bin_size, bin_size_unit)¶
Sets the size of square bins used to find hot spots.
- Parameters
bin_size (float) – Distance between parallel sides of a bin.
bin_size_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setNeighborhood(distance, distance_unit)¶
Sets the size of the neighborhood used to find hot spots. The neighborhood size must be larger than the bin size.
- Parameters
distance (float) – Radius of the neighborhood, measured from each bin center.
distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setTimeStep(interval_duration, interval_unit, reference_time=None, alignment=None)¶
Sets the time step interval, time step repeat, and reference time. If set, hot spots will be calculated for each time step at each bin location. The input DataFrame must have a datetime column to use this setter.
- Parameters
interval_duration (int) – Duration of each time step.
interval_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
reference_time (int/long/datetime.datetime) – A reference datetime to align the time steps to if alignment is ReferenceTime. The default is epoch time 0.
alignment (str) – Defines how aggregation will occur based on a given interval duration. Choose from StartTime, EndTime, or ReferenceTime.
Find Point Clusters¶
- class geoanalytics.tools.FindPointClusters¶
Finds clusters of points within surrounding noise based on their spatial or spatiotemporal distribution.
Two clustering methods are supported: DBSCAN or HDBSCAN. Both methods can find clusters in space, while DBSCAN can find spatiotemporal clusters in time-enabled point layers.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Find Point Clusters
- run(dataframe)¶
Runs the FindPointClusters tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point column with a projected spatial reference.
- Returns
A copy of the input DataFrame with a cluster ID assigned to each point.
- Return type
DataFrame
- setClusterMethod(cluster_method)¶
Sets The algorithm used for cluster analysis. Supported options are “DBSCAN” and “HDBSCAN”.
The DBSCAN algorithm uses a specified distance to separate dense clusters from sparser noise. DBSCAN is faster than HDBSCAN, but is only appropriate if there is a clear search distance to use that works well to define all clusters that may be present.
DBSCAN finds clusters that have similar densities. The HDBSCAN algorithm allows for clusters with varying densities based on cluster probability (or stability).
HDBSCAN is data-driven and does not use a search distance, but is a more time-consuming calculation than DBSCAN. The DBSCAN algorithm finds clusters in two-dimensional space by default. When setTimeMethod is called, DBSCAN will discover clusters in both space and time.
- Parameters
cluster_method (str) – Choose from DBSCAN or HDBSCAN.
- setMinPointsCluster(min_points_cluster)¶
This setter is used differently depending on the clustering method chosen. For DBSCAN, min_points_cluster specifies the number of points that must be found within a search range of a point for that point to start forming a cluster. The results may include clusters with fewer points than this value.
For HDBSCAN, min_points_cluster specifies the number of points neighboring each point (including the point itself) that will be considered when estimating density. This number is also the minimum cluster size allowed when extracting clusters.
- Parameters
min_points_cluster (int) – Number of points.
- setSearchDistance(search_distance, search_distance_unit)¶
Sets the search distance within which the number of points specified by setMinPointsCluster must be found (in addition to being within the search duration, if applicable) to form a cluster using the DBSCAN algorithm. This method is not used by HDBSCAN.
- Parameters
search_distance (float) – Distance within which min_points_cluster must be found to start forming a cluster. Results may include clusters with fewer points min_points_cluster.
search_distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setSearchDuration(search_duration, search_duration_unit)¶
Sets the search duration within which the number of points specified by setMinPointsCluster must be found (in addition to being within the search distance) to form a cluster using the DBSCAN algorithm.
Warning
The input DataFrame must have a datetime column to use this setter.
Note
This method is not used by HDBSCAN.
- Parameters
search_duration (int) – Duration within which min_points_cluster must be found to start forming a cluster. Results may include clusters with fewer points than min_points_cluster.
search_duration_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
Find Similar Locations¶
- class geoanalytics.tools.FindSimilarLocations¶
Measures the similarity of candidate locations to one or more reference locations.
This tool requires two DataFrames, one containing the reference locations and one containing candidate locations. Using specified fields representing the criteria to match, the tool will rank all of the candidate locations by how closely they match the reference locations.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Find Similar Locations
- run(reference_dataframe, search_dataframe)¶
Runs the FindSimilarLocations tool using the provided DataFrames.
- Parameters
reference_dataframe (DataFrame) – A DataFrame containing one or more reference rows with attributes.
search_dataframe (DataFrame) – A DataFrame containing candidate locations that will be evaluated for similarity to the reference rows.
- Returns
The similarity statistics with appended fields.
- Return type
DataFrame
- setAnalysisFields(*analysis_fields)¶
Sets the fields that will be used to determine similarity. They must be numeric fields, and the fields must exist on both input DataFrames. Depending on the match method selected, the tool will find rows that are most similar based on values or profiles of the fields.
- Parameters
analysis_fields (*str) – The names of one or more fields from the input DataFrames.
- setAppendFields(*append_fields)¶
Sets which fields from the search DataFrame are included in the result. By default, all fields from the search DataFrame are appended.
- Parameters
append_fields (*str) – The names of one or more fields from the search DataFrame.
- setMatchMethod(match_method)¶
Sets the method that specifies how matching is determined. There are two options: * AttributeValues: uses the squared differences of standardized values. This is the default. * AttributeProfiles: uses cosine similarity mathematics to compare the profile of standardized values. This option requires the use of at least two analysis fields.
- Parameters
match_method (str) – Choose from AttributeValues or AttributeProfiles.
- setMostOrLeastSimilar(most_or_least_similar)¶
Sets the rows that will be returned. Options include returning rows that are either most similar or least similar to the reference, or return both the most and least similar.
- Parameters
most_or_least_similar (str) – Choose from MostSimilar, LeastSimilar, or Both.
- setNumberOfResults(number_of_results)¶
Sets the number of ranked candidate rows to return. The default is 10 and the maximum allowed is 10000.
- Parameters
number_of_results (int) – Number of most or least similar locations to return.
GWR¶
- class geoanalytics.tools.GWR¶
Performs Geographically Weighted Regression (GWR), a local form of linear regression used to model spatially varying relationships.
GWR provides a local model of a variable by fitting a regression equation to every row in the input DataFrame using the geometry and any specified explanatory variables.
Refer to the GeoAnalytics Engine guide for examples and usage notes: GWR
- run(dataframe)¶
Runs the GWR tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point column with a projected spatial reference, dependent variables, and explanatory variables.
- Returns
A copy of the input DataFrame with model attributes appended to each row.
- Return type
DataFrame
- setDependentVariable(dependent_variable)¶
The numeric field containing the observed values to model.
- Parameters
dependent_variable (str) – The name of a field in the input DataFrame.
- setDistanceBand(distance_band=None, distance_band_unit=None)¶
Sets the neighborhood size as a fixed distance for each feature.
Note
This method will override setNumNeighbors if called last.
- Parameters
distance_band (float) – The distance for the spatial extent of the neighborhood.
distance_band_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setExplanatoryVariables(*explanatory_variables)¶
Sets one or more fields to represent independent explanatory variables in the model.
- Parameters
explanatory_variables (*str) – The names of one or more fields from the input DataFrame.
- setLocalWeightingScheme(local_weighting_scheme)¶
Sets the kernel type that will be used to provide the spatial weighting in the model. The kernel defines how each points is related to other points within its neighborhood. Two options are supported:
Bisquare: assigns a weight of 0 to any geometry outside the neighborhood. This is the default.
Gaussian: assigns weights to all geometries, but weights become exponentially smaller the farther away they are from the target geometry.
- Parameters
local_weighting_scheme (str) – Choose from Bisquare or Gaussian.
- setNumNeighbors(number_of_neighbors)¶
Sets the neighborhood size as a function of a specified number of neighbors included in calculations for each point. Where points are dense, the spatial extent of the neighborhood is smaller; where points are sparse, the spatial extent of the neighborhood is larger.
Note
This method will override setDistanceBand if called last.
:param number_of_neighbors :type number_of_neighbors: int
Group By Proximity¶
- class geoanalytics.tools.GroupByProximity¶
Groups geometries that are within spatial or spatiotemporal proximity of each other.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Group By Proximity
- run(dataframe)¶
Runs the GroupByProximity tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a geometry column.
- Returns
A copy of the input DataFrame with a column of group IDs appended.
- Return type
DataFrame
- setAttributeRelationship(expression, expression_type='sql')¶
Sets the attribute relationship expression to further refine groupings.
- Parameters
expression (str) – Expression representing the attribute relationship.
expression_type (str) – Choose from Arcade or SQL.
- setSpatialRelationship(spatial_relationship='Intersects', near_distance=None, near_distance_unit=None)¶
Sets the type of spatial relationship to group by.
- Parameters
spatial_relationship (str) – Choose from Intersects, Touches, NearGeodesic, or NearPlanar.
near_distance (float) – The search distance to determine if geometries are near one another. This is only applied if NearGeodesic or NearPlanar are set as the spatial relationship.
near_distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setTemporalRelationship(temporal_relationship='Intersects', temporal_distance=None, temporal_distance_unit=None)¶
Sets the type of temporal relationship to group by.
- Parameters
temporal_relationship (str) – Choose from Intersects or Near.
temporal_distance (int) – Sets the temporal search distance to determine if geometries are near one another.
temporal_distance_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
Overlay¶
- class geoanalytics.tools.Overlay¶
Combines two or more geometry columns into a single column using a spatial overlay operation.
Note
This tool operates on the entire input DataFrame and thus can more performant than equivalent row-wise operations using SQL functions.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Overlay
- run(input_dataframe, overlay_dataframe)¶
Runs the Overlay tool using the provided DataFrames.
- Parameters
input_dataframe (DataFrame) – A DataFrame containing a geometry column.
overlay_dataframe (DataFrame) – A DataFrame containing a geometry column to overlay.
- Returns
A DataFrame containing the result of the overlay.
- Return type
DataFrame
- setOverlayType(overlay_type)¶
Sets the type of overlay to be performed.
- Parameters
overlay_type (str) – Choose from Intersect, Erase, Union, Identity, or SymmetricalDifference.
Reconstruct Tracks¶
- class geoanalytics.tools.ReconstructTracks¶
Creates a line or polygon representing an entity’s path of movement over time using points or polygons with associated timestamps.
This tool groups input rows into tracks representing unique entities using a track identifier field. It then creates a linestring by connecting the point observations for each entity sequentially. The linestring can be buffered with a variable distance using a field from the input DataFrame.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Reconstruct Tracks
- addSummaryField(summary_field, statistic, alias=None)¶
Adds a summary statistic of a field in the input DataFrame to the result DataFrame.
- Parameters
summary_field (str) – The name of a field from the input DataFrame.
statistic (str) – Choose from Count, Sum, Mean, Max, Min, Range, Stddev, Var, or Any.
alias (str) – The name of the result field containing the statistic. The default is the field name and statistic separated by an underscore.
- run(dataframe)¶
Runs the ReconstructTracks tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point or polygon column, a track ID column, and a datetime column.
- Returns
A DataFrame containing the result linestrings or polygons.
- Return type
DataFrame
- setArcadeSplit(arcade_split)¶
Sets an Arcade expression to split tracks with. The expression will be evaluated for each point in a track and the track will be split if the expression equals True.
- Parameters
arcade_split (str) – An Arcade expression.
- setBufferField(buffer_field)¶
Sets a field in the input DataFrame that contains a buffer distance or a buffer expression. A buffer expression must begin with an equal sign (=).
- Parameters
buffer_field (str) – The name of a field from the input DataFrame.
- setDistanceMethod(distance_method)¶
Sets the method used to calculate distances between track observations. There are two methods to choose from:
Planar: measures distances using a Euclidean plane and will not calculate statistics across the date line.
Geodesic: calculations will cross the date line when appropriate. This is the default. If the spatial reference cannot be panned, calculations will be limited to the coordinate system extent and may not wrap.
- Parameters
distance_method (str) – Choose from Planar or Geodesic.
- setDistanceSplit(distance_split, distance_split_unit)¶
Sets the distance used to split tracks. Any rows in the input DataFrame that are in the same track and are farther apart than this distance will be split into a new track. If both the distance split and the time split are used, the track is split when at least one condition is met.
- Parameters
distance_split (float) – The distance used to split tracks.
distance_split_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setSplitBoundaryOption(split_boundary_option)¶
Sets how the track segment between two points is created when a track is split. The split type is applied to split expressions, distance splits, and time splits. There are three options: * Gap: no segment is created between the two points (this is the default). * FinishLast: a segment is created between the two points that ends after the split. * StartNext: a segment is created between the two points that ends before the split.
- Parameters
split_boundary_option (str) – Choose from Gap, FinishLast, or StartNext
- setTimeBoundarySplit(time_boundary_split, time_boundary_split_unit, time_boundary_reference=None)¶
Sets boundaries to limit calculations to defined spans of time. For example, if you use a time boundary of 1 day, starting on January 1, 1980 tracks will be analyzed one day at a time.
- Parameters
time_boundary_split (int) – The scale of the time boundary.
time_boundary_split_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
time_boundary_reference (int/long/datetime.datetime) – A reference datetime to align the time boundaries to. The default is epoch time 0.
- setTimeSplit(time_split, time_split_unit)¶
Sets the time duration used to split tracks. Any rows in the input DataFrame that are in the same track and are farther apart than this time will be split into a new track. If both the distance split and time split are used, a track is split when at least one condition is met.
- Parameters
time_split (int) – The time duration used to split tracks.
time_split_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years
- setTrackFields(*track_fields)¶
Sets one or more fields used to identify distinct tracks.
- Parameters
track_fields (*str) – The names of one or more fields from the input DataFrame.
Spatiotemporal Join¶
- class geoanalytics.tools.SpatiotemporalJoin¶
Joins attributes from one DataFrame to another based on spatial, temporal, and attribute relationships or some combination of the three.
The tool determines all input rows that meet the specified join conditions and joins the second DataFrame to the first. You can optionally join all rows to the matching rows or summarize the matching rows.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Spatiotemporal Join
- addSummaryField(summary_field, statistic, alias=None)¶
Adds a summary statistic of a field in the input DataFrame to the result DataFrame.
- Parameters
summary_field (str) – The name of a field from the input DataFrame.
statistic (str) – Choose from Count, Sum, Mean, Max, Min, Range, Stddev, Var, or Any.
alias (str) – The name of the result field containing the statistic. The default is the field name and statistic separated by an underscore.
- run(target_dataframe, join_dataframe)¶
Runs the SpatiotemporalJoin tool using the provided DataFrames.
- Parameters
target_dataframe (DataFrame) – A DataFrame.
join_dataframe (DataFrame) – A DataFrame to join.
- Returns
A DataFrame containing the result of the join.
- Return type
DataFrame
- setAttributeRelationship(attribute_relationship)¶
Sets a target field, relationship, and join field used to join equal attributes.
An equals relationship can be used (equal in JSON, and = using the string format), or to check for join strings that are equal without comparing casing or trailing and leading white spaces, equalIgnoreCaseTrimWhiteSpace can be used through JSON or ~= using a string.
- Parameters
attribute_relationship (str) – Expression representing the attribute relationship.
- setJoinCondition(join_condition)¶
Sets a condition to specified fields using an Arcade expression. Only rows with columns that meet this condition will be joined.
- Parameters
join_condition (str) – An Arcade expression.
- setJoinOneToMany()¶
Sets the join operation to one to many. If multiple join rows are found that have the same relationships with a single target row, the result DataFrame will contain multiple copies of the target row.
For example, if a single point in the target DataFrame is found within two separate polygons in the join DataFrame, the result DataFrame will contain two copies of the target row: one row with the attributes of one polygon and another row with the attributes of the other polygon. There are no summary statistics available with this method.
Note
This method will override setJoinOneToOne.
- setJoinOneToOne()¶
Sets the join operation to one to one. If multiple join rows are found that have the same relationships with a single target row, the fields from the multiple join rows will be aggregated using the specified summary statistics.
For example, if a point is found within two separate polygons, the fields associated with the two polygons will be aggregated before being returned in the result DataFrame. If one polygon has an attribute value of 3 and the other has a value of 7, and a summary statistic of sum is specified, the aggregated value in the output DataFrame will be 10. There will always be a Count field calculated, with a value of 2, for the number of rows specified.
Note
This method will override setJoinOneToMany
- setLeftJoin(left_join=True)¶
Specifies whether all target rows will be returned in the result DataFrame (known as a left outer join) or only those that have the specified relationships with the join rows (inner join). Left outer join can be used only with a one-to-one join and is not supported for one-to-many join.
- Parameters
left_join (bool) – If True a left outer join will be used, if False an inner join will be used.
- setSpatialRelationship(spatial_relationship, near_distance=None, near_distance_unit=None)¶
Sets the spatial relationship used to spatially join rows.
- Parameters
spatial_relationship (str) – Choose from Equals, Intersects, Contains, Within, Crosses, Touches, Overlaps, NearPlanar, NearGeodesic.
near_distance (float) – A double value used for the search distance to determine if a target geometry is near a join geometry. This is only applied if NearPlanar or NearGeodesic is the specified spatial relationship.
near_distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setTemporalRelationship(temporal_relationship, near_duration=None, near_duration_unit=None)¶
Sets the temporal relationship used to temporally join rows.
- Parameters
temporal_relationship (str) – Choose from Equals, Intersects, During, Contains, Finishes, FinishedBy, Meets, MetBy, Overlaps, OverlappedBy, Starts, StartedBy, Near,`NearBefore` or NearAfter.
near_duration (int) – An integer value used for the temporal search distance to determine if a target geometry is temporally near a join geometry.
near_duration_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years.
Summarize Within¶
- class geoanalytics.tools.SummarizeWithin¶
Summarizes geometries from the input DataFrame where they intersect summary polygons or bins using statistics.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Summarize Within
- Result¶
alias of
geoanalytics.tools.summarize_data.SummarizeWithinResult
- addStandardSummaryField(summary_field, statistic, alias=None)¶
Adds a summary statistic of a field in the input DataFrame to the result DataFrame.
- Parameters
summary_field (str) – The name of a field from the input DataFrame.
statistic (str) – Choose from Count, Sum, Mean, Max, Min, Range, Stddev, Var, or Any.
alias (str) – The name of the result field containing the statistic. The default is the field name and statistic separated by an underscore.
- includeShapeSummary(include=True, units=None)¶
Sets the inclusion of calculated statistics based on the geometry type of the primary geometry column in the input DataFrame, such as the length of lines or areas of polygons within each summary polygon.
- Parameters
include (bool) – If True, geometry summary statistics will be included in the result.
units (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, Yards, SquareMeters, SquareKilometers, Hectares, SquareFeet, SquareYards, SquareMiles or Acres.
- run(dataframe)¶
Runs the SummarizeWithin tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a geometry column.
- Returns
A named tuple with a DataFrame containing the summary polygons and a DataFrame containing the group-by summary (if applicable).
- Return type
namedtuple
- setGroupBy(group_by_field, include_minor_major_fields=True, include_group_percentages=True)¶
Sets a field from the input DataFrame that will be used to calculate statistics for each unique value.
When setGroupBy is called, the tool will return a DataFrame containing the statistics in addition to a DataFrame containing the summaries.
For example, suppose the input DataFrame contains city boundaries and the polygons set by setSummaryPolygons are parcels. One of the fields of the parcels is Status which contains two values: VACANT and OCCUPIED. To calculate the total area of vacant and occupied parcels within the boundaries of cities, use Status as the group-by field.
- Parameters
group_by_field (str) – The name of a field from the input DataFrame.
include_minor_major_fields (bool) – If True, the minority (least dominant) or the majority (most dominant) attribute values for each group will be included in the result.
include_group_percentages (bool) – If True, the percentage of each unique field value is calculated for each summary polygon.
- setSummaryBins(bin_size, bin_size_unit, bin_type='square')¶
Sets the size and shape of bins that the input DataFrame will be summarized into.
Note
This method overrides setSummaryPolygons. Use setSummaryPolygons if summarizing into an existing column of polygons.
- Parameters
bin_size (float) – Distance between parallel sides of a bin.
bin_size_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
bin_type (str) – Choose from Square or Hexagon.
- setSummaryPolygons(summary_polygons)¶
Sets the DataFrame containing a column of polygons that the input DataFrame will be summarized into.
Note
This method overrides setSummaryBins. Use setSummaryBins instead if summarizing into square or hexagon bins that are generated when the tool runs.
- Parameters
summary_polygons (pyspark.sql.DataFrame) – A DataFrame containing a polygon column.
Trace Proximity Events¶
- class geoanalytics.tools.TraceProximityEvents¶
Analyzes points representing moving entities. The tool will follow entities of interest in space (location) and time to see which other entities the entities of interest have interacted with. The trace will continue from entity to entity to a configurable maximum degrees of separation from the original entity of interest.
For example, suppose an organization monitors company-issued devices carried by workers. The company is interested in determining which employees were near an individual known to have COVID-19. Using the point layer representing device locations and time, they can identify devices that have been within 6 meters and 5 minutes of the contagious person and other possibly contagious employees.
Refer to the GeoAnalytics Engine guide for examples and usage notes: Trace Proximity Events
- Result¶
alias of
geoanalytics.tools.use_proximity.TraceProximityEventsResult
- includeTracksDataFrame()¶
Includes a second DataFrame with the points used in the trace.
- run(dataframe)¶
Runs the TraceProximityEvents tool using the provided DataFrame.
- Parameters
dataframe (DataFrame) – A DataFrame containing a point column, timestamp column, and entity ID column.
- Returns
A named tuple containing a copy of the input DataFrame with proximity event info appended and a DataFrame containing only points used in the trace.
- Return type
DataFrame
- setAttributeMatchCriteria(*attribute_match_criteria)¶
One or more fields used to constrain the proximity events. Entities will only be considered near when the spatial search distance and temporal search distance criteria are met and the two entities have equal values of the fields specified.
- Parameters
attribute_match_criteria (*str) – The names of one or more fields from the input DataFrame.
- setDistanceMethod(distance_method)¶
Sets the method used to calculate distances between track observations. There are two methods to choose from:
Planar: measures distances using a Euclidean plane and will not calculate statistics across the date line.
Geodesic: calculations will cross the date line when appropriate. This is the default. If the spatial reference cannot be panned, calculations will be limited to the coordinate system extent and may not wrap.
- Parameters
distance_method (str) – Choose from Planar or Geodesic.
- setEntitiesOfInterestIds(entities_of_interest_ids)¶
Sets one or more entities that you are interested in tracing from, as well as a time to start tracing from. If you do not specify a time, January 1, 1970, at 12:00 a.m. will be used.
- Parameters
entities_of_interest_ids (str) – A stringified list of dictionaries containing entity IDs and times in epoch ms.
- Example
‘[{“entityID”: “user5”, “epochTimeStamp”: 1598390663000}, {“entityID”: “user9”, “epochTimeStamp”: None}]’
- setEntityIdField(entity_id_field)¶
Sets the field used to identify distinct entities.
- Parameters
entity_id_field (str) – The name of a field from the input DataFrame.
- setMaxTraceDepth(max_trace_depth)¶
Sets the maximum degrees of separation between an entity of interest and an entity further down the trace.
- Parameters
max_trace_depth (int) – Degrees of separation.
- setSearchDistance(search_distance, search_distance_unit)¶
Sets the maximum distance between two points to be considered in proximity. Points closer together in space and that also meet the search duration criteria are considered in proximity of each other.
Note
This method is used along with setSearchDuration to define proximity.
- Parameters
search_distance (float) – The search distance used to determine if points are in proximity.
search_distance_unit (str) – Choose from Meters, Kilometers, Feet, Miles, NauticalMiles, or Yards.
- setSearchDuration(search_duration, search_duration_unit)¶
Sets the maximum duration between two points that are considered in proximity. Points closer together in time and that also meet the search distance criteria are considered in proximity of each other.
Note
This method is used along with setSearchDistance to define proximity.
- Parameters
search_duration (int) – The search duration used to determine if points are in proximity.
search_duration_unit (str) – Choose from Milliseconds, Seconds, Minutes, Hours, Days, Weeks, Months, or Years