This quick tutorial demonstrates some of the basic capabilities of ArcGIS GeoAnalytics Engine, including
how to access and manipulate data through DataFrames, an overview of SQL functions
and analysis tools, and how to visualize and save your results. To run the code samples in this tutorial you should
have GeoAnalytics Engine installed and authorized
in a running Spark session.
Creating DataFrames
GeoAnalytics Engine extends PySpark, the Python interface for Spark, and uses Spark
DataFrames along with custom geometry data types to represent spatial
data. A Spark DataFrame is like a Pandas DataFrame or a table in a
relational database but is optimized for distributed queries.
GeoAnalytics Engine comes with several DataFrame extensions for reading from spatial
data sources like shapefiles and feature services, in addition to any
data source that PySpark supports. When reading from a shapefile or
feature service, a geometry column will be created automatically. For
other data sources, a geometry column can be created from text or binary
columns using GeoAnalytics Engine functions.
The following example shows how to create a Spark DataFrame from a feature service of USA county boundaries, and then show the column names and types.
geoanalytics.sql.functions contains spatial functions that operate on
columns to do things like create or export geometries, identify spatial
relationships, generate bins, and more. These functions can be called through
python functions or by using SQL, similar to Spark SQL
functions.
The following example shows how to use a SQL function through Python.
Python
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import geoanalytics.sql.functions as ST
# Calculate the centroid of each county polygoncounty_centroids = df.select("Name", ST.centroid("shape"))
# Display the first 5 rows of the resultcounty_centroids.show(5)
geoanalytics.tools contains spatial and spatiotemporal analysis tools that
execute multi-step workflows on entire DataFrames using geometry, time,
and other values. These tools can only be called with their associated
Python classes.
Python
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from geoanalytics.tools import FindSimilarLocations
# Use Find Similar Locations to find counties with population count and density like Alexander Countyfsl = FindSimilarLocations() \
.setAnalysisFields("POP_SQMI","POPULATION") \
.setMostOrLeastSimilar("MostSimilar") \
.setNumberOfResults(5) \
.setAppendFields("NAME", "STATE_NAME") \
.run(df.where("NAME = 'Alexander County'"), df.where("NAME != 'Alexander County'"))
# Show the resultfsl.select("simrank", "NAME", "STATE_NAME").filter("NAME is not NULL").sort("simrank").show()
When scripting in a notebook-like environment, GeoAnalytics Engine supports simple
visualization of spatial data with an included plotting API based on
matplotlib.
Any DataFrame can be persisted by writing it to a collection of
shapefiles, vector tiles, or any data sink supported by PySpark.
Python
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Write a DataFrame returned from tool as a collection of shapefiles to an S3 bucketfsl.write.format("shapefile").save("s3a://my-bucket/fsl_result")
What next?
To get started, learn about using and loading data into DataFrames, running analysis and SQL functions, and visualizing results through the available guides and tutorials: