Frequently asked questions
Refer to the Get started guide topic to learn how to install GeoAnalytics Engine and about some of its features. There are a variety of resources for getting familiar with the module. For example:
Documentation is divided into two main components:
- API Reference—A concise reference manual containing details about all functions, classes, return types, and arguments in GeoAnalytics Engine.
- Guide—Descriptions, code samples, and usage notes for each function and tool, as well as installation instructions, core concepts, frequently asked questions, and tutorials.
The Spark SQL programming guide provides a high-level overview of Spark DataFrames and Spark SQL functions and includes extensive examples in Scala, Java, and Python. See the Machine Learning Library (MLlib) guide to learn more about Spark’s capabilities in the areas of classification, regression, clustering, and more.
To learn more about PySpark (the Python API for Spark) specifically, see the PySpark Quickstart and API reference. Spark also comes with a collection of PySpark examples that you can use to become more familiar with the API.
See Esri’s guide called What is GIS? to find more information and resources. The ArcGIS Book is a great free resource for learning about all things GIS, especially the basics of spatial analysis. For more inspiration see the sample notebooks and blog posts for GeoAnalytics Engine.
See the install guide for a complete list of officially supported Spark environments. These configurations have been tested and certified to work as documented in the install guide. Using GeoAnalytics Engine with other Spark runtimes or deployment environments may cause some functions or tools to not work correctly.
See Dependencies for a complete description of the install requirements for each version of Spark and GeoAnalytics Engine.
Yes, you must authorize the geoanalytics module before running any function or tool. See Licensing and Authorization for more information.
The size and scale of the Spark cluster you should use depends on the amount of data you’re working with, the type of analysis or queries being run, and the performance required by your use case.
Deploying a Spark cluster in the cloud is a great option if you don’t know what size you need. Managed Spark services have the advantage of allowing you to scale up or down resources quickly without purchasing hardware or making any long-term commitments. This means that you can estimate how large of a Spark cluster you may need and scale it out if needed based on the performance you observe.
Equally as important as the amount of cores/RAM is the ratio of RAM to cores. It is recommended that you have at least 10GB of RAM per core so that each Spark executor has sufficient memory for computations.
All functions and tools in GeoAnalytics Engine operate on Spark DataFrames or DataFrame columns. Therefore, the API supports any data source or format that can be loaded into a DataFrame. Spark includes built-in support for reading from Parquet, ORC, JSON, CSV, Text, Binary, and Avro files as well as Hive Tables and JDBC to other Databases. GeoAnalytics Engine also includes native support for reading from shapefiles and feature services, and for writing to vector tiles and shapefiles. See Data sources for a summary of the spatial data sources and sinks supported by GeoAnalytics Engine.
The way you connect to a cloud store is different for every cloud store and cloud provider. Some cloud stores have connectors that are included with Apache Hadoop and thus Apache Spark. For example, Hadoop comes with an Amazon S3 connector called s3a that can be used from any Spark cluster that is connected to the internet. Other cloud providers may manage their own connector or may not have direct Spark integration and may require you to mount the cloud store as a local drive on your Spark cluster.
No, GeoAnalytics Engine functions and tools operate on vector geometry data only. This includes points, lines, polygons, multipoints, and generic vector geometries.
The most common way to create a DataFrame is by loading data from a
supported data source
spark.read.load(). For example:
df = spark.read.load("examples/src/main/resources/users.parquet")
You can also create a DataFrame from a list of values or a Pandas
create. See Using
DataFrames for more information.
PySpark DataFrames (often referred to as DataFrames or Spark DataFrames in this documentation) are distributed across a Spark cluster and any operations on them are executed in parallel on all nodes of the cluster. Pandas DataFrames are stored in memory on a single node and operations on them are executed on a single thread. This means that the performance of Pandas DataFrames cannot be scaled out to handle larger datasets and is limited by the memory available on a single machine.
Other differences include that PySpark DataFrames are immutable while Pandas DataFrames are mutable. Also, PySpark uses lazy execution, which means that tasks are not executed until specific actions are taken. In contrast, Pandas uses eager execution which means that tasks are executed as soon as they are called.
Several options are available. Koalas is a pandas API for Apache Spark that provides a scalable way to convert between PySpark DataFrames and a pandas-like DataFrame. You must first convert any geometry column into a string or binary column before converting to a Koalas DataFrame.
GeoAnalytics Engine also includes a
function which converts a PySpark DataFrame to a spatially-enabled
supported by the ArcGIS API for Python. This option will preserve any
geometry columns in your PySpark DataFrame but cannot be distributed
across a Spark cluster and thus is not as scalable as using Koalas.
To learn more about spatial references and how to set them see Coordinate systems and transformations.
ST_SRID gets or sets the spatial reference ID of a geometry column but does not change any of the data in the column. ST_Transform transforms the geometry data within a column from an existing spatial reference to a new spatial reference and also sets the result column’s spatial reference ID.
To learn more about spatial references and how to transform between them see Coordinate systems and transformations.
This usually happens when using the wrong function to create the geometry column or when using an invalid or unsupported format. Double check that you are using the SQL function corresponding to the same geometry type as your input data. If you are unsure of the geometry type of your input data, use one of the generic geometry import functions:
Also verify that you’re using the SQL function corresponding to the format of your geometry data (EsriJSON, GeoJSON, WKT, WKB, or Shape), and that the representation is valid.
GeoAnalytics Engine uses the TimestampType
included with PySpark to represent instants in time. Use the
function to create a timestamp column from a numeric or
string column using Spark’s datetime patterns for formatting and
Intervals in time are represented by two timestamp columns containing the start and end instants of each interval.
To check that your geometry column is set correctly, use the
If there is only one geometry column in a DataFrame it will be used
automatically. If there are multiple geometry columns in a DataFrame,
you must call
st.set_ on the DataFrame
to specify the primary geometry column.
Similarly, if there is one timestamp column in a DataFrame it will be
used automatically as instant time when time is required by a tool. If
there are multiple timestamp columns or you want to represent intervals
of time you must call
The Spark Web UI is the best way to watch the progress of your jobs. The web UI is started on port 4040 by default when you start Spark. All managed Spark services offer their own UIs for tracking the progress of Spark jobs. See the documentation for each service below:
- Azure Databricks - View cluster information in the Apache Spark UI
- Azure Synapse Analytics - Use Synapse Studio to monitor your Apache Spark applications
- Amazon EMR – Access the Spark web UIs
- Google Cloud Dataproc - Cluster web interfaces
PySpark uses lazy evaluation which means that functions are not executed
until certain actions are called. In other words, calling a SQL function
will not run that function on your Spark cluster until you call an
action on the function return value. Examples of actions include
This exception is raised when the arguments to a SQL function are not all of the same documented type or when there are
unexpected arguments. For SQL functions that accept x, y, z, and m values in particular, all coordinates must be of the same
valid type or the exception above will be thrown. For example,
ST_ is valid because the x and y
coordinates are both floats, but
ST_ is not because one coordinate is an integer and the other is a float.
Check that the types of your function arguments match the expected types documented in the API reference.
Why are all functions failing with
Error: 'Java Package' object is not callable
Exception: Undefined function...
This message indicates that the geoanalytics module has been installed in Python but the accompanying jar has not been
properly configured with Spark. To learn more about configuring additional jars with Spark, see the documentation for the
spark.jars runtime environment properties
and advanced dependency management.
When the geometries in two or more DataFrames are in different spatial references, they won't plot in the expected locations relative to each other. Transforming one to the spatial reference of the other ensures that they use the same coordinate system and units and thus plot together as expected. To learn more see Coordinate systems and transformations.