2.0.0 Release notes
GeoAnalytics Engine 2.0.0 includes a new suite of functions and tools for working with raster data, so that you can use both raster and vector data together in your big data spatial analysis workflows. GeoAnalytics Engine 2.0.0 also includes a variety of enhancements to previously-released tools and functions, as well as performance and usability improvements throughout the API. Support has also been added for new Apache Spark versions and cloud runtimes, including new Databricks and AWS EMR runtime versions.
Version 2.0.0 is a major release of GeoAnalytics Engine and includes minimal breaking changes that may impact some workflows established with GeoAnalytics Engine 1.x. For more information, see the Breaking API changes section below.
Added support for raster data and analysis
Read, write, and visualize raster data
Geoanalytics Engine 2.0.0 includes a new raster data type for storing both raster values and raster references in Spark DataFrames. When reading from most raster data sources, rasters are tiled and each tile is stored as a separate record in a raster column, enabling scalable distributed analysis.
Using the raster data source, you can now read from common raster file types
like GeoTIFF, Cloud Optimized GeoTIFF, and PNG, and write results back to those same file types as well.
The new Raster Python class enables
both converting NumPy arrays to GeoAnalytics rasters, as well as exporting GeoAnalytics rasters to NumPy arrays, for
integration with many data science and machine learning Python libraries.
This release also adds support for reading and writing image services hosted in ArcGIS Online with the
image-service data source. This allows you to easily bring in hosted imagery data
from your ArcGIS Online organization or the Living Atlas of the World, and
then write results back for sharing and visualization. The image-service data source currently supports reading but
not writing image services hosted in ArcGIS Enterprise.
For quick visualization in a PySpark notebook, you can plot rasters on basemaps using
rt.plot, a lightweight raster plotting API
that extends matplotlib. You can use rt.plot along with st.plot to view both rasters and vector geometries together
in a single plot. For more information see Visualize results with rt.plot().
Raster functions
GeoAnalytics Engine 2.0.0 includes 30 new raster-type functions, also known as RT functions. These are row-level functions that operate on rasters and can be called with Python, SQL, or Scala syntax.
Some of these raster functions allow you to access raster properties in order to learn more about the data prior to performing analysis. For example, you can use RT_Info on a raster column to access properties like cell size, extent, spatial reference, and more. You can also access these properties individually using functions like RT_CellSizeX, RT_CellSizeY, RT_NumBands, RT_NumColumns, RT_NumRows, RT_PixelType, RT_Extent, RT_SRText, and RT_SRID.
You can also learn more about a raster using functions that calculate summary statistics, which include:
- RT_Statistics—Calculate summary statistics for all values in a raster.
- RT_BandStatistics—Calculate summary statistics for values in a specified raster band.
- RT_ZonalStatistics—Calculate summary statistics for band values that fall within a specified zone (polygon).
Many functions allow you to manipulate your raster data, for example:
- RT_Apply—Applies a user-defined function to each pixel value in a raster band.
- RT_BBoxClip—Clips a raster using a bounding box specified with
xmin,ymin,xmax, andymax. - RT_Calculator—Applies a map algebra expression to calculate pixel values using up to four rasters.
- RT_ConvertPixelType—Updates the input raster to utilize the specified pixel type.
- RT_Merge—Combines two or more rasters into a single raster.
- RT_Resample—Changes the spatial resolution of a raster using nearest neighbor cell assignment or bilinear interpolation.
- RT_SelectBands—Select a subset of bands from a raster or reorder raster bands.
- RT_SetExtent—Updates a raster's spatial extent using
xmin,ymin,xmax, andymax. - RT_Tiles—Re-tiles a raster and stores raster tiles of a specified size in each row of a raster column.
- RT_Transform—Transforms a raster to the specified spatial reference.
There are also several functions to help you import and export raster data. These functions complement the functionality of the raster and image service data sources, and include:
- RT_AddBand—Adds a band to an existing raster.
- RT_BandMask—Returns an array of mask values for a raster.
- RT_BandValues—Returns an array of pixel values for a raster.
- RT_CreateRaster—Creates a new raster using an array of pixel values
- RT_FromBinary—Creates a raster tile from the binary data in each row.
- RT_Materialize—Forces loading of pixel values into memory which can improve performance in certain scenarios.
- RT_ToBinary—Converts a raster column to a binary column.
All of these functions are scalable and will distribute computation across the cores of your Spark cluster, allowing analysis of big raster data. They can be seamlessly chained with ST and TRK functions where applicable, as well as with other SQL functions included with Spark.
Raster tools
GeoAnalytics Engine 2.0.0 includes 4 new analysis tools that work specifically with raster data and help you integrate raster and vector datasets. Like other tools in GeoAnalytics Engine, these are aware of all columns in a DataFrame and use all rows to compute a result if required. The new raster-focused tools are:
- Bins to raster—Converts a square bin column to a raster column, using data from other columns as pixel values.
- Enrich point with raster—Joins pixel values from a raster to point geometries.
- Geometry to raster—Rasterizes point, line, or polygon geometries, using data from other columns as pixel values.
- Zonal statistics—Computes summary statistics for raster band values within zone polygons.
Other new features
Beginning with GeoAnalytics Engine 2.0.0, Summarize Within allows you to summarize point, line, and polygon geometries into H3 bins of a specified resolution. This choice has been added to the existing options of either summarizing into Esri square/hexagon bins or polygon geometries that you provide.
Also introduced at GeoAnalytics Engine 2.0.0 is the
st.with DataFrame
extension which generates a GeoDisplay column using a geometry column in your DataFrame. A GeoDisplay column is a
spatial index used for fast rendering of geometries with the ArcGIS Maps SDK and other GeoDisplay-compatible mapping tools.
As with every release, version 2.0.0 adds support for new Apache Spark versions and related cloud runtimes. This release includes new compatibility with Spark 4.1.x, Databricks 18.0 and 18.1, and AWS EMR 7.11 and 7.12.
Breaking API changes
Dropped environment support
Installing GeoAnalytics Engine in the following runtimes is not supported beginning with version 2.0.0. These runtimes were formerly supported at version 1.7.x:
- Spark 3.2.x
- Spark 3.3.x
- Amazon EMR 6.6.x – 6.11.x
- Databricks 12.2 LTS
- Google Dataproc 2.1-x
- Azure Synapse Runtime for Apache Spark 3.4
Please update the environment in which you use GeoAnalytics Engine to a supported version if needed when upgrading to GeoAnalytics Engine 2.0. See the install guide for a list of supported versions.
SQL functions
GeoAnalytics Engine 2.0.0 introduces several breaking changes to ST and TRK functions to improve accuracy and usability. These changes include:
-
The default behavior of ST_GeodesicBuffer has been changed to preserve the geographic location of the input geometry’s interior, more accurately maintaining its original shape. You may see differences in the geometry returned with ST_GeodesicBuffer if you rely on the default behavior. To return to the legacy result of the function, set the new
preserveparameter to_shape False. By default, the parameter is set toTrue. -
ST_EnvIntersects no longer supports passing a geometry column and 4 numbers (
x-min,y-min,x-max, andy-max) as input. The only accepted input will be two geometry columns (geometry1,geometry2). This behavior was formally deprecated with GeoAnalytics Engine 1.2.0, however, remained as an undocumented legacy input in later versions of 1.x.x. If needed, replace your use of ST_EnvIntersects that use 4 bounding coordinates, and use ST_BboxIntersects instead. -
Functions that accept a spatial reference as input no longer support
sridas a named argument; onlysrwill be recognized. Thesridname was formally deprecated in favor ofsrstarting at version 1.1.0, however, remained as an undocumented legacy input in later versions of 1.x.x. This is applicable to the following functions:- ST_Point
- ST_PointZ
- ST_PointZM
- ST_PointM
- ST_PointFromText
- ST_MPointFromText
- ST_LineFromText
- ST_PolyFromText
- ST_GeomFromText
- ST_PointFromBinary
- ST_MPointFromBinary
- ST_LineFromBinary
- ST_PolyFromBinary
- ST_GeomFromBinary
- ST_PointFromEsriJSON
- ST_MPointFromEsriJSON
- ST_LineFromEsriJSON
- ST_PolyFromEsriJSON
- ST_GeomFromEsriJSON
- ST_PointFromGeoJSON
- ST_MPointFromGeoJSON
- ST_LineFromGeoJSON
- ST_PolyFromGeoJSON
- ST_GeomFromGeoJSON
- ST_PointFromShape
- ST_MPointFromShape
- ST_LineFromShape
- ST_PolyFromShape
- ST_GeomFromShape
- ST_Transform
Scripts relying on these functions should replace the named argument
sridwithsr, if necessary. -
The ST_MultiLinestring and ST_MultiPolygon functions no longer support
arrayas a named argument. TheOf Points arrayname was formally deprecated in favor accepting a variable number of positional arguments (one or more arrays) at version 1.1.0, however, remained as an undocumented legacy input in later versions of 1.x.x. Any scripts that use these functions should be updated to not use a named argument if necessary.Of Points -
Track functions that accept an output unit string as input no longer support
outputas a named argument; only_units outputwill be recognized. The_unit outputname was formally deprecated in favor of_units output, however, remained as an undocumented legacy input. This is applicable to the following functions:_unit -
The ST_FrechetDistance function will no longer be callable using a diacritic in the name (i.e.,
ST) due to lack of full support for the diacritic in Apache Spark. Replace the name of the function with_Fréchet Distance STin any scripts that previously used the function name with a diacritic._Frechet Distance -
Several functions are updated to use geodesic calculations if one input geometry is in an unprojected spatial reference and the other geometry has no spatial reference defined. In GeoAnalytics Engine 1.x.x, these functions used planar distance calculations if the first geometry did not have a defined spatial reference. The functions are:
To replicate the legacy behavior, set the spatial reference of both geometries to 0 (undefined) or a projected spatial reference to force planar distance calculations.
Tools
GeoAnalytics Engine 2.0.0 also introduces breaking changes to some analysis tools, including:
-
The
setsetter is now required for running certain tools if the input geometry has projected coordinates. Additionally, the term “geodesic” has been replaced by the term “geodetic” inDistance Method setand in spatial relationship choice lists. The term "geodetic" is a more accurate descriptor of the distance calculation. These changes apply to the following tools:Distance Method -
The Summarize Within tool has been changed in several ways:
- The
addsetter has been removed. Formerly, fields included inRate Field addwould not be proportioned prior to calculating statistics, whereas other fields would be proportioned by default prior to calculating statistics. Beginning with version 2.0, proportioning is controlled by theRate Field proportionparameter in theaddandStandard Summary Field addsetters. SetWeighted Summary Field proportiontoTrueto proportion any field prior to calculating statistics. The default isFalse, meaning all fields are not proportioned by default. The results returned from SummarizeWithin in version 2.0 will differ from results in 1.x unless the parameters are updated. - The tool output has been changed to return a DataFrame if run using the
runmethod, and a named tuple if run using the newrunmethod. Formerly, the tool would always return a named tuple if run using theInclude Group By runmethod. The return type of Summarize Within in version 2.0 will differ from that of 1.x if no updates are made.
- The
-
The Calculate Motion Statistics tool has been corrected to ignore track observations with null geometry and/or time. The tool would formerly return null for some statistics in cases where one or more track observations had a null geometry or timestamp. You may see differences in the returned results in version 2.0 as the null geometries will now be ignored. No mitigation is required, however, tool results may differ from those returned in earlier versions of GeoAnalytics Engine.
-
The Reconstruct Tracks tool has been updated to include timestamps as m-values in the resulting linestrings by default. This allows the resulting linestrings to be used by TRK functions and other track-related functionality. Any existing m-values in the input points will be overwritten by the timestamps and not included in the result linestrings by default. To obtain legacy results, call the new
preservesetter before running the tool.M