Data sources

GeoAnalytics Engine supports loading and saving data from some common spatial data sources in addition to the data sources supported by Spark. After importing the geoanalytics module, these spatial data sources can be accessed by setting the format when loading or saving a Spark DataFrame.

Most data sources support loading from a single file or from a folder containing multiple files. When loading from a folder, all files within the folder must be the same format and have the same schema. For example, the line below shows a DataFrame, df, being created from multiple shapefiles that are stored in a folder called hurricanes.

df = spark.read.format("shapefile").load("S3://my-data/hurricanes")

The same format() string can also be used to save. For example, the line below shows a DataFrame, df, being saved as a collection of shapefiles stored in HDFS.

df.write.format("shapefile").save("hdfs://nn1home:8020/hurricanes")

Because Spark is a distributed engine, multiple writers are used to save a single DataFrame which results in a collection of files unless the data is explicitly coalesced to one partition before saving.

The table below summarizes the spatial data sources available for loading and saving in ArcGIS GeoAnalytics Engine 1.4.x.

Data source	Format	Load	Save
CSV	`cvs`	Yes	Yes
Feature service	`feature-service`	Yes	Yes
File geodatabase	`filegdb`	Yes	No
GeoJSON	`geojson`	Yes	Yes
GeoParquet	`geoparquet`	Yes	Yes
JDBC	`jdbc`	Yes	Yes
ORC	`orc`	Yes	Yes
Parquet	`parquet`	Yes	Yes
Esri shapefile	`shapefile`	Yes	Yes
Vector tiles	`vector-tile`	No	Yes

What next?

For more examples on how to read and save data formats, see the following tutorials: