Shapefile is an Esri vector data storage format commonly used in geospatial analysis and GIS software applications. For more information on shapefiles, see the Shapefile format specification.
Shapefile data can be stored in a distributed file system such as HDFS, cloud storage such as S3, or a local directory.
When loading shapefile data, a geometry column will be automatically created in the result DataFrame and its spatial
reference set. Shapefile supports
polygon, and multipart collections of
polygon geometries. After loading shapefiles
into a Spark DataFrame, you can perform analysis and visualize the data by using the SQL functions and tools available
in GeoAnalytics Engine in addition to functions offered in Spark. Once you save a DataFrame
as shapefile, you can store the files or access and visualize them through other systems.
The following table shows examples of the Python syntax for loading and saving shapefiles with GeoAnalytics Engine, where
path is a path to a directory of shapefiles.
Data classes provide optional parameters that you
can use when reading or writing shapefiles. For a full list of options, see the
Spark API reference.
|Filters the shapefile by the spatial extent specified using the format |
|Merge the schemas of a collection of shapefile datasets in the input directory.|
|Save data to the number of partitions specified. For example, |
|Partition the output by the given column name. This example will partition the output Parquet files by values in the |
|Overwrite existing data in the specified path. Other available options are |
The default string format of timestamps saved to shapefiles is
Shapefile doesn't support saving the generic
geometrytype. Geometry columns must be
Writing to shapefile requires exactly one geometry field.
From Engine 1.1.x,
bin2dcolumns will be saved automatically as a
longfield in the result shapefile, where the field value contains the bin ID.
Spark will read shapefiles from multiple directories if the directory names start with
column=. For example, the following example directory contains shapefile data that is partitioned by
district. Spark can infer
districtas a column name in the DataFrame by reading the subdirectory names starting with
When reading in a directory of shapefiles with subdirectories not following the naming convention of
column=, Spark won't be able to read all data from subdirectories in bulk, you will need to add the glob pattern at the end of the root path (i.e.,
Shapefiles have a maximum size of 2GB. Be cautious to use
.coalesce(1)when writing large output DataFrame to shapefile format.
Writing to shapefile doesn't require a spatial reference to be set on the geometry column. However, it is recommended to always set and check the spatial reference of a DataFrame before writing to shapefile. Invalid or unavailable spatial reference in shapefile format could potentially lead to inaccurate visualization or analysis results.