Skip to content

Shapefile

Shapefile is an Esri vector data storage format commonly used in geospatial analysis and GIS software applications. For more information on shapefiles, see the Shapefile format specification.

Shapefile data can be stored in a distributed file system such as HDFS, cloud storage such as S3, or a local directory.

When loading shapefile data, a geometry column will be automatically created in the result DataFrame and its spatial reference set. Shapefile supports point, line, polygon, and multipart collections of point, line, or polygon geometries. After loading shapefiles into a Spark DataFrame, you can perform analysis and visualize the data by using the SQL functions and tools available in GeoAnalytics Engine in addition to functions offered in Spark. Once you save a DataFrame as shapefile, you can store the files or access and visualize them through other systems.

Reading Shapefiles in Spark

The following examples demonstrate how to load a shapefile into Spark DataFrames using both Python and Scala.

PythonPythonScala
Use dark colors for code blocksCopy
1
2

df = spark.read.format("shapefile").load("path/to/your/file")

Here are the options can be used in reading Shapefiles:

DataFrameReader optionExampleDescription
extent.option("extent", "-90.0, 30.0, 90, 50")Filters the shapefile by the spatial extent specified using the format "<x min>, <y min>, <x max>, <y max>".
mergeSchemas.option("mergeSchemas", True)Merge the schemas of a collection of shapefile datasets in the input directory.

Writing DataFrames to Shapefile

The following examples demonstrate how to write Spark DataFrames to shapefile using both Python and Scala.

PythonPythonScala
Use dark colors for code blocksCopy
1
2

df.write.format("shapefile").save("path/to/output/directory")

Here are the options can be used in writing Shapefiles:

DataFrameWriter optionExampleDescription
coalesce.coalesce(1)Save data to the number of partitions specified. For example, .coalesce(1) will write the data to one shapefile.
partitionBy.partitionBy("date")Partition the output by the given column name. This example will partition the output Parquet files by values in the date column.
overwrite.mode("overwrite")Overwrite existing data in the specified path. Other available options are append, error, and ignore.

Usage notes

  • The default string format of timestamps saved to shapefiles is yyyy-MM-dd HH:mm:ss.SSS.
  • Shapefile doesn't support saving the generic geometry type. Geometry columns must be point, multipoint, line, or polygon type.
  • Writing to shapefile requires exactly one geometry field.
  • From Engine 1.1.x, bin2d columns will be saved automatically as a long field in the result shapefile, where the field value contains the bin ID.
  • Shapefiles have a maximum size of 2GB. Be cautious to use .coalesce(1) when writing large output DataFrame to shapefile format.
  • Writing to shapefile doesn't require a spatial reference to be set on the geometry column. However, it is recommended to always set and check the spatial reference of a DataFrame before writing to shapefile. Invalid or unavailable spatial reference in shapefile format could potentially lead to inaccurate visualization or analysis results.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.