GeoParquet

GeoParquet is a standardized open-source columnar storage format that extends Apache Parquet by defining how geospatial data should be stored, including the representation of geometries and the required additional metadata. Parquet is highly structured meaning it stores the schema and data type of each column with the data files. GeoParquet's structure enables interoperability between any system that reads or writes spatial data in Parquet format. To write to Parquet format without geometry data, see the Parquet data source included with Apache Spark.

The following table shows examples of the Python syntax for loading and saving GeoParquet with GeoAnalytics Engine, where path is a path to a directory of GeoParquet files or a single GeoParquet file.

LoadSave
spark.read.format("geoparquet").load(path)df.write.format("geoparquet").save(path)
spark.read.load(path, format="geoparquet")df.write.save(path, format="geoparquet")

Additionally, the Spark DataFrameReader and DataFrameWriter classes provide extra options that can be used when reading or writing GeoParquet. For a full list of options offered in DataFrameReader and DataFrameWriter, see the Spark API reference.

DataFrameWriter optionExampleDescription
version.option("version", "1.0.0")Defines the version of the GeoParquet specification that will be used when writing. The default is version 1.1.0.
encoding.option("encoding", "wkb")Defines the geometry encoding that will be used when writing. Can be chosen from wkb or native when version is set to 1.1.0. The default is wkb.
includeZ.option("includeZ", "true")Defines if z-values will be included when writing with native encoding. The default is false. The option has no effect on WKB encoding. WKB encoding always includes z-values and m-values.

Usage notes

  • Writing to GeoParquet requires one or more geometry fields.
  • When loading data from GeoParquet into a DataFrame, geometry columns will be created automatically.
  • When writing to GeoParquet using native encoding, the generic geometry type is not supported.
  • When writing to GeoParquet using native encoding, m-values are not supported.
  • The following table outlines how each geometry type in GeoAnalytics Engine maps to geometry types in the GeoParquet specification. Note that the GeoParquet specification allows for geometry columns to have multiple types.
GeoAnalytics EngineGeoParquet
point[Point]
multipoint[MultiPoint]
linestring[LineString, MultiLineString]
polygon[Polygon, MultiPolygon]
geometry[GeometryCollection]
  • When loading data from GeoParquet, GeoAnalytics Engine will read the spatial reference of geometries using included metadata. If the spatial reference is not recognized, the SRID will be set to 0 (unknown). Always verify that the spatial reference was read correctly using st.get_spatial_reference on the result DataFrame. If the spatial reference was not recognized, you can set it using ST_SRID.
  • If no spatial reference is defined in the metadata for a geometry column, GeoAnalytics Engine will assume the data is in World Geodetic System 1984 and set the SRID to 4326.
  • When writing data to GeoParquet, some spatial references may not be recognized or supported by other systems that read GeoParquet. In these cases, transform your geometry data to World Geodetic System 1984 (SRID:4326) before writing to guarantee interoperability.
  • Setting a spatial reference on each geometry column prior to writing is recommended but not required. The GeoParquet specification defines that any data without a spatial reference is assumed to be in World Geodetic System 1984 (SRID:4326).
  • Spark writes to multiple files by default. For interoperability with systems that do not support collections of GeoParquet, use .coalesce(1) on a DataFrame to return a new DataFrame that has exactly one partition and will be written to one GeoParquet file. Be cautious using .coalesce(1) on a large dataset.
  • The schema of a nested geometry field may load differently from GeoParquet files depending on the Spark version. For example, in Spark 3.4, the schema of a point geometry column in a nested column is loaded as StructField(point,StructType(StructField(x,DoubleType,false),StructField(y,DoubleType,false),StructField(z,DoubleType,false),StructField(m,DoubleType,false)) instead of StructField(point,PointUDT(None),true)) at Spark 3.3.
  • The following table outlines which versions of the GeoParquet schema are supported by each version of GeoAnalytics Engine:
GeoAnalytics EngineGeoParquet schema
1.0.xNot supported
1.1.x - 1.2.x0.1.0 - 0.4.0
1.3.x - 1.4.x0.1.0 - 1.0.0
1.5.x0.1.0 - 1.1.0

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.