GeoParquet is a standardized open-source columnar storage format that extends Apache Parquet by defining how geospatial data should be stored, including the representation of geometries and the required additional metadata. Parquet is highly structured meaning it stores the schema and data type of each column with the data files. GeoParquet's structure enables interoperability between any system that reads or writes spatial data in Parquet format. To write to Parquet format without geometry data, see the Parquet data source included with Apache Spark.
The following table shows examples of the Python syntax for loading and saving GeoParquet with GeoAnalytics Engine, where
path
is a path to a directory of GeoParquet files or a single GeoParquet file.
Load | Save |
---|---|
spark.read.format("geoparquet").load(path) | df.write.format("geoparquet").save(path) |
spark.read.load(path, format="geoparquet") | df.write.save(path, format="geoparquet") |
Additionally, the Spark Data
and Data
classes provide extra options that can be used when
reading or writing GeoParquet. For a full list of options offered
in Data
and Data
, see the Spark API reference.
DataFrameWriter option | Example | Description |
---|---|---|
version | .option("version", "0.4.0") | Defines the version of the GeoParquet specification that will be used when writing. The default is version 1.0.0. |
Usage notes
-
Writing to GeoParquet requires one or more geometry fields.
-
When loading data from GeoParquet into a DataFrame, geometry columns will be created automatically.
-
The following table outlines how each geometry type in GeoAnalytics Engine maps to geometry types in the GeoParquet specification. Note that the GeoParquet specification allows for geometry columns to have multiple types.
GeoAnalytics Engine GeoParquet point [Point] multipoint [MultiPoint] linestring [LineString, MultiLineString] polygon [Polygon, MultiPolygon] geometry [GeometryCollection] -
When loading data from GeoParquet, GeoAnalytics Engine will read the spatial reference of geometries using included metadata. If the spatial reference is not recognized, the SRID will be set to 0 (unknown). Always verify that the spatial reference was read correctly using st.get_spatial_reference on the result DataFrame. If the spatial reference was not recognized, you can set it using ST_SRID.
-
If no spatial reference is defined in the metadata for a geometry column, GeoAnalytics Engine will assume the data is in World Geodetic System 1984 and set the SRID to 4326.
-
When writing data to GeoParquet, some spatial references may not be recognized or supported by other systems that read GeoParquet. In these cases, transform your geometry data to World Geodetic System 1984 (SRID:4326) before writing to guarantee interoperability.
-
Setting a spatial reference on each geometry column prior to writing is recommended but not required. The GeoParquet specification defines that any data without a spatial reference is assumed to be in World Geodetic System 1984 (SRID:4326).
-
Spark writes to multiple files by default. For interoperability with systems that do not support collections of GeoParquet, use
.coalesce(1)
on a DataFrame to return a new DataFrame that has exactly one partition and will be written to one GeoParquet file. Be cautious using.coalesce(1)
on a large dataset. -
The schema of a nested geometry field may load differently from GeoParquet files depending on the Spark version. For example, in Spark 3.4, the schema of a point geometry column in a nested column is loaded as
Struct
instead ofField(point,Struct Type(Struct Field(x,Double Type,false),Struct Field(y,Double Type,false),Struct Field(z,Double Type,false),Struct Field(m,Double Type,false)) Struct
at Spark 3.3.Field(point,Point UDT(None),true)) -
The following table outlines which versions of the GeoParquet schema are supported by each version of GeoAnalytics Engine:
GeoAnalytics Engine GeoParquet schema 1.0.x Not supported 1.1.x - 1.2.x 0.1.0 - 0.4.0 1.3.x - 1.4.x 0.1.0 - 1.0.0