GeoParquet is a standardized open-source columnar storage format that extends Apache Parquet by defining how geospatial data should be stored, including the representation of geometries and the required additional metadata. Parquet is highly structured meaning it stores the schema and data type of each column with the data files. GeoParquet's structure enables interoperability between any system that reads or writes spatial data in Parquet format. To write to Parquet format without geometry data, see the Parquet data source included with Apache Spark.
Reading GeoParquet in Spark
The following examples demonstrate how to load GeoParquet into Spark DataFrames using both Python and Scala.
df = spark.read.format("geoparquet").load("path/to/your/file.parquet")Writing DataFrames to GeoParquet
The following examples demonstrate how to write Spark DataFrames to GeoParquet using both Python and Scala.
df.write.format("geoparquet").option("version", "1.0.0").save("path/to/output/directory")Here are the options that can be used in writing GeoParquet files:
| DataFrameWriter option | Example | Description |
|---|---|---|
version | .option("version", "1.0.0") | Defines the version of the GeoParquet specification that will be used when writing. The default is version 1.0.0. |
Usage notes
- Writing to GeoParquet requires one or more geometry fields.
- When loading data from GeoParquet into a DataFrame, geometry columns will be created automatically.
- The following table outlines how each geometry type in GeoAnalytics for Microsoft Fabric maps to geometry types in the GeoParquet specification. Note that the GeoParquet specification allows for geometry columns to have multiple types.
| GeoAnalytics for Microsoft Fabric | GeoParquet |
|---|---|
| point | [Point] |
| multipoint | [MultiPoint] |
| linestring | [LineString, MultiLineString] |
| polygon | [Polygon, MultiPolygon] |
| geometry | [GeometryCollection] |
- When loading data from GeoParquet, GeoAnalytics for Microsoft Fabric will read the spatial reference of geometries using included metadata. If the spatial reference is not recognized, the SRID will be set to 0 (unknown). Always verify that the spatial reference was read correctly using st.get_spatial_reference on the result DataFrame. If the spatial reference was not recognized, you can set it using ST_SRID.
- If no spatial reference is defined in the metadata for a geometry column, GeoAnalytics for Microsoft Fabric will assume the data is in World Geodetic System 1984 and set the SRID to 4326.
- When writing data to GeoParquet, some spatial references may not be recognized or supported by other systems that read GeoParquet. In these cases, transform your geometry data to World Geodetic System 1984 (SRID:4326) before writing to guarantee interoperability.
- Setting a spatial reference on each geometry column prior to writing is recommended but not required. The GeoParquet specification defines that any data without a spatial reference is assumed to be in World Geodetic System 1984 (SRID:4326).
- Spark writes to multiple files by default. For interoperability with systems that do not support collections of
GeoParquet, use
.coalesce(1)on a DataFrame to return a new DataFrame that has exactly one partition and will be written to one GeoParquet file. Be cautious using.coalesce(1)on a large dataset. - The schema of a nested geometry field may load differently from GeoParquet files depending on the Spark version.
- The following table outlines which versions of the GeoParquet schema are supported by each version of GeoAnalytics for Microsoft Fabric:
| GeoAnalytics for Microsoft Fabric | GeoParquet schema |
|---|---|
| 1.0.0 | 0.1.0 - 1.1.0 |