GeoParquet

GeoParquet is a standardized open-source columnar storage format that extends Apache Parquet by defining how geospatial data should be stored, including the representation of geometries and the required additional metadata. Parquet is highly structured meaning it stores the schema and data type of each column with the data files. GeoParquet's structure enables interoperability between any system that reads or writes spatial data in Parquet format. To write to Parquet format without geometry data, see the Parquet data source included with Apache Spark.

The following table shows examples of the Python syntax for loading and saving GeoParquet with GeoAnalytics for Microsoft Fabric, where path is a path to a directory of GeoParquet files or a single GeoParquet file.

Load	Save
`spark.read.format("geoparquet").load(path)`	`df.write.format("geoparquet").save(path)`
`spark.read.load(path, format="geoparquet")`	`df.write.save(path, format="geoparquet")`

Additionally, the Spark DataFrameReader and DataFrameWriter classes provide extra options that can be used when reading or writing GeoParquet. For a full list of options offered in DataFrameReader and DataFrameWriter, see the Spark API reference.

DataFrameWriter option	Example	Description
`version`	`.option("version", "1.0.0")`	Defines the version of the GeoParquet specification that will be used when writing. The default is version 1.0.0.

Usage notes

Writing to GeoParquet requires one or more geometry fields.
When loading data from GeoParquet into a DataFrame, geometry columns will be created automatically.
The following table outlines how each geometry type in GeoAnalytics for Microsoft Fabric maps to geometry types in the GeoParquet specification. Note that the GeoParquet specification allows for geometry columns to have multiple types.

GeoAnalytics for Microsoft Fabric	GeoParquet
point	[Point]
multipoint	[MultiPoint]
linestring	[LineString, MultiLineString]
polygon	[Polygon, MultiPolygon]
geometry	[GeometryCollection]

When loading data from GeoParquet, GeoAnalytics for Microsoft Fabric will read the spatial reference of geometries using included metadata. If the spatial reference is not recognized, the SRID will be set to 0 (unknown). Always verify that the spatial reference was read correctly using st.get_spatial_reference on the result DataFrame. If the spatial reference was not recognized, you can set it using ST_SRID.
If no spatial reference is defined in the metadata for a geometry column, GeoAnalytics for Microsoft Fabric will assume the data is in World Geodetic System 1984 and set the SRID to 4326.
When writing data to GeoParquet, some spatial references may not be recognized or supported by other systems that read GeoParquet. In these cases, transform your geometry data to World Geodetic System 1984 (SRID:4326) before writing to guarantee interoperability.
Setting a spatial reference on each geometry column prior to writing is recommended but not required. The GeoParquet specification defines that any data without a spatial reference is assumed to be in World Geodetic System 1984 (SRID:4326).
Spark writes to multiple files by default. For interoperability with systems that do not support collections of GeoParquet, use .coalesce(1) on a DataFrame to return a new DataFrame that has exactly one partition and will be written to one GeoParquet file. Be cautious using .coalesce(1) on a large dataset.
The schema of a nested geometry field may load differently from GeoParquet files depending on the Spark version. For example, in Spark 3.4, the schema of a point geometry column in a nested column is loaded as StructField(point,StructType(StructField(x,DoubleType,false),StructField(y,DoubleType,false),StructField(z,DoubleType,false),StructField(m,DoubleType,false)) instead of StructField(point,PointUDT(None),true)) at Spark 3.3.
The following table outlines which versions of the GeoParquet schema are supported by each version of GeoAnalytics for Microsoft Fabric:

GeoAnalytics for Microsoft Fabric	GeoParquet schema
1.0.0-beta	0.1.0 - 1.0.0