GeoJSON

GeoJSON is an open standard geospatial data interchange format that represents simple geographic features and their nonspatial attributes. Based on JavaScript Object Notation (JSON), GeoJSON is a format for encoding a variety of geographic data structures. It uses a geographic coordinate reference system, World Geodetic System 1984, and units of decimal degrees. To learn more about GeoJSON, see the GeoJSON specification.

GeoJSON format data can be stored in a distributed file system such as HDFS, a Cloud storage such as S3, a local directory, or other locations that is accessible through Spark.

When loading GeoJSON data, a geometry column will be automatically created in the result DataFrame and its spatial reference set. GeoJSON supports point, line, polygon, and multipart collections of point, line, or polygon geometries. After loading GeoJSON files into a Spark DataFrame, you can perform analysis and visualize the data by using the SQL functions and tools available in GeoAnalytics Engine in addition to functions offered in Spark. Once you save a DataFrame as GeoJSON, you can store the files or access and visualize them through other systems.

The following table shows examples of the Python syntax for loading and saving GeoJSON with GeoAnalytics Engine, where path is a path to a directory of GeoJSON files or a single GeoJSON file.

Load	Save
`spark.read.format("geojson").load(path)`	`df.write.format("geojson").save(path)`
`spark.read.load(path, format="geojson")`	`df.write.save(path, format="geojson")`

By default, GeoJSON uses World Geodetic System 1984 (SRID:4326) and decimal degrees when saving the DataFrame. If the DataFrame geometry is in a different spatial reference, it will be automatically transformed into World Geodetic System 1984. In addition, GeoAnalytics Engine supports the option to save the DataFrame with a custom spatial reference using customCrs. To learn more about spatial references, see Coordinate systems and transformations. Additionally, the Spark DataFrameReader and DataFrameWriter classes provide other options that can be used when loading and saving GeoJSON files, as shown below.

DataFrameReader option	Example	Description
`sampleSize`	`.option("sampleSize", 5)`	Specify the number of records to sample when inferring the schema.
`samplingRatio`	`.option("samplingRatio", 0.25)`	Specify the ratio of records to sample when inferring the schema. It must be a number between 0 and 1.
`mergeSchemas`	`.option("mergeSchemas", True)`	Merge the schemas of a collection of GeoJSON datasets in the input directory.

DataFrameWriter option	Example	Description
`customCrs`	`.option("customCrs", True)`	When set to `True`, geometries will not be automatically transformed to World Geodetic System 1984. The default is `False`.
`ignoreNullFields`	`.option("ignoreNullFields", False)`	Specifies whether to ignore null fields when generating GeoJSON objects. The default is `True`.
`partitionBy`	`.partitionBy("date")`	Partition the output by the given column name. This example will partition the output GeoJSON files by values in the `date` column.
`overwrite`	`.mode("overwrite")`	Overwrite existing data in the specified path. Other available options are `append`, `error`, and `ignore`.
`multiline`	`.option("multiline", True)`	Writes the GeoJSON in multiline format.

Usage notes

GeoJSON doesn't support the generic geometry data type. A point, multipoint, line, or polygon column is required when writing to GeoJSON.
Writing to GeoJSON requires exactly one geometry field.
When loading GeoJSON, if there is no spatial reference defined it will be assumed to be World Geodetic System 1984 (SRID:4326).
Spark will read GeoJSON files from multiple directories if the directory names start with column=. For example, the following example directory contains GeoJSON data that is partitioned by district. Spark can infer district as a column name in the DataFrame by reading the subdirectory names starting with district=.
Writing DataFrames to GeoJSON doesn't require a spatial reference to be set on the geometry column. However, it is recommended to always set and check the spatial reference of a DataFrame before writing to GeoJSON if the data is not in World Geodetic System 1984.