Skip to content

A raster consists of cells (or pixels) organized into rows and columns (or a grid) in which each cell contains a value representing information.

Raster data can be stored in many locations, including cloud storage, a local directory, or any other location accessible to Spark.

GeoAnalytics Engine supports loading the following raster file formats: GTiff, JPEG, and PNG. After loading the raster data into a Spark DataFrame, you can use GeoAnalytics Engine raster functions to perform analysis and visualize the data. You can also export a Spark DataFrame raster column to a raster file (GTiff, JPEG, and PNG) for data storage or export to other systems.

Reading raster files in Spark

Read in a single raster file

The following examples demonstrate how to load a single raster file into a Spark DataFrame using both Python and Scala.

PythonPythonScala
Use dark colors for code blocksCopy
1
2

df = spark.read.format("raster").load("path/to/your/file.tif")

Read in a directory of raster files

The following examples demonstrate how to load raster files from a directory into a Spark DataFrame using both Python and Scala.

Example-Raster-Directory-Structure
Use dark colors for code blocksCopy
1
2
3
4
path/to/your/directory/
├── raster_1.tif
├── raster_2.png
└── raster_3.jpg
PythonPythonScala
Use dark colors for code blocksCopy
1
2

df = spark.read.format("raster").load("path/to/your/directory")

Here are the options that can be used when reading in raster files:

DataFrameReader optionExampleDescription
maxPartitionBytes.option("maxPartitionBytes", "256m")Defines the maximum size in bytes (e.g., "256m" or "1g") for each Spark partition when reading in a raster file. If not specified, the value from spark.sql.files.maxPartitionBytes will be used, which has a default of "128m".
materialize.option("materialize", "false")If false, the raster objects are stored as references to the original files, if true they are brought into memory immediately. For more information see raster references. The default is false.
tileColumns.option("tileColumns", 2048)Defines the number of columns in each raster tile. The default is 1024.
tileRows.option("tileRows", 2048)Defines the number of rows in each raster tile. The default is 1024.
overlap.option("overlap", 2)Defines the number of rows and columns that will overlap in each raster tile. If not specified, the tiles will not overlap.
sr.option("sr", 4326)The spatial reference that the raster will be transformed to. If sr is different than the input raster's spatial reference, the result will be a value raster regardless of the materialize option.

Writing DataFrames to raster files

The following examples demonstrate how to write a Spark DataFrame to a raster file using both Python and Scala.

PythonPythonScala
Use dark colors for code blocksCopy
1
2

df.write.format("raster").save("path/to/output/directory")

Here are the options that can be used when writing raster files:

DataFrameWriter optionExampleDescription
compressionType.option("compressionType", "lzw")Defines the GTiff output raster compression type. Supported types are: "none", "jpeg", "lzw", "packbits", and "deflate". The default is "none".
fieldName.option("fieldName", "raster")Defines the raster column to write when there are multiple raster columns.
format.option("format", "png")Defines the output raster file type. The default is "gtiff".
noDataPolicy.option("noDataPolicy", "promotion")Defines the masking policy for the NoData values on write. Supported policies are: "promotion", a numeric value (e.g., "1"), "maximum", "minimum", and "best-effort".

Raster pixels can have a NoData value to represent the absence of data. When writing a raster you can optionally specify a masking policy (noDataPolicy) for the NoData values. A mask is used to determine which cells in the raster are valid.

A mask is essentially a boolean array of the same length as the data. If the value at the index is false, the cell is not valid and should not be included for analysis.

The following policies are supported:

  • promotion: Converts the raster to the next largest pixel type and sets the NoData value to the largest value for the new type. If the pixel type is Float64, no conversion will occur and the NoData value will be set to 1.7976931348623157E308, the maximum value for Float64.

  • <numeric value>: Sets the NoData value to the numeric value provided. Masked pixels will be set to this value.

    • WARNING any unmasked pixels with this value will be masked when read back in.
  • maximum: Sets the NoData value to the maximum value for the pixel type.

  • minimum: Sets the NoData value to the minimum value for the pixel type.

  • best-effort: No policy is applied. The NoData value present will be used if it exists.

It is not guaranteed that NoData metadata is handled when processing an input raster. The best way to correctly preserve the NoData value when writing a raster is to either explicitly specify the NoData value or to use the promotion policy.

The table below lists the supported pixel types when writing a raster column to a raster file in the specified format:

FormatSupported Pixel Types
GTiff (.tif)uint1, uint2, uint4, uint8, int8, uint16, int16, uint32, int32, float32, float64
JPEG (.jpg)uint8, int8
PNG (.png)uint8, int8, uint16, int16

The short version of pixel type is also supported. For example, u1 can be used instead of uint1 or f32 instead of float32. Also, the pixel type representation is bit-based. For example, u1 is one bit and not one byte, which means that it uses one bit per pixel.

Unlike most Spark data sources, the raster data source writes out each row as its own file. While .coalesce() and .repartition() change the partitioning before writing out, they won't change the number of files written out and using them may even adversely affect performance due to cluster underutilization. If larger files need to be written out, RT_Merge can be used to combine tiles into a single DataFrame row. The maximum uncompressed raster size for a single DataFrame row is 2.147 gigabytes due to Java Virtual Machine constraints.

Usage notes

  • The tileColumns and tileRows options do not apply when reading in a JPEG raster file. A single tile has a maximum size of 2.147 gigabytes due to Java Virtual Machine constraints. If the JPEG raster file is larger than this, break the JPEG into tiles before reading it in or convert it to a cloud native format like Cloud Optimized GeoTIFF (COG).
  • Writing to a raster requires a DataFrame with one or more raster columns. Use the fieldName option to specify the raster column that will be used for write.
  • When loading data from a raster file into a DataFrame, a raster column will be created automatically.
  • An additional alpha band is added when writing a raster column to a PNG file.
  • The mean and standard deviation statistics for each band may be different after writing a raster to a JPEG file and reading it back in as a raster, given that JPEG is a "lossy" compression format.
  • When working with large rasters and planning to perform a spatial transformation, it is recommended to specify the sr (spatial reference) option directly in the raster read operation, rather than applying the RT.Transform function after reading the raster. This approach is preferred because reading large rasters typically involves tiling, and performing the transformation afterward can introduce slight gaps or missing pixels at tile boundaries. By specifying the desired spatial reference during the read, you ensure that the transformation is handled efficiently and accurately, minimizing the risk of data loss or visual artifacts.
  • When working with remote raster datasets, such as those stored in Amazon S3, the Cloud Optimized GeoTIFF (COG) format is often the preferred choice for achieving the fastest remote read access.

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.