A raster consists of cells (or pixels) organized into rows and columns (or a grid) in which each cell contains a value representing information.
Raster data can be stored in many locations, including cloud storage, a local directory, or any other location accessible to Spark.
GeoAnalytics Engine
supports loading the following raster file formats: G, JPEG, and PNG.
After loading the raster data into a Spark DataFrame, you can use GeoAnalytics Engine
raster functions to perform analysis and visualize the data.
You can also export a Spark DataFrame raster column to a raster file (G, JPEG, and PNG) for data
storage or export to other systems.
Reading raster files in Spark
Read in a single raster file
The following examples demonstrate how to load a single raster file into a Spark DataFrame using both Python and Scala.
df = spark.read.format("raster").load("path/to/your/file.tif")Read in a directory of raster files
The following examples demonstrate how to load raster files from a directory into a Spark DataFrame using both Python and Scala.
path/to/your/directory/
├── raster_1.tif
├── raster_2.png
└── raster_3.jpg
df = spark.read.format("raster").load("path/to/your/directory")Here are the options that can be used when reading in raster files:
| DataFrameReader option | Example | Description |
|---|---|---|
max | .option("max | Defines the maximum size in bytes (e.g., "256m" or "1g") for each Spark partition when reading in a raster file. If not specified, the value from spark.sql.files.max will be used, which has a default of "128m". |
materialize | .option("materialize", "false") | If false, the raster objects are stored as references to the original files, if true they are brought into memory immediately. For more information see raster references. The default is false. |
tile | .option("tile | Defines the number of columns in each raster tile. The default is 1024. |
tile | .option("tile | Defines the number of rows in each raster tile. The default is 1024. |
overlap | .option("overlap", 2) | Defines the number of rows and columns that will overlap in each raster tile. If not specified, the tiles will not overlap. |
sr | .option("sr", 4326) | The spatial reference that the raster will be transformed to. If sr is different than the input raster's spatial reference, the result will be a value raster regardless of the materialize option. |
Writing DataFrames to raster files
The following examples demonstrate how to write a Spark DataFrame to a raster file using both Python and Scala.
df.write.format("raster").save("path/to/output/directory")Here are the options that can be used when writing raster files:
| DataFrameWriter option | Example | Description |
|---|---|---|
compression | .option("compression | Defines the GTiff output raster compression type. Supported types are: "none", "jpeg", "lzw", "packbits", and "deflate". The default is "none". |
field | .option("field | Defines the raster column to write when there are multiple raster columns. |
format | .option("format", "png") | Defines the output raster file type. The default is "gtiff". |
no | .option("no | Defines the masking policy for the No values on write. Supported policies are: "promotion", a numeric value (e.g., "1"), "maximum", "minimum", and "best-effort". |
Raster pixels can have a No value to represent the absence of data. When writing a raster you can optionally
specify a masking policy (no) for the No values. A mask is used to determine which cells in the
raster are valid.
A mask is essentially a boolean array of the same length as the data. If the value at the index is false, the cell is not valid and should not be included for analysis.
The following policies are supported:-
promotion: Converts the raster to the next largest pixel type and sets theNovalue to the largest value for the new type. If the pixel type is Float64, no conversion will occur and theData Novalue will be set toData 1.7976931348623157, the maximum value for Float64.E308 -
<numeric value: Sets the> Novalue to the numeric value provided. Masked pixels will be set to this value.Data - WARNING any unmasked pixels with this value will be masked when read back in.
-
maximum: Sets theNovalue to the maximum value for the pixel type.Data -
minimum: Sets theNovalue to the minimum value for the pixel type.Data -
best-effort: No policy is applied. TheNovalue present will be used if it exists.Data
It is not guaranteed that No metadata is handled when processing an input raster. The best way to correctly preserve
the No value when writing a raster is to either explicitly specify the No value or to use the promotion policy.
The table below lists the supported pixel types when writing a raster column to a raster file in the specified format:
| Format | Supported Pixel Types |
|---|---|
| GTiff (.tif) | uint1, uint2, uint4, uint8, int8, uint16, int16, uint32, int32, float32, float64 |
| JPEG (.jpg) | uint8, int8 |
| PNG (.png) | uint8, int8, uint16, int16 |
The short version of pixel type is also supported. For example, u1 can be used instead of uint1 or f32 instead of float32.
Also, the pixel type representation is bit-based. For example, u1 is one bit and not one byte, which means that it uses one bit per pixel.
Unlike most Spark data sources, the raster data source writes out each row as its own file. While .coalesce() and .repartition() change the partitioning before writing out, they won't change the number of files written out and using them may even adversely affect performance due to cluster underutilization. If larger files need to be written out, RT_Merge can be used to combine tiles into a single DataFrame row. The maximum uncompressed raster size for a single DataFrame row is 2.147 gigabytes due to Java Virtual Machine constraints.
Usage notes
- The
tileandColumns tileoptions do not apply when reading in aRows JPEGraster file. A single tile has a maximum size of 2.147 gigabytes due to Java Virtual Machine constraints. If theJPEGraster file is larger than this, break theJPEGinto tiles before reading it in or convert it to a cloud native format like Cloud Optimized GeoTIFF (COG). - Writing to a raster requires a DataFrame with one or more raster columns. Use the
fieldoption to specify the raster column that will be used for write.Name - When loading data from a raster file into a DataFrame, a raster column will be created automatically.
- An additional alpha band
is added when writing a raster column to a
PNGfile. - The mean and standard deviation statistics for each band may be different after writing a raster to a
JPEGfile and reading it back in as a raster, given thatJPEGis a "lossy" compression format. - When working with large rasters and planning to perform a spatial transformation, it is recommended to specify
the
sr(spatial reference) option directly in the raster read operation, rather than applying theRfunction after reading the raster. This approach is preferred because reading large rasters typically involves tiling, and performing the transformation afterward can introduce slight gaps or missing pixels at tile boundaries. By specifying the desired spatial reference during the read, you ensure that the transformation is handled efficiently and accurately, minimizing the risk of data loss or visual artifacts.T. Transform - When working with remote raster datasets, such as those stored in Amazon S3, the Cloud Optimized GeoTIFF (COG) format is often the preferred choice for achieving the fastest remote read access.