Skip to content

GeoAnalytics Engine uses Spark DataFrame columns to represent raster datasets, where each row in the column contains one or more raster bands.

Raster values

A raster consists of cells (or pixels) organized into rows and columns (or a grid) in which each cell contains a value representing information.

Cell values can represent information such as a spectral value, category, magnitude, or height. The category can be a land-use class such as grassland, forest, or road. Spectral values are used in satellite and aerial imagery to represent light reflectance and color. A magnitude may represent gravity, noise pollution, or percent of rainfall. Height (distance) can represent surface elevation above mean sea level, which can be used to derive slope, aspect, and watershed properties.

Cell values can be either positive or negative, integer, or floating point. Integer values are best used to represent categorical (discrete) data and floating-point values are well suited to represent continuous surfaces. Cells can also have a NoData value to represent the absence of data. The type of a raster value is known as the Pixel Type.

Raster properties

The location of each cell is defined by the row or column where it is located in the raster matrix. The matrix is represented by a Cartesian coordinate system in which the rows of the matrix are parallel to the x-axis and the columns to the y-axis of the Cartesian plane. Row and column values begin with 0. In the example below, if the raster is in a Universal Transverse Mercator (UTM) projected coordinate system and has a pixel size of 100, the pixel location at 5,1 is 300,500 East, 5,900,600 North.

Coordinate location
Cartesian vs Geographic location within a raster

When you need to specify the extent of a raster, the extent is defined by the top, bottom, left, and right coordinates of the rectangular area covered by the raster. Other geographic properties of rasters include the following:

This information can be used to find the location of any specific cell. When this information is available, the raster data structure lists the cell values in order from the upper left cell along each row to the lower right cell, as illustrated below.

Coordinate location
Raster properties used to translate from image space

Raster bands

Some rasters have a single band (a measure of a single characteristic) of data, while others have multiple bands. A band is represented by a single matrix of cell values, and a raster with multiple bands contains multiple spatially coincident matrices of cell values representing the same spatial area.

An example of a single-band raster dataset is a digital elevation model (DEM). Each cell in a DEM contains only one value representing surface elevation. You can also have a single-band orthoimage, which is called a panchromatic or grayscale image.

When there are multiple bands, every cell location has more than one value associated with it. With multiple bands, each band usually represents a segment of the electromagnetic spectrum collected by a sensor. Bands can represent any portion of the electromagnetic spectrum, including ranges not visible to the eye, such as the infrared or ultraviolet sections. Most satellite, airborne, and drone imagery has multiple bands.

Raster references

Internally, GeoAnalytics Engine has two types of rasters: value and reference.

Value rasters

Value rasters contain the full pixel array and metadata in each Spark DataFrame row. When a tile is read from a raster file, all pixels are loaded and included in each DataFrame row. Doing this incurs the full read cost up front.

Reference rasters

Reference rasters store only the original file path and the tile bounding box. Pixels are read from the raster file only when needed by an operation. For example when using, RT_ZonalStatistics or RT_Calculator. Sections of the raster file that are never accessed are never read, which can significantly reduce the read costs and may improve performance.

Python raster class

The Python raster class provides a Python interface for working with raster data in a Spark DataFrame. It enables users to extract raster properties and band values through a comprehensive set of functions and properties, as well as to create new raster datasets. Additionally, the Python raster class is compatible with Python UDFs. To be able to use the raster functions, a value raster should be used.

To learn more about using the Python raster class, see the API reference and the tutorial on using the Python raster class.

Raster color maps and attribute tables

Color maps

Color maps contain a set of values that are associated with colors used to display a single-band raster consistently with the same colors. Each pixel value is associated with a color, defined as a set of RGB values. Color maps support any bit depth except floating point. They also support positive and negative values and can contain missing color-mapped values. When displaying a dataset with a color map containing missing values, the pixels with the missing values will not be displayed.

Color map
Color map information for each pixel

In GeoAnalytics Engine, you can read a color map using the following methods:

  • From a color map file (.clr)
  • Bundled in the raster GTiff file (.tif)

Writing a color map is supported when saving to a GTiff file (.tif). The color map values and colors can be accessed using the colormap_values and colormap_colors properties in the Python raster class.

Attribute tables

Raster attribute tables store information for categorical raster data (e.g., land cover), linking pixel values and count to specific classes. In GeoAnalytics Engine, you can read and write rasters with attribute tables.

The attribute table values can be accessed using the attribute_table property in the Python raster class.

Recommendations

  • The Raster data source reads files as reference rasters by default.
  • Use RT_Materialize when a tile is going to be reused. For example, if a raster tile intersects many polygons, RT_ZonalStatistics will read that tile repeatedly by default. Call RT_Materialize before calling RT_ZonalStatistics to avoid re-reading the tile from disk for each polygon. Materializing creates a value raster and avoids repeated reads from the original raster file.
  • Materializing a raster stores the raster data in memory which may be impractical for large rasters. Use it only when the cost of repeated reads outweighs the one-time materialization cost.
  • You can also use persist() or cache() to materialized DataFrame rows in Spark to reuse them efficiently.
  • You can visualize raster data using the rt.plot function. For more info, see the tutorial on plotting raster data

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.