Learn the functionalities for exploring the properties of raster datasets.

In this tutorial, you will read from raster datasets and explore their properties using different functions in GeoAnalytics Engine.

Read in Raster data

Set up the workspace

  1. Import the required modules in your notebook.
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    # import geoanalytics functionality
    import geoanalytics
    import geoanalytics.raster.functions as RT
    import geoanalytics.sql.functions as ST
    
    import pyspark.sql.functions as F
    
    geoanalytics.auth(username="user1", password="p@ssword")

Prepare your input raster

  1. Set the location of the raster dataset.

    The data used in this tutorial comes from a multi-band image from the United States Geological Survey's National Agriculture Image Program (NAIP). Note that the image provided is a clipped piece of a larger NAIP image.

    The file used here is available here.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    
    # raster dataset location
    raster_data_loc = r"data/m_4209351_se_15_030_20230902_clip.tif"
  2. Read in the raster file as a DataFrame.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    
    df_raster = spark.read.format("raster").load(raster_data_loc)
  3. Show the tiles of the raster. These tiles are stored in the raster DataFrame as rows.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    
    df_raster.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    +--------------------+--------------------+
    |                path|              raster|
    +--------------------+--------------------+
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x935x1...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x935x8...|
    +--------------------+--------------------+

Raster Properties

In this tutorial, we will explore different raster properties using raster functions.

The raster dataset was read into a DataFrame with multiple tiles. To keep examples concise, this tutorial typically shows the properties of a single row, corresponding to one tile. In many cases, metadata such as cell size and spatial reference is shared across all tiles in the dataset.

  1. RT_info lists all the general metadata of a raster.

    Show all the metadata of a single raster tile.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    
    # print raster info
    df_raster.withColumn("info", RT.info("raster")) \
                 .selectExpr("info.*") \
                 .show(1, vertical=True)
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    -RECORD 0--------------------------
     numColumns | 1024
     numRows    | 1024
     numBands   | 4
     cellSizeX  | 0.30000000000001137
     cellSizeY  | 0.3000000000001819
     pixelType  | UInt8
     srid       | 26915
     srText     | PROJCS["NAD_1983_...
     extent     | [444822.9, 466684...
    only showing top 1 row
  2. A variety of functions are available to explore specific properties of raster metadata:

    • RT_extent - overall extent of the raster
    • RT_cellsizex - cell size on the x axis
    • RT_cellsizey - cell size on the y axis
    • RT_numbands - number of bands in the raster
    • RT_numcolumns - number of columns in the raster
    • RT_numrows - number of rows in the raster
    • RT_pixeltype - the pixel type (e.g., UInt8)
    • RT_srid - the spatial reference ID
    • RT_srtext - the full text string with the spatial reference metadata

    Show all the metadata of a single raster tile again without using RT_Info.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    df_raster\
        .select(RT.extent("raster"),
              RT.cell_size_x("raster"),
              RT.cell_size_y("raster"),
              RT.num_bands("raster"),
              RT.num_columns("raster"),
              RT.num_rows("raster"),
              RT.pixel_type("raster"),
              RT.srid("raster"),\
              RT.sr_text("raster"))\
        .show(1, vertical=True)
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    -RECORD 0----------------------------------
     Extent(raster)     | {"rings":[[[44482...
     CellSizeX(raster)  | 0.30000000000001137
     CellSizeY(raster)  | 0.3000000000001819
     NumBands(raster)   | 4
     NumColumns(raster) | 1024
     NumRows(raster)    | 1024
     PixelType(raster)  | UInt8
     Srid(raster)       | 26915
     SrText(raster)     | PROJCS["NAD_1983_...
    only showing top 1 row

Summarize properties across all tiles for a raster dataset

When you read in a large raster it will be broken into multiple tiles (DataFrame rows) to improve performance. This means that when you retrieve properties for the tiles in a raster DataFrame, you will see the properties for each tile separately.

If you want to aggregate and retrieve properties across the entire dataset, you need to work across the rows in the DataFrame.

In the example below, we aggregate across all rows (raster tiles) in the DataFrame to obtain properties about the full raster.

The aggregation below includes:

  • The total number of rows in the table (equivalent to the number of tiles in the raster)
  • The full extent of the raster, calculated as the union of all tile extents
  • The cell size (x and y) for the first cell; assuming the cell size is consistent across all tiles in the DataFrame
  • The raster height and width, calculated from the full extent
  • The number of rows and columns, calculated by dividing the full extent dimensions by the cell size in the x and y directions
PythonPythonScala
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11

# height, width, number of rows, and number of columns across _all tiles created when a raster was ingested_
df_raster.agg(F.sum(F.lit(1)).alias("number_of_tiles"),
              ST.aggr_union(RT.extent("raster")).alias("full_extent"),
              F.first(RT.cell_size_x("raster")).alias("cell_size_x"),
              F.first(RT.cell_size_y("raster")).alias("cell_size_y"))\
    .withColumn("width_total", ST.max_x("full_extent") - ST.min_x("full_extent"))\
    .withColumn("height_total", ST.max_y("full_extent") - ST.min_y("full_extent"))\
    .withColumn("num_rows_total", F.round(F.col("width_total") / F.col("cell_size_x"), 0))\
    .withColumn("num_cols_total", F.round(F.col("height_total") / F.col("cell_size_y"), 0))\
    .show(vertical=True)
Result
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
-RECORD 0-------------------------------
 number_of_tiles | 10
 full_extent     | {"rings":[[[44482...
 cell_size_x     | 0.30000000000001137
 cell_size_y     | 0.3000000000001819
 width_total     | 1509.2999999999884
 height_total    | 573.0
 num_rows_total  | 5031.0
 num_cols_total  | 1910.0

Band statistics

Raster datasets may have more than one band. RT_Statistics and RT_BandStatistics can be used to explore the properties of raster bands.

RT_Statistics provides statistics for all bands (min, max, mean, standard deviation).

RT_BandStastistics provides statistics for a single, specified band.

  1. Start by looking at band properties for each tile. In the first examples, only one row of data is shown for simplicity.

    The information returned by RT_statistics is an array of elements. There is one element in the array for each band, and the elements returned include the four individual measures (min, max, mean, standard deviation).

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    
    # individual tile
    # statistics for five bands
    df_stats = df_raster.select(RT.statistics("raster").alias("statistics"))
    
    df_stats.withColumn("stats", F.explode("statistics")) \
                .select("stats.*") \
                .show(5, vertical=True)
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    -RECORD 0-------------------
     min   | 15.0
     max   | 210.0
     mean  | 92.35225296020508
     stdev | 26.602800170008393
    -RECORD 1-------------------
     min   | 44.0
     max   | 222.0
     mean  | 129.31229305267334
     stdev | 17.710729011348835
    -RECORD 2-------------------
     min   | 34.0
     max   | 220.0
     mean  | 72.41140651702881
     stdev | 19.689422600849085
    -RECORD 3-------------------
     min   | 44.0
     max   | 229.0
     mean  | 174.80743598937988
     stdev | 14.41309833053885
    -RECORD 4-------------------
     min   | 14.0
     max   | 240.0
     mean  | 84.154541015625
     stdev | 26.81645369218381
    only showing top 5 rows
  2. Show the statistics for one band at a time. This example shows the statistics for the first band in the first raster tile of the DataFrame.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    
    # individual tile
    # statistics for one band
    df_raster.select(RT.band_statistics("raster", 1)).show(1, vertical=True, truncate=False)
    Result
    Use dark colors for code blocksCopy
    1
    2
    -RECORD 0-------------------------------------------------------------------------
    BandStatistics(raster, 1) | {15.0, 210.0, 92.35225296020508, 26.602800170008393}
  3. For each band, RT_Statistics returns a min, max, mean, and standard deviation.

    Return these values in an array of elements.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    
    # for each band, statistics returns a min, max, mean, and standard deviation
    df_raster.select(RT.statistics("raster")).printSchema()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    root
     |-- Statistics(raster): array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- min: double (nullable = false)
     |    |    |-- max: double (nullable = false)
     |    |    |-- mean: double (nullable = true)
     |    |    |-- stdev: double (nullable = true)
  4. Unpack the array and show the statistics of all bands for a single tile.

    Do this by exploding the set of statistics so one row represents a single band.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    # show statistics for all bands for a single tile
    # First, explode the statistics result into a column called "element"
    # Second, access the individual named parts of the element e.g., F.col("element").min to get the min value (see the naming structure from the cell above)
    df_raster.limit(1).select(F.explode(RT.statistics("raster")).alias("element"))\
        .withColumn("min", F.col("element").min)\
        .withColumn("max", F.col("element").max)\
        .withColumn("mean", F.col("element").mean)\
        .withColumn("stdev", F.col("element").stdev)\
        .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    +--------------------+----+-----+------------------+------------------+
    |             element| min|  max|              mean|             stdev|
    +--------------------+----+-----+------------------+------------------+
    |{15.0, 210.0, 92....|15.0|210.0| 92.35225296020508|26.602800170008393|
    |{44.0, 222.0, 129...|44.0|222.0|129.31229305267334|17.710729011348835|
    |{34.0, 220.0, 72....|34.0|220.0| 72.41140651702881|19.689422600849085|
    |{44.0, 229.0, 174...|44.0|229.0|174.80743598937988| 14.41309833053885|
    +--------------------+----+-----+------------------+------------------+
  5. The example above showed one tile, but statistics for all tiles and all bands can be displayed as a single table.

    Use windowing to get labels for the tiles and bands to keep the results organized in the output table.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    from pyspark.sql.window import Window
    
    # Define a window specification to allow us to list the tile number and the band number for each individual tile
    # for numbering tile
    window_spec_tile = Window.orderBy(col("path"))
    # for numbering band when the statistics are exploded into elements
    window_spec = Window.partitionBy("tile_number").orderBy(col("element"))
    
    
    df_raster\
        .withColumn("tile_number", F.row_number().over(window_spec_tile))\
        .select(RT.extent("raster").alias("extent"), "tile_number", F.explode(RT.statistics("raster")).alias("element"))\
        .withColumn("band", F.row_number().over(window_spec))\
        .withColumn("min", F.col("element").min)\
        .withColumn("max", F.col("element").max)\
        .withColumn("mean", F.col("element").mean)\
        .withColumn("stdev", F.col("element").stdev)\
        .drop("element")\
        .sort("tile_number")\
        .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    +--------------------+-----------+----+----+-----+------------------+------------------+
    |              extent|tile_number|band| min|  max|              mean|             stdev|
    +--------------------+-----------+----+----+-----+------------------+------------------+
    |{"rings":[[[44482...|          1|   1|15.0|210.0| 92.35225296020508|26.602800170008393|
    |{"rings":[[[44482...|          1|   2|34.0|220.0| 72.41140651702881|19.689422600849085|
    |{"rings":[[[44482...|          1|   3|44.0|222.0|129.31229305267334|17.710729011348835|
    |{"rings":[[[44482...|          1|   4|44.0|229.0|174.80743598937988| 14.41309833053885|
    |{"rings":[[[44513...|          2|   1|14.0|240.0|   84.154541015625| 26.81645369218381|
    |{"rings":[[[44513...|          2|   2|23.0|255.0| 74.64390182495117|24.848689245873594|
    |{"rings":[[[44513...|          2|   3|38.0|255.0| 124.0490951538086| 19.33410483909644|
    |{"rings":[[[44513...|          2|   4|43.0|255.0|172.65335083007812| 16.11312650314451|
    |{"rings":[[[44543...|          3|   1|11.0|233.0|  90.1596269607544| 35.35228140271004|
    |{"rings":[[[44543...|          3|   2|22.0|247.0| 80.91951274871826| 35.10800917517936|
    |{"rings":[[[44543...|          3|   3|40.0|248.0|124.84498310089111| 27.80693983917014|
    |{"rings":[[[44543...|          3|   4|44.0|255.0|159.28014087677002| 20.60250979293797|
    |{"rings":[[[44574...|          4|   1|16.0|232.0| 80.22738838195801|26.324469299187722|
    |{"rings":[[[44574...|          4|   2|16.0|248.0| 73.15077209472656|24.554763438866985|
    |{"rings":[[[44574...|          4|   3|45.0|250.0|117.66667747497559|20.641196882943824|
    |{"rings":[[[44574...|          4|   4|51.0|254.0| 159.6287441253662|11.813605740654758|
    |{"rings":[[[44605...|          5|   1|15.0|231.0| 80.11518215240642| 25.71023617309586|
    |{"rings":[[[44605...|          5|   2|19.0|233.0| 72.73027657085561| 23.62374643935028|
    |{"rings":[[[44605...|          5|   3|48.0|244.0|158.23040712733956|11.794383958706929|
    |{"rings":[[[44605...|          5|   4|48.0|246.0|117.44518298796791| 20.90472557528909|
    +--------------------+-----------+----+----+-----+------------------+------------------+
    only showing top 20 rows

Global statistics for bands

Instead of looking at statistics for each tile in a raster, you may want global statistics across all tiles/rows in the DataFrame. Here are a few ways that you can get those values:

  1. If the raster is small enough to merge into a single tile, you can use RT_Merge to combine all the raster tiles into a single row in the DataFrame

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    df_raster_merge = df_raster.agg(RT.merge(F.collect_list("raster")).alias("raster"))
    
    # pull out statistics for the merged dataset (it is only one row at this point since it's one giant tile)
    df_raster_merge.select(F.explode(RT.statistics("raster")).alias("element"))\
        .withColumn("min", F.col("element").min)\
        .withColumn("max", F.col("element").max)\
        .withColumn("mean", F.col("element").mean)\
        .withColumn("stdev", F.col("element").stdev)\
        .drop("element")\
        .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    +----+-----+---------------+---------------+
    | min|  max|           mean|          stdev|
    +----+-----+---------------+---------------+
    | 9.0|240.0|80.369452743768|28.752628468188|
    |25.0|255.0| 119.5062109164| 21.84675346408|
    | 2.0|255.0|  73.6978419662|24.961412705739|
    |36.0|255.0|167.75836983476|  18.1467047388|
    +----+-----+---------------+---------------+
  2. These can also be run one band at a time if preferred:

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    
    # get number of bands
    num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"]
    
    # walk through each of the bands and print results
    for i in range(0, num_bands):
        # bands are numbered starting at 1, not at 0
        print(f"Band {i+1} statistics")
    
        # list the statistics for the band
        df_raster_merge.select(RT.band_statistics("raster", i+1).alias("band_stats"))\
            .withColumn("min", F.col("band_stats").min)\
            .withColumn("max", F.col("band_stats").max)\
            .withColumn("mean", F.col("band_stats").mean)\
            .withColumn("stdev", F.col("band_stats").stdev)\
            .drop("band_stats")\
            .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    Band 1 statistics
    +---+-----+---------------+---------------+
    |min|  max|           mean|          stdev|
    +---+-----+---------------+---------------+
    |9.0|240.0|80.369452743768|28.752628468188|
    +---+-----+---------------+---------------+
    
    Band 2 statistics
    +----+-----+--------------+--------------+
    | min|  max|          mean|         stdev|
    +----+-----+--------------+--------------+
    |25.0|255.0|119.5062109164|21.84675346408|
    +----+-----+--------------+--------------+
    
    Band 3 statistics
    +---+-----+-------------+---------------+
    |min|  max|         mean|          stdev|
    +---+-----+-------------+---------------+
    |2.0|255.0|73.6978419662|24.961412705739|
    +---+-----+-------------+---------------+
    
    Band 4 statistics
    +----+-----+---------------+-------------+
    | min|  max|           mean|        stdev|
    +----+-----+---------------+-------------+
    |36.0|255.0|167.75836983476|18.1467047388|
    +----+-----+---------------+-------------+
  3. If the tiles in the DataFrame are too large to merge into a single row, you can obtain min and max through aggregation of the individual tiles. This will not work for mean and standard deviation.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    # get number of bands
    num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"]
    
    # walk through each of the bands and print results
    for i in range(0, num_bands):
        # bands are numbered starting at 1, not at 0
        print(f"Band {i+1} statistics")
    
        # calculate the aggregated min and max across all tiles
        df_raster\
            .select(RT.band_statistics("raster", i+1).alias("band_stats"))\
                .withColumn("min", F.col("band_stats").min)\
                .withColumn("max", F.col("band_stats").max)\
                .agg(F.min("min").alias("min"),F.max("max").alias("max"))\
                .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    Band 1 statistics
    +---+-----+
    |min|  max|
    +---+-----+
    |9.0|240.0|
    +---+-----+
    
    Band 2 statistics
    +----+-----+
    | min|  max|
    +----+-----+
    |25.0|255.0|
    +----+-----+
    
    Band 3 statistics
    +---+-----+
    |min|  max|
    +---+-----+
    |2.0|255.0|
    +---+-----+
    
    Band 4 statistics
    +----+-----+
    | min|  max|
    +----+-----+
    |36.0|255.0|
    +----+-----+

Raster values

  1. Obtain values for all the cells for a single band within a raster tile using RT_BandValues. For this example, just show the first 10 values of bands 1 and 2.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 1
    raster = df_raster.limit(1).select(RT.band_values("raster", 1).alias("values")).head(1)[0]["values"]
    
    # print the first 10 values
    raster[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [65.0, 68.0, 60.0, 61.0, 71.0, 72.0, 67.0, 75.0, 72.0, 67.0]
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 2
    raster = df_raster.limit(1).select(RT.band_values("raster", 2).alias("values")).head(1)[0]["values"]
    
    # print the first 10 values
    raster[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [115.0, 116.0, 108.0, 109.0, 119.0, 120.0, 115.0, 123.0, 122.0, 116.0]
  2. Using RT_BandMask, check the mask values for all the cells for a single band of a raster tile. Because all cells are valid in this raster, all mask values should be true here.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 1
    df_band_mask = df_raster.select(RT.band_mask("raster", 1).alias("band_mask")).head(1)[0]["band_mask"]
    
    # print the first 10 values
    df_band_mask[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [True, True, True, True, True, True, True, True, True, True]
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 2
    df_band_mask = df_raster.select(RT.band_mask("raster", 2).alias("band_mask")).head(1)[0]["band_mask"]
    
    # print the first 10 values
    df_band_mask[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [True, True, True, True, True, True, True, True, True, True]

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools:

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.