Skip to content

Learn the functionalities for exploring the properties of raster datasets.

In this tutorial you will read from raster datasets and explore their properties using different functions in GeoAnalytics Engine

Read in Raster data

Set up the workspace

  1. Import the required modules.
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    # import geoanalytics functionality
    import geoanalytics
    import geoanalytics.raster.functions as RT
    import geoanalytics.sql.functions as ST
    
    import pyspark.sql.functions as F
    
    geoanalytics.auth(username="user1", password="p@ssword")

Prepare your input raster

  1. Set the location of the raster dataset.

    The data comes from a multi-band image from the United States Geological Survey's National Agriculture Image Program (NAIP). Note that the image provided for use is a clipped piece of a larger NAIP image.

    The file used here is available here.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    
    # raster dataset location
    raster_data_loc = r"data/m_4209351_se_15_030_20230902_clip.tif"
  2. Read in the raster file

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    
    df_raster = spark.read.format("raster").load(raster_data_loc)
  3. The raster was read in and split into multiple tiles, visualize the tiles as rows in the DataFrame.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    
    df_raster.show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    +--------------------+--------------------+
    |                path|              raster|
    +--------------------+--------------------+
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x935x1...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x1024x...|
    |file:/C:/_workboo...|SqlRaster(4x935x8...|
    +--------------------+--------------------+

Raster Properties

There are numerous functions available for working with raster datasets. In this tutorial we will explore raster properties.

Since our raster dataset was read into a DataFrame with multiple rows/tiles, in most of the examples, we will only show one row when we explore the values since the general properties are likely the same for every tile (e.g., cell size, spatial reference, etc.)

  1. RT_info - list all of the general metadata information about a raster.

    The individual components can be collected separately as well.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    
    # print raster info
    df_raster.withColumn("info", RT.info("raster")) \
                 .selectExpr("info.*") \
                 .show(1, vertical=True)
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    -RECORD 0--------------------------
     numColumns | 1024
     numRows    | 1024
     numBands   | 4
     cellSizeX  | 0.30000000000001137
     cellSizeY  | 0.3000000000001819
     pixelType  | UInt8
     srid       | 26915
     srText     | PROJCS["NAD_1983_...
     extent     | [444822.9, 466684...
    only showing top 1 row
  2. When exploring specific individual metadeta items about the raster, we can use a variety of functions for this:

    • RT_extent - overall extent of the raster
    • RT_cellsizex - cell size on the x axis
    • RT_cellsizey - cell size on the y axis
    • RT_numbands - number of bands in the raster
    • RT_numcolumns - number of columns in the raster
    • RT_numrows - number of rows in the raster
    • RT_pixeltype - the pixel type (e.g., UInt8)
    • RT_srid - the spatial reference ID
    • RT_srtext - the full text string with the spatial reference metadata
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    df_raster\
        .select(RT.extent("raster"),
              RT.cell_size_x("raster"),
              RT.cell_size_y("raster"),
              RT.num_bands("raster"),
              RT.num_columns("raster"),
              RT.num_rows("raster"),
              RT.pixel_type("raster"),
              RT.srid("raster"),\
              RT.sr_text("raster"))\
        .show(1, vertical=True)
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    -RECORD 0----------------------------------
     Extent(raster)     | {"rings":[[[44482...
     CellSizeX(raster)  | 0.30000000000001137
     CellSizeY(raster)  | 0.3000000000001819
     NumBands(raster)   | 4
     NumColumns(raster) | 1024
     NumRows(raster)    | 1024
     PixelType(raster)  | UInt8
     Srid(raster)       | 26915
     SrText(raster)     | PROJCS["NAD_1983_...
    only showing top 1 row

Summarize properties across all tiles for a raster dataset

When you read in a large raster it will be broken into multiple tiles (DataFrame rows) to improve performance. This means that when you retrieve properties for the tiles in a raster DataFrame, you will see the properties for each tile separately.

If you want to aggregate to retrieve properties across the entire dataset, you need to work across the rows in the DataFrame

In the example below, we aggregate across all rows (raster tiles) in our DataFrame to obtain properties about the full raster that was ingested.

The aggregation below includes:

  • Count all rows in the table (sum a count of 1 for each row)
  • Union the extent of all rows to get the full extent
  • The cell size (x and y) for the first cell; assuming the cell size is consistent across all tiles in the DataFrame
  • Calculation of the raster height and width based on the full extent
  • Calculation of the number of rows and columns based on the full extent divided by the cell size (x and y dimensions)
PythonPythonScala
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
10
11

# height, width, number of rows, and number of columns across _all tiles created when a raster was ingested_
df_raster.agg(F.sum(F.lit(1)).alias("number_of_tiles"),
              ST.aggr_union(RT.extent("raster")).alias("full_extent"),
              F.first(RT.cell_size_x("raster")).alias("cell_size_x"),
              F.first(RT.cell_size_y("raster")).alias("cell_size_y"))\
    .withColumn("width_total", ST.max_x("full_extent") - ST.min_x("full_extent"))\
    .withColumn("height_total", ST.max_y("full_extent") - ST.min_y("full_extent"))\
    .withColumn("num_rows_total", F.round(F.col("width_total") / F.col("cell_size_x"), 0))\
    .withColumn("num_cols_total", F.round(F.col("height_total") / F.col("cell_size_y"), 0))\
    .show(vertical=True)
Result
Use dark colors for code blocksCopy
1
2
3
4
5
6
7
8
9
-RECORD 0-------------------------------
 number_of_tiles | 10
 full_extent     | {"rings":[[[44482...
 cell_size_x     | 0.30000000000001137
 cell_size_y     | 0.3000000000001819
 width_total     | 1509.2999999999884
 height_total    | 573.0
 num_rows_total  | 5031.0
 num_cols_total  | 1910.0

Band statistics

Raster datasets may have more than one band. We can look at statistics for each raster band using RT_statistics and RT_bandstatistics

RT_statistics provides statistics for all bands (min, max, mean, standard deviation)

RT_bandstastistics provides statistics for a single, specified band

  1. We can start by looking at these for each individual tile. In the first examples we only show one row of data for simplicity.

    The information returned by RT_statistics is an array of elements. There is one element in the array for each band, and the elements returned include the four individual measures (min, max, mean, standard deviation).

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    
    # individual tile
    # statistics for five bands
    df_stats = df_raster.select(RT.statistics("raster").alias("statistics"))
    
    df_stats.withColumn("stats", F.explode("statistics")) \
                .select("stats.*") \
                .show(5, vertical=True)
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    -RECORD 0-------------------
     min   | 15.0
     max   | 210.0
     mean  | 92.35225296020508
     stdev | 26.602800170008393
    -RECORD 1-------------------
     min   | 44.0
     max   | 222.0
     mean  | 129.31229305267334
     stdev | 17.710729011348835
    -RECORD 2-------------------
     min   | 34.0
     max   | 220.0
     mean  | 72.41140651702881
     stdev | 19.689422600849085
    -RECORD 3-------------------
     min   | 44.0
     max   | 229.0
     mean  | 174.80743598937988
     stdev | 14.41309833053885
    -RECORD 4-------------------
     min   | 14.0
     max   | 240.0
     mean  | 84.154541015625
     stdev | 26.81645369218381
    only showing top 5 rows
  2. We can also look at the statistics for one band at a time. In this example we'll look at the statistics for band 1 in the raster tile in the first row of the DataFrame

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    
    # individual tile
    # statistics for one band
    df_raster.select(RT.band_statistics("raster", 1)).show(1, vertical=True, truncate=False)
    Result
    Use dark colors for code blocksCopy
    1
    2
    -RECORD 0-------------------------------------------------------------------------
    BandStatistics(raster, 1) | {15.0, 210.0, 92.35225296020508, 26.602800170008393}
  3. For each band, statistics returns a min, max, mean, and standard deviation.

    The returned values are in an array of individual statistical elements

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    
    # for each band, statistics returns a min, max, mean, and standard deviation
    df_raster.select(RT.statistics("raster")).printSchema()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    root
     |-- Statistics(raster): array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- min: double (nullable = false)
     |    |    |-- max: double (nullable = false)
     |    |    |-- mean: double (nullable = true)
     |    |    |-- stdev: double (nullable = true)
  4. We can unpack these and show statistics for all bands in a single tile in one table

    We can do this by exploding the full set of statistics so that we have one row for each band

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    # show statistics for all bands for a single tile
    # First, explode the statistics result into a column called "element"
    # Second, access the individual named parts of the element e.g., F.col("element").min to get the min value (see the naming structure from the cell above)
    df_raster.limit(1).select(F.explode(RT.statistics("raster")).alias("element"))\
        .withColumn("min", F.col("element").min)\
        .withColumn("max", F.col("element").max)\
        .withColumn("mean", F.col("element").mean)\
        .withColumn("stdev", F.col("element").stdev)\
        .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    +--------------------+----+-----+------------------+------------------+
    |             element| min|  max|              mean|             stdev|
    +--------------------+----+-----+------------------+------------------+
    |{15.0, 210.0, 92....|15.0|210.0| 92.35225296020508|26.602800170008393|
    |{44.0, 222.0, 129...|44.0|222.0|129.31229305267334|17.710729011348835|
    |{34.0, 220.0, 72....|34.0|220.0| 72.41140651702881|19.689422600849085|
    |{44.0, 229.0, 174...|44.0|229.0|174.80743598937988| 14.41309833053885|
    +--------------------+----+-----+------------------+------------------+
  5. The example above showed one row, but we can also show statistics for all tiles and all bands in a single table

    We've added in some windowing to get labels for the tiles and bands to keep the results organized in the output table

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    from pyspark.sql.window import Window
    
    # Define a window specification to allow us to list the tile number and the band number for each individual tile
    # for numbering tile
    window_spec_tile = Window.orderBy(col("path"))
    # for numbering band when the statistics are exploded into elements
    window_spec = Window.partitionBy("tile_number").orderBy(col("element"))
    
    
    df_raster\
        .withColumn("tile_number", F.row_number().over(window_spec_tile))\
        .select(RT.extent("raster").alias("extent"), "tile_number", F.explode(RT.statistics("raster")).alias("element"))\
        .withColumn("band", F.row_number().over(window_spec))\
        .withColumn("min", F.col("element").min)\
        .withColumn("max", F.col("element").max)\
        .withColumn("mean", F.col("element").mean)\
        .withColumn("stdev", F.col("element").stdev)\
        .drop("element")\
        .sort("tile_number")\
        .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    +--------------------+-----------+----+----+-----+------------------+------------------+
    |              extent|tile_number|band| min|  max|              mean|             stdev|
    +--------------------+-----------+----+----+-----+------------------+------------------+
    |{"rings":[[[44482...|          1|   1|15.0|210.0| 92.35225296020508|26.602800170008393|
    |{"rings":[[[44482...|          1|   2|34.0|220.0| 72.41140651702881|19.689422600849085|
    |{"rings":[[[44482...|          1|   3|44.0|222.0|129.31229305267334|17.710729011348835|
    |{"rings":[[[44482...|          1|   4|44.0|229.0|174.80743598937988| 14.41309833053885|
    |{"rings":[[[44513...|          2|   1|14.0|240.0|   84.154541015625| 26.81645369218381|
    |{"rings":[[[44513...|          2|   2|23.0|255.0| 74.64390182495117|24.848689245873594|
    |{"rings":[[[44513...|          2|   3|38.0|255.0| 124.0490951538086| 19.33410483909644|
    |{"rings":[[[44513...|          2|   4|43.0|255.0|172.65335083007812| 16.11312650314451|
    |{"rings":[[[44543...|          3|   1|11.0|233.0|  90.1596269607544| 35.35228140271004|
    |{"rings":[[[44543...|          3|   2|22.0|247.0| 80.91951274871826| 35.10800917517936|
    |{"rings":[[[44543...|          3|   3|40.0|248.0|124.84498310089111| 27.80693983917014|
    |{"rings":[[[44543...|          3|   4|44.0|255.0|159.28014087677002| 20.60250979293797|
    |{"rings":[[[44574...|          4|   1|16.0|232.0| 80.22738838195801|26.324469299187722|
    |{"rings":[[[44574...|          4|   2|16.0|248.0| 73.15077209472656|24.554763438866985|
    |{"rings":[[[44574...|          4|   3|45.0|250.0|117.66667747497559|20.641196882943824|
    |{"rings":[[[44574...|          4|   4|51.0|254.0| 159.6287441253662|11.813605740654758|
    |{"rings":[[[44605...|          5|   1|15.0|231.0| 80.11518215240642| 25.71023617309586|
    |{"rings":[[[44605...|          5|   2|19.0|233.0| 72.73027657085561| 23.62374643935028|
    |{"rings":[[[44605...|          5|   3|48.0|244.0|158.23040712733956|11.794383958706929|
    |{"rings":[[[44605...|          5|   4|48.0|246.0|117.44518298796791| 20.90472557528909|
    +--------------------+-----------+----+----+-----+------------------+------------------+
    only showing top 20 rows

Global statistics for bands

Instead of looking at statistics for each tile / row in a DataFrame, you may want global statistics across all tiles / rows in the DataFrame. Here are a few ways that you can get those values:

  1. If the raster is small enough to merge into a single tile, you can use RT_merge to combine all of the tiles into a single raster tile / row in the DataFrame

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    
    df_raster_merge = df_raster.agg(RT.merge(F.collect_list("raster")).alias("raster"))
    
    # pull out statistics for the merged dataset (it is only one row at this point since it's one giant tile)
    df_raster_merge.select(F.explode(RT.statistics("raster")).alias("element"))\
        .withColumn("min", F.col("element").min)\
        .withColumn("max", F.col("element").max)\
        .withColumn("mean", F.col("element").mean)\
        .withColumn("stdev", F.col("element").stdev)\
        .drop("element")\
        .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    +----+-----+---------------+---------------+
    | min|  max|           mean|          stdev|
    +----+-----+---------------+---------------+
    | 9.0|240.0|80.369452743768|28.752628468188|
    |25.0|255.0| 119.5062109164| 21.84675346408|
    | 2.0|255.0|  73.6978419662|24.961412705739|
    |36.0|255.0|167.75836983476|  18.1467047388|
    +----+-----+---------------+---------------+
  2. These can also be run off one band at a time if preferred:

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    
    # get number of bands
    num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"]
    
    # walk through each of the bands and print results
    for i in range(0, num_bands):
        # bands are numbered starting at 1, not at 0
        print(f"Band {i+1} statistics")
    
        # list the statistics for the band
        df_raster_merge.select(RT.band_statistics("raster", i+1).alias("band_stats"))\
            .withColumn("min", F.col("band_stats").min)\
            .withColumn("max", F.col("band_stats").max)\
            .withColumn("mean", F.col("band_stats").mean)\
            .withColumn("stdev", F.col("band_stats").stdev)\
            .drop("band_stats")\
            .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    Band 1 statistics
    +---+-----+---------------+---------------+
    |min|  max|           mean|          stdev|
    +---+-----+---------------+---------------+
    |9.0|240.0|80.369452743768|28.752628468188|
    +---+-----+---------------+---------------+
    
    Band 2 statistics
    +----+-----+--------------+--------------+
    | min|  max|          mean|         stdev|
    +----+-----+--------------+--------------+
    |25.0|255.0|119.5062109164|21.84675346408|
    +----+-----+--------------+--------------+
    
    Band 3 statistics
    +---+-----+-------------+---------------+
    |min|  max|         mean|          stdev|
    +---+-----+-------------+---------------+
    |2.0|255.0|73.6978419662|24.961412705739|
    +---+-----+-------------+---------------+
    
    Band 4 statistics
    +----+-----+---------------+-------------+
    | min|  max|           mean|        stdev|
    +----+-----+---------------+-------------+
    |36.0|255.0|167.75836983476|18.1467047388|
    +----+-----+---------------+-------------+
  3. If the tiles in the DataFrame are too large to merge into a single row, we can obtain min and max through aggregation of the individual tiles. This will not work for mean and standard deviation

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
    # get number of bands
    num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"]
    
    # walk through each of the bands and print results
    for i in range(0, num_bands):
        # bands are numbered starting at 1, not at 0
        print(f"Band {i+1} statistics")
    
        # calculate the aggregated min and max across all tiles
        df_raster\
            .select(RT.band_statistics("raster", i+1).alias("band_stats"))\
                .withColumn("min", F.col("band_stats").min)\
                .withColumn("max", F.col("band_stats").max)\
                .agg(F.min("min").alias("min"),F.max("max").alias("max"))\
                .show()
    Result
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    Band 1 statistics
    +---+-----+
    |min|  max|
    +---+-----+
    |9.0|240.0|
    +---+-----+
    
    Band 2 statistics
    +----+-----+
    | min|  max|
    +----+-----+
    |25.0|255.0|
    +----+-----+
    
    Band 3 statistics
    +---+-----+
    |min|  max|
    +---+-----+
    |2.0|255.0|
    +---+-----+
    
    Band 4 statistics
    +----+-----+
    | min|  max|
    +----+-----+
    |36.0|255.0|
    +----+-----+

Raster values

  1. You can obtain values for all of the cells for a single band within a raster tile using RT_bandvalues. Here we can look at the first 10 values of band 1 and 2.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 1
    raster = df_raster.limit(1).select(RT.band_values("raster", 1).alias("values")).head(1)[0]["values"]
    
    # print the first 10 values
    raster[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [65.0, 68.0, 60.0, 61.0, 71.0, 72.0, 67.0, 75.0, 72.0, 67.0]
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 2
    raster = df_raster.limit(1).select(RT.band_values("raster", 2).alias("values")).head(1)[0]["values"]
    
    # print the first 10 values
    raster[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [115.0, 116.0, 108.0, 109.0, 119.0, 120.0, 115.0, 123.0, 122.0, 116.0]
  2. Using RT_bandmask, we can check the mask values for all of the cells for a single band within a raster tile. Because all cells are valid in this raster, all mask values should be true here.

    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 1
    df_band_mask = df_raster.select(RT.band_mask("raster", 1).alias("band_mask")).head(1)[0]["band_mask"]
    
    # print the first 10 values
    df_band_mask[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [True, True, True, True, True, True, True, True, True, True]
    PythonPythonScala
    Use dark colors for code blocksCopy
    1
    2
    3
    4
    5
    6
    
    # first 10 values for band 2
    df_band_mask = df_raster.select(RT.band_mask("raster", 2).alias("band_mask")).head(1)[0]["band_mask"]
    
    # print the first 10 values
    df_band_mask[0:10]
    Result
    Use dark colors for code blocksCopy
    1
    [True, True, True, True, True, True, True, True, True, True]

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools:

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.