Explore properties for raster data | ArcGIS GeoAnalytics Engine

Learn the functionalities for exploring the properties of raster datasets.

In this tutorial you will read from raster datasets and explore their properties using different functions in GeoAnalytics Engine

Read in Raster data

Set up the workspace

Import the required modules.

Python

Scala

Use dark colors for code blocksCopy

# import geoanalytics functionality
import geoanalytics
import geoanalytics.raster.functions as RT
import geoanalytics.sql.functions as ST

import pyspark.sql.functions as F

geoanalytics.auth(username="user1", password="p@ssword")

Prepare your input raster

Set the location of the raster dataset.

The data comes from a multi-band image from the United States Geological Survey's National Agriculture Image Program (NAIP). Note that the image provided for use is a clipped piece of a larger NAIP image.

The file used here is available here.

PythonPythonScala
Use dark colors for code blocksCopy
```
1
2
3

# raster dataset location
raster_data_loc = r"data/m_4209351_se_15_030_20230902_clip.tif"
```
Read in the raster file

PythonPythonScala
Use dark colors for code blocksCopy
```
1
2

df_raster = spark.read.format("raster").load(raster_data_loc)
```

The raster was read in and split into multiple tiles, visualize the tiles as rows in the DataFrame.

Python

Scala

Use dark colors for code blocksCopy

df_raster.show()

Result
Use dark colors for code blocksCopy
+--------------------+--------------------+
|                path|              raster|
+--------------------+--------------------+
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x935x1...|
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x1024x...|
|file:/C:/_workboo...|SqlRaster(4x935x8...|
+--------------------+--------------------+

Raster Properties

There are numerous functions available for working with raster datasets. In this tutorial we will explore raster properties.

Since our raster dataset was read into a DataFrame with multiple rows/tiles, in most of the examples, we will only show one row when we explore the values since the general properties are likely the same for every tile (e.g., cell size, spatial reference, etc.)

RT_info - list all of the general metadata information about a raster.

The individual components can be collected separately as well.

Python

Scala

Use dark colors for code blocksCopy

# print raster info
df_raster.withColumn("info", RT.info("raster")) \
             .selectExpr("info.*") \
             .show(1, vertical=True)

Result
Use dark colors for code blocksCopy
-RECORD 0--------------------------
 numColumns | 1024
 numRows    | 1024
 numBands   | 4
 cellSizeX  | 0.30000000000001137
 cellSizeY  | 0.3000000000001819
 pixelType  | UInt8
 srid       | 26915
 srText     | PROJCS["NAD_1983_...
 extent     | [444822.9, 466684...
only showing top 1 row

When exploring specific individual metadeta items about the raster, we can use a variety of functions for this:

RT_extent - overall extent of the raster
RT_cellsizex - cell size on the x axis
RT_cellsizey - cell size on the y axis
RT_numbands - number of bands in the raster
RT_numcolumns - number of columns in the raster
RT_numrows - number of rows in the raster
RT_pixeltype - the pixel type (e.g., UInt8)
RT_srid - the spatial reference ID
RT_srtext - the full text string with the spatial reference metadata

Python

Scala

Use dark colors for code blocksCopy

df_raster\
    .select(RT.extent("raster"),
          RT.cell_size_x("raster"),
          RT.cell_size_y("raster"),
          RT.num_bands("raster"),
          RT.num_columns("raster"),
          RT.num_rows("raster"),
          RT.pixel_type("raster"),
          RT.srid("raster"),\
          RT.sr_text("raster"))\
    .show(1, vertical=True)

Result
Use dark colors for code blocksCopy
-RECORD 0----------------------------------
 Extent(raster)     | {"rings":[[[44482...
 CellSizeX(raster)  | 0.30000000000001137
 CellSizeY(raster)  | 0.3000000000001819
 NumBands(raster)   | 4
 NumColumns(raster) | 1024
 NumRows(raster)    | 1024
 PixelType(raster)  | UInt8
 Srid(raster)       | 26915
 SrText(raster)     | PROJCS["NAD_1983_...
only showing top 1 row

Summarize properties across all tiles for a raster dataset

When you read in a large raster it will be broken into multiple tiles (DataFrame rows) to improve performance. This means that when you retrieve properties for the tiles in a raster DataFrame, you will see the properties for each tile separately.

If you want to aggregate to retrieve properties across the entire dataset, you need to work across the rows in the DataFrame

In the example below, we aggregate across all rows (raster tiles) in our DataFrame to obtain properties about the full raster that was ingested.

The aggregation below includes:

Count all rows in the table (sum a count of 1 for each row)
Union the extent of all rows to get the full extent
The cell size (x and y) for the first cell; assuming the cell size is consistent across all tiles in the DataFrame
Calculation of the raster height and width based on the full extent
Calculation of the number of rows and columns based on the full extent divided by the cell size (x and y dimensions)

Python

Scala

Use dark colors for code blocksCopy

# height, width, number of rows, and number of columns across _all tiles created when a raster was ingested_
df_raster.agg(F.sum(F.lit(1)).alias("number_of_tiles"),
              ST.aggr_union(RT.extent("raster")).alias("full_extent"),
              F.first(RT.cell_size_x("raster")).alias("cell_size_x"),
              F.first(RT.cell_size_y("raster")).alias("cell_size_y"))\
    .withColumn("width_total", ST.max_x("full_extent") - ST.min_x("full_extent"))\
    .withColumn("height_total", ST.max_y("full_extent") - ST.min_y("full_extent"))\
    .withColumn("num_rows_total", F.round(F.col("width_total") / F.col("cell_size_x"), 0))\
    .withColumn("num_cols_total", F.round(F.col("height_total") / F.col("cell_size_y"), 0))\
    .show(vertical=True)

Result
Use dark colors for code blocksCopy
-RECORD 0-------------------------------
 number_of_tiles | 10
 full_extent     | {"rings":[[[44482...
 cell_size_x     | 0.30000000000001137
 cell_size_y     | 0.3000000000001819
 width_total     | 1509.2999999999884
 height_total    | 573.0
 num_rows_total  | 5031.0
 num_cols_total  | 1910.0

Band statistics

Raster datasets may have more than one band. We can look at statistics for each raster band using RT_statistics and RT_bandstatistics

RT_statistics provides statistics for all bands (min, max, mean, standard deviation)

RT_bandstastistics provides statistics for a single, specified band

We can start by looking at these for each individual tile. In the first examples we only show one row of data for simplicity.

The information returned by RT_statistics is an array of elements. There is one element in the array for each band, and the elements returned include the four individual measures (min, max, mean, standard deviation).

Python

Scala

Use dark colors for code blocksCopy

# individual tile
# statistics for five bands
df_stats = df_raster.select(RT.statistics("raster").alias("statistics"))

df_stats.withColumn("stats", F.explode("statistics")) \
            .select("stats.*") \
            .show(5, vertical=True)

Result
Use dark colors for code blocksCopy
-RECORD 0-------------------
 min   | 15.0
 max   | 210.0
 mean  | 92.35225296020508
 stdev | 26.602800170008393
-RECORD 1-------------------
 min   | 44.0
 max   | 222.0
 mean  | 129.31229305267334
 stdev | 17.710729011348835
-RECORD 2-------------------
 min   | 34.0
 max   | 220.0
 mean  | 72.41140651702881
 stdev | 19.689422600849085
-RECORD 3-------------------
 min   | 44.0
 max   | 229.0
 mean  | 174.80743598937988
 stdev | 14.41309833053885
-RECORD 4-------------------
 min   | 14.0
 max   | 240.0
 mean  | 84.154541015625
 stdev | 26.81645369218381
only showing top 5 rows

We can also look at the statistics for one band at a time. In this example we'll look at the statistics for band 1 in the raster tile in the first row of the DataFrame

Python

Scala

Use dark colors for code blocksCopy

# individual tile
# statistics for one band
df_raster.select(RT.band_statistics("raster", 1)).show(1, vertical=True, truncate=False)

Result
Use dark colors for code blocksCopy
-RECORD 0-------------------------------------------------------------------------
BandStatistics(raster, 1) | {15.0, 210.0, 92.35225296020508, 26.602800170008393}

For each band, statistics returns a min, max, mean, and standard deviation.

The returned values are in an array of individual statistical elements

Python

Scala

Use dark colors for code blocksCopy

# for each band, statistics returns a min, max, mean, and standard deviation
df_raster.select(RT.statistics("raster")).printSchema()

Result
Use dark colors for code blocksCopy
root
 |-- Statistics(raster): array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- min: double (nullable = false)
 |    |    |-- max: double (nullable = false)
 |    |    |-- mean: double (nullable = true)
 |    |    |-- stdev: double (nullable = true)

We can unpack these and show statistics for all bands in a single tile in one table

We can do this by exploding the full set of statistics so that we have one row for each band

Python

Scala

Use dark colors for code blocksCopy

# show statistics for all bands for a single tile
# First, explode the statistics result into a column called "element"
# Second, access the individual named parts of the element e.g., F.col("element").min to get the min value (see the naming structure from the cell above)
df_raster.limit(1).select(F.explode(RT.statistics("raster")).alias("element"))\
    .withColumn("min", F.col("element").min)\
    .withColumn("max", F.col("element").max)\
    .withColumn("mean", F.col("element").mean)\
    .withColumn("stdev", F.col("element").stdev)\
    .show()

Result
Use dark colors for code blocksCopy
+--------------------+----+-----+------------------+------------------+
|             element| min|  max|              mean|             stdev|
+--------------------+----+-----+------------------+------------------+
|{15.0, 210.0, 92....|15.0|210.0| 92.35225296020508|26.602800170008393|
|{44.0, 222.0, 129...|44.0|222.0|129.31229305267334|17.710729011348835|
|{34.0, 220.0, 72....|34.0|220.0| 72.41140651702881|19.689422600849085|
|{44.0, 229.0, 174...|44.0|229.0|174.80743598937988| 14.41309833053885|
+--------------------+----+-----+------------------+------------------+

The example above showed one row, but we can also show statistics for all tiles and all bands in a single table

We've added in some windowing to get labels for the tiles and bands to keep the results organized in the output table

Python

Scala

Use dark colors for code blocksCopy

from pyspark.sql.window import Window

# Define a window specification to allow us to list the tile number and the band number for each individual tile
# for numbering tile
window_spec_tile = Window.orderBy(col("path"))
# for numbering band when the statistics are exploded into elements
window_spec = Window.partitionBy("tile_number").orderBy(col("element"))


df_raster\
    .withColumn("tile_number", F.row_number().over(window_spec_tile))\
    .select(RT.extent("raster").alias("extent"), "tile_number", F.explode(RT.statistics("raster")).alias("element"))\
    .withColumn("band", F.row_number().over(window_spec))\
    .withColumn("min", F.col("element").min)\
    .withColumn("max", F.col("element").max)\
    .withColumn("mean", F.col("element").mean)\
    .withColumn("stdev", F.col("element").stdev)\
    .drop("element")\
    .sort("tile_number")\
    .show()

Result
Use dark colors for code blocksCopy
+--------------------+-----------+----+----+-----+------------------+------------------+
|              extent|tile_number|band| min|  max|              mean|             stdev|
+--------------------+-----------+----+----+-----+------------------+------------------+
|{"rings":[[[44482...|          1|   1|15.0|210.0| 92.35225296020508|26.602800170008393|
|{"rings":[[[44482...|          1|   2|34.0|220.0| 72.41140651702881|19.689422600849085|
|{"rings":[[[44482...|          1|   3|44.0|222.0|129.31229305267334|17.710729011348835|
|{"rings":[[[44482...|          1|   4|44.0|229.0|174.80743598937988| 14.41309833053885|
|{"rings":[[[44513...|          2|   1|14.0|240.0|   84.154541015625| 26.81645369218381|
|{"rings":[[[44513...|          2|   2|23.0|255.0| 74.64390182495117|24.848689245873594|
|{"rings":[[[44513...|          2|   3|38.0|255.0| 124.0490951538086| 19.33410483909644|
|{"rings":[[[44513...|          2|   4|43.0|255.0|172.65335083007812| 16.11312650314451|
|{"rings":[[[44543...|          3|   1|11.0|233.0|  90.1596269607544| 35.35228140271004|
|{"rings":[[[44543...|          3|   2|22.0|247.0| 80.91951274871826| 35.10800917517936|
|{"rings":[[[44543...|          3|   3|40.0|248.0|124.84498310089111| 27.80693983917014|
|{"rings":[[[44543...|          3|   4|44.0|255.0|159.28014087677002| 20.60250979293797|
|{"rings":[[[44574...|          4|   1|16.0|232.0| 80.22738838195801|26.324469299187722|
|{"rings":[[[44574...|          4|   2|16.0|248.0| 73.15077209472656|24.554763438866985|
|{"rings":[[[44574...|          4|   3|45.0|250.0|117.66667747497559|20.641196882943824|
|{"rings":[[[44574...|          4|   4|51.0|254.0| 159.6287441253662|11.813605740654758|
|{"rings":[[[44605...|          5|   1|15.0|231.0| 80.11518215240642| 25.71023617309586|
|{"rings":[[[44605...|          5|   2|19.0|233.0| 72.73027657085561| 23.62374643935028|
|{"rings":[[[44605...|          5|   3|48.0|244.0|158.23040712733956|11.794383958706929|
|{"rings":[[[44605...|          5|   4|48.0|246.0|117.44518298796791| 20.90472557528909|
+--------------------+-----------+----+----+-----+------------------+------------------+
only showing top 20 rows

Global statistics for bands

Instead of looking at statistics for each tile / row in a DataFrame, you may want global statistics across all tiles / rows in the DataFrame. Here are a few ways that you can get those values:

If the raster is small enough to merge into a single tile, you can use RT_merge to combine all of the tiles into a single raster tile / row in the DataFrame

Python

Scala

Use dark colors for code blocksCopy

df_raster_merge = df_raster.agg(RT.merge(F.collect_list("raster")).alias("raster"))

# pull out statistics for the merged dataset (it is only one row at this point since it's one giant tile)
df_raster_merge.select(F.explode(RT.statistics("raster")).alias("element"))\
    .withColumn("min", F.col("element").min)\
    .withColumn("max", F.col("element").max)\
    .withColumn("mean", F.col("element").mean)\
    .withColumn("stdev", F.col("element").stdev)\
    .drop("element")\
    .show()

Result
Use dark colors for code blocksCopy
+----+-----+---------------+---------------+
| min|  max|           mean|          stdev|
+----+-----+---------------+---------------+
| 9.0|240.0|80.369452743768|28.752628468188|
|25.0|255.0| 119.5062109164| 21.84675346408|
| 2.0|255.0|  73.6978419662|24.961412705739|
|36.0|255.0|167.75836983476|  18.1467047388|
+----+-----+---------------+---------------+

These can also be run off one band at a time if preferred:

Python

Scala

Use dark colors for code blocksCopy

# get number of bands
num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"]

# walk through each of the bands and print results
for i in range(0, num_bands):
    # bands are numbered starting at 1, not at 0
    print(f"Band {i+1} statistics")

    # list the statistics for the band
    df_raster_merge.select(RT.band_statistics("raster", i+1).alias("band_stats"))\
        .withColumn("min", F.col("band_stats").min)\
        .withColumn("max", F.col("band_stats").max)\
        .withColumn("mean", F.col("band_stats").mean)\
        .withColumn("stdev", F.col("band_stats").stdev)\
        .drop("band_stats")\
        .show()

Result
Use dark colors for code blocksCopy
Band 1 statistics
+---+-----+---------------+---------------+
|min|  max|           mean|          stdev|
+---+-----+---------------+---------------+
|9.0|240.0|80.369452743768|28.752628468188|
+---+-----+---------------+---------------+

Band 2 statistics
+----+-----+--------------+--------------+
| min|  max|          mean|         stdev|
+----+-----+--------------+--------------+
|25.0|255.0|119.5062109164|21.84675346408|
+----+-----+--------------+--------------+

Band 3 statistics
+---+-----+-------------+---------------+
|min|  max|         mean|          stdev|
+---+-----+-------------+---------------+
|2.0|255.0|73.6978419662|24.961412705739|
+---+-----+-------------+---------------+

Band 4 statistics
+----+-----+---------------+-------------+
| min|  max|           mean|        stdev|
+----+-----+---------------+-------------+
|36.0|255.0|167.75836983476|18.1467047388|
+----+-----+---------------+-------------+

If the tiles in the DataFrame are too large to merge into a single row, we can obtain min and max through aggregation of the individual tiles. This will not work for mean and standard deviation

Python

Scala

Use dark colors for code blocksCopy

# get number of bands
num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"]

# walk through each of the bands and print results
for i in range(0, num_bands):
    # bands are numbered starting at 1, not at 0
    print(f"Band {i+1} statistics")

    # calculate the aggregated min and max across all tiles
    df_raster\
        .select(RT.band_statistics("raster", i+1).alias("band_stats"))\
            .withColumn("min", F.col("band_stats").min)\
            .withColumn("max", F.col("band_stats").max)\
            .agg(F.min("min").alias("min"),F.max("max").alias("max"))\
            .show()

Result
Use dark colors for code blocksCopy
Band 1 statistics
+---+-----+
|min|  max|
+---+-----+
|9.0|240.0|
+---+-----+

Band 2 statistics
+----+-----+
| min|  max|
+----+-----+
|25.0|255.0|
+----+-----+

Band 3 statistics
+---+-----+
|min|  max|
+---+-----+
|2.0|255.0|
+---+-----+

Band 4 statistics
+----+-----+
| min|  max|
+----+-----+
|36.0|255.0|
+----+-----+

Raster values

You can obtain values for all of the cells for a single band within a raster tile using RT_bandvalues. Here we can look at the first 10 values of band 1 and 2.

Python

Scala

Use dark colors for code blocksCopy

# first 10 values for band 1
raster = df_raster.limit(1).select(RT.band_values("raster", 1).alias("values")).head(1)[0]["values"]

# print the first 10 values
raster[0:10]

Result
Use dark colors for code blocksCopy
[65.0, 68.0, 60.0, 61.0, 71.0, 72.0, 67.0, 75.0, 72.0, 67.0]

Python

Scala

Use dark colors for code blocksCopy

# first 10 values for band 2
raster = df_raster.limit(1).select(RT.band_values("raster", 2).alias("values")).head(1)[0]["values"]

# print the first 10 values
raster[0:10]

Result
Use dark colors for code blocksCopy
[115.0, 116.0, 108.0, 109.0, 119.0, 120.0, 115.0, 123.0, 122.0, 116.0]

Using RT_bandmask, we can check the mask values for all of the cells for a single band within a raster tile. Because all cells are valid in this raster, all mask values should be true here.

Python

Scala

Use dark colors for code blocksCopy

# first 10 values for band 1
df_band_mask = df_raster.select(RT.band_mask("raster", 1).alias("band_mask")).head(1)[0]["band_mask"]

# print the first 10 values
df_band_mask[0:10]

Result
Use dark colors for code blocksCopy
[True, True, True, True, True, True, True, True, True, True]

Python

Scala

Use dark colors for code blocksCopy

# first 10 values for band 2
df_band_mask = df_raster.select(RT.band_mask("raster", 2).alias("band_mask")).head(1)[0]["band_mask"]

# print the first 10 values
df_band_mask[0:10]

Result
Use dark colors for code blocksCopy
[True, True, True, True, True, True, True, True, True, True]

What's next?

Learn about how to read in other data types or analyze your data through SQL functions and analysis tools:

Read and write to databases