Learn the functionalities for exploring the properties of raster datasets.
In this tutorial you will read from raster datasets and explore their properties using different functions in GeoAnalytics Engine
Read in Raster data
Set up the workspace
- Import the required modules.
Python Python Scala Use dark colors for code blocks Copy # import geoanalytics functionality import geoanalytics import geoanalytics.raster.functions as RT import geoanalytics.sql.functions as ST import pyspark.sql.functions as F geoanalytics.auth(username="user1", password="p@ssword")
Prepare your input raster
-
Set the location of the raster dataset.
The data comes from a multi-band image from the United States Geological Survey's National Agriculture Image Program (NAIP). Note that the image provided for use is a clipped piece of a larger NAIP image.
The file used here is available here.
Python Python Scala Use dark colors for code blocks Copy # raster dataset location raster_data_loc = r"data/m_4209351_se_15_030_20230902_clip.tif" -
Read in the raster file
Python Python Scala Use dark colors for code blocks Copy df_raster = spark.read.format("raster").load(raster_data_loc) -
The raster was read in and split into multiple tiles, visualize the tiles as rows in the DataFrame.
Python Python Scala Use dark colors for code blocks Copy df_raster.show()ResultUse dark colors for code blocks Copy +--------------------+--------------------+ | path| raster| +--------------------+--------------------+ |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x935x1...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x935x8...| +--------------------+--------------------+
Raster Properties
There are numerous functions available for working with raster datasets. In this tutorial we will explore raster properties.
Since our raster dataset was read into a DataFrame with multiple rows/tiles, in most of the examples, we will only show one row when we explore the values since the general properties are likely the same for every tile (e.g., cell size, spatial reference, etc.)
-
RT- list all of the general metadata information about a raster._info The individual components can be collected separately as well.
Python Python Scala Use dark colors for code blocks Copy # print raster info df_raster.withColumn("info", RT.info("raster")) \ .selectExpr("info.*") \ .show(1, vertical=True)ResultUse dark colors for code blocks Copy -RECORD 0-------------------------- numColumns | 1024 numRows | 1024 numBands | 4 cellSizeX | 0.30000000000001137 cellSizeY | 0.3000000000001819 pixelType | UInt8 srid | 26915 srText | PROJCS["NAD_1983_... extent | [444822.9, 466684... only showing top 1 row -
When exploring specific individual metadeta items about the raster, we can use a variety of functions for this:
RT- overall extent of the raster_extent RT- cell size on the x axis_cellsizex RT- cell size on the y axis_cellsizey RT- number of bands in the raster_numbands RT- number of columns in the raster_numcolumns RT- number of rows in the raster_numrows RT- the pixel type (e.g., UInt8)_pixeltype RT- the spatial reference ID_srid RT- the full text string with the spatial reference metadata_srtext
Python Python Scala Use dark colors for code blocks Copy df_raster\ .select(RT.extent("raster"), RT.cell_size_x("raster"), RT.cell_size_y("raster"), RT.num_bands("raster"), RT.num_columns("raster"), RT.num_rows("raster"), RT.pixel_type("raster"), RT.srid("raster"),\ RT.sr_text("raster"))\ .show(1, vertical=True)ResultUse dark colors for code blocks Copy -RECORD 0---------------------------------- Extent(raster) | {"rings":[[[44482... CellSizeX(raster) | 0.30000000000001137 CellSizeY(raster) | 0.3000000000001819 NumBands(raster) | 4 NumColumns(raster) | 1024 NumRows(raster) | 1024 PixelType(raster) | UInt8 Srid(raster) | 26915 SrText(raster) | PROJCS["NAD_1983_... only showing top 1 row
Summarize properties across all tiles for a raster dataset
When you read in a large raster it will be broken into multiple tiles (DataFrame rows) to improve performance. This means that when you retrieve properties for the tiles in a raster DataFrame, you will see the properties for each tile separately.
If you want to aggregate to retrieve properties across the entire dataset, you need to work across the rows in the DataFrame
In the example below, we aggregate across all rows (raster tiles) in our DataFrame to obtain properties about the full raster that was ingested.
The aggregation below includes:
- Count all rows in the table (sum a count of 1 for each row)
- Union the extent of all rows to get the full extent
- The cell size (x and y) for the first cell; assuming the cell size is consistent across all tiles in the DataFrame
- Calculation of the raster height and width based on the full extent
- Calculation of the number of rows and columns based on the full extent divided by the cell size (x and y dimensions)
# height, width, number of rows, and number of columns across _all tiles created when a raster was ingested_
df_raster.agg(F.sum(F.lit(1)).alias("number_of_tiles"),
ST.aggr_union(RT.extent("raster")).alias("full_extent"),
F.first(RT.cell_size_x("raster")).alias("cell_size_x"),
F.first(RT.cell_size_y("raster")).alias("cell_size_y"))\
.withColumn("width_total", ST.max_x("full_extent") - ST.min_x("full_extent"))\
.withColumn("height_total", ST.max_y("full_extent") - ST.min_y("full_extent"))\
.withColumn("num_rows_total", F.round(F.col("width_total") / F.col("cell_size_x"), 0))\
.withColumn("num_cols_total", F.round(F.col("height_total") / F.col("cell_size_y"), 0))\
.show(vertical=True)-RECORD 0-------------------------------
number_of_tiles | 10
full_extent | {"rings":[[[44482...
cell_size_x | 0.30000000000001137
cell_size_y | 0.3000000000001819
width_total | 1509.2999999999884
height_total | 573.0
num_rows_total | 5031.0
num_cols_total | 1910.0Band statistics
Raster datasets may have more than one band. We can look at statistics for each raster band using RT and RT
RT provides statistics for all bands (min, max, mean, standard deviation)
RT provides statistics for a single, specified band
-
We can start by looking at these for each individual tile. In the first examples we only show one row of data for simplicity.
The information returned by
RTis an array of elements. There is one element in the array for each band, and the elements returned include the four individual measures (min, max, mean, standard deviation)._statistics Python Python Scala Use dark colors for code blocks Copy # individual tile # statistics for five bands df_stats = df_raster.select(RT.statistics("raster").alias("statistics")) df_stats.withColumn("stats", F.explode("statistics")) \ .select("stats.*") \ .show(5, vertical=True)ResultUse dark colors for code blocks Copy -RECORD 0------------------- min | 15.0 max | 210.0 mean | 92.35225296020508 stdev | 26.602800170008393 -RECORD 1------------------- min | 44.0 max | 222.0 mean | 129.31229305267334 stdev | 17.710729011348835 -RECORD 2------------------- min | 34.0 max | 220.0 mean | 72.41140651702881 stdev | 19.689422600849085 -RECORD 3------------------- min | 44.0 max | 229.0 mean | 174.80743598937988 stdev | 14.41309833053885 -RECORD 4------------------- min | 14.0 max | 240.0 mean | 84.154541015625 stdev | 26.81645369218381 only showing top 5 rows -
We can also look at the statistics for one band at a time. In this example we'll look at the statistics for band 1 in the raster tile in the first row of the DataFrame
Python Python Scala Use dark colors for code blocks Copy # individual tile # statistics for one band df_raster.select(RT.band_statistics("raster", 1)).show(1, vertical=True, truncate=False)ResultUse dark colors for code blocks Copy -RECORD 0------------------------------------------------------------------------- BandStatistics(raster, 1) | {15.0, 210.0, 92.35225296020508, 26.602800170008393} -
For each band, statistics returns a min, max, mean, and standard deviation.
The returned values are in an array of individual statistical elements
Python Python Scala Use dark colors for code blocks Copy # for each band, statistics returns a min, max, mean, and standard deviation df_raster.select(RT.statistics("raster")).printSchema()ResultUse dark colors for code blocks Copy root |-- Statistics(raster): array (nullable = true) | |-- element: struct (containsNull = false) | | |-- min: double (nullable = false) | | |-- max: double (nullable = false) | | |-- mean: double (nullable = true) | | |-- stdev: double (nullable = true) -
We can unpack these and show statistics for all bands in a single tile in one table
We can do this by exploding the full set of statistics so that we have one row for each band
Python Python Scala Use dark colors for code blocks Copy # show statistics for all bands for a single tile # First, explode the statistics result into a column called "element" # Second, access the individual named parts of the element e.g., F.col("element").min to get the min value (see the naming structure from the cell above) df_raster.limit(1).select(F.explode(RT.statistics("raster")).alias("element"))\ .withColumn("min", F.col("element").min)\ .withColumn("max", F.col("element").max)\ .withColumn("mean", F.col("element").mean)\ .withColumn("stdev", F.col("element").stdev)\ .show()ResultUse dark colors for code blocks Copy +--------------------+----+-----+------------------+------------------+ | element| min| max| mean| stdev| +--------------------+----+-----+------------------+------------------+ |{15.0, 210.0, 92....|15.0|210.0| 92.35225296020508|26.602800170008393| |{44.0, 222.0, 129...|44.0|222.0|129.31229305267334|17.710729011348835| |{34.0, 220.0, 72....|34.0|220.0| 72.41140651702881|19.689422600849085| |{44.0, 229.0, 174...|44.0|229.0|174.80743598937988| 14.41309833053885| +--------------------+----+-----+------------------+------------------+ -
The example above showed one row, but we can also show statistics for all tiles and all bands in a single table
We've added in some windowing to get labels for the tiles and bands to keep the results organized in the output table
Python Python Scala Use dark colors for code blocks Copy from pyspark.sql.window import Window # Define a window specification to allow us to list the tile number and the band number for each individual tile # for numbering tile window_spec_tile = Window.orderBy(col("path")) # for numbering band when the statistics are exploded into elements window_spec = Window.partitionBy("tile_number").orderBy(col("element")) df_raster\ .withColumn("tile_number", F.row_number().over(window_spec_tile))\ .select(RT.extent("raster").alias("extent"), "tile_number", F.explode(RT.statistics("raster")).alias("element"))\ .withColumn("band", F.row_number().over(window_spec))\ .withColumn("min", F.col("element").min)\ .withColumn("max", F.col("element").max)\ .withColumn("mean", F.col("element").mean)\ .withColumn("stdev", F.col("element").stdev)\ .drop("element")\ .sort("tile_number")\ .show()ResultUse dark colors for code blocks Copy +--------------------+-----------+----+----+-----+------------------+------------------+ | extent|tile_number|band| min| max| mean| stdev| +--------------------+-----------+----+----+-----+------------------+------------------+ |{"rings":[[[44482...| 1| 1|15.0|210.0| 92.35225296020508|26.602800170008393| |{"rings":[[[44482...| 1| 2|34.0|220.0| 72.41140651702881|19.689422600849085| |{"rings":[[[44482...| 1| 3|44.0|222.0|129.31229305267334|17.710729011348835| |{"rings":[[[44482...| 1| 4|44.0|229.0|174.80743598937988| 14.41309833053885| |{"rings":[[[44513...| 2| 1|14.0|240.0| 84.154541015625| 26.81645369218381| |{"rings":[[[44513...| 2| 2|23.0|255.0| 74.64390182495117|24.848689245873594| |{"rings":[[[44513...| 2| 3|38.0|255.0| 124.0490951538086| 19.33410483909644| |{"rings":[[[44513...| 2| 4|43.0|255.0|172.65335083007812| 16.11312650314451| |{"rings":[[[44543...| 3| 1|11.0|233.0| 90.1596269607544| 35.35228140271004| |{"rings":[[[44543...| 3| 2|22.0|247.0| 80.91951274871826| 35.10800917517936| |{"rings":[[[44543...| 3| 3|40.0|248.0|124.84498310089111| 27.80693983917014| |{"rings":[[[44543...| 3| 4|44.0|255.0|159.28014087677002| 20.60250979293797| |{"rings":[[[44574...| 4| 1|16.0|232.0| 80.22738838195801|26.324469299187722| |{"rings":[[[44574...| 4| 2|16.0|248.0| 73.15077209472656|24.554763438866985| |{"rings":[[[44574...| 4| 3|45.0|250.0|117.66667747497559|20.641196882943824| |{"rings":[[[44574...| 4| 4|51.0|254.0| 159.6287441253662|11.813605740654758| |{"rings":[[[44605...| 5| 1|15.0|231.0| 80.11518215240642| 25.71023617309586| |{"rings":[[[44605...| 5| 2|19.0|233.0| 72.73027657085561| 23.62374643935028| |{"rings":[[[44605...| 5| 3|48.0|244.0|158.23040712733956|11.794383958706929| |{"rings":[[[44605...| 5| 4|48.0|246.0|117.44518298796791| 20.90472557528909| +--------------------+-----------+----+----+-----+------------------+------------------+ only showing top 20 rows
Global statistics for bands
Instead of looking at statistics for each tile / row in a DataFrame, you may want global statistics across all tiles / rows in the DataFrame. Here are a few ways that you can get those values:
-
If the raster is small enough to merge into a single tile, you can use
RTto combine all of the tiles into a single raster tile / row in the DataFrame_merge Python Python Scala Use dark colors for code blocks Copy df_raster_merge = df_raster.agg(RT.merge(F.collect_list("raster")).alias("raster")) # pull out statistics for the merged dataset (it is only one row at this point since it's one giant tile) df_raster_merge.select(F.explode(RT.statistics("raster")).alias("element"))\ .withColumn("min", F.col("element").min)\ .withColumn("max", F.col("element").max)\ .withColumn("mean", F.col("element").mean)\ .withColumn("stdev", F.col("element").stdev)\ .drop("element")\ .show()ResultUse dark colors for code blocks Copy +----+-----+---------------+---------------+ | min| max| mean| stdev| +----+-----+---------------+---------------+ | 9.0|240.0|80.369452743768|28.752628468188| |25.0|255.0| 119.5062109164| 21.84675346408| | 2.0|255.0| 73.6978419662|24.961412705739| |36.0|255.0|167.75836983476| 18.1467047388| +----+-----+---------------+---------------+ -
These can also be run off one band at a time if preferred:
Python Python Scala Use dark colors for code blocks Copy # get number of bands num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"] # walk through each of the bands and print results for i in range(0, num_bands): # bands are numbered starting at 1, not at 0 print(f"Band {i+1} statistics") # list the statistics for the band df_raster_merge.select(RT.band_statistics("raster", i+1).alias("band_stats"))\ .withColumn("min", F.col("band_stats").min)\ .withColumn("max", F.col("band_stats").max)\ .withColumn("mean", F.col("band_stats").mean)\ .withColumn("stdev", F.col("band_stats").stdev)\ .drop("band_stats")\ .show()ResultUse dark colors for code blocks Copy Band 1 statistics +---+-----+---------------+---------------+ |min| max| mean| stdev| +---+-----+---------------+---------------+ |9.0|240.0|80.369452743768|28.752628468188| +---+-----+---------------+---------------+ Band 2 statistics +----+-----+--------------+--------------+ | min| max| mean| stdev| +----+-----+--------------+--------------+ |25.0|255.0|119.5062109164|21.84675346408| +----+-----+--------------+--------------+ Band 3 statistics +---+-----+-------------+---------------+ |min| max| mean| stdev| +---+-----+-------------+---------------+ |2.0|255.0|73.6978419662|24.961412705739| +---+-----+-------------+---------------+ Band 4 statistics +----+-----+---------------+-------------+ | min| max| mean| stdev| +----+-----+---------------+-------------+ |36.0|255.0|167.75836983476|18.1467047388| +----+-----+---------------+-------------+ -
If the tiles in the DataFrame are too large to merge into a single row, we can obtain min and max through aggregation of the individual tiles. This will not work for mean and standard deviation
Python Python Scala Use dark colors for code blocks Copy # get number of bands num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"] # walk through each of the bands and print results for i in range(0, num_bands): # bands are numbered starting at 1, not at 0 print(f"Band {i+1} statistics") # calculate the aggregated min and max across all tiles df_raster\ .select(RT.band_statistics("raster", i+1).alias("band_stats"))\ .withColumn("min", F.col("band_stats").min)\ .withColumn("max", F.col("band_stats").max)\ .agg(F.min("min").alias("min"),F.max("max").alias("max"))\ .show()ResultUse dark colors for code blocks Copy Band 1 statistics +---+-----+ |min| max| +---+-----+ |9.0|240.0| +---+-----+ Band 2 statistics +----+-----+ | min| max| +----+-----+ |25.0|255.0| +----+-----+ Band 3 statistics +---+-----+ |min| max| +---+-----+ |2.0|255.0| +---+-----+ Band 4 statistics +----+-----+ | min| max| +----+-----+ |36.0|255.0| +----+-----+
Raster values
-
You can obtain values for all of the cells for a single band within a raster tile using
RT. Here we can look at the first 10 values of band 1 and 2._bandvalues Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 1 raster = df_raster.limit(1).select(RT.band_values("raster", 1).alias("values")).head(1)[0]["values"] # print the first 10 values raster[0:10]ResultUse dark colors for code blocks Copy [65.0, 68.0, 60.0, 61.0, 71.0, 72.0, 67.0, 75.0, 72.0, 67.0]Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 2 raster = df_raster.limit(1).select(RT.band_values("raster", 2).alias("values")).head(1)[0]["values"] # print the first 10 values raster[0:10]ResultUse dark colors for code blocks Copy [115.0, 116.0, 108.0, 109.0, 119.0, 120.0, 115.0, 123.0, 122.0, 116.0] -
Using
RT, we can check the mask values for all of the cells for a single band within a raster tile. Because all cells are valid in this raster, all mask values should be_bandmask truehere.Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 1 df_band_mask = df_raster.select(RT.band_mask("raster", 1).alias("band_mask")).head(1)[0]["band_mask"] # print the first 10 values df_band_mask[0:10]ResultUse dark colors for code blocks Copy [True, True, True, True, True, True, True, True, True, True]Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 2 df_band_mask = df_raster.select(RT.band_mask("raster", 2).alias("band_mask")).head(1)[0]["band_mask"] # print the first 10 values df_band_mask[0:10]ResultUse dark colors for code blocks Copy [True, True, True, True, True, True, True, True, True, True]
What's next?
Learn about how to read in other data types or analyze your data through SQL functions and analysis tools: