Learn the functionalities for exploring the properties of raster datasets.
In this tutorial, you will read from raster datasets and explore their properties using different functions in GeoAnalytics Engine.
Read in Raster data
Set up the workspace
- Import the required modules in your notebook.
Python Python Scala Use dark colors for code blocks Copy # import geoanalytics functionality import geoanalytics import geoanalytics.raster.functions as RT import geoanalytics.sql.functions as ST import pyspark.sql.functions as F geoanalytics.auth(username="user1", password="p@ssword")
Prepare your input raster
-
Set the location of the raster dataset.
The data used in this tutorial comes from a multi-band image from the United States Geological Survey's National Agriculture Image Program (NAIP). Note that the image provided is a clipped piece of a larger NAIP image.
The file used here is available here.
Python Python Scala Use dark colors for code blocks Copy # raster dataset location raster_data_loc = r"data/m_4209351_se_15_030_20230902_clip.tif" -
Read in the raster file as a DataFrame.
Python Python Scala Use dark colors for code blocks Copy df_raster = spark.read.format("raster").load(raster_data_loc) -
Show the tiles of the raster. These tiles are stored in the raster DataFrame as rows.
Python Python Scala Use dark colors for code blocks Copy df_raster.show()ResultUse dark colors for code blocks Copy +--------------------+--------------------+ | path| raster| +--------------------+--------------------+ |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x935x1...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x1024x...| |file:/C:/_workboo...|SqlRaster(4x935x8...| +--------------------+--------------------+
Raster Properties
In this tutorial, we will explore different raster properties using raster functions.
The raster dataset was read into a DataFrame with multiple tiles. To keep examples concise, this tutorial typically shows the properties of a single row, corresponding to one tile. In many cases, metadata such as cell size and spatial reference is shared across all tiles in the dataset.
-
RTlists all the general metadata of a raster._info Show all the metadata of a single raster tile.
Python Python Scala Use dark colors for code blocks Copy # print raster info df_raster.withColumn("info", RT.info("raster")) \ .selectExpr("info.*") \ .show(1, vertical=True)ResultUse dark colors for code blocks Copy -RECORD 0-------------------------- numColumns | 1024 numRows | 1024 numBands | 4 cellSizeX | 0.30000000000001137 cellSizeY | 0.3000000000001819 pixelType | UInt8 srid | 26915 srText | PROJCS["NAD_1983_... extent | [444822.9, 466684... only showing top 1 row -
A variety of functions are available to explore specific properties of raster metadata:
RT- overall extent of the raster_extent RT- cell size on the x axis_cellsizex RT- cell size on the y axis_cellsizey RT- number of bands in the raster_numbands RT- number of columns in the raster_numcolumns RT- number of rows in the raster_numrows RT- the pixel type (e.g., UInt8)_pixeltype RT- the spatial reference ID_srid RT- the full text string with the spatial reference metadata_srtext
Show all the metadata of a single raster tile again without using
RT._Info Python Python Scala Use dark colors for code blocks Copy df_raster\ .select(RT.extent("raster"), RT.cell_size_x("raster"), RT.cell_size_y("raster"), RT.num_bands("raster"), RT.num_columns("raster"), RT.num_rows("raster"), RT.pixel_type("raster"), RT.srid("raster"),\ RT.sr_text("raster"))\ .show(1, vertical=True)ResultUse dark colors for code blocks Copy -RECORD 0---------------------------------- Extent(raster) | {"rings":[[[44482... CellSizeX(raster) | 0.30000000000001137 CellSizeY(raster) | 0.3000000000001819 NumBands(raster) | 4 NumColumns(raster) | 1024 NumRows(raster) | 1024 PixelType(raster) | UInt8 Srid(raster) | 26915 SrText(raster) | PROJCS["NAD_1983_... only showing top 1 row
Summarize properties across all tiles for a raster dataset
When you read in a large raster it will be broken into multiple tiles (DataFrame rows) to improve performance. This means that when you retrieve properties for the tiles in a raster DataFrame, you will see the properties for each tile separately.
If you want to aggregate and retrieve properties across the entire dataset, you need to work across the rows in the DataFrame.
In the example below, we aggregate across all rows (raster tiles) in the DataFrame to obtain properties about the full raster.
The aggregation below includes:
- The total number of rows in the table (equivalent to the number of tiles in the raster)
- The full extent of the raster, calculated as the union of all tile extents
- The cell size (x and y) for the first cell; assuming the cell size is consistent across all tiles in the DataFrame
- The raster height and width, calculated from the full extent
- The number of rows and columns, calculated by dividing the full extent dimensions by the cell size in the x and y directions
# height, width, number of rows, and number of columns across _all tiles created when a raster was ingested_
df_raster.agg(F.sum(F.lit(1)).alias("number_of_tiles"),
ST.aggr_union(RT.extent("raster")).alias("full_extent"),
F.first(RT.cell_size_x("raster")).alias("cell_size_x"),
F.first(RT.cell_size_y("raster")).alias("cell_size_y"))\
.withColumn("width_total", ST.max_x("full_extent") - ST.min_x("full_extent"))\
.withColumn("height_total", ST.max_y("full_extent") - ST.min_y("full_extent"))\
.withColumn("num_rows_total", F.round(F.col("width_total") / F.col("cell_size_x"), 0))\
.withColumn("num_cols_total", F.round(F.col("height_total") / F.col("cell_size_y"), 0))\
.show(vertical=True)-RECORD 0-------------------------------
number_of_tiles | 10
full_extent | {"rings":[[[44482...
cell_size_x | 0.30000000000001137
cell_size_y | 0.3000000000001819
width_total | 1509.2999999999884
height_total | 573.0
num_rows_total | 5031.0
num_cols_total | 1910.0Band statistics
Raster datasets may have more than one band. RT and RT can be used to explore the properties of raster bands.
RT provides statistics for all bands (min, max, mean, standard deviation).
RT provides statistics for a single, specified band.
-
Start by looking at band properties for each tile. In the first examples, only one row of data is shown for simplicity.
The information returned by
RTis an array of elements. There is one element in the array for each band, and the elements returned include the four individual measures (min, max, mean, standard deviation)._statistics Python Python Scala Use dark colors for code blocks Copy # individual tile # statistics for five bands df_stats = df_raster.select(RT.statistics("raster").alias("statistics")) df_stats.withColumn("stats", F.explode("statistics")) \ .select("stats.*") \ .show(5, vertical=True)ResultUse dark colors for code blocks Copy -RECORD 0------------------- min | 15.0 max | 210.0 mean | 92.35225296020508 stdev | 26.602800170008393 -RECORD 1------------------- min | 44.0 max | 222.0 mean | 129.31229305267334 stdev | 17.710729011348835 -RECORD 2------------------- min | 34.0 max | 220.0 mean | 72.41140651702881 stdev | 19.689422600849085 -RECORD 3------------------- min | 44.0 max | 229.0 mean | 174.80743598937988 stdev | 14.41309833053885 -RECORD 4------------------- min | 14.0 max | 240.0 mean | 84.154541015625 stdev | 26.81645369218381 only showing top 5 rows -
Show the statistics for one band at a time. This example shows the statistics for the first band in the first raster tile of the DataFrame.
Python Python Scala Use dark colors for code blocks Copy # individual tile # statistics for one band df_raster.select(RT.band_statistics("raster", 1)).show(1, vertical=True, truncate=False)ResultUse dark colors for code blocks Copy -RECORD 0------------------------------------------------------------------------- BandStatistics(raster, 1) | {15.0, 210.0, 92.35225296020508, 26.602800170008393} -
For each band,
RTreturns a min, max, mean, and standard deviation._Statistics Return these values in an array of elements.
Python Python Scala Use dark colors for code blocks Copy # for each band, statistics returns a min, max, mean, and standard deviation df_raster.select(RT.statistics("raster")).printSchema()ResultUse dark colors for code blocks Copy root |-- Statistics(raster): array (nullable = true) | |-- element: struct (containsNull = false) | | |-- min: double (nullable = false) | | |-- max: double (nullable = false) | | |-- mean: double (nullable = true) | | |-- stdev: double (nullable = true) -
Unpack the array and show the statistics of all bands for a single tile.
Do this by exploding the set of statistics so one row represents a single band.
Python Python Scala Use dark colors for code blocks Copy # show statistics for all bands for a single tile # First, explode the statistics result into a column called "element" # Second, access the individual named parts of the element e.g., F.col("element").min to get the min value (see the naming structure from the cell above) df_raster.limit(1).select(F.explode(RT.statistics("raster")).alias("element"))\ .withColumn("min", F.col("element").min)\ .withColumn("max", F.col("element").max)\ .withColumn("mean", F.col("element").mean)\ .withColumn("stdev", F.col("element").stdev)\ .show()ResultUse dark colors for code blocks Copy +--------------------+----+-----+------------------+------------------+ | element| min| max| mean| stdev| +--------------------+----+-----+------------------+------------------+ |{15.0, 210.0, 92....|15.0|210.0| 92.35225296020508|26.602800170008393| |{44.0, 222.0, 129...|44.0|222.0|129.31229305267334|17.710729011348835| |{34.0, 220.0, 72....|34.0|220.0| 72.41140651702881|19.689422600849085| |{44.0, 229.0, 174...|44.0|229.0|174.80743598937988| 14.41309833053885| +--------------------+----+-----+------------------+------------------+ -
The example above showed one tile, but statistics for all tiles and all bands can be displayed as a single table.
Use windowing to get labels for the tiles and bands to keep the results organized in the output table.
Python Python Scala Use dark colors for code blocks Copy from pyspark.sql.window import Window # Define a window specification to allow us to list the tile number and the band number for each individual tile # for numbering tile window_spec_tile = Window.orderBy(col("path")) # for numbering band when the statistics are exploded into elements window_spec = Window.partitionBy("tile_number").orderBy(col("element")) df_raster\ .withColumn("tile_number", F.row_number().over(window_spec_tile))\ .select(RT.extent("raster").alias("extent"), "tile_number", F.explode(RT.statistics("raster")).alias("element"))\ .withColumn("band", F.row_number().over(window_spec))\ .withColumn("min", F.col("element").min)\ .withColumn("max", F.col("element").max)\ .withColumn("mean", F.col("element").mean)\ .withColumn("stdev", F.col("element").stdev)\ .drop("element")\ .sort("tile_number")\ .show()ResultUse dark colors for code blocks Copy +--------------------+-----------+----+----+-----+------------------+------------------+ | extent|tile_number|band| min| max| mean| stdev| +--------------------+-----------+----+----+-----+------------------+------------------+ |{"rings":[[[44482...| 1| 1|15.0|210.0| 92.35225296020508|26.602800170008393| |{"rings":[[[44482...| 1| 2|34.0|220.0| 72.41140651702881|19.689422600849085| |{"rings":[[[44482...| 1| 3|44.0|222.0|129.31229305267334|17.710729011348835| |{"rings":[[[44482...| 1| 4|44.0|229.0|174.80743598937988| 14.41309833053885| |{"rings":[[[44513...| 2| 1|14.0|240.0| 84.154541015625| 26.81645369218381| |{"rings":[[[44513...| 2| 2|23.0|255.0| 74.64390182495117|24.848689245873594| |{"rings":[[[44513...| 2| 3|38.0|255.0| 124.0490951538086| 19.33410483909644| |{"rings":[[[44513...| 2| 4|43.0|255.0|172.65335083007812| 16.11312650314451| |{"rings":[[[44543...| 3| 1|11.0|233.0| 90.1596269607544| 35.35228140271004| |{"rings":[[[44543...| 3| 2|22.0|247.0| 80.91951274871826| 35.10800917517936| |{"rings":[[[44543...| 3| 3|40.0|248.0|124.84498310089111| 27.80693983917014| |{"rings":[[[44543...| 3| 4|44.0|255.0|159.28014087677002| 20.60250979293797| |{"rings":[[[44574...| 4| 1|16.0|232.0| 80.22738838195801|26.324469299187722| |{"rings":[[[44574...| 4| 2|16.0|248.0| 73.15077209472656|24.554763438866985| |{"rings":[[[44574...| 4| 3|45.0|250.0|117.66667747497559|20.641196882943824| |{"rings":[[[44574...| 4| 4|51.0|254.0| 159.6287441253662|11.813605740654758| |{"rings":[[[44605...| 5| 1|15.0|231.0| 80.11518215240642| 25.71023617309586| |{"rings":[[[44605...| 5| 2|19.0|233.0| 72.73027657085561| 23.62374643935028| |{"rings":[[[44605...| 5| 3|48.0|244.0|158.23040712733956|11.794383958706929| |{"rings":[[[44605...| 5| 4|48.0|246.0|117.44518298796791| 20.90472557528909| +--------------------+-----------+----+----+-----+------------------+------------------+ only showing top 20 rows
Global statistics for bands
Instead of looking at statistics for each tile in a raster, you may want global statistics across all tiles/rows in the DataFrame. Here are a few ways that you can get those values:
-
If the raster is small enough to merge into a single tile, you can use
RTto combine all the raster tiles into a single row in the DataFrame_Merge Python Python Scala Use dark colors for code blocks Copy df_raster_merge = df_raster.agg(RT.merge(F.collect_list("raster")).alias("raster")) # pull out statistics for the merged dataset (it is only one row at this point since it's one giant tile) df_raster_merge.select(F.explode(RT.statistics("raster")).alias("element"))\ .withColumn("min", F.col("element").min)\ .withColumn("max", F.col("element").max)\ .withColumn("mean", F.col("element").mean)\ .withColumn("stdev", F.col("element").stdev)\ .drop("element")\ .show()ResultUse dark colors for code blocks Copy +----+-----+---------------+---------------+ | min| max| mean| stdev| +----+-----+---------------+---------------+ | 9.0|240.0|80.369452743768|28.752628468188| |25.0|255.0| 119.5062109164| 21.84675346408| | 2.0|255.0| 73.6978419662|24.961412705739| |36.0|255.0|167.75836983476| 18.1467047388| +----+-----+---------------+---------------+ -
These can also be run one band at a time if preferred:
Python Python Scala Use dark colors for code blocks Copy # get number of bands num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"] # walk through each of the bands and print results for i in range(0, num_bands): # bands are numbered starting at 1, not at 0 print(f"Band {i+1} statistics") # list the statistics for the band df_raster_merge.select(RT.band_statistics("raster", i+1).alias("band_stats"))\ .withColumn("min", F.col("band_stats").min)\ .withColumn("max", F.col("band_stats").max)\ .withColumn("mean", F.col("band_stats").mean)\ .withColumn("stdev", F.col("band_stats").stdev)\ .drop("band_stats")\ .show()ResultUse dark colors for code blocks Copy Band 1 statistics +---+-----+---------------+---------------+ |min| max| mean| stdev| +---+-----+---------------+---------------+ |9.0|240.0|80.369452743768|28.752628468188| +---+-----+---------------+---------------+ Band 2 statistics +----+-----+--------------+--------------+ | min| max| mean| stdev| +----+-----+--------------+--------------+ |25.0|255.0|119.5062109164|21.84675346408| +----+-----+--------------+--------------+ Band 3 statistics +---+-----+-------------+---------------+ |min| max| mean| stdev| +---+-----+-------------+---------------+ |2.0|255.0|73.6978419662|24.961412705739| +---+-----+-------------+---------------+ Band 4 statistics +----+-----+---------------+-------------+ | min| max| mean| stdev| +----+-----+---------------+-------------+ |36.0|255.0|167.75836983476|18.1467047388| +----+-----+---------------+-------------+ -
If the tiles in the DataFrame are too large to merge into a single row, you can obtain min and max through aggregation of the individual tiles. This will not work for mean and standard deviation.
Python Python Scala Use dark colors for code blocks Copy # get number of bands num_bands = df_raster.select(RT.num_bands("raster").alias("num_bands")).head(1)[0]["num_bands"] # walk through each of the bands and print results for i in range(0, num_bands): # bands are numbered starting at 1, not at 0 print(f"Band {i+1} statistics") # calculate the aggregated min and max across all tiles df_raster\ .select(RT.band_statistics("raster", i+1).alias("band_stats"))\ .withColumn("min", F.col("band_stats").min)\ .withColumn("max", F.col("band_stats").max)\ .agg(F.min("min").alias("min"),F.max("max").alias("max"))\ .show()ResultUse dark colors for code blocks Copy Band 1 statistics +---+-----+ |min| max| +---+-----+ |9.0|240.0| +---+-----+ Band 2 statistics +----+-----+ | min| max| +----+-----+ |25.0|255.0| +----+-----+ Band 3 statistics +---+-----+ |min| max| +---+-----+ |2.0|255.0| +---+-----+ Band 4 statistics +----+-----+ | min| max| +----+-----+ |36.0|255.0| +----+-----+
Raster values
-
Obtain values for all the cells for a single band within a raster tile using
RT. For this example, just show the first 10 values of bands 1 and 2._Band Values Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 1 raster = df_raster.limit(1).select(RT.band_values("raster", 1).alias("values")).head(1)[0]["values"] # print the first 10 values raster[0:10]ResultUse dark colors for code blocks Copy [65.0, 68.0, 60.0, 61.0, 71.0, 72.0, 67.0, 75.0, 72.0, 67.0]Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 2 raster = df_raster.limit(1).select(RT.band_values("raster", 2).alias("values")).head(1)[0]["values"] # print the first 10 values raster[0:10]ResultUse dark colors for code blocks Copy [115.0, 116.0, 108.0, 109.0, 119.0, 120.0, 115.0, 123.0, 122.0, 116.0] -
Using
RT, check the mask values for all the cells for a single band of a raster tile. Because all cells are valid in this raster, all mask values should be_Band Mask truehere.Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 1 df_band_mask = df_raster.select(RT.band_mask("raster", 1).alias("band_mask")).head(1)[0]["band_mask"] # print the first 10 values df_band_mask[0:10]ResultUse dark colors for code blocks Copy [True, True, True, True, True, True, True, True, True, True]Python Python Scala Use dark colors for code blocks Copy # first 10 values for band 2 df_band_mask = df_raster.select(RT.band_mask("raster", 2).alias("band_mask")).head(1)[0]["band_mask"] # print the first 10 values df_band_mask[0:10]ResultUse dark colors for code blocks Copy [True, True, True, True, True, True, True, True, True, True]
What's next?
Learn about how to read in other data types or analyze your data through SQL functions and analysis tools: