Generating cloud masks from satellite imagery - Part I

Introduction

This notebook describes the different ways cloud masks can be generated from satellite imagery, in this instance using sentinel imagery.

Cloud presence causes problems for remote sensing analysis of surface properties, for instance in analyses like land use and land cover classification, image compositing, or change detection. In cases of single scene image processing, it is relatively easy to manually filter out clouds; however, for studies that use a larger number of images, an automated approach for removing or masking out clouds is necessary.

In parts 1 and 2 of this notebook series, we will demonstrate three methods of cloud mask extraction:

  • First, we use the sentinel2 cloudless python package, which is Sentinel's hub cloud detector that works only on sentinel imagery.

  • Second, an unsupervised model using the mlmodel framework is applied to generate a cloud mask. This can run on any kind of imagery.

  • Third, we train a pixel classifier model based on Unet.

The first two methods are described in this notebook, while the third method is described in the second part of the notebook series, along with a comparison of the results of the three different methods.

Imports

Here, we import the modules we will be using in this notebook, including the pixel cloud detector from the Sentinel Cloudless package.

%matplotlib inline

import pandas as pd
from datetime import datetime
from IPython.display import Image
from IPython.display import HTML
import matplotlib.pyplot as plt
import numpy as np

from sklearn.preprocessing import MinMaxScaler
from datetime import datetime as dt

import arcgis
from arcgis.gis import GIS
from arcgis.learn import MLModel, prepare_tabulardata
from arcgis.raster import Raster

from fastai.vision import *

The Sentinel cloudless package can be installed using the following commands:

pip install sentinelhub --upgrade

pip install s2cloudless
from s2cloudless import S2PixelCloudDetector, CloudMaskRequest, get_s2_evalscript

Connecting to ArcGIS

gis = GIS("home")

Accessing & Visualizing datasets

Next, we will access the Sentinel-2 imagery, which has a high resolution of 10m and has 13 bands. This imagery can be accessed from the ArcGIS Enterprise portal, where it is sourced from the AWS collection.

# get image
s2 = gis.content.get('fd61b9e0c69c4e14bebd50a9a968348c')
sentinel = s2.layers[0]
s2
Sentinel-2 Views
Sentinel-2, 10m Multispectral, Multitemporal, 13-band images with visual renderings and indices. This Imagery Layer is sourced from the Sentinel-2 on AWS collections and is updated daily with new imagery. This layer is in beta release. Imagery Layer by esri
Last Modified: May 28, 2020
23 comments, 2461936 views

Method 1: Cloud mask using Sentinel Cloudless

Data Preparation

Define the Area of Interest in NYC

The area of interest is defined using the four latitude and longitude values from a certain part of NYC with cloud presence, as can be seen from the images.

#  extent in 3857 for NYU rain clouds
NYU_cloud_extent = {
    "xmin": -8231176.77,
    "ymin": 4967559.25,
    "xmax": -8242898.16,
    "ymax": 4973524.61,
    "spatialReference": {"wkid": 3857}
}
# NYU - The respective scene having the above area is selected
selected = sentinel.filter_by(where="(Category = 1) AND (cloudcover >=0.05)",                              
                              geometry=arcgis.geometry.filters.intersects(NYU_cloud_extent))
df = selected.query(out_fields="AcquisitionDate, GroupName, CloudCover, DayOfYear",
                   order_by_fields="AcquisitionDate").sdf

df['AcquisitionDate'] = pd.to_datetime(df['acquisitiondate'], unit='ms')
df
objectidacquisitiondategroupnamecloudcoverdayofyearSHAPEAcquisitionDate
042419242021-03-16 16:01:4720210316T160146_18TWL_01.000075{"rings": [[[-8213212.075300001, 4963896.74279...2021-03-16 16:01:47
142372382021-03-18 15:51:5020210318T155150_18TWL_01.000077{"rings": [[[-8204375.4298, 4983638.8807], [-8...2021-03-18 15:51:50
242952422021-03-23 15:51:5020210323T155149_18TWL_00.400382{"rings": [[[-8204375.4298, 4983638.8838], [-8...2021-03-23 15:51:50
343312252021-03-26 16:01:4520210326T160145_18TWL_00.261585{"rings": [[[-8208581.755000001, 4979145.44569...2021-03-26 16:01:45
443578512021-03-28 15:51:5020210328T155149_18TWL_01.000087{"rings": [[[-8202748.8665, 5092663.800099999]...2021-03-28 15:51:50
........................
13587953502022-04-27 15:51:5620220427T155156_18TWL_00.7319117{"rings": [[[-8204375.4296, 4983638.886600003]...2022-04-27 15:51:56
13688295662022-04-30 16:01:5220220430T160152_18TWL_00.2307120{"rings": [[[-8208559.822899999, 4979109.30569...2022-04-30 16:01:52
13788538452022-05-02 15:51:4520220502T155145_18TWL_01.0000122{"rings": [[[-8204375.429300001, 4983638.90709...2022-05-02 15:51:45
13888883882022-05-05 16:01:4320220505T160142_18TWL_00.9993125{"rings": [[[-8212688.777000001, 4963035.21440...2022-05-05 16:01:43
13989122932022-05-07 15:51:5520220507T155155_18TWL_00.9983127{"rings": [[[-8204375.4296, 4983638.897500001]...2022-05-07 15:51:55

140 rows × 7 columns

Visualize Area of Interest in NYC

# The scene is selected with the least cloud cover and extracted using the selected NYC extent
NYU_clouds_scene4 = sentinel.filter_by('OBJECTID=352211')
NYU_clouds_scene4.extent = NYU_cloud_extent
NYU_clouds_scene4
<ImageryLayer url:"https://sentinel.arcgis.com/arcgis/rest/services/Sentinel2/ImageServer">
from arcgis.raster.functions import apply
s2_ms = apply(NYU_clouds_scene4, None)
pd.DataFrame(s2_ms.key_properties()['BandProperties'])
BandNameWavelengthMinWavelengthMax
0B1_Aerosols433453
1B2_Blue458522
2B3_Green543577
3B4_Red650680
4B5_RedEdge698712
5B6_RedEdge733747
6B7_RedEdge773793
7B8_NearInfraRed784899
8B8A_NarrowNIR855875
9B9_WaterVapour935955
10B10_Cirrus13601390
11B11_ShortWaveInfraRed15651655
12B12_ShortWaveInfraRed21002280

Convert scene to arrray

# convert the scene to a numpy array
test_array = s2_ms.export_image(f='numpy_array')
test_array.shape
(450, 1200, 13)
# Change array to float data type
test_array_sr = test_array/10000
test_array_sr = test_array_sr.astype('float32')

Initiate cloud detector

Finally, the cloud detector can be initiated using specified parameters. The values can be adjusted for different amounts of cloud cover.

# Initiate sentinel cloud detector 
cloud_detector = S2PixelCloudDetector(
    threshold=0.7,
    average_over=4,
    dilation_size=2,
    all_bands=True
)

Generate cloud mask

Once initiated, the cloud detector can be used to extract the cloud mask from the chosen scene.

%%time
test_cloud_mask = cloud_detector.get_cloud_masks(test_array_sr)
Wall time: 2.85 s
test_cloud_mask.shape
(450, 1200)
test_cloud_mask
array([[0, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       ...,
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0],
       [0, 0, 0, 0, ..., 0, 0, 0, 0]], dtype=int8)

Visualize the cloud mask

plt.figure(figsize=(25,8))
plt.imshow(test_cloud_mask)
plt.title('Clouds mask by Sentinel Cloudless')
plt.axis('off')
plt.show()
<Figure size 1800x576 with 1 Axes>

Upon successfully visualizing the results, we can see that the clouds have been detected properly, separating the actual clouds from other objects with similar appearance.

Next, a different scene with clouds over forested areas will be tested with the same workflow used above. First, an area of interest is defined, then that scene is converted to an array to which the the cloud detector is appllied, resulting in a final cloud mask.

#  extent in 3857 for amazon rain clouds
cloud_extent = {
    "xmin": -6501664.396,
    "ymin": -610686.285,
    "xmax": -6490039.703,
    "ymax": -600222.721,
    "spatialReference": {"wkid": 3857}
}
# The respective scene having the above area is selected
selected = sentinel.filter_by(where="(Category = 1) AND (cloudcover >=0.05)",                              
                              geometry=arcgis.geometry.filters.intersects(cloud_extent))
df = selected.query(out_fields="AcquisitionDate, GroupName, CloudCover, DayOfYear",
                   order_by_fields="AcquisitionDate").sdf

df['AcquisitionDate'] = pd.to_datetime(df['acquisitiondate'], unit='ms')
df
objectidacquisitiondategroupnamecloudcoverdayofyearSHAPEAcquisitionDate
041795052021-03-13 14:24:0020210313T142359_21MUQ_00.976172{"rings": [[[-6454184.5053, -600807.9221000001...2021-03-13 14:24:00
141795092021-03-13 14:24:1420210313T142414_21MUP_00.985372{"rings": [[[-6476198.8497, -702069.1108000018...2021-03-13 14:24:14
242364052021-03-18 14:23:5920210318T142359_21MUQ_00.996677{"rings": [[[-6447913.4047, -573839.8112000003...2021-03-18 14:23:59
342364092021-03-18 14:24:1420210318T142414_21MUP_00.999977{"rings": [[[-6460798.414899999, -633008.85029...2021-03-18 14:24:14
442943902021-03-23 14:23:5920210323T142358_21MUQ_00.880282{"rings": [[[-6453754.216600001, -601144.96370...2021-03-23 14:23:59
........................
14987945122022-04-27 14:24:2020220427T142419_21MUP_00.9294117{"rings": [[[-6461088.789899999, -632967.87530...2022-04-27 14:24:20
15088530122022-05-02 14:23:5520220502T142354_21MUQ_00.9954122{"rings": [[[-6449537.895099999, -583912.98730...2022-05-02 14:23:55
15188530162022-05-02 14:24:0920220502T142409_21MUP_00.9707122{"rings": [[[-6460231.455399999, -632979.51350...2022-05-02 14:24:09
15289115312022-05-07 14:24:0520220507T142404_21MUQ_00.9896127{"rings": [[[-6449574.7908, -584023.4871000014...2022-05-07 14:24:05
15389115352022-05-07 14:24:1920220507T142419_21MUP_00.9836127{"rings": [[[-6472218.8981, -688137.7030000016...2022-05-07 14:24:19

154 rows × 7 columns

# The scene is selected with the least cloud cover and extracted using the amzon extent
amazon_clouds_scene3 = sentinel.filter_by('OBJECTID=6738997')
amazon_clouds_scene3.extent = cloud_extent
amazon_clouds_scene3
<ImageryLayer url:"https://sentinel.arcgis.com/arcgis/rest/services/Sentinel2/ImageServer">
s2_amazon_ms = apply(amazon_clouds_scene3, None)
pd.DataFrame(s2_amazon_ms.key_properties()['BandProperties'])
BandNameWavelengthMinWavelengthMax
0B1_Aerosols433453
1B2_Blue458522
2B3_Green543577
3B4_Red650680
4B5_RedEdge698712
5B6_RedEdge733747
6B7_RedEdge773793
7B8_NearInfraRed784899
8B8A_NarrowNIR855875
9B9_WaterVapour935955
10B10_Cirrus13601390
11B11_ShortWaveInfraRed15651655
12B12_ShortWaveInfraRed21002280
test_array = s2_amazon_ms.export_image(f='numpy_array')
test_array_sr = test_array/10000
test_array_sr = test_array_sr.astype('float32')
# Initiate sentinel cloud detector 
cloud_detector = S2PixelCloudDetector(
    threshold=0.3,
    average_over=4,
    dilation_size=2,
    all_bands=True
)
%%time
test_cloud_mask = cloud_detector.get_cloud_masks(test_array_sr)
Wall time: 2.9 s
plt.figure(figsize=(25,8))
plt.imshow(test_cloud_mask)
plt.title('Clouds mask by Sentinel Cloudless')
plt.axis('off')
plt.show()
<Figure size 1800x576 with 1 Axes>

The results show the clouds detected as yellow pixels, with some light clouds not originally visible in the scene being detected in the left portion of the image. The parameters of the cloud detector can be changed to detect different intensities of clouds.

Method 2: Cloud mask using unsupervised learning

In this section, the unsupervised learning method of k-means clustering is used for cloud detection. First, a scene from a preselected area of interest over Iowa is accessed.

Accessing & Visualizing datasets

iowa_clouds = Raster(r"https://iservices6.arcgis.com/SMX5BErCXLM7eDtY/arcgis/rest/services/small_rgb_iowa3/ImageServer", gis=gis)
iowa_clouds.extent
{
  "xmin": 659230,
  "ymin": 4642610,
  "xmax": 686720,
  "ymax": 4660770,
  "spatialReference": {
    "wkid": 32615,
    "latestWkid": 32615
  }
}
ncols = iowa_clouds.columns
nrows = iowa_clouds.rows 
iowa_clouds.read(ncols=ncols,nrows=nrows).shape
(1816, 2749, 4)
iowa_clouds.export_image(size=[2749,1816])
<IPython.core.display.Image object>

In the scene above, only clouds are present in the area of interest.

Data Pre-processing

iowa_clouds.name
'small_rgb_iowa3'

Here, the imagery name is rgbnir_iowa.tif, and the name of the 4 bands of blue, green, red, and near infrared are small_rgb_iowa3, small_rgb_iowa3_1, small_rgb_iowa3_2, small_rgb_iowa3_3 respectively. These bands will be used for defining the preprocessors.

iowa_clouds.band_names
['Band_1', 'Band_2', 'Band_3', 'Band_4']

This is a four band imagery, so the preprocessors are first defined for scaling.

preprocessors = [('small_rgb_iowa3', 'small_rgb_iowa3_1', 'small_rgb_iowa3_2', 'small_rgb_iowa3_3', MinMaxScaler())]
data = prepare_tabulardata(explanatory_rasters=[iowa_clouds], preprocessors=preprocessors)
data.show_batch()
small_rgb_iowa3small_rgb_iowa3_1small_rgb_iowa3_2small_rgb_iowa3_3
18460703928392541345941
2499805143012508406298
37379054619446347986414
49174805813572661477259
4945138170213659893131

Model Initialization

Once the data is prepared, an unsupervised model of k-means clustering from scikit-learn can be used for clustering the pixels into areas of clouds and no clouds. The clustering model is passed inside an MLModel, with the number of clusters set as three for the classes of no clouds, medium clouds, and dense clouds.

model = MLModel(data, 'sklearn.cluster.KMeans', n_clusters=3, init='k-means++', random_state=43)

Model Training

Finally, the model is ready to be trained, labeling the pixels into the three different classes.

model.fit()
model.show_results()
small_rgb_iowa3small_rgb_iowa3_1small_rgb_iowa3_2small_rgb_iowa3_3prediction_results
24721279900992510355117351
31284471220108471348670
313330145524091369559662
38516221451122795152950
43018711501129288561570

Cloud Prediction

# creating the local output raster path
import tempfile
local_path=tempfile.gettempdir()
output_raster_path=local_path+r"/result"+str(dt.now().microsecond)+".tif"
output_raster_path
'C:\\Users\\Supratim\\AppData\\Local\\Temp/result668383.tif'
# predicting the cloud masks using the fitted model
pred_new = model.predict(explanatory_rasters=[iowa_clouds], 
                         prediction_type='raster',
                         output_raster_path=output_raster_path)

Result Visualization

iowa_predicted_cloud_mask = Raster(output_raster_path)
iowa_predicted_cloud_mask.export_image(size=[2749,1816])
<IPython.core.display.Image object>

From the results, it can be seen that the cloud mask has been created and is represented by the white and grey pixels, while the black pixels represent areas with no cloud coverage. These results could be further reclassified to consist of two distinct classes of clouds and no-clouds in Arcgis Pro, for generating the final mask. This mask can be further processed in Arcgis Pro to create a polygon mask if needed.

Conclusion

In this sample notebook, two methods were described to create cloud masks from satellite images. A third method will be described in the second part of this notebook series.

In the first method the cloudless sentinel package performed well in detecting clouds and provides flexibility to the user to detect different intensities of clouds by changing the model's initialization parameters. However, the core caveat of this model is that it can only be used on Sentinel imagery.

The second method performed well with scenes where there were only clouds, and has the benefit of being able to run an analysis on any kind of imagery like Sentinel, Landsat, Surface reflectance etc. The model can also be used for automatically labelling training data for a pixel classification cloud detection model, which will be covered in the next part.

Data resources

DatasetSourceLink
sat imagerysentinel2https://registry.opendata.aws/sentinel-2/

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.