Detecting deforestation in the Amazon rainforest using unsupervised K-means clustering on satellite imagery

Introduction

Deforestation around the world has reached a critical level, causing irreversible damage to environmental sustainability that is contributing to climate change around the world. Widespread forest fires, from the Amazon Basin in Brazil, to the west coast of the United States, are raging all year-round. This notebook will allow us to detect deforested areas in the Brazilian Amazon rainforest, using satellite imagery.

Imports

%matplotlib inline

import pandas as pd
from datetime import datetime
from IPython.display import Image
from IPython.display import HTML
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler
from datetime import datetime as dt

import arcgis
from arcgis.gis import GIS
from arcgis.learn import MLModel, prepare_tabulardata
from arcgis.raster import Raster

from fastai.vision import *

Connecting to ArcGIS

gis = GIS("home")   
gis_enterp = GIS("https://pythonapi.playground.esri.com/portal", "arcgis_python", "amazing_arcgis_123")

Accessing & Visualizing datasets

Here, we use Sentinel-2 imagery, which has a high resolution of 10m and 13 bands. This imagery is accessed from the ArcGIS Enterprise portal, where it is sourced from the AWS collection.

# get image
s2 = gis.content.get('fd61b9e0c69c4e14bebd50a9a968348c')
sentinel = s2.layers[0]
s2
Sentinel-2 Views
Sentinel-2, 10m Multispectral, Multitemporal, 13-band images with visual renderings and indices. This Imagery Layer is sourced from the Sentinel-2 on AWS collections and is updated daily with new imagery. This layer is in beta release. Imagery Layer by esri
Last Modified: May 28, 2020
21 comments, 1,513,352 views

Data Preparation

Define Area of Interest in the Amazon

The area of interest is defined using the four latitude and longitude values from a certain region of the Amazon rainforest where a considerable area of forest has been deforested, as can be seen from the images above.

#  extent in 3857 for amazon rainforest
amazon_extent = {
    "xmin": -6589488.51,
    "ymin": -325145.08,
    "xmax": -6586199.09,
    "ymax": -327024.74,
    "spatialReference": {"wkid": 3857}
}

Here, we select all the scenes from the sentinel imagery containing the area of interest for our study.

# The respective scene having the above area is selected
selected = sentinel.filter_by(where="(Category = 1) AND (cloudcover <=0.05)",
                              geometry=arcgis.geometry.filters.intersects(amazon_extent))
df = selected.query(out_fields="AcquisitionDate, GroupName, CloudCover, DayOfYear",
                   order_by_fields="AcquisitionDate").sdf

df['AcquisitionDate'] = pd.to_datetime(df['acquisitiondate'], unit='ms')
df
objectidacquisitiondategroupnamecloudcoverdayofyearSHAPEAcquisitionDate
024599602020-06-21 14:23:3720200621T142337_21MTS_00.0359173{"rings": [[[-6535992.3057, -302418.3759999983...2020-06-21 14:23:37
133431762020-07-06 14:23:3920200706T142339_21MTS_00.0037188{"rings": [[[-6535992.3057, -302418.3759999983...2020-07-06 14:23:39
236200962020-07-21 14:23:3720200721T142336_21MTS_00.0091203{"rings": [[[-6535992.3057, -302418.3759999983...2020-07-21 14:23:37
335961772020-07-26 14:23:4020200726T142340_21MTS_00.0020208{"rings": [[[-6535992.3057, -302418.3759999983...2020-07-26 14:23:40
415848182020-07-31 14:23:3820200731T142337_21MTS_00.0000213{"rings": [[[-6535992.3057, -302418.3759999983...2020-07-31 14:23:38
57765682020-08-10 14:23:3820200810T142338_21MTS_00.0039223{"rings": [[[-6535992.3057, -302418.3759999983...2020-08-10 14:23:38
65566562020-08-15 14:23:4120200815T142340_21MTS_00.0087228{"rings": [[[-6535992.3057, -302418.3759999983...2020-08-15 14:23:41
715518212020-09-04 14:23:3920200904T142339_21MTS_00.0421248{"rings": [[[-6535992.3057, -302418.3759999983...2020-09-04 14:23:39
82249972020-09-19 14:23:3720200919T142336_21MTS_00.0160263{"rings": [[[-6535992.3057, -302418.3759999983...2020-09-19 14:23:37

The satellite imagery with the least cloud cover is selected and visualized for further processing.

# The scene is selected with the least cloud cover and extracted using the amazon extent
amazon_scene = sentinel.filter_by('OBJECTID=1584818')
amazon_scene.extent = amazon_extent
amazon_scene
<ImageryLayer url:"https://sentinel.arcgis.com/arcgis/rest/services/Sentinel2/ImageServer">

In the above scene, the brown patches are the deforested areas that are to be identified. This selected scene is then published to the portal.

# publish the scene to the portal
amazon_scene.save('amazon_scene'+ str(dt.now().microsecond), gis=gis_enterp)
amazon_scene331940
Analysis Image Service generated from GenerateRasterImagery Layer by arcgis_python
Last Modified: July 21, 2021
0 comments, 0 views

The published imagery of the Amazon rainforest is exported back to an image file on disk for further processing.

raster_amazon_13bands = Raster("https://pythonapi.playground.esri.com/ra/rest/services/Hosted/amazon_scene_may26/ImageServer",
                               gis=gis_enterp,
                               engine="image_server")
# visualizing the image 
raster_amazon_13bands.export_image(size=[3330,1880])
<IPython.core.display.Image object>

Model Building

The first part of model building consists of defining the preprocessors, which will be used to scale the bands before feeding them into the model. The band names use the conventional naming method of the imagery name with an id number appended at the end as follows:

Sentinel-2 imagery has 13 bands, of which 4 bands, namely the blue, green, red, and near infrared bands, we will use here for modelling. These bands work well for differentiating green forested areas from barren land. The band information, along with the band name and their respective ids, are obtained for selecting the required bands.

# get the band names and their ids, sentinel images have 13 bands
pd.DataFrame(amazon_scene.key_properties()['BandProperties'])
BandNameWavelengthMinWavelengthMax
0B1_Aerosols433453
1B2_Blue458522
2B3_Green543577
3B4_Red650680
4B5_RedEdge698712
5B6_RedEdge733747
6B7_RedEdge773793
7B8_NearInfraRed784899
8B8A_NarrowNIR855875
9B9_WaterVapour935955
10B10_Cirrus13601390
11B11_ShortWaveInfraRed15651655
12B12_ShortWaveInfraRed21002280
# get the imagery name to define the band names
raster_amazon_13bands.name
'Hosted/amazon_scene_may26'

Here, the imagery name is 'Hosted/amazon_scene_april9'. Subsequently, the names of the blue, green, red, and near infrared bands would be 'Hosted/amazon_scene_april9_1', 'Hosted/amazon_scene_april9_2', 'Hosted/amazon_scene_april9_3', 'Hosted/amazon_scene_april9_7' respectively. These bands will be used for defining the preprocessors.

Data Pre-processing

from sklearn.preprocessing import MinMaxScaler

The four bands are listed in the preprocessors for scaling, with the last item as the designated scaler as follows.

preprocessors = [('Hosted/amazon_scene_may26_1', 
                  'Hosted/amazon_scene_may26_2', 
                  'Hosted/amazon_scene_may26_3',
                  'Hosted/amazon_scene_may26_7', MinMaxScaler())]

Here, in the explanatory raster, we pass the name of the explanatory raster and the selected bands by their id's — 1 for blue, 2 for green, 3 for red, and 7 for NIR, as follows:

# Data is prepared for the MLModel using the selected scene and the preprocessors
data = prepare_tabulardata(explanatory_rasters=[(raster_amazon_13bands,(1,2,3,7))], preprocessors=preprocessors)
# visualization of the data to be processed by the model
data.show_batch()
Hosted/amazon_scene_may26_1Hosted/amazon_scene_may26_2Hosted/amazon_scene_may26_3Hosted/amazon_scene_may26_7
6478206954062508
53451144115511812303
287648187073892901
420421201125212572596
615158027003882410

Model Initialization

Once the data is prepared, an unsupervised model of k-means clustering from scikit-learn is used for clustering the pixels into deforested areas and forested areas. The clustering model is passed inside MLModel as follows, with the number of clusters set as three in the parameters.

from arcgis.learn import MLModel, prepare_tabulardata
model = MLModel(data, 'sklearn.cluster.KMeans', n_clusters=3, init='k-means++', random_state=43)

Model Training

Now the model is ready to be trained, and will label the pixels as being one of three classes, either forested, slightly deforested, or highly deforested.

# here model is trained which would label the pixels into the designated classes 
model.fit()
# the labelled pixels can be visualized as follows with the last column returning the predicted labels by the model 
model.show_results()
Hosted/amazon_scene_may26_1Hosted/amazon_scene_may26_2Hosted/amazon_scene_may26_3Hosted/amazon_scene_may26_7prediction_results
2542981170939132942
3011382973039732612
3134583576741927832
32640986101578030282
5578381570842129582

Deforestation Clusters Prediction

Next, the trained model is used to predict clusters within the entire selected scene of the Amazon rainforest. This is passed as the explanatory raster, with the prediction type as raster and a local path provided for output. The output_layer_name parameter can also be used for publishing.

pred_new = model.predict(explanatory_rasters=[raster_amazon_13bands],
                         prediction_type='raster',
                         output_layer_name=('deforest_predicted2'+str(dt.now().microsecond)),
                         output_raster_path=r"/tmp/result5.tif")
The Published Item ID is :  3df30cd12a48421dade893490c967c62

Result Visualization

The resulting raster with the predicted classes of deforested areas and forested areas is now read back for visualization. The predicted local raster can be accessed here.

amazon_predict = gis.content.get('b81b89aac4cd4e08bcd7cd400fac558f')
amazon_predict
amazon_deforest_result
Image Collection by arcgis_python
Last Modified: April 16, 2021
0 comments, 12 views
import os, zipfile
filepath_new = amazon_predict.download(file_name=amazon_predict.name)
with zipfile.ZipFile(filepath_new, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath_new).parent)
output_path = Path(os.path.join(os.path.splitext(filepath_new)[0]))
output_path = os.path.join(output_path, "result5.tif")    
raster_predict = Raster(output_path)
raster_predict.export_image(size=[3290,1880])
<IPython.core.display.Image object>

The model has correctly labelled the deforested areas in white, distinguishing them from the rest of the forested areas in black.

The boundaries of the detected deforested areas could be further extracted into polygons using the convert raster to feature function.

Conclusion

In this sample notebook we were able to detect deforestation in the Amazon rainforest using the unsupervised model of k-means clustering on satellite imagery. This was implemented via the MLModel framework, which exhibited the application of several unsupervised models from the scikit learn library on raster imagery.

Summary of methods used

MethodDescriptionExamples
prepare_tabulardataprepare data including imputation, normalization and train-test splitprepare data ready for fitting a MLModel
MLModel()select the ML algorithm to be used for fittingany supervised and unsupervised models from scikit learn can be used
model.fit()train a modeltraining the unsupervised model with suitable input
model.score()find the appropriate model metric of the trained modelreturns suitable value after training the unsupervised MLModel
model.predict()predict on a test setpredict values using the trained models on the trained data itself

Data resources

DatasetSourceLink
sat imagerysentinel2https://registry.opendata.aws/sentinel-2/

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.