Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Object Detection Workflow with arcgis.learn

Deep learning models 'learn' by looking at several examples of imagery and the expected outputs. In the case of object detection, this requires imagery as well as known or labelled locations of objects that the model can learn from. With the ArcGIS platform, these datasets are represented as layers, and are available in GIS. In the workflow below, we will be training a model to identify well pads from Planet imagery.


  • Please refer to the prerequisites section in our guide for more information. This sample demonstrates how to do export training data and model inference using ArcGIS Image Server. Alternatively, they can be done using ArcGIS Pro as well.
  • If you have already exported training samples using ArcGIS Pro, you can jump straight to the training section. The saved model can also be imported into ArcGIS Pro directly.

The code below connects to our GIS and accesses the known well pad locations and the imagery, in this case provided by Planet:

In [1]:
from arcgis.gis import GIS
from arcgis.raster.functions import extract_band 
from arcgis.learn import export_training_data

gis = GIS("home")
In [2]:
# layers we need - The input to generate training samples and the imagery
well_pads = gis.content.get('ae6f1c62027c42b8a88c4cf5deb86bbf') # Well pads layer
Well Pads Permian Basin
Well Points in Hobbs CountyFeature Layer Collection by portaladmin
Last Modified: March 19, 2019
0 comments, 352 views
In [3]:
# Weekly mosaics provided by Planet
planet_mosaic_item ="PlanetGlobalMosaics")[0] 
PlanetGlobalMosaicsImagery Layer by portaladmin
Last Modified: March 10, 2019
0 comments, 88 views

Export Training Samples

The export_training_data() method generates training samples for training deep learning models, given the input imagery, along with labelled vector data or classified images. Deep learning training samples are small sub images, called image chips, and contain the feature or class of interest. This tool creates folders containing image chips for training the model, labels and metadata files and stores them in the raster store of your enterprise GIS. The image chips are often small (e.g. 256x256), unless the training sample size is large. These training samples support model training workflows using the arcgis.learn package as well as by third-party deep learning libraries, such as TensorFlow or PyTorch. The supported models in arcgis.learn accept the PASCAL_VOC_rectangles format for object detection models, which is a standardized image dataset for object class recognition. The label files are XML files containing information about image name, class value, and bounding boxes.

In order to take advantage of pretrained models that have been trained on large image collections (e.g. ImageNet), we have to pick 3 bands from a multispectral imagery as those pretrained models are trained with images that have only 3 RGB channels. The extract_bands() method can be used to specify which 3 bands should be extracted for fine tuning the models:

In [4]:
planet_mosaic_data = extract_band(planet_mosaic_item.layers[0], [1,2,3])

We recommend exporting image chips with a larger size than that used for training the models. This allows arcgis.learn to perform random center cropping as part of its default data augmentation and makes the model see a different sub-area of each chip when training leading to better generalization and avoid overfitting to the training data. By default, a chip size of 448 x 448 pixels works well, but this can be adjusted based on the amount of context you wish to provide to the model, as well as the amount of GPU memory available.

In [5]:
chips = export_training_data(planet_mosaic_data, well_pads, "PNG", {"x":448,"y":448}, {"x":224,"y":224}, 
                             "PASCAL_VOC_rectangles", 75, "planetdemo")

Data Preparation

Data preparation can be a time consuming process that typically involves splitting the data into training and validation sets, applying various data augmentation techniques, creating the necessary data structures for loading data into the model, memory management by using the appropriately sized mini-batches of data and so on. The prepare_data() method can directly read the training samples exported by ArcGIS and automate the entire process.

By default, prepare_data() uses a default set of transforms for data augmentation that work well for satellite imagery. These transforms randomly rotate, scale and flip the images so the model sees a different image each time. Alternatively, users can compose their own transforms using transforms for the specific data augmentations they wish to perform.

In [6]:
from arcgis.learn import prepare_data

data = prepare_data('/arcgis/directories/rasterstore/planetdemo', {1: '  Pad'})

The show_batch() method can be used to visualize the exported training samples, along with labels, after data augmentation transformations have been applied.

In [7]:

Model Training

arcgis.learn includes support for training deep learning models for object detection.

The models in arcgis.learn are based upon pretrained Convolutional Neural Networks (CNNs, or in short, convnets) that have been trained on millions of common images such as those in the ImageNet dataset. The intuition of a CNN is that it uses a hierarchy of layers, with the earlier layers learning to identify simple features like edges and blobs, middle layers combining these primitive features to identify corners and object parts and the later layers combining the inputs from these in unique ways to grasp what the whole image is about. The final layer in a typical convnet is a fully connected layer that looks at all the extracted features and essentially compute a weighted sum of these to determine a probability of each object class (whether its an image of a cat or a dog, etc.).

A convnet trained on a huge corpus of images such as ImageNet is thus considered as a ready-to-use feature extractor. In practice, we could replace the last layer of these convnets with something else that uses those features for other useful tasks (e.g. object detection and pixel classification), which is also called transfer learning. The advantage of transfer learning is that we now don't need as much data to train an excellent model.

The arcgis.learn module is based on PyTorch and and enables fine-tuning of pretrained torchvision models on satellite imagery. The arcgis.learn models leverages's learning rate finder and one-cycle learning, and allows for much faster training and removes guesswork in picking hyperparameters.

arcgis.learn provides the SingleShotDetector (SSD) model for object detection tasks, which is based on a pretrained convnet, like ResNet that acts as the 'backbone'. More details about SSD can be found here.

Train SingleShotDetector Model

Since the image chips visualized in the section above indicate that most well pads are roughly of the same size and square in shape, we can keep an aspect ratio of 1:1 and zoom scale of 1. This will help simplify the model and make it easier to train. Also, since the size of well pads in the image chips is such that approximately nine could fit side by side, we can keep a grid size of 9.

In [8]:
from arcgis.learn import SingleShotDetector

ssd = SingleShotDetector(data, grids=[9], zooms=[1.0], ratios=[[1.0, 1.0]])

Find the efficient learning rate

Now, once a model architecture is defined we can start to train it. This process involves setting a good learning rate.

Choosing a very small learning rate leads to very slow training of the model, while selecting an extremely high rate can 'overshoot' the minima where the loss (or error rate) is lowest, and prevent the model from converging.

arcgis.learn includes learning rate finder, and is accessible through the model's lr_find() method, that can automatically select an optimum learning rate, without requiring repeated experiments.

In [9]:
# The users can visualize the learning rate of the model with comparative loss.

The above function returns 0.001 as the learning rate.

Train the model

As discussed earlier, the idea of transfer learning is to fine-tune earlier layers of the pretrained model and focus on training the newly added layers, meaning we need two different learning rates to better fit the model. We have already selected a good learning rate to train the later layers above (i.e. 0.02). An empirical value of lower learning rate for fine-tuning the earlier layers is usually one tenth of the higher rate. We choose 0.001 to be more careful not to disturb the weights of the pretrained backbone by too much. It can be adjusted depending upon how different the imagery is from natural images on which the backbone network is trained.

Training the network is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as epoch. This is indicative of the model learning the task.

Optionally, if we pass early_stopping=True as a parameter in fit() method, it stops training the model if validation loss doesn't decrease for 5 consecutive epochs. Moreover, checkpoint=True parameter saves the best model based on validation loss during training.

Note: You may also choose not to pass lr parameter. The method automatically calls lr_find() function to find an optimum learning rate if lr parameter is not set.

In [10]:
# here we are training the model for 10 epochs, lr=0.001)
Total time: 57:44

epoch train_loss valid_loss
1 1743.360840 759.151855
2 1700.622559 763.675842
3 1691.474487 733.454163
4 1705.710205 736.463928
5 1715.943115 731.263000
6 1718.531738 734.463257
7 1705.809692 736.284851
8 1706.338623 738.023926
9 1707.944458 724.500916
10 1701.971069 727.411438

As each epoch progresses, the loss (error rate, that we are trying to minimize) for the training data and the validation set are reported. In the table above we can see the losses going down for both the training and validation datasets, indicating that the model is learning to recognize the well pads. We continue training the model for several iterations till we observe the validation loss going up. That indicates that the model is starting to overfit to the training data, and is not generalizing well enough for the validation data. When that happens, we can either add more data (or data augmentations), or increase regularization by increasing the dropout parameter in the SingleShotDetector model, or reduce the model complexity.

Unfreezing the backbone and fine-tuning

By default, the earlier layers of the model (i.e. the backbone or encoder) are frozen and their weights are not updated when the model is being trained. This allows the model to take advantage of the (ImageNet) pretrained weights for training the 'head' of the network.

Once the later layers have been sufficiently trained, the earlier layers are unfrozen (by calling unfreeze()) and and fine-tuned to the nuances of the particular satellite imagery. Using satellite imagery rather than photos of everyday objects (from ImageNet) that the backbone was initially trained on, helps to improve model performance and accuracy.

The learning rate finder can be used to identify the optimum learning rate between the different training phases of the model. Please note that this step is optional. If we don't call unfreeze(), the lower learning rate we specified in the fit() won't be used.

Visualize results

The results of how well the model has learnt can be visually observed using the model's show_results() method. The ground truth is shown in the left column and the corresponding predictions from the model on the right. As we can see below, the model has learnt to detect well pads fairly well. In some cases, it is even able to detect the well pads that are missing in the ground truth data (due to inaccuracies in labelling or the records).

In [11]:
ssd.show_results(rows=25, thresh=0.05)