Skip To Content ArcGIS for Developers Sign In Dashboard

ArcGIS API for Python

Object Detection Workflow with arcgis.learn

Deep learning models 'learn' by looking at several examples of imagery and the expected outputs. In the case of object detection, this requires imagery as well as known or labelled locations of objects that the model can learn from. With the ArcGIS platform, these datasets are represented as layers, and are available in our GIS. In the workflow below, we will be training a model to identify well pads from Planet imagery.


  • Please refer to the prerequisites section in our guide for more information. This sample demonstrates how to do export training data and model inference using ArcGIS Image Server. Alternatively, they can be done using ArcGIS Pro as well.
  • If you have already exported training samples using ArcGIS Pro, you can jump straight to the training section. The saved model can also be imported into ArcGIS Pro directly.

The code below connects to our GIS and accesses the known well pad locations and the imagery, in this case provided by Planet:

In [1]:
from arcgis.gis import GIS
from arcgis.raster.functions import extract_band 
from arcgis.learn import export_training_data

gis = GIS("home")
In [2]:
# layers we need - The input to generate training samples and the imagery
well_pads = gis.content.get('ae6f1c62027c42b8a88c4cf5deb86bbf') # Well pads layer
Well Pads Permian Basin
Well Points in Hobbs CountyFeature Layer Collection by portaladmin
Last Modified: March 19, 2019
0 comments, 17 views
In [3]:
# Weekly mosaics provided by Planet
planet_mosaic_item ="PlanetGlobalMosaics")[0] 
PlanetGlobalMosaicsImagery Layer by portaladmin
Last Modified: March 10, 2019
0 comments, 35 views

Export Training Samples

The export_training_data() method generates training samples for training deep learning models, given the input imagery, alongwith labeled vector data or classified images. Deep learning training samples are small subimages, called image chips, and contain the feature or class of interest. This tool creates folders containing image chips for training the model, labels and metadata files and stores them in the raster store of your enterprise GIS. The image chips are often small (e.g. 256x256), unless the training sample size is large. These training samples support model training workflows using the arcgis.learn package as well as by third-party deep learning libraies, such as TensorFlow or PyTorch. The supported models in arcgis.learn accept the PASCAL_VOC_rectangles format for object detection models, a standardized image dataset for object class recognition. The label files are XML files containing information about image name, class value, and bounding boxes.

In order to take advantage of pretrained models that have been trained on large image collections (e.g. ImageNet), we have to pick 3 bands from a multispectral imagery as those pretrained models are trained with images that have only 3 RGB channels. The extract_bands() method can be used to specify which 3 bands should be extracted for fine tuning the models:

In [4]:
planet_mosaic_data = extract_band(planet_mosaic_item.layers[0], [1,2,3])

We recommend exporting image chips with a larger size than that used for training the models. This allows arcgis.learn to perform random center cropping as part of its default data augmentation and makes the model see a different sub-area of each chip when training leading to better generalization and avoid overfitting to the training data. By default, a chip size of 448 x 448 pixels works well, but this can be adjusted based on the amount of context you wish to provide to the model, as well as the amount of GPU memory available.

In [8]:
chips = export_training_data(planet_mosaic_data, well_pads, "PNG", {"x":448,"y":448}, {"x":224,"y":224}, 
                             "PASCAL_VOC_rectangles", 75, "planetdemo")

Data Preparation

Data preparation can be a time consuming process that typically involves splitting the data into training and validation sets, applying various data augmentation techniques, creating the necessary data structures for loading data into the model, memory management by using the appropiately sized mini-batches of data and so on. The prepare_data() method can directly read the training samples exported by ArcGIS and automate the entire process.

By default, prepare_data() uses a default set of transforms for data augmentation that work well for satellite imagery. These transforms randomly rotate, scale and flip the images so the model sees a different image each time. Alternatively, users can compose their own transforms using transforms for the specific data augmentations they wish to perform.

In [ ]:
from arcgis.learn import prepare_data

data = prepare_data('/arcgis/directories/rasterstore/planetdemo', {1: '  Pad'})

The show_batch() method can be used to visualize the exported training samples, along with labels, after data augmentation transformations have been applied.

In [ ]:

Model Training

arcgis.learn includes support for training deep learning models for object detection.

The models in arcgis.learn are based upon pretrained Convolutional Neural Networks (CNNs, or in short, convnets) that have been trained on millions of common images such as those in the ImageNet dataset. The intuition of a CNN is that it uses a hierarchy of layers, with the earlier layers learning to identify simple features like edges and blobs, middle layers combining these primitive features to identify corners and object parts and the later layers combining the inputs from these in unique ways to grasp what the whole image is about. The final layer in a typical convnet is a fully connected layer that looks at all the extracted features and essentially compute a weighted sum of these to determine a probability of each object class (whether its an image of a cat or a dog, etc.).

A convnet trained on a huge corpus of images such as ImageNet is thus considered as a ready-to-use feature extractor. In practive, we could replace the last layer of these convnets with something else that uses those features for other useful tasks (e.g. object detection and pixel classification), which is also called transfer learning. The advantage of transfer learning is that we now don't need as much data to train an excellent model.

The arcgis.learn module is based on PyTorch and and enables fine-tuning of pretrained torchvision models on satellite imagery. The arcgis.learn models leverages's learning rate finder and one-cycle learning, and allows for much faster training and removes guesswork in picking hyperparameters.

arcgis.learn provides the SingleShotDetector (SSD) model for object detection tasks, which is based on a pretrained convnet, like ResNet that acts as the 'backbone'. More details about SSD can be found here.

Train SingleShotDetector Model

Since the image chips visualized in the section above indicate that most well pads are roughly of the same size and square in shape, we can keep an aspect ratio of 1:1 and zoom scale of 1. This will help simplify the model and make it easier to train. Also, since the size of well pads in the image chips is such that approximately nine could fit side by side, we can keep a grid size of 9.

In [21]:
from arcgis.learn import SingleShotDetector

ssd = SingleShotDetector(data, grids=[9], zooms=[1.0], ratios=[[1.0, 1.0]])

Find the good learning rate

Now we have defined a model architecture, we can start to train it. This process involves setting a good learning rate. Picking a very small learning rate leads to very slow training of the model, while picking one that is too high can prevent the model from converging and 'overshoot' the minima where the loss (or error rate) is lowest. arcgis.learn includes's learning rate finder, accessible through the model's lr_find() method, that helps in picking a good learning rate, without needing to experiment with several learning rates and picking from among them.

In [22]:
# The users can visualize the learning rate of the model with comparative loss.

In the chart above we find that the loss is going down steeply at 2e-02 (0.02) and we pick that as the max learning rate.

Train the model

As dicussed earlier, the idea of transfer learning is to fine-tune earlier layers of the pretrained model and focus on training the newly added layers, meaning we need two different learning rates to better fit the model. We have already selected a good learning rate to train the later layers above (i.e. 0.02). An empirical value of lower learning rate for fine-tuning the ealier layers is usually one tenth of the higher rate. We choose 0.001 to be more careful not to disturb the weights of the pretrained backbone by too much. It can be adjusted depending upon how different the imagery is from natural images on which the backbone network is trained.

Training the network is an iterative process. We can train the model using its fit() method till the validation loss (or error rate) continues to go down with each training pass also known as epoch. This is indicative of the model learning the task.

In [31]:
# here we are training the model for 500 epochs, slice(0.001, 0.02))
Total time: 57:44

epoch train_loss valid_loss
1 1743.360840 759.151855
2 1700.622559 763.675842
3 1691.474487 733.454163
4 1705.710205 736.463928
5 1715.943115 731.263000
6 1718.531738 734.463257
7 1705.809692 736.284851
8 1706.338623 738.023926
9 1707.944458 724.500916
10 1701.971069 727.411438

As each epoch progresses, the loss (error rate, that we are trying to minimize) for the training data and the validation set are reported. In the table above we can see the losses going down for both the training and validation datasets, indicating that the model is learning to recognize the well pads. We continue training the model for several iterations like this till we observe the validation loss going up. That indicates that the model is starting to overfit to the training data, and is not generalizing well enough for the validation data. When that happens, we can try adding more data (or data augmentations), increase regularization by increasing the dropout parameter in the SingleShotDetector model, or reduce the model complexity.

Unfreezing the backbone and fine-tuning

By default, the earlier layers of the model (i.e. the backbone or encoder) are frozen and their weights are not updated when the model is being trained. This allows the model to take advantage of the (ImageNet) pretrained weights for training the 'head' of the network. Once the later layers have been sufficiently trained, it helps to improve model performance and accuracy to unfreeze() the earlier layers and allow their weights to be fine-tuned to the nuances of the particular satellite imagery compared to the photos of everyday objects (from ImageNet) that the backbone was trained on. The learning rate finder can be used to identify the optimum lerning rate between the different training phases of the model. Please note that this step is only optional. If we don't call unfreeze(), the lower learning rate we specificed in the fit() won't be used.

Visualize results

The results of how well the model has learnt can be visually observed using the model’s show_results() method. The ground truth is shown in the left column and the corresponding predictions from the model on the right. As we can see below, the model has learnt to detect well pads fairly well. In some cases, it is even able to detect the well pads that are missing in the ground truth data (due to inaccuracies in labelling or the records).

In [ ]:
ssd.show_results(rows=25, thresh=0.05)