Image Captioning Using Deep Learning

  • 🔬 Data Science
  • 🥠 Deep Learning and Image Captioning

Introduction and objective

Image caption, a concise textual summary that describes the content of an image, has applications in numerous fields such as scene classification, virtual assistants, image indexing, social media, for visually impaired persons and more. Deep learning has been achieving superhuman level performance in computer vision tasks ranging from object detection to natural language processing. ImageCaptioner, which is a combination of both image and text, is a deep learning model that generates image captions of remote sensing image data.

This sample shows how ArcGIS API for Python can be used to train ImageCaptioner model using Remote Sensing Image Captioning Dataset (RSICD) [1]. It is a publicly available dataset for remote sensing image captioning task. RSICD contains more than ten thousands remote sensing images which are collected from Google Earth, Baidu Map, MapABC and Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images is 10921, with five sentences descriptions per image. The below screenshot shows an example of this data:

The trained model can be deployed on ArcGIS Pro or ArcGIS Enterprise to generate captions on a high satellite resolution imagery.

Necessary imports

from pathlib import Path
import os, json

from arcgis.learn import prepare_data, ImageCaptioner
from arcgis.gis import GIS
gis = GIS('home')

Prepare data that will be used for training

We need to put the RSICD dataset in a specific format, i.e., a root folder containing a folder named "images" and the JSON file containing the annotations named "annotations.json". The specific format of the json can be seen here.

Folder structure for RSICD dataset. A root folder containing "images" folder and "annotations.json" file.

Model training

Let's set a path to the folder that contains training images and their corresponding labels.

training_data = gis.content.get('8c4fc46930a044a9b20bb974d667e074')
Image Collection by api_data_owner
Last Modified: May 12, 2022
0 comments, 0 views
filepath =
import zipfile
with zipfile.ZipFile(filepath, 'r') as zip_ref:
data_path = Path(os.path.join(os.path.splitext(filepath)[0]))

We'll use the prepare_data function to create a databunch with the necessary parameters such as batch_size, and chip_size. A complete list of parameters can be found in the API reference.

data = prepare_data(data_path, 

Visualize training data

To visualize and get a sense of the training data, we can use the data.show_batch method.

<Figure size 720x720 with 4 Axes>

Load model architecture

arcgis.learn provides us image captioning model which are based on pretrained convnets, such as ResNet, that act as the backbones. We will use ImageCaptioner with the backbone parameters as Resnet50 to create our image captioning model. For more details on ImageCaptioner check out How image_captioning works? and the API reference.

ic = ImageCaptioner(data, backbone='resnet50')

We will use the lr_find() method to find an optimum learning rate. It is important to set a learning rate at which we can train a model with good accuracy and speed.

lr = ic.lr_find()
<Figure size 432x288 with 1 Axes>

Train the model

We will now train the ImageCaptioner model using the suggested learning rate from the previous step. We can specify how many epochs we want to train for. Let's train the model for 100 epochs., lr, early_stopping=True)
33.00% [33/100 29:17:58<59:29:12]

100.00% [273/273 04:59<00:00]
Epoch 33: early stopping

Visualize results on validation set

To see sample results we can use the show_results method. This method displays the chips from the validation dataset with ground truth (left) and predictions (right). This visual analysis helps in assessing the qualitative results of the trained model.

<Figure size 1440x1440 with 8 Axes>

Evaluate model performance

To see the quantitative results of our model we will use the bleu_score method. Bilingual Evaluation Understudy Score(BLEU’s): is a popular metric that measures the number of sequential words that match between the predicted and the ground truth caption. It compares n-grams of various lengths from 1 through 4 to do this. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0. summarizes how close the generated text is to the expected text.

{'bleu-1': 0.5853038148042357,
 'bleu-2': 0.3385487762905085,
 'bleu-3': 0.2464713554187269,
 'bleu-4': 0.1893991004368455,
 'BLEU': 0.2583728068745172}

Save the model

Let's save the model by giving it a name and calling the save method, so that we can load it later whenever required. The model is saved by default in a directory called models in the data_path initialized earlier, but a custom path can be provided.'image-captioner-33epochs')
Computing model metrics...

Prediction on test image

We can perform inferencing on a small test image using the predict function.

'some cars are parked in a parking lot .'
<Figure size 432x288 with 1 Axes>

Now that we are satisfied with the model performance on a test image, we are ready to perform model inferencing on our desired images. In our case, we are interested in inferencing on high resolution satellite image.

Model inference

Before using the model for inference we need to make some changes in the model_name>.emd file. You can learn more about this file here.

By default, CropSizeFixed is set to 1. We want to change the CropSizeFixed to 0 so that the size of tile cropped around the features is not fixed. the below code will edit the emd file with CropSizeFixed:0 information.

with open(
    os.path.join(data_path, "models", "image-captioner-33epochs", "image-captioner-33epochs" + ".emd"), "r+"
) as emd_file:
    data = json.load(emd_file)
    data["CropSizeFixed"] = 0
    json.dump(data, emd_file, indent=4)

In order to perform inferencing in ArcGIS Pro, we need to create a feature class on the map using Create Feature Class or Create Fishnet tool.

The Feature Class and the trained model has been provided for reference. You could directly download these files to run perform model inferencing on desired area.

with arcpy.EnvManager(extent="-13049125.3076102 4033595.5228646 -13036389.0790898 4042562.3896354", cellSize=1, processorType="GPU"):
arcpy.ia.ClassifyObjectsUsingDeepLearning("Inferencing_Image", r"C:\Users\Admin\Documents\ImgCap\captioner.gdb\Classified_ImageCaptions", r"D:\image-captioner-33epochs\image-captioner-33epochs.emd", "California_Features", '', "PROCESS_AS_MOSAICKED_IMAGE", "batch_size 1;beam_width 5;max_length 20", "Caption")


We selected an area unseen (by the model) and generated some features using the Create Feature Class tool. We then used our model to generate captions. Below are the results that we have achieved.


In this notebook, we demonstrated how to use the ImageCaptioner model from the ArcGIS API for Python to generate image captions using RSICD as training data.


Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.