Image Captioning Using Deep Learning

  • 🔬 Data Science
  • 🥠 Deep Learning and Image Captioning

Introduction and objective

Image caption, a concise textual summary that describes the content of an image, has applications in numerous fields such as scene classification, virtual assistants, image indexing, social media, for visually impaired persons and more. Deep learning has been achieving superhuman level performance in computer vision tasks ranging from object detection to natural language processing. ImageCaptioner, which is a combination of both image and text, is a deep learning model that generates image captions of remote sensing image data.

This sample shows how ArcGIS API for Python can be used to train ImageCaptioner model using Remote Sensing Image Captioning Dataset (RSICD) [1]. It is a publicly available dataset for remote sensing image captioning task. RSICD contains more than ten thousands remote sensing images which are collected from Google Earth, Baidu Map, MapABC and Tianditu. The images are fixed to 224X224 pixels with various resolutions. The total number of remote sensing images is 10921, with five sentences descriptions per image. The below screenshot shows an example of this data:

The trained model can be deployed on ArcGIS Pro or ArcGIS Enterprise to generate captions on a high satellite resolution imagery.

Necessary imports

Input
from pathlib import Path
import os, json

from arcgis.learn import prepare_data, ImageCaptioner
from arcgis.gis import GIS
Input
gis = GIS('home')

Prepare data that will be used for training

We need to put the RSICD dataset in a specific format, i.e., a root folder containing a folder named "images" and the JSON file containing the annotations named "annotations.json". The specific format of the json can be seen here.


Folder structure for RSICD dataset. A root folder containing "images" folder and "annotations.json" file.

Model training

Let's set a path to the folder that contains training images and their corresponding labels.

Input
training_data = gis.content.get('8c4fc46930a044a9b20bb974d667e074')
training_data
Output
ImageCaptioning
Image Collection by api_data_owner
Last Modified: May 12, 2022
0 comments, 0 views
Input
filepath = training_data.download(file_name=training_data.name)
Input
import zipfile
with zipfile.ZipFile(filepath, 'r') as zip_ref:
    zip_ref.extractall(Path(filepath).parent)
Input
data_path = Path(os.path.join(os.path.splitext(filepath)[0]))

We'll use the prepare_data function to create a databunch with the necessary parameters such as batch_size, and chip_size. A complete list of parameters can be found in the API reference.

Input
data = prepare_data(data_path, 
                    chip_size=224,
                    batch_size=4,
                    dataset_type='ImageCaptioning')

Visualize training data

To visualize and get a sense of the training data, we can use the data.show_batch method.

Input
data.show_batch()

Load model architecture

arcgis.learn provides us image captioning model which are based on pretrained convnets, such as ResNet, that act as the backbones. We will use ImageCaptioner with the backbone parameters as Resnet50 to create our image captioning model. For more details on ImageCaptioner check out How image_captioning works? and the API reference.

Input
ic = ImageCaptioner(data, backbone='resnet50')

We will use the lr_find() method to find an optimum learning rate. It is important to set a learning rate at which we can train a model with good accuracy and speed.

Input
lr = ic.lr_find()

Train the model

We will now train the ImageCaptioner model using the suggested learning rate from the previous step. We can specify how many epochs we want to train for. Let's train the model for 100 epochs.

Input
ic.fit(100, lr, early_stopping=True)
33.00% [33/100 29:17:58<59:29:12]
epoch train_loss valid_loss accuracy bleu time
0 4.695280 4.683484 0.169359 0.000000 53:17
1 4.149865 4.193302 0.212167 0.000000 53:33
2 3.847940 3.863936 0.302988 0.062242 53:08
3 3.531038 3.593763 0.350844 0.089471 53:07
4 3.237615 3.324673 0.380708 0.107219 53:04
5 2.885094 3.115619 0.413582 0.128764 53:05
6 2.745963 2.863758 0.445932 0.152206 53:10
7 2.540881 2.766686 0.462775 0.177337 53:09
8 2.355527 2.600504 0.489885 0.207568 53:06
9 2.313179 2.487214 0.504936 0.235590 53:04
10 2.054211 2.346865 0.529235 0.254470 53:10
11 2.047108 2.199641 0.551783 0.283868 53:10
12 2.064740 2.172241 0.556525 0.291952 53:06
13 1.986457 2.128242 0.563766 0.296588 53:06
14 1.826541 2.031750 0.579189 0.311093 53:11
15 1.763997 2.046451 0.566121 0.291099 53:05
16 1.817860 1.919416 0.592966 0.326027 53:06
17 1.766880 1.871730 0.592674 0.319502 53:58
18 1.714902 1.899555 0.593588 0.329748 53:33
19 1.617795 1.881732 0.595126 0.329128 53:17
20 1.604728 1.855326 0.600784 0.330772 53:13
21 1.565843 1.844262 0.603579 0.337001 53:16
22 1.554246 1.790282 0.609448 0.341711 53:15
23 1.581763 1.795434 0.613312 0.348124 53:32
24 1.543676 1.773622 0.615620 0.348699 53:47
25 1.504090 1.731651 0.625233 0.360189 53:47
26 1.532415 1.756249 0.615012 0.340073 53:10
27 1.473316 1.719556 0.620736 0.349469 53:05
28 1.420376 1.727326 0.616957 0.344545 53:05
29 1.487180 1.737915 0.622099 0.362117 53:05
30 1.449937 1.751929 0.618553 0.347141 53:08
31 1.485789 1.755898 0.624464 0.363179 53:08
32 1.371849 1.719512 0.621808 0.355478 53:08

100.00% [273/273 04:59<00:00]
Epoch 33: early stopping

Visualize results on validation set

To see sample results we can use the show_results method. This method displays the chips from the validation dataset with ground truth (left) and predictions (right). This visual analysis helps in assessing the qualitative results of the trained model.

Input
ic.show_results()

Evaluate model performance

To see the quantitative results of our model we will use the bleu_score method. Bilingual Evaluation Understudy Score(BLEU’s): is a popular metric that measures the number of sequential words that match between the predicted and the ground truth caption. It compares n-grams of various lengths from 1 through 4 to do this. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0. summarizes how close the generated text is to the expected text.

Input
ic.bleu_score()
Output
{'bleu-1': 0.5853038148042357,
 'bleu-2': 0.3385487762905085,
 'bleu-3': 0.2464713554187269,
 'bleu-4': 0.1893991004368455,
 'BLEU': 0.2583728068745172}

Save the model

Let's save the model by giving it a name and calling the save method, so that we can load it later whenever required. The model is saved by default in a directory called models in the data_path initialized earlier, but a custom path can be provided.

Input
ic.save('image-captioner-33epochs')
Computing model metrics...
Output
WindowsPath('//ImageCaptioning/models/image-captioner-33epochs')

Prediction on test image

We can perform inferencing on a small test image using the predict function.

Input
ic.predict(r'\image-captioner\test_img.tif')
Output
'some cars are parked in a parking lot .'

Now that we are satisfied with the model performance on a test image, we are ready to perform model inferencing on our desired images. In our case, we are interested in inferencing on high resolution satellite image.

Model inference

Before using the model for inference we need to make some changes in the model_name>.emd file. You can learn more about this file here.

By default, CropSizeFixed is set to 1. We want to change the CropSizeFixed to 0 so that the size of tile cropped around the features is not fixed. the below code will edit the emd file with CropSizeFixed:0 information.

Input
with open(
    os.path.join(data_path, "models", "image-captioner-33epochs", "image-captioner-33epochs" + ".emd"), "r+"
) as emd_file:
    data = json.load(emd_file)
    data["CropSizeFixed"] = 0
    emd_file.seek(0)
    json.dump(data, emd_file, indent=4)
    emd_file.truncate()

In order to perform inferencing in ArcGIS Pro, we need to create a feature class on the map using Create Feature Class or Create Fishnet tool.

The Feature Class and the trained model has been provided for reference. You could directly download these files to run perform model inferencing on desired area.

with arcpy.EnvManager(extent="-13049125.3076102 4033595.5228646 -13036389.0790898 4042562.3896354", cellSize=1, processorType="GPU"):
arcpy.ia.ClassifyObjectsUsingDeepLearning("Inferencing_Image", r"C:\Users\Admin\Documents\ImgCap\captioner.gdb\Classified_ImageCaptions", r"D:\image-captioner-33epochs\image-captioner-33epochs.emd", "California_Features", '', "PROCESS_AS_MOSAICKED_IMAGE", "batch_size 1;beam_width 5;max_length 20", "Caption")

Results

We selected an area unseen (by the model) and generated some features using the Create Feature Class tool. We then used our model to generate captions. Below are the results that we have achieved.

Conclusion

In this notebook, we demonstrated how to use the ImageCaptioner model from the ArcGIS API for Python to generate image captions using RSICD as training data.

References

Your browser is no longer supported. Please upgrade your browser for the best experience. See our browser deprecation post for more details.