Deep learning has been achieving superhuman level performance in computer vision that can do object classification, object detection, or semantic segmentation of different features. On the other hand, natural language processing models perform well on tasks such as named entity recognition, text classification, etc. This guide explains a new model which is a combination of both image and text. Image captioning model, as the name suggests, generates textual captions of an image. Like all supervised learning scenarios, this model also requires labelled training data in order to train a model. Image captioning technique is mostly done on images taken from handheld camera, however, research continues to explore captioning for remote sensing images. These could help describe the features on the map for accessibility purposes. Figure 1 shows an example of a few images from the RSICD dataset . This dataset contains up to 5 unique captions for ~11k images.