ArcGIS Developers

ArcGIS API for Python

How Mask R-CNN Works?

Instance segmentation

In our guide titled How SSD Works, we learned how SSD detects objects and also finds their position in terms of bounding boxes. This class of algorithm is called Object Detection.

In another guide titled How U-net Works, we saw how to achieve pixel level classification that helps in solving problems like land cover classification. Algorithms achieving tasks like this are categorized as Semantic Segmentation.

Object Instance Segmentation is a recent approach that gives us best of both worlds. It integrates object detection task where the goal is to detect object class along with bounding box prediction in an image and semantic segmentation task, which classifies each pixel into pre-defined categories Thus, it enables us to detect objects in an image while precisely segmenting a mask for each object instance.

Instance segmentation allows us to solve problems like damage detection where it's important to know extent of damage. Another use case is in case of self driving cars where it's important to know position of each car in the scene. Generating building footprints for each individual building is a popular problem in the field of GIS. arcgis.learn gives us advantage to use Mask R-CNN model to solve such real life problems.

Let us take an example of building footprint detection use case.

Figure 1: Segmentation Types

Image (a) has two type of pixels. One belongs to the object (Building) and other belongs to background. It is difficult to count the number of buildings present in the image. In image (b) each building is identified as distinct entity hence overcomes the limitation of semantic segmentation.

Mask R-CNN architecture

Mask R-CNN is a state of the art model for instance segmentation, developed on top of Faster R-CNN. Faster R-CNN is a region-based convolutional neural networks [2], that returns bounding boxes for each object and its class label with a confidence score.

To understand Mask R-CNN, let's first discus architecture of Faster R-CNN that works in two stages:

Stage1: The first stage consists of two networks, backbone (ResNet, VGG, Inception, etc..) and region proposal network. These networks run once per image to give a set of region proposals. Region proposals are regions in the feature map which contain the object.

Stage2: In the second stage, the network predicts bounding boxes and object class for each of the proposed region obtained in stage1. Each proposed region can be of different size whereas fully connected layers in the networks always require fixed size vector to make predictions. Size of these proposed regions is fixed by using either RoI pool (which is very similar to MaxPooling) or RoIAlign method.

Figure 2: Faster R-CNN is a single, unified network for object detection [2]

Faster R-CNN predicts object class and bounding boxes. Mask R-CNN is an extension of Faster R-CNN with additional branch for predicting segmentation masks on each Region of Interest (RoI).