How DeepLabV3 Works

Introduction

Fully Convolutional Neural Networks (FCNs) are often used for semantic segmentation. One challenge with using FCNs on images for segmentation tasks is that input feature maps become smaller while traversing through the convolutional & pooling layers of the network. This causes loss of information about the images and results in output where predictions are of low resolution and object boundaries are fuzzy.

The DeepLab model addresses this challenge by using Atrous convolutions and Atrous Spatial Pyramid Pooling (ASPP) modules. This architecture has evolved over several generations:

DeepLabV1: Uses Atrous Convolution and Fully Connected Conditional Random Field (CRF) to control the resolution at which image features are computed.

DeepLabV2: Uses Atrous Spatial Pyramid Pooling (ASPP) to consider objects at different scales and segment with much improved accuracy.

DeepLabV3: Apart from using Atrous Convolution, DeepLabV3 uses an improved ASPP module by including batch normalization and image-level features. It gets rid of CRF (Conditional Random Field) as used in V1 and V2.

DeepLabV3 Model Architecture

The DeepLabV3 model has the following architecture:

Features are extracted from the backbone network (VGG, DenseNet, ResNet).
To control the size of the feature map, atrous convolution is used in the last few blocks of the backbone.
On top of extracted features from the backbone, an ASPP network is added to classify each pixel corresponding to their classes.
The output from the ASPP network is passed through a 1 x 1 convolution to get the actual size of the image which will be the final segmented mask for the image.

How DeepLabV3 Works

Introduction

DeepLabV3 Model Architecture

Atrous Convoltion (Dilated Convolution)

Atrous Spatial Pyramid Pooling (ASPP)

PointRend Enhancement

References: