In the guide How u-net works, we have learned in detail about semantic segmentation using U-net in the ArcGIS API for Python. There are many other semantic segmentation algorithms like PSPNet, Deeplab, etc. In this guide, we will mainly focus on Pyramid scene parsing network (PSPNet)  which is one of the most well-recognized image segmentation algorithms as it won ImageNet Scene Parsing Challenge 2016 and its paper is highly cited by the computer vision community. PSPNet is another semantic segmentation model along with the Unet that has been implemented into the
arcgis.learn module which can be trained to classify pixels in a raster.
Note: To follow the guide below, we assume that you have some basic understanding of deep learning and the convolutional neural networks (CNNs). For a detailed review of CNNs, please review Stanford University's CS231n: Convolutional Neural Networks for Visual Recognition course (Course Notes). Also, we recommend to read How u-net works before reading this one.
The PSPNet architechture takes into account the global context of the image to predict the local level predictions hence gives better performance on benchmark datasets like PASCAL VOC 2012 and cityscapes. The model was needed because FCN based pixel classifiers were not able to capture the context of the whole image.
Let's understand through an example below:
The boat inside the yellow bounding box in (a) is classified as a "car" by a FCN based classifier (c). It is because of its shape and appearance. But it is uncommon to see a car in river. If the model could gain information from the context, for example in our case water around the object, it will be able to correctly classify it. The PSPNet model (d) is able to capture the context of the whole image to classify the object as a boat (Green).
Most semantic segmentation models contains two parts, i.e an Encoder and a Decoder. The Encoder is responsible for the extracting out features from the image, the decoder is the one which predicts the class of the pixel at the end. A typical Encoder-Decoder for segmentation task looks like the architecture shown below.
The PSPNet encoder contains the CNN backbone with dilated convolutions along with the pyramid pooling module.
In the last layers of the backbone, we replace the traditional convolutional layers with Dilated convolution layers, which helps in increasing the receptive field. This Dilated convolution layers are placed in the last two blocks of the backbone. Hence the feature recieved at the end of the backbone contains richer features. The illustration shows what dilated convolutions do and how is it different from convolutions.
The value of dilation specifies the sparsity while doing the convolution. We can see that the receptive field for dilated convolution is larger as compared to the standard convolution. The size of the receptive field indicates how much context information we use. In PSPNet, the last two blocks of the backbone have dilation values 2 and 4 respectively.
The pyramid pooling module is the main part of this model as it helps the model to capture the global context in the image which helps it to classify the pixels based on the global information present in the image.
The feature map from the backbone is pooled at different sizes and then passed through a convolution layer and after which upsampling takes place on the pooled features to make them the same size as of the original feature map. Finally, the upsampled maps are concatenated with the original feature map to be passed to the decoder. This technique fuses the features different scales hence aggregating the overall context.
For example in the above figure from the paper, the four colours represent different scales which are 6, 3, 2, and 1 for colours green, blue, orange and red respectively. The feature map is pooled at these scales after which they are convolved with 1x1 filters to reduce the feature depth. Then all of these features are then upsampled at the size of the feature map and are concatenated. The 1x1 feature is the most coarse pooled feature map as it captures all the information in just 1x1 spatial location, whereas as the spatial resolution increases, the high resolution features are also taken in account as in the case of 6x6 pyramid size. A rule of thumb is that the if the pyramid size is small i.e ~1 or 2, the model will capture the low resolution and bigger features whereas if the the pyramid size is bigger i.e ~6-8, the model will be able to capture high resolution features.
How to set scales in
arcgis.learn you can pass the parameter
pyramid_sizes to the
PSPNetClassifier model to set the values at which you want the model to pool the features from the feature map. The default value of this parameter is (1, 3, 5, 6) which is used in most implementations. We can set
pyramid_sizes according to our dataset. So, if we have a large variance in the size of features to be segmented, we might want to increase the number of pyramids and the size of pyramids in the PSPNet module, on the other hand if we have similiar sized features, we can get away with only one or two level pyramid.
After the encoder has extracted out features of the image, it is the turn of the decoder to take those features and convert them into predictions by passing them into its layers. The decoder is just another network which takes in features and results into predictions.
8x upsampling decoder¶
The PSPNet model is not a complete segmentation model in itself, it is just an encoder, which means it is just half of what is required for image segmentation. The most common decoders that are found in various implementations of PSPNet is a convolution layer followed by a 8x bilinear-upsampling.
There is a downside of having a 8x upsmapling decoder in the end is that there are no learnable parameters in them hence the results that we get are blobby and it fails to capture high resolution information from the image.
Now, to get the high resolution output from the model, one way is to have a decoder which has learnable parameters and can take in intermidiate features from the encoder as the input.
To achieve that we turned to the feature pyramid network (FPN) decoder, which is what used in the U-Net  as well. So, we added the FPN decoder to the PSPNet encoder, which can is capable of capturing the small features from the image.
By default we create a FPN like decoder while initializing the
PSPNetClassifier object. We can do that by
psp = PSPNetClassifier(data=data). Here data is fastai databunch, object returned by
To create a pspnet classifier with 8x upsampling decoder.
psp = PSPNetClassifier(data=data, use_unet=False)
Then we can continue with the basic
In this guide we learnt the various details of PSPNet model and how we can initialize the model of our choice in
 H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid sceneparsing network.arXiv:1612.01105, 2016
 Sik-Ho Tsang. Review: DilatedNet — Dilated Convolution (Semantic Segmentation). https://towardsdatascience.com/review-dilated-convolution-semantic-segmentation-9d5a5bd768f5. Accessed 10 November 2019.
 Olaf Ronneberger, Philipp Fischer: “U-Net: Convolutional Networks for Biomedical Image Segmentation”, 2015; [http://arxiv.org/abs/1505.04597 arXiv:1505.04597].
 Nan Zhang's. A PyTorch implementation of Image Segmentation Using UNet, Stratification and K-Fold Learning. http://goldsequence.blogspot.com/2018/11/a-pytorch-implementation-of-image.html. Accessed 11 November 2019
 Hengshuang Zhao. PSPNet. https://github.com/hszhao/PSPNet. Accessed 11 November 2019
 Machine Intelligence for humans. Image segmentation with Deep learning. https://www.hackevolve.com/image-segmentation-using-deep-learning/. Accessed 12 November 2019
 kazucmpt. Questions about PSPNet. https://github.com/hszhao/PSPNet/issues/101. Accessed 12 November 2019
 Fisher Yu, Vladlen Koltun: “Dilated Residual Networks”, 2017
Feedback on this topic?