Abstract

In recent years, the development of smart transportation has accelerated research on semantic segmentation as it is one of the most important problems in this area. A large receptive field has always been the center of focus when designing convolutional neural networks for semantic segmentation. A majority of recent techniques have used maxpooling to increase the receptive field of a network at an expense of decreasing its spatial resolution. Although this idea has shown improved results in object detection applications, however, when it comes to semantic segmentation, a high spatial resolution also needs to be considered. To address this issue, a new deep learning model, the M-Net is proposed in this paper which satisfies both high spatial resolution and a large enough receptive field while keeping the size of the model to a minimum. The proposed network is based on an encoder-decoder architecture. The encoder uses atrous convolution to encode the features at full resolution, and instead of using heavy transposed convolution, the decoder consists of a multipath feature extraction module that can extract multiscale context information from the encoded features. The experimental results reported in the paper demonstrate the viability of the proposed scheme.

1. Introduction

Computer vision stands as the backbone of various modern autonomous driving systems [1] with semantic segmentation being one of its fundamental tasks. The goal of semantic segmentation is to assign a label to every pixel of an image. Deep convolutional neural networks have opened up a wide area of extremely effective solutions to problems like object detection [2], lane detection [3], object tracking [4], and semantic segmentation.

Improvements in the performance of deep neural networks have largely been achieved by increasing the number of learnable parameters along with careful network designing, making them computationally expensive. Reducing computational cost and extracting the maximum possible performance from the minimum number of learnable parameters is undoubtedly an extremely important requirement when dealing with embedded systems in autonomous driving. To detect large objects in an image, it is necessary to have a receptive field large enough to gather enough context information, and the use of pooling layers in many recent networks to increase the receptive field means that this information is found on a courser scale at higher layers. Finer details like edges of an object or small/thin objects need high spatial resolution to perform accurate segmentation.

To increase the receptive field of the network, the encoder in encoder-decoder methods normally downsamples the image using strided convolution or pooling layers or both, at an expense of reducing the spatial resolution. The decoder then uses transposed convolution to upsample these encoded features to obtain a high-resolution final feature map; this makes segmenting small objects difficult. Encoder-decoder structures like FCN [5] and U-Net [6] use skip connections to connect the lower layers in an encoder to higher layers of the decoder; this partially solves the problem by allowing both high layer course information and low layer fine information to contribute to the prediction of the final feature map. This technique is effective to some extent but can lead to deeper models with a large number of learnable parameters.

An alternative way can be to maintain the spatial resolution of the features in the encoder while using atrous convolution to increase the receptive field. DeepLab [7] modifies FCN [5] by replacing the last 2 downsampling operations with atrous convolutions to maintain the receptive field. In the architecture proposed by [8], atrous convolutions are used extensively to effectively increase the receptive field while maintaining the spatial resolution throughout the network to segment smaller objects. Figure 1 shows how atrous convolution expands the receptive field by adding holes into a normal convolutional layer. A convolution layer with a kernel and a dilation rate of 2 has the same field of view as a layer with a kernel, while only using 9 parameters. Dilated convolution is an effective way to maintain spatial resolution, but going deeper with high-resolution feature maps can also introduce latency in the system. Processing features in full resolution can be computationaly expensive, to reduce the latency in our system we used maxpooling half way down our network to reduce the spacial resolution by half, this reduces the run time of our network and at the same time increases the receptive field for larger objects. Capturing useful image context information at multiple scales has proven to enhance segmentation accuracy. Pyramid pooling modules like the one introduced in [9] uses pyramid pooling operation for multiple scale context aggregation. The authors in [10] divided the initial input into four subregions and obtained the pooled features from each of those four subregions, respectively. DeepLab [7] on the other hand use atrous spatial pyramid pooling(ASPP) that exploits atrous convolution to divide the features into different scales instead of pooling layers. A deeper version of the ASPP module was introduced in [11] by adding a standard convolution after atrous convolutions. We have taken a similar approach by using a multipath feature extraction module as a decoder to fuse together the key information from three different scales, leading to better segmentation ability.

Semantic segmentation is of great importance in self-driving cars and various driving aids. Deep convolutional neural networks when used in encoder-decoder network architectures have shown remarkable segmentation performance. Encoder-decoder network architectures were first introduced by Bayesian SegNet [12] and SegNet [13]; they used the encoder to downsample the features, and then, the decoder was responsible for recovering the spacial dimensions of the features. FCN [5] used a similar approach by using a classification model like VGG [14] as an encoder to extract features and those extracted features were then upsampled to perform pixelwise prediction in full resolution.

Recent works have brought various changes to the encoder-decoder structure. Instead of using transposed convolution in the decoder, the architecture in [15] introduced a JPU unit to decode the features encoded by FCN [5], the joint pyramid upsampling (JPU) unit upsamples the last 3 feature maps from FCN and then uses 4 dilated convolution layers to extract the features at multiple scales; this decreases the size of the network and also speeding up the network. Encoder-decoders like the ones in [16, 17] use an encoder to extract multilevel features and then used a decoder to combine them into a high-resolution final prediction, avoiding the extensive use of transposed convolution.

DeepLabs [7, 18] introduced atrous spatial pyramid pooling (ASPP) to extract context information at different scales for better segmentation. PSPNet [9] used global average pooling to capture context information. A similar multipath module has been used by [19] to generate a feature pyramid in a generative adversarial network for road segmentation. The authors in [20] use multiple paths in the decoder to capture different variations in the face with the same expression label. In [21], the input is taken at three different scales and an attention map for each scale is then learned. Yap [22] proposed an architecture to segment damages on the road; the architecture contained detail branch and segmentation branch using VGG net [14] and MoblieNetV2 [23], respectively, as backbone architectures.

All these developments lead to a huge improvement in prediction accuracy but some of them are hard on computations. There have been developments to reduce the computational complexity required to achieve certain segmentation accuracy. ENet [24] used early downsampling to reduce the cost of processing large frames and used PReLU as activation. The use of PReLU tends to increase the computational cost, but the reduction in computations caused by reducing the spatial dimensions of the features early in the network was large enough to make the overall network faster than its counterparts. SINet [25] introduced an extremely lightweight multipath structure containing spacial squeeze modules. These spacial squeeze modules reduce the number of feature maps by half by using pointwise convolution, to further reduce the computations they used average pooling to squeeze the resolution of the feature maps, beating ENet [24] in the total number of parameters.

3. Proposed Method

This section will discuss our proposed methodology in detail. Our encoder is designed to effectively encode the features in full resolution without allowing too much latency into the system. Since our encoded features will be in full resolution it would eliminate the need to use extensive transposed convolutions in our decoder. The decoder in our case is a multipath feature extraction module; this would extract features at different scales, making better use of high-resolution encoded features. We have proposed two architectures both with the same encoder but one with PSP module as the decoder and the second one with ASPP module as the decoder.

3.1. Architecture 1: M-Net Encoder+PSP Decoder

The encoder is aimed at encoding the features at full resolution making much finer predictions possible, while also having a large enough receptive field to effectively segment large objects.

Our encoder is four conv-blocks deep as shown in Figure 2. Each conv-block has one standard convolutional layer and two atrous convolutional layers with dilation rates of 2 and 4, respectively, and each of them with a kernel. Stacking up convolutional layers in this particular order connects each output pixel with input pixels. To explain this concept, we have used 1D convolutions to make things look a bit less complicated. Figure 3 shows a set of 1D convolutions each with a kernel size of 3 and a dilation factor of 1, 2, and 4 is used for convolutional layers going from the top, middle, to bottom layers, respectively. Each conv-block effectively increases the receptive field by 15 pixels while maintaining constant spatial dimensions; this order of dilation rate also avoids the problem where the information from the adjacent pixels do not overlap if only even dilation rates are used as pointed out by [8]. Since going deep with high spatial resolution can be computationally expensive, the first 2 conv-blocks are followed by a MaxPooling layer which reduces the spatial dimensions of the features by half, after which 2 more conv-blocks are added. This also helps to increase the receptive field of the network and enables it to segment larger objects in the image. Table 1 shows the input and output dimensions of every layer. We selected a kernel size of 3 for each layer throughout the network; we have avoided using larger kernel sizes to reduce computations. Specific padding is used for each dilation rate to maintain the spacial resolution. The outputs from the last two conv-blocks are upscaled using transposed convolution to recover the spatial dimensions of the features from the last 2 conv-blocks. All four feature maps are then concatenated together resulting in a feature map of shape which is then passed on to the decoder which in this first case is a PSP module.

The emphasis behind using a PSP module as the decoder is to extract features from different scales further increasing the receptive field and to fuse the information received from different scales. This increases the range of context information obtained.

This idea was inspired by the PSP module proposed by [9] which uses spacial pyramid pooling to capture the global context information from the high-resolution features.

Multipath structures like the ones used in google’s inception nets and the ones used in this PSP module can be hard on computations. To counter the high computational requirements, we have used a convolution layer to reduce the number of channels. The feature maps are then pooled into their respective subregions each followed by a convolution layer and batch normalization as shown in Figure 2. The features from each scale are then upsampled using bilinear interpolation and are then concatenated together.

3.2. Architecture 2: M-Net Encoder+ASPP Decoder

Another way to extract multiple-scale information is by using atrous spatial pyramid pooling (ASPP). The ASPP module replaces pooling layers with atrous convolution at different dilation rates to extract features at multiple scales. The reason why we have not completely gone with pooling layers in the PSP module to extract multiscale features is that despite being robust at increasing the receptive field of the network maxpooling layers have shown to lose some of the information; this effect is shown by the authors in [26], and we have also observed finer results with ASPP module. We have used three atrous convolution layers with the dilation rates of 2, 4, and 8, respectively. Each atrous convolution is followed by a standard convolution layer with a kernel as shown in Figure 4. We decided not to go deep with the convolution layers in PSP and ASPP modules as a large number of computations on multiple paths can make the system slower. Table 2 shows the architectural difference between our PSP and ASPP decoders.

4. Experiments and Results

We have used Pytorch as our deep learning framework to train and test our model. Adam Optimizer [27] with a learning rate of , weight decay of , and batch size of 10 was used to train our networks on Cityscapes [28] and Mapillary vistas [29]. We have compared our results with ENet and SINet since they both are known for working with a low number of parameters.

We have used mean intersection over union as our evaluation metric; mIoU is the mean of IoU scores for each class Equation (1), where TP, FP, and FN represent true positives, false positives, and false negatives, respectively.

4.1. Cityscapes

Cityscapes is a large dataset with video sequences recorded from the streets of 50 different cities. The dataset has 2975 training samples, 500 validation samples, and 1525 test images. We trained our networks at an image resolution of and with 10 classes. To test the performance of both multipath feature extraction methods, we have trained our network first with a PSP module as our decoder and then with an ASPP module. Table 3 shows the comparison between ENet, SINet, and our proposed networks on the CityScapes dataset, the number of parameters of SINet are still much less than our proposed architecture but the jump in mIoU is significant while still using half trainable parameters; the graphical comparison between the models is shown in Figure 5 shows that both of our models tend to converge a bit sooner. Despite having slightly more parameters than the PSP module, the ASPP module is still faster than the PSP module while still producing better results.

The results in Figure 6 show that high-resolution feature encoding in our model makes it better at segmenting thin and small objects and also at predicting fine-edged feature masks. The pole in the first example is segmented by both of our networks with reasonable accuracy while ENet and SINet completely ignored it. The person in the second image is segmented as a blob by SINet and ENet, whereas our proposed architectures have managed to produce better edges and a more human-like shape.

4.2. Mapillary Vistas

Mapillary Vistas has 25,000 high-resolution images which are 5 times larger than Cityscapes; it contains 66 object categories with labels for 37 classes. It contains images from all devices from all around the world in various weather conditions and seasons. We have augmented our dataset by flipping the images along the -axis, doubling the dataset. We divided our dataset so that we had 40,000 training samples, 5,000 validation, and 5,000 test samples. Table 4 shows how each network performed on the Mapillary dataset. Figure 7 shows that our architectures maintain a pattern similar to the one presented by Figure 5 on a much larger Mapillary vistas dataset.

All 4 networks are trained on 3 classes naming vehicle, pedestrian, and road. Figure 8 shows the visual comparison between the results of all 4 networks on Mapillary Vistas.

Both of our networks have shown similar improvements on both datasets. Careful encoding of features in high resolution combined with multipath feature extraction has shown to segment finer edges without any increase in the number of learnable parameters. The first example in Figure 8 shows how both of our M-Net architectures were able to segment 3 different cars separately instead of segmenting all three cars as one. In the second example both M-Nets are able to produce much finer results showing how it is able to segment both large and small objects. The graphs below from (a) to (d) show the change in mIoU with every epoch on a validation set.

5. Discussion

This paper improves on the traditional encoder-decoder technique for segmentation and proposes a technique to encode the features in full resolution and uses a multipath feature extraction feature extraction module to predict much finer segmentation masks as compared to its traditional encoder-decoder counterparts.

U-Net [6] has been one of the most widely used encoder-decoder architecture for semantic segmentation; its effectiveness and simplicity is the main reason behind its popularity. It is safe to say that aggressive down sampling in segmentation models can cause the loss of important spacial information. It can be argued that the skip connections in U-Net [6] and SegNet [13] can overcome the loss of information due to downsampling, but looking at it from a different angle, it is clear that the convolutional layer immediately after the pooling layer will not receive the needed spacial information. Small models like ENet [24] and SINet [25] downsample the features in the beginning of the network and then go deep with much smaller feature maps to reduce the size and computational requirements of the model. In this paper, we show why that is not a good idea when network is to be used for road scene segmentation. Encoding the features in full resolution and using a multipath feature extraction module has shown to result in much finer and accurate segmentation masks while still maintaining low computational requirements. The future work of this study may include upscaling the network to compare its performance with larger segmentation models. The main limitation of this technique is that going too deep with full scale features can be expensive this is one of the reasons why we had to use max-pooling to be better than the networks under consideration (ENet and SINet) in both size and speed. Future work might also be able to study the effect of going deeper with full scale features for applications where computation resource is not an issue.

6. Conclusion

Unlike detection and classification applications, spacial resolution of features is extremely important when it comes to segmentation. This is also true for road scenes when segmenting small objects like a person and traffic sign, etc. This paper proposes a new deep learning-based model for semantic segmentation using an encoder-decoder architecture.

Instead of following the conventional approach of doing extensive downsampling of features in the encoder, we have introduced the idea of high-resolution feature encoding, thus enabling the decoder to extract valuable multiscale features from the high-resolution encoded features. To address the issue of latency due to high-resolution features, the spatial resolution is reduced by half after every two convolution blocks. The downsampled features are then upsampled before being concatenated with the rest of the features. This way the output of the encoder is in full resolution. The decoder consists of a multipath feature extraction module to decode the necessary information from three different scales. The proposed scheme is also compared with some classical encoder-decoder architectures for semantic segmentation. The experimental results reported in the paper show that encoding in full resolution has resulted in the prediction of much finer segmentation masks for both large and small objects. This research shows the overall effectiveness of the proposed architecture in terms of improved segmentation performance.

Data Availability

Two datasets were used in this set of experimentation namingly Cityscapes and Mapillary vistas; they both are open access datasets and are available on the following links: https://www.kaggle.com/datasets/zhangyunsheng/cityscapes-data and https://www.mapillary.com/dataset/vistas.

Conflicts of Interest

There are no potential conflicts of interest. The work has been undertaken to accept standards of ethics and of professional standards.

Acknowledgments

One of the authors (Harish Kumar) extends his gratitude to the Deanship of Scientific Research at King Khalid University for funding this work through research groups program under grant number R. G. P. 2/198/43.