Abstract

In recent years, convolutional neural networks (CNNs) have been at the centre of the advances and progress of advanced driver assistance systems and autonomous driving. This paper presents a point-wise pyramid attention network, namely, PPANet, which employs an encoder-decoder approach for semantic segmentation. Specifically, the encoder adopts a novel squeeze nonbottleneck module as a base module to extract feature representations, where squeeze and expansion are utilized to obtain high segmentation accuracy. An upsampling module is designed to work as a decoder; its purpose is to recover the lost pixel-wise representations from the encoding part. The middle part consists of two parts point-wise pyramid attention (PPA) module and an attention-like module connected in parallel. The PPA module is proposed to utilize contextual information effectively. Furthermore, we developed a combined loss function from dice loss and binary cross-entropy to improve accuracy and get faster training convergence in KITTI road segmentation. The paper conducted the training and testing experiments on KITTI road segmentation and Camvid datasets, and the evaluation results show that the proposed method proved its effectiveness in road semantic segmentation.

1. Introduction

Advanced driver assistance systems (ADAS) have gained massive popularity in the past decades, with much attention given by big car companies such as Tesla, Google, and Uber. ADAS, including adaptive cruise control (ACC), lateral guidance assistance, collision avoidance, traffic sign recognition, and lane change assistance, can be considered crucial factors in developing autonomous driving systems [13]. Early studies have developed to detect lanes using mathematical models and traditional computer vision algorithms. For instance, many algorithms have been developed to work on supervised and unsupervised approaches [47]. The current paradigm of research has shifted towards nontraditional machine learning methods, namely, deep learning. Deep learning methods have notable performance improvement and have been the dominant solution for many academia and industry problems because convolutional neural networks (CNNs) extract robust and representative features. The significant improvement in ADAS and autonomous driving field has been driven by deep learning success, particularly deep convolutional neural networks (CNNs).

Road detection is an essential component of many ADAS and autonomous vehicles. There is much active research focusing on performing road detection [819] and wide-ranging algorithms of various representations proposed for this regard. Semantic segmentation has been at the centre of this development. There is a significant amount of research using convolution neural network-based segmentation. As region-based representation [20], encoder-decoder networks [2126] and several supporting approaches along with these networks have been used, while other supporting techniques fused 3D LiDAR point cloud with 2D images, such as [27, 28]. In this paper, we focus on road segmentation using RGB images. Inspired by the seminal segmentation model U-Net [29], inception [30], SqueezeNet [31], and deep residual learning [32], we propose an architecture that takes the strengths of these well-established models and performs semantic segmentation more effectively. The proposed new architecture is named PPANet (point-wise pyramid attention network), which follows the encoder-decoder approach. In summary, our main contributions could be summarized as follows.

Firstly, we introduce a novel module named point-wise pyramid attention (PPA module) to acquire long-term dependency and multiscale features without much computation burden. Secondly, we design an upsampling module to help to recover the lost details in the encoder. Thirdly, based on the possibility for improvement, we propose a squeeze-nbt module to extract feature representations in the encoder. At last, we combine these modules in an encoder-decoder manner to construct our PPANet for semantic segmentation. The designed model was used to improve the performance of road understanding on KITTI road segmentation and Camvid datasets.

2.1. Encoder-Decoder Method

In semantic segmentation, the main objective is to assign a categorical label to every pixel in an image, which plays a significant role in road scene understanding. The success and advances in deep convolutional neural network (CNN) models [30, 3234] have a remarkable impact on pixel-wise semantic segmentation progress due to the rich hierarchical features [29, 3538]. Usually, to obtain a more delicate result from such a deep network, it is essential to retain high-level semantic information when using low-level details. However, training such a deep neural network requires a large amount of data, but only a limited number of training examples are available in many practical cases. One way to overcome this problem is by employing transfer learning through a network that is pretrained on a big dataset then fine-tuned on the targeted dataset, as done in [36, 39]. Another solution for such a problem is performing extensive data augmentation, as done in U-Net [29]. In addition to data augmentation, the model architecture should be designed to propagate the information from low levels to the corresponding high levels in a much easier way, such as U-Net.

2.2. Deep Neural Networks

Since the seminal AlexNet [33], model architecture with only eight layers, many studies have been proposed with new approaches for a classification task. Later on, these developed models were applied successfully to a different computer vision task, for example, to segmentation [36], object detection [34], video classification [40, 41], object tracking [42], human pose estimation [43], and superresolution [44]. These successes spurred the design of a new model with a very large number of layers. However, these growing numbers of layers will need tedious hyperparameter tuning, and that can increase the difficulty of designing such kind of model. In 2014, VGGNet [34] was proposed, in which a significant improvement has been made by utilizing a wider and deeper network; their approach introduced a simple yet effective strategy for designing a very deep network. The quality of a deeper network has a significant impact on improving other computer vision tasks. ResNet [32] has come with an even very deeper model. However, increasing the depth of the network could cause a vanishing gradient problem [32]. Many techniques have been introduced to prevent vanishing gradients, for instance, using an initialization method MSR [45] and batch normalization [46].

Meanwhile, skip connection (identity mapping) was used to ease the training process of deep networks without vanishing gradient problems, although VGGNet has a simple architecture, which requires high computation capabilities. On the other hand, inception model families have been designed to perform well with constraint memory and low computation budget. In an Inception module, a split transform-merge strategy where the input feature maps are split into lower dimensions (using convolutions) then transformed by a combination of specialized filters (, , and ) and merged in the end by concatenating branches is adopted.

2.3. Semantic Segmentation with CNN

Recent segmentation-based methods have a significant contribution to solving many computer vision problems, using a wide range of techniques such as a Fully Convolutional Network (FCN) [36], FCN with conditional random field CFD [46], region-based representation [20], encoder-decoder networks [2123], and multidimensional recurrent networks [47]. Furthermore, pyramid pooling and its variance have a great impact on the recent advances in semantic segmentation [4853].

2.4. Dilated Convolution-Based Architecture

Dilated convolution or atrous convolutions [53] are a powerful tool in the recent progress of semantic segmentation [52]. It is used to enlarge the receptive field while maintaining the same number of parameters. Recently, many approaches focus on multimodal fusion and contextual information aggregation to improve semantic segmentations [52, 54, 55]. ParseNet [56] applies average pooling to the full image to capture the global contextual information. Spatial pyramid pooling (SPP) [57] has inspired the use of pyramid pooling to aggregate multiscale contextual information such as pyramid pooling [51] module and atrous spatial pyramid pooling module (ASPP) [53, 58]. DenseASPP [59] is proposed to generate dense connections to acquire a larger receptive field. To empower the ASPP module, Xie et al. [60] introduced vortex pooling to utilize contextual information.

3. Methodology

3.1. Architecture

In this work, we proposed a point-wise pyramid attention network (PPANet) for semantic segmentation, as shown in Figure 1. The network is constructed with an encoder-decoder framework. The encoder is similar to the classification networks; it extracts features and encodes the input data into compact representations. At the same time, the decoder is used to recover the corresponding representations. The squeeze-nbt unit in Figure 2 is used as the main building block for the different stages in the encoder part. Each stage in the encoder has two blocks of the squeeze-nbt unit and the feature map downsampled by half at each first block in each stage using stride convolution (for more details, see Table 1. The network has two other parts: the point-wise pyramid attention module and attention module inserted between the encoder and the decoder. These modules in the centre have been used to enrich the receptive field and provide sufficient context information. More details will be discussed in the following sections.

3.2. Basic Building Unit

This subsection elaborates the squeeze-nbt module architecture (as illustrated in Figure 2). It draws its inspiration from several concepts that have been introduced into recent state-of-the-art models in classification and segmentation, such as the fire module in SqueezeNet [31], depthwise separable convolution [61], and dilated convolution [58]. Figure 2 is the squeeze-nbt module and encoder-decoder framework. We introduce a new module named squeeze-nbt (squeeze nonbottleneck) module. It is based on a reduce-split-squeeze-merge strategy. The squeeze-nbt module first uses point-wise convolution to reduce the feature maps and then apply a parallel fire module to learn useful representations. To make squeeze-nbt computationally efficient, we adopted dilated depthwise separable convolution instead of computationally expensive convolution.

3.3. Upsampling Module

Several methods such as [2123, 62], transpose convolution [63], or bilinear upsampling have been utilized broadly to gradually upsample encoded feature maps. In this work, we proposed the upsample module to work as a decoder and to refine the encoded feature maps by aggregating features of different resolutions. First, the low-level feature is processed with convolution and in parallel the high-level features upsampled to match the features coming from the encoder; these different features are concatenated and refined with two consecutive convolutions.

3.4. Point-Wise Pyramid Attention (PPA) Module

Segmentation requires both sizeable receptive field and rich spatial information. We proposed the point-wise pyramid attention (PPA) module, which is effective for aggregating global contextual information. As shown in Figure 3, the PPA module consists of two parts: the nonlocal part and vortex pooling. On the one hand, the nonlocal module will generate dense pixel-wise weight and extract long-range dependency. On the other hand, vortex atrous convolution is useful in detecting an object at multiple scales. By analysing the vortex pooling and nonlocal dependency, we fuse these two modules’ advantages in one module named the point-wise pyramid attention (PPA) module. The PPA module consisted of three parallel vortex atrous convolution blocks with dilation rates of 3, 6, and 9 and one nonatrous convolution block.

The point-wise pyramid attention module is shown in Figure 3. Let be the input feature map where and , , and are width, height, and channels, respectively. First, we apply two parallel convolution layers and to generate a feature map of , where indicates the number of channels of :

Then, we calculate the similarity matrix of and by a matrix multiplication .

Lastly, softmax is applied to normalize the result and transform to self-attention-like mechanism:

4. Experimental Results and Analysis

In this section, comprehensive experiments on the KITTI road segmentation dataset [64] and Camvid dataset [65] are carried out. We evaluate the efficiency and effectiveness of our proposed architecture. Firstly, an introduction to the datasets and the implementation protocols is given. We then elaborate on the loss function and the evaluation metrics used to train KITTI, followed by ablation studies and experiments with the SOTA models. Finally, we report a comparison on the Camvid dataset.

4.1. Datasets and Implementation Details
4.1.1. Datasets

(1) KITTI Road Segmentation Dataset. It consists of 289 training images with their corresponding ground truth. The data in this benchmark is divided into 3 categories: urban marked (UM) with 95 frames, urban multiple marked lane (UMM) with 96 frames, and urban unmarked (UU). The dataset has a small number of frames and difficult lightning conditions, which make it very challenging. In total, it has 290 frames for testing (testing frames have no ground truth information). Training and testing frames were extracted from the KITTI dataset [64] at a minimum spatial distance of 20 m. Each image has a resolution of . We split the dataset into three subsets: (a) training samples with 173 images, (b) validation samples with 58 images, and (c) testing samples with 58 images.

(2) Camvid Dataset. The Camvid dataset is an urban street scene understanding dataset in autonomous driving. It consists of 701 samples: 376 training samples, 101 validation samples, and 233 test samples, with 11 semantic categories such as building, road, sky, and bicycle, while class 12 contains unlabelled data that we ignore during training. The original image resolution for the Camvid dataset is . It has been downsampled into 360x before training for a fair comparison. A weighted categorical cross-entropy loss was used to compensate for a small number of categories in the dataset.

4.1.2. Implementation Details

All experiments were implemented with one GTX1080Ti CUDA 10.2 and cuDNN 8.0 on Pytorch [58] deep learning framework. The Adam [59] optimizer is a stochastic-based optimizer used with an initial learning rate of to train KITTI road segmentation and Camvid datasets. The learning rate is adjusted according to Equation (3), where is the initial learning rate, is a factor used to control the learning rate drop, is the number of epochs to decrease the learning rate value, and is the current epoch. In PPANet implementation, the learning rate is reduced by a factor of every 15 epochs. The proposed network is limited to run for a maximum of 300 epochs. Normal weight initialization [45] is used to initialize the model. Finally, of regularization to deal with the model overfitting

(1) Loss Function. There is a wide range of loss functions proposed over the years to perform semantic segmentation tasks. For instance, binary cross-entropy (BCE) has been applied to many research in classification and segmentation with remarkable success. Although it is convenient to train neural networks using BCE, it might not perform well in class unbalance. For instance, it does not perform well when it is used as the only loss function on KITTI road segmentation with PPANet. In this work, our total loss Equation (6) is a combination of dice loss Equation (5), which was proposed in Zhou et al. [62], and binary cross-entropy Equation (4). Let be the prediction given by a sigmoid nonlinearity and let be the corresponding ground truth. Dice loss has been implemented in a different form in literature; for instance, in [62, 63], it has equivalent definitions, differing in the denominator value. Our experiment found that using the dice loss function that uses the summation of squared values of probabilities and ground truth in the denominator performs better. These functions are defined as follows.

The binary cross-entropy:

Dice loss:

Total loss:

4.1.3. Evaluation Metrics on KITTI

Precision and recall evaluation metrics can be considered one of the most common metrics for evaluating a binary classification; following the methods used in [64, 66, 67], we evaluated our segmentation model using precision Equation (7), recall Equation (8), and -measure Equation (11). The evaluation metrics are listed in the following equations:

4.1.4. KITTI Data Augmentation

Data augmentation comprises a wide range of techniques used to extend the training samples by applying random perturbations and jitters to the original data. In our model, an online data augmentation approach helps the model learn more robust features and increase the generalizability by preventing the model from seeing the same image twice, as slightly random modification to the input data is performed each time. Therefore, we perform a series of data transformation to deal with typical changes in road images, such as texture and colour changes and illumination. In particular, we implemented normalization, blurring, and changing the illumination. Data augmentation can lend itself naturally in the context of computer vision. For example, we can acquire additional training data from the original KITTI road segmentation images by applying the following transforms: (1)Transformation that applied to both image and the ground truth (i)Geometric transformations are used to alter the position of the point, such as translation, scaling, and rotations(ii)Mirroring (horizontal flip)(2)Transformations that applied to the image only since they affect only pixel values (i)Normalize the input image by standardizing each pixel to be in range using Equation (13)(ii)Random brightness adjustment(iii)Gaussian blur(iv)Random noise:

(1) Mathematical Morphology. Applying deep learning methods to the segmentation of a road sometimes results in some noise. Nonroad could be classified as a road and vice versa. Several mathematical morphology techniques can be used to remove the noise and improve the performance of the model in the testing time. An opening mathematical morphology process with square structuring elements of sizes was used. It helped the network eliminate some of the nonroad classified as a road (false positive), as illustrated in Figure 4, where (a) represents the performance of the model without augmentation, and we can see the noise by the side of the road, and (b) shows the effect of removing this false positive when training the model.

4.2. Ablation Study
4.2.1. Encoder

We carried out some ablation studies to highlight the effectiveness of our proposed model structure. The proposed method baseline achieves 96.3% and 96.2% AP; then, we run experiments with different dilation rate settings. First, we gradually increased the dilation rates 3, 9, and 13 in the encoder at stages 2, 3, and 4, respectively, which result in a decrease of 6.32% -measure and 4.25% average precision. To further examine and verify the effectiveness of our method with a range of dilation rates, we employed another combination dilation rates, 2, 4, and 8, which yield the lowest result in the ablation experiments with 89.36% -measure and 78.9%. This sequence of dilation rates has given lower results. It seems that the combination of dilation rates is not effective for the PPANet encoder. Finally, we tested our model using a dilation rate of 2 in the three stages of the encoder and yield the best outcome for our model, as shown in Table 2. Therefore, we set the dilation rate of 2 for all three stages in the encoder.

4.2.2. Decoder

We test two settings in the decoding part. First, using the upsampling unit comprises bilinear upsampling and convolution to restore the high-resolution feature from low-resolution features; this approach achieved good results. However, still, there is some information lost during downsampling the feature map in the encoding process. To maintain the highest possible global context feature, we designed a point-wise pyramid attention that is used to increase the model prediction performance with both PPA and upsampling unit; the decoder can aggregate information through a fusion of a multiscale feature. Therefore, it effectively captures local and global context features (see Table 2 for further comparison). It can be seen that the proposed upsample unit and the point-wise pyramid attention (PPA) improve the segmentation-based PPANet and helped achieve superior AP, precision, recall, and score compared to other models.

4.3. Comparing with the SOTA

In this subsection, we will present the overall qualitative and quantitative assessment of the trained model. The training and evaluation were conducted using the KITTI road segmentation and Camvid datasets. We then compare the model with a selected SOTA model. Table 3 shows the comparison of PPANet with other SOTA models on the KITTI road segmentation dataset. PPANet is designed for road scene understanding, and it is being trained end-to-end. As previously stated, the dataset has a limited amount of data divided into three categories: urban unmarked (UM road), urban multiple marked lanes (UMM road), and urban unmarked (UU road) as one category to help alongside data augmentation to overcome model overfitting. To rank the best performance among the chosen models for comparison and evaluation, we reported precision (PRE), recall (REC), and metrics, which are known metrics used to evaluate different approaches in binary semantic segmentation. We have chosen some state-of-the-art models to perform a comparison with our proposed PPANet model. These models include SegNet [23], ENet [21], FastFCN [68], LBN-AA [66], DABNet [67], and AGLNet [69]. The overall results of PPANet and other SOTA models are illustrated in Figure 3. Our PPANet obtained the highest scores for all metrics, demonstrating the effectiveness of the proposed method for robust road detection; FastFCN ranked second in terms of precision and third for , while ALGNet has better -measure. ENet achieved the lowest results compared to all other models. It is designed for speed purposes.

For qualitative performance evaluation of our model in road segmentation, a visual representation of PPANet predictions in the KITTI dataset test set is presented in Figures 57 in perspective view for UM, UMM, and UU, respectively. We can see that the urban marked (Figure 5) road got the best prediction with almost no misclassification. For urban unmarked, there is little noisy prediction that can be improved using some postprocess optimization techniques such CRF or increasing the amount of data. When we move to the urban multiple unmarked, it has a higher misclassified road area; it has some area outside the road predicted as road. These false positive detections mainly occur in pole railway when it is close to the road, and also, the road detection is affected by shadow. So, our model with only 3.01 M parameters and without pretrained weights got quite excellent results in a small dataset such as KITTI road segmentation.

4.4. Comparison with SOTA Models on Camvid

In this subsection, we design an experiment to demonstrate our proposed network effectiveness and validity on the Camvid dataset. We train and evaluate the model in the training and validation images and validation set for 400 epochs. Then, the model was tested using the testing images and the results reported in Table 4 in terms of mean intersection over union (mIoU). From Table 4, we can see that the proposed PPANet method has superior performance in terms of mIoU. First, the model was compared with models that have been designed for real-time semantic segmentation such as ENet [21], BiSeNetv1 [70], CGNet [71], NDNet45-FCN8-LF [72], LBN-AA [66], DABNet [67], and AGLNet [69]. And also, we compared our proposed method with a non-real-time model such as DeebLabv2 [58], PSPNet [51], DenseDecoder [73], and SegNet [23]. Besides, we present the individual category results in the Camvid test set in Table 5. As can be seen, the proposed method obtained better accuracy in most of the classes. We also provide visual results in Figure 8.

5. Conclusion

This paper has presented an approach to scene understanding in monocular images. A novel encoder-decoder network for effective semantic segmentation is proposed, named PPANet. The encoder adopts split and squeeze operations in the residual layer to enhance information propagation and feature reuse. To effectively refine the encoded feature map, we design a decoder consisting of the upsampling unit and point-wise pyramid attention (PPA) module. The PPA module is inserted in the centre to enrich the receptive field and to aggregate global contextual information. The attention mechanism is utilized to refine the prediction using a sequence of depthwise convolution followed by sigmoid. This interaction between different features from the upsampling unit, PPA, and attention provides guidance for high-level and low-level features to improve the performance. The network is trained in an end-to-end manner on two popular datasets: KITTI road segmentation and Camvid. The experimental results showed that the proposed method improves the state of the art for road segmentation on small datasets such as the KITTI dataset and Camvid. Future works will include using pretrained weight as that has been the paradigm for most SOTA in this field. Also, we will investigate the potential of incorporating other sensors such as LiDAR into the architecture and test the effectiveness of our approach in dealing with data fusion and 3D road segmentation.

Data Availability

We have used the Camvid dataset and KITTI road segmentation dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The study was funded by the Fujian province Innovation Strategy Research Program (No. 2020R01020196) and Yongtai Artificial Intelligence Institute.