PPANet: Point-Wise Pyramid Attention Network for Semantic Segmentation

Elhassan, Mohammed A. M.; Chen, YuXuan; Chen, Yunyi; Huang, Chenxi; Yang, Jane; Yao, Xingcong; Yang, Chenhui; Cheng, Yinuo

doi:https://doi.org/10.1155/2021/5563875

Wireless Communications and Mobile Computing

On this page

Abstract Introduction Related Works Experimental Results and Analysis Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Deep and Transfer Learning Approaches for Complex Data Analysis in the Industry 4.0 Era

View this Special Issue

Research Article | Open Access

Volume 2021 | Article ID 5563875 | https://doi.org/10.1155/2021/5563875

PPANet: Point-Wise Pyramid Attention Network for Semantic Segmentation

Mohammed A. M. Elhassan,¹YuXuan Chen,¹Yunyi Chen,¹Chenxi Huang,¹Jane Yang,²Xingcong Yao,¹Chenhui Yang,¹and Yinuo Cheng³

Academic Editor: Khin wee Lai

Received07 Jan 2021

Revised30 Jan 2021

Accepted03 Apr 2021

Published30 Apr 2021

Abstract

In recent years, convolutional neural networks (CNNs) have been at the centre of the advances and progress of advanced driver assistance systems and autonomous driving. This paper presents a point-wise pyramid attention network, namely, PPANet, which employs an encoder-decoder approach for semantic segmentation. Specifically, the encoder adopts a novel squeeze nonbottleneck module as a base module to extract feature representations, where squeeze and expansion are utilized to obtain high segmentation accuracy. An upsampling module is designed to work as a decoder; its purpose is to recover the lost pixel-wise representations from the encoding part. The middle part consists of two parts point-wise pyramid attention (PPA) module and an attention-like module connected in parallel. The PPA module is proposed to utilize contextual information effectively. Furthermore, we developed a combined loss function from dice loss and binary cross-entropy to improve accuracy and get faster training convergence in KITTI road segmentation. The paper conducted the training and testing experiments on KITTI road segmentation and Camvid datasets, and the evaluation results show that the proposed method proved its effectiveness in road semantic segmentation.

1. Introduction

Advanced driver assistance systems (ADAS) have gained massive popularity in the past decades, with much attention given by big car companies such as Tesla, Google, and Uber. ADAS, including adaptive cruise control (ACC), lateral guidance assistance, collision avoidance, traffic sign recognition, and lane change assistance, can be considered crucial factors in developing autonomous driving systems [1–3]. Early studies have developed to detect lanes using mathematical models and traditional computer vision algorithms. For instance, many algorithms have been developed to work on supervised and unsupervised approaches [4–7]. The current paradigm of research has shifted towards nontraditional machine learning methods, namely, deep learning. Deep learning methods have notable performance improvement and have been the dominant solution for many academia and industry problems because convolutional neural networks (CNNs) extract robust and representative features. The significant improvement in ADAS and autonomous driving field has been driven by deep learning success, particularly deep convolutional neural networks (CNNs).

Road detection is an essential component of many ADAS and autonomous vehicles. There is much active research focusing on performing road detection [8–19] and wide-ranging algorithms of various representations proposed for this regard. Semantic segmentation has been at the centre of this development. There is a significant amount of research using convolution neural network-based segmentation. As region-based representation [20], encoder-decoder networks [21–26] and several supporting approaches along with these networks have been used, while other supporting techniques fused 3D LiDAR point cloud with 2D images, such as [27, 28]. In this paper, we focus on road segmentation using RGB images. Inspired by the seminal segmentation model U-Net [29], inception [30], SqueezeNet [31], and deep residual learning [32], we propose an architecture that takes the strengths of these well-established models and performs semantic segmentation more effectively. The proposed new architecture is named PPANet (point-wise pyramid attention network), which follows the encoder-decoder approach. In summary, our main contributions could be summarized as follows.

Firstly, we introduce a novel module named point-wise pyramid attention (PPA module) to acquire long-term dependency and multiscale features without much computation burden. Secondly, we design an upsampling module to help to recover the lost details in the encoder. Thirdly, based on the possibility for improvement, we propose a squeeze-nbt module to extract feature representations in the encoder. At last, we combine these modules in an encoder-decoder manner to construct our PPANet for semantic segmentation. The designed model was used to improve the performance of road understanding on KITTI road segmentation and Camvid datasets.

2.1. Encoder-Decoder Method

In semantic segmentation, the main objective is to assign a categorical label to every pixel in an image, which plays a significant role in road scene understanding. The success and advances in deep convolutional neural network (CNN) models [30, 32–34] have a remarkable impact on pixel-wise semantic segmentation progress due to the rich hierarchical features [29, 35–38]. Usually, to obtain a more delicate result from such a deep network, it is essential to retain high-level semantic information when using low-level details. However, training such a deep neural network requires a large amount of data, but only a limited number of training examples are available in many practical cases. One way to overcome this problem is by employing transfer learning through a network that is pretrained on a big dataset then fine-tuned on the targeted dataset, as done in [36, 39]. Another solution for such a problem is performing extensive data augmentation, as done in U-Net [29]. In addition to data augmentation, the model architecture should be designed to propagate the information from low levels to the corresponding high levels in a much easier way, such as U-Net.

2.2. Deep Neural Networks

Since the seminal AlexNet [33], model architecture with only eight layers, many studies have been proposed with new approaches for a classification task. Later on, these developed models were applied successfully to a different computer vision task, for example, to segmentation [36], object detection [34], video classification [40, 41], object tracking [42], human pose estimation [43], and superresolution [44]. These successes spurred the design of a new model with a very large number of layers. However, these growing numbers of layers will need tedious hyperparameter tuning, and that can increase the difficulty of designing such kind of model. In 2014, VGGNet [34] was proposed, in which a significant improvement has been made by utilizing a wider and deeper network; their approach introduced a simple yet effective strategy for designing a very deep network. The quality of a deeper network has a significant impact on improving other computer vision tasks. ResNet [32] has come with an even very deeper model. However, increasing the depth of the network could cause a vanishing gradient problem [32]. Many techniques have been introduced to prevent vanishing gradients, for instance, using an initialization method MSR [45] and batch normalization [46].

Meanwhile, skip connection (identity mapping) was used to ease the training process of deep networks without vanishing gradient problems, although VGGNet has a simple architecture, which requires high computation capabilities. On the other hand, inception model families have been designed to perform well with constraint memory and low computation budget. In an Inception module, a split transform-merge strategy where the input feature maps are split into lower dimensions (using convolutions) then transformed by a combination of specialized filters (, , and ) and merged in the end by concatenating branches is adopted.

2.3. Semantic Segmentation with CNN

Recent segmentation-based methods have a significant contribution to solving many computer vision problems, using a wide range of techniques such as a Fully Convolutional Network (FCN) [36], FCN with conditional random field CFD [46], region-based representation [20], encoder-decoder networks [21–23], and multidimensional recurrent networks [47]. Furthermore, pyramid pooling and its variance have a great impact on the recent advances in semantic segmentation [48–53].

2.4. Dilated Convolution-Based Architecture

Dilated convolution or atrous convolutions [53] are a powerful tool in the recent progress of semantic segmentation [52]. It is used to enlarge the receptive field while maintaining the same number of parameters. Recently, many approaches focus on multimodal fusion and contextual information aggregation to improve semantic segmentations [52, 54, 55]. ParseNet [56] applies average pooling to the full image to capture the global contextual information. Spatial pyramid pooling (SPP) [57] has inspired the use of pyramid pooling to aggregate multiscale contextual information such as pyramid pooling [51] module and atrous spatial pyramid pooling module (ASPP) [53, 58]. DenseASPP [59] is proposed to generate dense connections to acquire a larger receptive field. To empower the ASPP module, Xie et al. [60] introduced vortex pooling to utilize contextual information.

3. Methodology

3.1. Architecture

In this work, we proposed a point-wise pyramid attention network (PPANet) for semantic segmentation, as shown in Figure 1. The network is constructed with an encoder-decoder framework. The encoder is similar to the classification networks; it extracts features and encodes the input data into compact representations. At the same time, the decoder is used to recover the corresponding representations. The squeeze-nbt unit in Figure 2 is used as the main building block for the different stages in the encoder part. Each stage in the encoder has two blocks of the squeeze-nbt unit and the feature map downsampled by half at each first block in each stage using stride convolution (for more details, see Table 1. The network has two other parts: the point-wise pyramid attention module and attention module inserted between the encoder and the decoder. These modules in the centre have been used to enrich the receptive field and provide sufficient context information. More details will be discussed in the following sections.

3.2. Basic Building Unit

This subsection elaborates the squeeze-nbt module architecture (as illustrated in Figure 2). It draws its inspiration from several concepts that have been introduced into recent state-of-the-art models in classification and segmentation, such as the fire module in SqueezeNet [31], depthwise separable convolution [61], and dilated convolution [58]. Figure 2 is the squeeze-nbt module and encoder-decoder framework. We introduce a new module named squeeze-nbt (squeeze nonbottleneck) module. It is based on a reduce-split-squeeze-merge strategy. The squeeze-nbt module first uses point-wise convolution to reduce the feature maps and then apply a parallel fire module to learn useful representations. To make squeeze-nbt computationally efficient, we adopted dilated depthwise separable convolution instead of computationally expensive convolution.

3.3. Upsampling Module

Several methods such as [21–23, 62], transpose convolution [63], or bilinear upsampling have been utilized broadly to gradually upsample encoded feature maps. In this work, we proposed the upsample module to work as a decoder and to refine the encoded feature maps by aggregating features of different resolutions. First, the low-level feature is processed with convolution and in parallel the high-level features upsampled to match the features coming from the encoder; these different features are concatenated and refined with two consecutive convolutions.

3.4. Point-Wise Pyramid Attention (PPA) Module

Segmentation requires both sizeable receptive field and rich spatial information. We proposed the point-wise pyramid attention (PPA) module, which is effective for aggregating global contextual information. As shown in Figure 3, the PPA module consists of two parts: the nonlocal part and vortex pooling. On the one hand, the nonlocal module will generate dense pixel-wise weight and extract long-range dependency. On the other hand, vortex atrous convolution is useful in detecting an object at multiple scales. By analysing the vortex pooling and nonlocal dependency, we fuse these two modules’ advantages in one module named the point-wise pyramid attention (PPA) module. The PPA module consisted of three parallel vortex atrous convolution blocks with dilation rates of 3, 6, and 9 and one nonatrous convolution block.

The point-wise pyramid attention module is shown in Figure 3. Let be the input feature map where and , , and are width, height, and channels, respectively. First, we apply two parallel convolution layers and to generate a feature map of , where indicates the number of channels of :

Then, we calculate the similarity matrix of and by a matrix multiplication .

Lastly, softmax is applied to normalize the result and transform to self-attention-like mechanism:

4. Experimental Results and Analysis

In this section, comprehensive experiments on the KITTI road segmentation dataset [64] and Camvid dataset [65] are carried out. We evaluate the efficiency and effectiveness of our proposed architecture. Firstly, an introduction to the datasets and the implementation protocols is given. We then elaborate on the loss function and the evaluation metrics used to train KITTI, followed by ablation studies and experiments with the SOTA models. Finally, we report a comparison on the Camvid dataset.

4.1. Datasets and Implementation Details

4.1.1. Datasets

(1) KITTI Road Segmentation Dataset. It consists of 289 training images with their corresponding ground truth. The data in this benchmark is divided into 3 categories: urban marked (UM) with 95 frames, urban multiple marked lane (UMM) with 96 frames, and urban unmarked (UU). The dataset has a small number of frames and difficult lightning conditions, which make it very challenging. In total, it has 290 frames for testing (testing frames have no ground truth information). Training and testing frames were extracted from the KITTI dataset [64] at a minimum spatial distance of 20 m. Each image has a resolution of . We split the dataset into three subsets: (a) training samples with 173 images, (b) validation samples with 58 images, and (c) testing samples with 58 images.

(2) Camvid Dataset. The Camvid dataset is an urban street scene understanding dataset in autonomous driving. It consists of 701 samples: 376 training samples, 101 validation samples, and 233 test samples, with 11 semantic categories such as building, road, sky, and bicycle, while class 12 contains unlabelled data that we ignore during training. The original image resolution for the Camvid dataset is . It has been downsampled into 360x before training for a fair comparison. A weighted categorical cross-entropy loss was used to compensate for a small number of categories in the dataset.

4.1.2. Implementation Details

All experiments were implemented with one GTX1080Ti CUDA 10.2 and cuDNN 8.0 on Pytorch [58] deep learning framework. The Adam [59] optimizer is a stochastic-based optimizer used with an initial learning rate of to train KITTI road segmentation and Camvid datasets. The learning rate is adjusted according to Equation (3), where is the initial learning rate, is a factor used to control the learning rate drop, is the number of epochs to decrease the learning rate value, and is the current epoch. In PPANet implementation, the learning rate is reduced by a factor of every 15 epochs. The proposed network is limited to run for a maximum of 300 epochs. Normal weight initialization [45] is used to initialize the model. Finally, of regularization to deal with the model overfitting

(1) Loss Function. There is a wide range of loss functions proposed over the years to perform semantic segmentation tasks. For instance, binary cross-entropy (BCE) has been applied to many research in classification and segmentation with remarkable success. Although it is convenient to train neural networks using BCE, it might not perform well in class unbalance. For instance, it does not perform well when it is used as the only loss function on KITTI road segmentation with PPANet. In this work, our total loss Equation (6) is a combination of dice loss Equation (5), which was proposed in Zhou et al. [62], and binary cross-entropy Equation (4). Let be the prediction given by a sigmoid nonlinearity and let be the corresponding ground truth. Dice loss has been implemented in a different form in literature; for instance, in [62, 63], it has equivalent definitions, differing in the denominator value. Our experiment found that using the dice loss function that uses the summation of squared values of probabilities and ground truth in the denominator performs better. These functions are defined as follows.

The binary cross-entropy:

Dice loss:

Total loss:

4.1.3. Evaluation Metrics on KITTI

Precision and recall evaluation metrics can be considered one of the most common metrics for evaluating a binary classification; following the methods used in [64, 66, 67], we evaluated our segmentation model using precision Equation (7), recall Equation (8), and -measure Equation (11). The evaluation metrics are listed in the following equations:

4.1.4. KITTI Data Augmentation

Data augmentation comprises a wide range of techniques used to extend the training samples by applying random perturbations and jitters to the original data. In our model, an online data augmentation approach helps the model learn more robust features and increase the generalizability by preventing the model from seeing the same image twice, as slightly random modification to the input data is performed each time. Therefore, we perform a series of data transformation to deal with typical changes in road images, such as texture and colour changes and illumination. In particular, we implemented normalization, blurring, and changing the illumination. Data augmentation can lend itself naturally in the context of computer vision. For example, we can acquire additional training data from the original KITTI road segmentation images by applying the following transforms: (1)Transformation that applied to both image and the ground truth (i)Geometric transformations are used to alter the position of the point, such as translation, scaling, and rotations(ii)Mirroring (horizontal flip)(2)Transformations that applied to the image only since they affect only pixel values (i)Normalize the input image by standardizing each pixel to be in range using Equation (13)(ii)Random brightness adjustment(iii)Gaussian blur(iv)Random noise:

(1) Mathematical Morphology. Applying deep learning methods to the segmentation of a road sometimes results in some noise. Nonroad could be classified as a road and vice versa. Several mathematical morphology techniques can be used to remove the noise and improve the performance of the model in the testing time. An opening mathematical morphology process with square structuring elements of sizes was used. It helped the network eliminate some of the nonroad classified as a road (false positive), as illustrated in Figure 4, where (a) represents the performance of the model without augmentation, and we can see the noise by the side of the road, and (b) shows the effect of removing this false positive when training the model.

(a) Without mathematical morphology

(b) With mathematical morphology

4.2. Ablation Study

4.2.1. Encoder

We carried out some ablation studies to highlight the effectiveness of our proposed model structure. The proposed method baseline achieves 96.3% and 96.2% AP; then, we run experiments with different dilation rate settings. First, we gradually increased the dilation rates 3, 9, and 13 in the encoder at stages 2, 3, and 4, respectively, which result in a decrease of 6.32% -measure and 4.25% average precision. To further examine and verify the effectiveness of our method with a range of dilation rates, we employed another combination dilation rates, 2, 4, and 8, which yield the lowest result in the ablation experiments with 89.36% -measure and 78.9%. This sequence of dilation rates has given lower results. It seems that the combination of dilation rates is not effective for the PPANet encoder. Finally, we tested our model using a dilation rate of 2 in the three stages of the encoder and yield the best outcome for our model, as shown in Table 2. Therefore, we set the dilation rate of 2 for all three stages in the encoder.

4.2.2. Decoder

We test two settings in the decoding part. First, using the upsampling unit comprises bilinear upsampling and convolution to restore the high-resolution feature from low-resolution features; this approach achieved good results. However, still, there is some information lost during downsampling the feature map in the encoding process. To maintain the highest possible global context feature, we designed a point-wise pyramid attention that is used to increase the model prediction performance with both PPA and upsampling unit; the decoder can aggregate information through a fusion of a multiscale feature. Therefore, it effectively captures local and global context features (see Table 2 for further comparison). It can be seen that the proposed upsample unit and the point-wise pyramid attention (PPA) improve the segmentation-based PPANet and helped achieve superior AP, precision, recall, and score compared to other models.

4.3. Comparing with the SOTA

In this subsection, we will present the overall qualitative and quantitative assessment of the trained model. The training and evaluation were conducted using the KITTI road segmentation and Camvid datasets. We then compare the model with a selected SOTA model. Table 3 shows the comparison of PPANet with other SOTA models on the KITTI road segmentation dataset. PPANet is designed for road scene understanding, and it is being trained end-to-end. As previously stated, the dataset has a limited amount of data divided into three categories: urban unmarked (UM road), urban multiple marked lanes (UMM road), and urban unmarked (UU road) as one category to help alongside data augmentation to overcome model overfitting. To rank the best performance among the chosen models for comparison and evaluation, we reported precision (PRE), recall (REC), and metrics, which are known metrics used to evaluate different approaches in binary semantic segmentation. We have chosen some state-of-the-art models to perform a comparison with our proposed PPANet model. These models include SegNet [23], ENet [21], FastFCN [68], LBN-AA [66], DABNet [67], and AGLNet [69]. The overall results of PPANet and other SOTA models are illustrated in Figure 3. Our PPANet obtained the highest scores for all metrics, demonstrating the effectiveness of the proposed method for robust road detection; FastFCN ranked second in terms of precision and third for , while ALGNet has better -measure. ENet achieved the lowest results compared to all other models. It is designed for speed purposes.

For qualitative performance evaluation of our model in road segmentation, a visual representation of PPANet predictions in the KITTI dataset test set is presented in Figures 5–7 in perspective view for UM, UMM, and UU, respectively. We can see that the urban marked (Figure 5) road got the best prediction with almost no misclassification. For urban unmarked, there is little noisy prediction that can be improved using some postprocess optimization techniques such CRF or increasing the amount of data. When we move to the urban multiple unmarked, it has a higher misclassified road area; it has some area outside the road predicted as road. These false positive detections mainly occur in pole railway when it is close to the road, and also, the road detection is affected by shadow. So, our model with only 3.01 M parameters and without pretrained weights got quite excellent results in a small dataset such as KITTI road segmentation.

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

(a)

(b)

(c)

(d)

4.4. Comparison with SOTA Models on Camvid

In this subsection, we design an experiment to demonstrate our proposed network effectiveness and validity on the Camvid dataset. We train and evaluate the model in the training and validation images and validation set for 400 epochs. Then, the model was tested using the testing images and the results reported in Table 4 in terms of mean intersection over union (mIoU). From Table 4, we can see that the proposed PPANet method has superior performance in terms of mIoU. First, the model was compared with models that have been designed for real-time semantic segmentation such as ENet [21], BiSeNetv1 [70], CGNet [71], NDNet45-FCN8-LF [72], LBN-AA [66], DABNet [67], and AGLNet [69]. And also, we compared our proposed method with a non-real-time model such as DeebLabv2 [58], PSPNet [51], DenseDecoder [73], and SegNet [23]. Besides, we present the individual category results in the Camvid test set in Table 5. As can be seen, the proposed method obtained better accuracy in most of the classes. We also provide visual results in Figure 8.

(a)

(b)

(c)

5. Conclusion

This paper has presented an approach to scene understanding in monocular images. A novel encoder-decoder network for effective semantic segmentation is proposed, named PPANet. The encoder adopts split and squeeze operations in the residual layer to enhance information propagation and feature reuse. To effectively refine the encoded feature map, we design a decoder consisting of the upsampling unit and point-wise pyramid attention (PPA) module. The PPA module is inserted in the centre to enrich the receptive field and to aggregate global contextual information. The attention mechanism is utilized to refine the prediction using a sequence of depthwise convolution followed by sigmoid. This interaction between different features from the upsampling unit, PPA, and attention provides guidance for high-level and low-level features to improve the performance. The network is trained in an end-to-end manner on two popular datasets: KITTI road segmentation and Camvid. The experimental results showed that the proposed method improves the state of the art for road segmentation on small datasets such as the KITTI dataset and Camvid. Future works will include using pretrained weight as that has been the paradigm for most SOTA in this field. Also, we will investigate the potential of incorporating other sensors such as LiDAR into the architecture and test the effectiveness of our approach in dealing with data fusion and 3D road segmentation.

Data Availability

We have used the Camvid dataset and KITTI road segmentation dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The study was funded by the Fujian province Innovation Strategy Research Program (No. 2020R01020196) and Yongtai Artificial Intelligence Institute.

References

H. Liu, S. E. Shladover, X.-Y. Lu, and X. Kan, “Freeway vehicle fuel efficiency improvement via cooperative adaptive cruise control,” Journal of Intelligent Transportation Systems, pp. 1–13, 2020.
View at: Publisher Site | Google Scholar
I. Mahdinia, R. Arvin, A. J. Khattak, and A. Ghiasi, “Safety, energy, and emissions impacts of adaptive cruise control and cooperative adaptive cruise control,” Transportation Research Record, vol. 2674, no. 6, pp. 253–267, 2020.
View at: Publisher Site | Google Scholar
Y. Jiang, “Modeling and simulation of adaptive cruise control system,” 2020, https://arxiv.org/abs/2008.02103.
View at: Google Scholar
E. Kurbatova, “Road detection based on color and geometry characteristics,” in 2020 International Conference on Information Technology and Nanotechnology (ITNT), pp. 1–5, Samara, Russia, 2020.
View at: Publisher Site | Google Scholar
Y. Zhang, L. Wang, H. Wu, X. Geng, D. Yao, and J. Dong, “A clustering method based on fast exemplar finding and its application on brain magnetic resonance images segmentation,” Journal of Medical Imaging and Health Informatics, vol. 6, no. 5, pp. 1337–1344, 2016.
View at: Publisher Site | Google Scholar
Y. Zhang, F.-l. Chung, and S. Wang, “Clustering by transmission learning from data density to label manifold with statistical diffusion,” Knowledge-Based Systems, vol. 193, article 105330, 2020.
View at: Publisher Site | Google Scholar
Y. Zhang, F. Tian, H. Wu et al., “Brain MRI tissue classification based fuzzy clustering with competitive learning,” Journal of Medical Imaging and Health Informatics, vol. 7, no. 7, pp. 1654–1659, 2017.
View at: Publisher Site | Google Scholar
B. Wang, V. Frémont, and S. A. Rodríguez, “Color-based road detection and its evaluation on the KITTI road benchmark,” in 2014 IEEE Intelligent Vehicles Symposium Proceedings, pp. 31–36, Dearborn, MI, USA, 2014.
View at: Publisher Site | Google Scholar
L. Geng, J. Sun, Z. Xiao, F. Zhang, and J. Wu, “Combining CNN and MRF for road detection,” Computers & Electrical Engineering, vol. 70, pp. 895–903, 2018.
View at: Publisher Site | Google Scholar
M. Passani, J. J. Yebes, and L. M. Bergasa, “CRF-based semantic labeling in miniaturized road scenes,” in 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), pp. 1902-1903, Qingdao, China, 2014.
View at: Publisher Site | Google Scholar
H. Liu, X. Han, X. Li, Y. Yao, P. Huang, and Z. Tang, “Deep representation learning for road detection through Siamese network,” 2019, https://arxiv.org/abs/1905.13394.
View at: Google Scholar
G. L. Oliveira, W. Burgard, and T. Brox, “Efficient deep models for monocular road segmentation,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4885–4891, Daejeon, South Korea, 2016.
View at: Publisher Site | Google Scholar
F. Ren, X. He, Z. Wei et al., “Fusing appearance and prior cues for road detection,” Applied Sciences, vol. 9, no. 5, p. 996, 2019.
View at: Publisher Site | Google Scholar
S. Gu, Y. Zhang, X. Yuan, J. Yang, T. Wu, and H. Kong, “Histograms of the normalized inverse depth and line scanning for urban road detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 20, no. 8, pp. 3070–3080, 2018.
View at: Publisher Site | Google Scholar
L. Xiao, R. Wang, B. Dai, Y. Fang, D. Liu, and T. Wu, “Hybrid conditional random field based camera-LIDAR fusion for road detection,” Information Sciences, vol. 432, pp. 543–558, 2018.
View at: Publisher Site | Google Scholar
L. Xiao, B. Dai, D. Liu, D. Zhao, and T. Wu, “Monocular road detection using structured random forest,” International Journal of Advanced Robotic Systems, vol. 13, no. 3, p. 101, 2016.
View at: Publisher Site | Google Scholar
T. Rateke, K. A. Justen, V. F. Chiarella, A. C. Sobieranski, E. Comunello, and A. V. Wangenheim, “Passive vision region-based road detection,” ACM Computing Surveys, vol. 52, no. 2, pp. 1–34, 2019.
View at: Publisher Site | Google Scholar
K. Goro and K. Onoguchi, “Road boundary detection using in-vehicle monocular camera,” in Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods, pp. 379–387, Funchal, Madeira, Portugal, 2018.
View at: Publisher Site | Google Scholar
Y. Lyu, L. Bai, and X. Huang, “Road segmentation using CNN and distributed LSTM,” in 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5, Sapporo, Japan, 2019.
View at: Publisher Site | Google Scholar
H. Caesar, J. Uijlings, and V. Ferrari, “Region-based semantic segmentation with end-to-end training,” in in European Conference on Computer Vision, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol. 9905 of Lecture Notes in Computer Science, pp. 381–397, Springer, 2016.
View at: Publisher Site | Google Scholar
A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: a deep neural network architecture for real-time semantic segmentation,” 2016, https://arxiv.org/abs/1606.02147.
View at: Google Scholar
E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: efficient residual factorized ConvNet for real-time semantic segmentation,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, pp. 263–272, 2018.
View at: Publisher Site | Google Scholar
V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: a deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
View at: Publisher Site | Google Scholar
Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: a nested U-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 3–11, Springer, 2018.
View at: Google Scholar
J. Wang, H. Xiong, H. Wang, and X. Nian, “ADSCNet: asymmetric depthwise separable convolution for semantic segmentation in real-time,” Applied Intelligence, vol. 50, no. 4, pp. 1045–1056, 2020.
View at: Publisher Site | Google Scholar
C. Hazirbas, L. Ma, C. Domokos, and D. Cremers, “Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture,” in Asian conference on computer vision, S. H. Lai, V. Lepetit, K. Nishino, and Y. Sato, Eds., vol. 10111 of Lecture Notes in Computer Science, pp. 213–228, Springer, 2016.
View at: Publisher Site | Google Scholar
L. Caltagirone, M. Bellone, L. Svensson, and M. Wahde, “LIDAR–camera fusion for road detection using fully convolutional neural networks,” Robotics and Autonomous Systems, vol. 111, pp. 125–131, 2019.
View at: Publisher Site | Google Scholar
H. Liu, Y. Yao, Z. Sun, X. Li, K. Jia, and Z. Tang, “Road segmentation with image-LiDAR data fusion in deep neural network,” Multimedia Tools and Applications, vol. 79, no. 47, pp. 35503–35518, 2020.
View at: Publisher Site | Google Scholar
O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015, N. Navab, J. Hornegger, W. Wells, and A. Frangi, Eds., vol. 9351 of Lecture Notes in Computer Science, pp. 234–241, Springer, Cham, 2015.
View at: Publisher Site | Google Scholar
C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, Boston, MA, USA, 2015.
View at: Publisher Site | Google Scholar
F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size,” 2016, https://arxiv.org/abs/1602.07360.
View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, Las Vegas, NV, USA, 2016.
View at: Google Scholar
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 60, no. 6, pp. 84–90, 2017.
View at: Publisher Site | Google Scholar
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, https://arxiv.org/abs/1409.1556.
View at: Google Scholar
S. Zheng, S. Jayasumana, B. Romera-Paredes et al., “Conditional random fields as recurrent neural networks,” in Proceedings of the IEEE international conference on computer vision, pp. 1529–1537, Santiago, Chile, 2015.
View at: Publisher Site | Google Scholar
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, Boston, MA, USA, 2015.
View at: Publisher Site | Google Scholar
Z. Liu, X. Li, P. Luo, C.-C. Loy, and X. Tang, “Semantic image segmentation via deep parsing network,” in Proceedings of the IEEE international conference on computer vision, pp. 1377–1385, Santiago, Chile, 2015.
View at: Publisher Site | Google Scholar
G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Efficient piecewise training of deep structured models for semantic segmentation,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3194–3203, Las Vegas, NV, USA, 2016.
View at: Publisher Site | Google Scholar
L. Zhou, C. Zhang, and M. Wu, “D-LinkNet: LinkNet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 182–186, Salt Lake City, UT, USA.
View at: Publisher Site | Google Scholar
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732, Columbus, OH, USA, 2014.
View at: Publisher Site | Google Scholar
H. Tian, Y. Tao, S. Pouyanfar, S.-C. Chen, and M.-L. Shyu, “Multimodal deep representation learning for video classification,” World Wide Web, vol. 22, no. 3, pp. 1325–1341, 2019.
View at: Publisher Site | Google Scholar
Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: a benchmark,” in 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418, Portland, OR, USA, 2013.
View at: Publisher Site | Google Scholar
A. Toshev and C. Szegedy, “Deeppose: human pose estimation via deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1653–1660, Columbus, Ohio, USA, 2014.
View at: Google Scholar
C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in Computer Vision – ECCV 2014. ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds., vol. 8692 of Lecture Notes in Computer Science, pp. 184–199, Springer, Cham, 2014.
View at: Publisher Site | Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: surpassing human-level performance on ImageNet classification,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034, Santiago, Chile, 2015.
View at: Publisher Site | Google Scholar
S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015.
View at: Google Scholar
A. Graves, S. Fernández, and J. Schmidhuber, “Multi-dimensional recurrent neural networks,” in Artificial Neural Networks – ICANN 2007. ICANN 2007, vol. 4668 of Lecture Notes in Computer Science, pp. 549–558, Springer, Berlin, Heidelberg, 2007.
View at: Publisher Site | Google Scholar
L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Computer Vision – ECCV 2018. ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11211 of Lecture Notes in Computer Science, pp. 801–818, Springer, Cham, 2018.
View at: Publisher Site | Google Scholar
S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi, “ESPNet: efficient spatial pyramid of dilated convolutions for semantic segmentation,” in Computer Vision – ECCV 2018. ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11214 of Lecture Notes in Computer Science, pp. 552–568, Springer, Cham, 2018.
View at: Publisher Site | Google Scholar
S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi, “ESPNetv2: a light-weight, power efficient, and general purpose convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9190–9200, Long Beach, CA, 2019.
View at: Google Scholar
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2881–2890, Honolulu, HI, USA, 2017.
View at: Publisher Site | Google Scholar
F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2015, https://arxiv.org/abs/1511.07122.
View at: Google Scholar
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” 2017, https://arxiv.org/abs/1706.05587.
View at: Google Scholar
X. Lian, Y. Pang, J. Han, and J. Pan, “Cascaded hierarchical atrous spatial pyramid pooling module for semantic segmentation,” Pattern Recognition, vol. 110, article 107622, 2021.
View at: Publisher Site | Google Scholar
Y. Zhang, S. Wang, K. Xia, Y. Jiang, P. Qian, and For the Alzheimer’s Disease Neuroimaging Initiative, “Alzheimer’s disease multiclass diagnosis via multimodal neuroimaging embedding feature selection and fusion,” Information Fusion, vol. 66, pp. 170–183, 2021.
View at: Publisher Site | Google Scholar
W. Liu, A. Rabinovich, and A. C. Berg, “Parsenet: looking wider to see better,” 2015, https://arxiv.org/abs/1506.04579.
View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
View at: Publisher Site | Google Scholar
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
View at: Publisher Site | Google Scholar
M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “DenseASPP for semantic segmentation in street scenes,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3684–3692, Salt Lake City, UT, USA, 2018.
View at: Publisher Site | Google Scholar
C.-W. Xie, H.-Y. Zhou, and J. Wu, “Vortex pooling: improving context representation in semantic segmentation,” 2018, https://arxiv.org/abs/1804.06242.
View at: Google Scholar
F. Chollet, “Xception: deep learning with depthwise separable convolutions,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1251–1258, Honolulu, HI, USA, 2017.
View at: Publisher Site | Google Scholar
Q. Zhou, W. Yang, G. Gao et al., “Multi-scale deep context convolutional neural networks for semantic segmentation,” World Wide Web, vol. 22, no. 2, pp. 555–570, 2019.
View at: Publisher Site | Google Scholar
H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in 2015 IEEE International Conference on Computer Vision (ICCV), pp. 1520–1528, Santiago, Chile, 2015.
View at: Publisher Site | Google Scholar
J. Fritsch, T. Kuehnl, and A. Geiger, “A new performance measure and evaluation benchmark for road detection algorithms,” in 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), pp. 1693–1700, The Hague, Netherlands, 2013.
View at: Publisher Site | Google Scholar
G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes in video: a high-definition ground truth database,” Pattern Recognition Letters, vol. 30, no. 2, pp. 88–97, 2009.
View at: Publisher Site | Google Scholar
G. Dong, Y. Yan, C. Shen, and H. Wang, “Real-time high-performance semantic image segmentation of urban street scenes,” IEEE Transactions on Intelligent Transportation Systems, pp. 1–17, 2020.
View at: Publisher Site | Google Scholar
G. Li, I. Yun, J. Kim, and J. Kim, “DABNet: depth-wise asymmetric bottleneck for real-time semantic segmentation,” 2019, https://arxiv.org/abs/1907.11357.
View at: Google Scholar
H. Wu, J. Zhang, K. Huang, K. Liang, and Y. Yu, “FastFCN: rethinking dilated convolution in the backbone for semantic segmentation,” 2019, https://arxiv.org/abs/1903.11816.
View at: Google Scholar
Q. Zhou, Y. Wang, Y. Fan et al., “AGLNet: towards real-time semantic segmentation of self-driving images via attention-guided lightweight network,” Applied Soft Computing, vol. 96, p. 106682, 2020.
View at: Publisher Site | Google Scholar
C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: bilateral segmentation network for real-time semantic segmentation,” in Computer Vision – ECCV 2018. ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., vol. 11217 of Lecture Notes in Computer Science, pp. 325–341, Springer, Cham, 2018.
View at: Publisher Site | Google Scholar
T. Wu, S. Tang, R. Zhang, and Y. Zhang, “CGNet: a light-weight context guided network for semantic segmentation,” 2018, https://arxiv.org/abs/1811.08201.
View at: Google Scholar
Z. Yang, H. Yu, M. Feng et al., “Small object augmentation of urban scenes for real-time semantic segmentation,” IEEE Transactions on Image Processing, vol. 29, pp. 5175–5190, 2020.
View at: Publisher Site | Google Scholar
P. Bilinski and V. Prisacariu, “Dense decoder shortcut connections for single-pass semantic segmentation,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6596–6605, Salt Lake City, UT, USA, 2018.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2021 Mohammed A. M. Elhassan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1133

Downloads

981

Citations

Wireless Communications and Mobile Computing

Deep and Transfer Learning Approaches for Complex Data Analysis in the Industry 4.0 Era

PPANet: Point-Wise Pyramid Attention Network for Semantic Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Encoder-Decoder Method

2.2. Deep Neural Networks

2.3. Semantic Segmentation with CNN

2.4. Dilated Convolution-Based Architecture

3. Methodology

3.1. Architecture

3.2. Basic Building Unit

3.3. Upsampling Module

3.4. Point-Wise Pyramid Attention (PPA) Module

4. Experimental Results and Analysis

4.1. Datasets and Implementation Details

4.1.1. Datasets

4.1.2. Implementation Details

4.1.3. Evaluation Metrics on KITTI

4.1.4. KITTI Data Augmentation

4.2. Ablation Study

4.2.1. Encoder

4.2.2. Decoder

4.3. Comparing with the SOTA

4.4. Comparison with SOTA Models on Camvid

5. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright