Semantic Segmentation Algorithm Based on Attention Mechanism and Transfer Learning
In this paper, we propose a semantic segmentation algorithm (RoadNet) for auxiliary edge detection tasks with an attention mechanism. RoadNet improves the dispersion of the low-level features of the network model and further enhances the performance and applicability of the semantic segmentation algorithm. In RoadNet, a fully convolutional neural network is used as the basic model, an auxiliary loss in the image classification, multitask learning in machine learning, and attention mechanism in natural language processing. To improve the generalization of the model, we select and analyze a proper domain difference measure. Subsequently, the context semantic distribution module and the annotation distribution loss are designed based on the context semantic encoding structure. The domain discriminator based on the adversarial training and the adversarial training algorithm based on transfer learning are then well integrated to provide a transfer learning-based semantic segmentation algorithm (TransRoadNet). The experimental results indicate that the proposed TransRoadNet and RoadNet overperform their equivalent comparison models.
Application of deep learning methods in image classification achieves remarkable results; see, e.g., [1–4]. Deep learning is also extensively applied in image semantic segmentation. For instance, E. Shelhamer et al.  present a fully convolutional neural network (FCN) for image segmentation. The combination of transfer learning and deep learning  is also used to introduce the concepts and methods of deep learning in a variety of research fields in social and natural sciences [7–9]. However, in practical applications, the efficacy of deep learning based methods is challenged by the availability of enriched datasets, their inference accuracy, and generalization performance of the deep learning models. In this paper, we address these three challenges:(1)Although PASCAL VOC2012 , CIFAR-10/100 , and Cityscapes dataset  are able to provide powerful training data sources, multi-view observation scenes are further required to be constructed for the complex urban road images. In this paper, using Eagle Eye, the high-altitude road monitoring dataset is formed, and the virtual images are collected via the communications between the graphics library and the monitoring game to make the virtual dataset. Hereafter, we simply refer to these enriched datasets as Surv-Citispace and Virt-Citispace.(2)Regarding the attention mechanism and to make the low-level features of the network, the auxiliary task branch for edge detection is designed based on objects’ shape and edge information. Meanwhile, an auxiliary task learning module and an attention-constant residual network are constructed to form a semantic segmentation model, namely, RoadNet. In order to improve the receptive field of the semantic segmentation task, global pooling concepts and comprehensive cascading ideas are utilized to further improve the atrous spatial pyramid pooling and design a cascaded atrous spatial pyramid pooling.(3)We further investigate the transfer learning algorithm for RoadNet from the perspectives of domain difference measurement, semantic distribution loss, and adversarial learning and then design a semantic segmentation model, namely, TransRoadNet, based on transfer learning. TransRoadNet effectively reduces the performance loss of basic model, RoadNet, in the process of migration and deployment on different data (i.e., Cityspace to Virt-Citispace and/or Surv-Citispace to Cityspace).
2. Related Works
2.1. Attention Mechanism
Chen et al.  introduce using conditional random fields to the FCN as a post-processing algorithm. Zhao et al.  also design a pyramid pooling module to aggregate the context information of different regions by combining four feature maps of different scales to improve the capability obtaining global information of the neural networks.
Regional-Convolutional Neural Network (R-CNN) in  triggers the application of target detection convolutional neural networks based on the candidate regions. He et al.  suggest using shared convolutions to speed up the calculation of R-CNN. The region-of-interest pooling layer is also designed by Girshick  based on a spatial pyramid pooling which is able to pool the considered regions with different sizes, into a fixed-size feature vector. Ren et al.  suggest handing over the task of finding candidate target areas to a deep convolutional neural network and propose the Region Proposal Network (RPN). Further, a network branch is added by He et al.  based on RPN to predict the segmentation mask of the target object. They further expand their method from the original simple target recognition to instance segmentation. The current research results are greatly influenced by related thoughts [20, 21].
In the above works, the attention mechanism is often utilized to explicitly model the interdependence between the semantic features of and xt. This is done through combining the attention residual module (ARM) with the residual module and self-attention mechanisms. Due to adaptive enhancement of the channel graph of relevant semantics, it is therefore possible to replace the feature fusion in the original residual network and further enhance the ability to express the relevant semantics of the residual module.
2.2. Receptive Fields and Auxiliary Tasks
From the multitasking perspective, Badrinarayanan et al.  introduce the encoder-decoder structure into the FCN, where the pooling layer index is retained to store more image information in the encoding stage. In this stage, the pooling layer index is used to restore image loss information.
Holmstrom  indicated that occasional addition of noise during the training can enhance the generalization capability of the network model. In contrast to other methods which are focused on enhancing the training effect of the auxiliary tasks, RoadNet is focused on enhancing the training effect of the main task. In this context, the edge detection is often considered as an auxiliary task. The low-level shared network mainly considers the edge and shape information of the object; hence, it can obtain more features regarding the differences in the object categories. The annotated images which are required for edge detection can be simply attained from the semantically segmented annotated images.
Regarding the receptive field, Fisherand Koltun  showed that the FCN upsampling is unable to restore the information lost. This is because of the pooling layer downsampling without loss. To address this issue, they suggest atrous convolution, where the original convolution range is extended thus increasing the receptive field of the network. ASPP is also applied by several researchers; see, e.g., .
Here, we combine the cascading idea in DenseASPP with the global pooling branch in ASPPv2 and propose the cascade atrous spatial pyramid pooling (CascadeASPP). In our proposed design, the atrous convolution of multiple atrous rates is connected step by step. It provides a larger receptive field, improves the pixel sampling density of the atrous convolution, and hence forms more receptive fields to provide a higher level of size invariance. Moreover, to tackle the degradation problem of atrous convolution, here the global context information is obtained through the global pooling branch.
2.3. Transfer Learning and GAN
To find a suitable difference domain measure, we train the following three methods on the basic FCN network and RoadNet with the above-mentioned three transferred datasets:(1)Correlation Alignment (CORAL) proposed by Bao Sun et al., which is an unsupervised domain adaptive algorithm (2)Maximum Mean Discrepancy (MMD) as one of the most commonly used distance measures in transfer learning (3)Contrastive Domain Discrepancy (CDD) which adds category information based on MMD and hence measures the intraclass and interclass differences across domains 
The best representation is discovered by the feature-representation-based transfer through feature transformation. The context semantic encoding  (CSE) captures the global context scene information and improves the scene-related feature map. Nevertheless, the context semantic encoding only predicts the existence of the category as prior knowledge of the scene with obvious defects. Hence, a semantic distribution loss is proposed to replace the semantic encoding loss. Particularly, in the proposed semantic distribution loss, the ability of the model to predict the existence of the categories and the proportion of the categories in the image is essential, adding more prior knowledge of scenes and the relationship between categories to the model.
The generative adversarial network is a network model proposed by Goodfellow et al. . It can better grasp the global information by discriminating against the network compared to the direct use of the loss function. Moreover, TransRoadNet integrates GAN’s domain adversarial ideas and replaces the image generation network in GAN with source and target domain feature extractors to extract the image features. The task of discriminating the network in GAN is to determine the extraction of the image features from the source or target domain images. The domain-invariant features are extracted by the encoder as much as possible so that the discriminator cannot distinguish between the two domains. Meanwhile, the discriminator needs to distinguish the two domains as much as possible to conduct adversarial training.
3. Network Structure
3.1. Attention Residual Module
The original residual module is shown in Figure 1(a) aswhere and represent the input and output of the l-th layer, respectively, shows the residual function, denotes the identity mapping function, and is the rectified linear activation function. Although the identity mapping function in the residual module can ensure no loss in the information flow, the information flow of the entire network includes loss due to the activation function. Therefore, also becomes an identity mapping function to obtain an enhanced residual module, namely, the identity residual module , ensuring the flow of the information between the layers without a loss (Figure 1(b)).
The mathematical expression is as follows:
Based on the backpropagation chain rule, the following partial derivative is obtained:
Equation (3) indicates that the loss gradient can be transferred to any residual module without loss. Even the loss gradient of any residual module can be converted without loss to the remaining residual modules; hence, the probability of vanishing the gradient is reduced.
Nevertheless, if each channel of the feature graph is assumed to be the semantic feature response graph of the segmentation target, there must be a correlation between the corresponding graphs of the semantic features of various segmentation targets in the image. The semantic features of and in the residual module are not consistent and are not added directly. Hence, the self-attention mechanism is inserted into the fusion of and in the identity residual module to explicitly model the interdependence between semantic features. Using the interdependence between the channels, it is possible to improve the interdependent features as well as the representation of the specific semantic features:
The input feature map of the attention residual module (Figure 2) is . A novel feature map, , is then obtained, after two rounds of batch normalization, convolution, and activation function. Hence, and are reorganized into and respectively. Matrix multiplication is also performed on the transpose of , and . After normalizing the exponential function, the channel attention graph is finally obtained:where represents the influence factor of the i-th channel of to the j-th channel of . Matrix multiplication is conducted on and , and is readjusted as the improved feature map. The ultimate output feature map, , is then attained by adding elements of and .
3.2. RoadNet and Auxiliary Edge Detection Tasks
Figure 3 represents the RoadNet structure. In particular, the training task signal of the auxiliary task has specific domain information to improve the main task generalization effect. Following the pyramid network structure of FPN [32, 33], a semantic segmentation network model is designed to test the auxiliary tasks, including a top-down basic network, a horizontal connection, and a bottom-up edge detection auxiliary network. The accurate edge detail information is then obtained from shallow features, and then the semantic information is attained from the deep features. Consequently, the lack of image detail information in the original semantic segmentation network is eliminated.
The network takes an image of any size as input and then calculates a feature map of multiple scaling ratios using the basic network. The network is also divided into five stages based on the size of the feature map. The relative scale of the feature map output by the last residual module to the input image in each stage is 4, 8, 16, and 32, respectively.
By upsampling of the image of the high-level feature pyramid, the edge detection auxiliary network restores its resolution. The basic network is also connected with the edge detection auxiliary network through horizontal connections to merge the feature maps of the same size. Furthermore, using the Canny algorithm, the annotated image of the edge detection auxiliary network is obtained from the annotated image of the semantic segmentation .
The loss function of the edge detection network takes multi-class empirical cross-entropy to normalize the predicted feature map exponentially. The calculation formula is
Then, the cross-entropy is calculated aswhere is pixel i, j in the image, represents pixel i, j after the exponential normalization of the n-th channel of the image, is pixel i, j of the n-th channel of the image, M is the image length, N is the image width, and C is the category number.
ASPP has gained a large receptive field; however, a huge deal of image information is lost within the calculation process due to the low pixel sampling rate. For example, the receptive field size is 13 for a 33 atrous convolution with an atrous rate of 6; however, only 9 pixels are sampled for calculation. Then, the pixel sampling rate is 0.05. By connecting two convolutions with an atrous rate of 3 in series, the receptive field size is also 13, while 25 pixels are sampled for calculation. The pixel sampling rate is 0.15, which is more than three times the pixel sampling rate of the former. By a higher atrous rate, this effect becomes more obvious which is overcome by the proposed model effectively.
The global pooling branches and all atrous convolutions are cascaded through CascadeASPP (Figure 4). After 1 × 1 convolution and batch normalization, it is then upsampled to the preferred spatial dimension. For feature fusion, it is then merged with other atrous convolutions with different atrous rates. Through the cascading between different sizes of atrous rates, 13 sizes of receptive fields are covered. In the meantime, the coverage and atrous convolution pixels are sampled with a higher density.
3.4. Transfer Learning Mechanism
Using the feature-representation transferring technique, the difference between the target domain and the source domain is added to the loss function of the network model. Thus, the difference between the target domain and the features of the source domain is minimized through model training. After comparative testing, the MMD difference measure is selected as the loss function to design a context semantic distribution (CSD) module. The structure is illustrated in Figure 5. It is observed that the input feature map of the context semantic distribution module passes through two fully connected layers. The proportion of categories in the scene becomes an output of the fully connected layer, i.e., category distribution information. Consequently, for this category of distribution information as well as for the annotated image, the semantic distribution loss is calculated. The other fully connected layer outputs the scaling factor of the input feature map and then multiplies the input feature map and the scaling factor and by the channel as the output of the module. It is aimed at strengthening the feature maps related to the current scene based on the prior knowledge of the scene and also weakening the feature maps which are not related to the current scene. The category distribution information for the model inference graph is then determined, and the semantic distribution loss is calculated within the category distribution information of the annotated images.
The semantic distribution information of the annotated image is denoted as the feature vector p with the length of C. In this model, each value indicates the ratio of the pixels occupied by the category c in the annotated image to the total pixels of the image (i.e., the percentage of the image occupied by each category). The calculation formula is stated aswhere
The semantic distribution information of the inference graph is then determined aswhere C is the total number of categories in the source domain dataset, Y is the annotated image, shows the model inference graph, H is the pixel height of the annotated image, and W is pixel width of the annotated image.
Using a multi-class cross-entropy loss function, the semantic distribution loss is determined for the semantic distribution information of the annotated image and the inference graph, which is
To simplify the training process, a gradient inversion layer is added between the domain discriminator and the basic network as a connection layer. However, the gradient inversion layer is corresponding to the identity mapping over the forward propagation. It includes no other operations and the input is directly outputted to the next layer. During the back propagation, the gradient inversion layer obtains the gradient from the next layer multiplied by −1, before passing to the previous layer. The structure of TransRoadNet is illustrated in Figure 6.
Based on the Cityscapes dataset , to collect the urban road traffic images, the surveillance video by Eagle Eye camera is considered as the source. An image from the video is intercepted by the dataset at given intervals and a total of 400 images are collected. Subsequently, the dataset is divided into a test set including 200 images and training set including 200 images (i.e., ratio of 5 : 5). This dataset is referred to as Surv-Cityscapes.
Grand Theft Auto V (GTA5) is selected by the virtual dataset as the virtual data collection virtual environment. GTA5 is started from Render Doc with a resolution of 1920 × 1080. The character is manipulated to drive the vehicle and to select the first angle of view. In total, 4000 images are collected, and they are randomly classified into 2000 test set images and 2000 training set images with a ratio 5 : 5; this dataset is referred to as Virt-Cityscapes.
Principal component dithering and random image cropping along with other algorithms are also utilized to augment the dataset and to create transfer learning datasets on these three city image datasets.
Three public datasets were also used in the experiment, including PASCAL VOC2012  and CIfar-10 and CIfar-100 . The CIFAR-10 dataset consists of 60,000 32 × 32 color images from 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images in the dataset. Cifar-100 dataset includes 100 classes, each containing 600 images. Each category includes 500 training images and 100 test images. The dataset, PASCAL VOC2012, supports image recognition tasks such as classification, target detection, and semantic segmentation. In our experiment, we use the semantic segmentation sub-dataset of PASCAL VOC2012.
4.2. Attention Residual Module Testing
For the first residual module in the attention residual network, the channel correlation between and x1 is obtained. The channel correlation heatmap is visualized in Figure 7. This figure illustrates the correlation between any channels of the two feature maps represented by the color of the corresponding cell. A higher (lower) correlation is shown by a lighter (darker) color. As it is seen in Figure 7, the color of the correlation heat map becomes significantly lighter after the attention mechanism. This means that the correlation in the feature map channel is significantly enhanced; therefore, the correlation between features is improved by the attention residual module.
To obtain ResNet, the original residual modules are stacked, and the identity residual modules are also stacked to obtain IdentityResNet. To obtain AttentionResNet, the attention residual modules are stacked. Using Xavier algorithm , the weights are initialized, and the ResNet model which is trained on the ImageNet dataset is used as the pre-training model. The test results are presented in Table 1.
Compared to the original residual network, in CIFAR-100 dataset, the attention residual network in the 50-layer network is 1.45% lower. Furthermore, compared to the original residual network, the attention residual network in the 101-layer network is 1.49% lower. It is also seen that, using the attention residual module, the probability of convergence problems and degradation problems is greatly reduced.
4.3. CascadeASPP Testing
Here, we compare the FCN basic model, FCN-ASPP, and FCN-CascadeASPP on multiple datasets. The results of these comparisons are shown in Table 2. As it is seen, the model evaluation metric for the ASPP structure is greatly enhanced in comparison with the basic model in all datasets, which are also further enhanced by using CascadeASPP. These results suggest that the context mechanism is an important factor and a larger receptive field is essential for capturing further contextual information and prior knowledge of the scene.
4.4. CSD Testing
The encoder in RoadNet is the top-down basic network, while the bottom-up semantic segmentation main network is the decoder. The context semantic encoding module and the context semantic distribution module are, respectively, added to the FCN and RoadNet, and the new models are referred to as FCN-CSD, FCN-CSE, RoadNet-CSD, and RoadNet-CSE. We examine these models on three transfer learning datasets. The test results shown in Table 3 suggest the following:(1)The context semantic distribution module has only 0.2% and 0.4% performance improvement on Surv-Cityscapes transfer learning dataset and around 3% performance improvement on both Cityscapes and Virt-Cityscapes transferred dataset. The reason is the fixed recording position and angle of the Surv-Cityscapes surveillance camera. This results in a fixed image scene, and hence its prior knowledge of the scene is relatively simple. Nevertheless, the model performance is improved by the context semantic distribution module as it adds more scene prior information to the model.(2)For the three transfer learning datasets and the proposed network model, a performance improvement of about 0.2% to 3% is achieved by adding the context semantic distribution module and transferring the model. The results also indicate a further performance improvement in the context semantic encoding module. This validates the effectiveness of the context semantic distribution module.
4.5. RoadNet Testing
To obtain a larger receptive field, CascadeASPP is added to the jump connection of RoadNet. To compare with EncNet, the number of training epochs is consistent and set to 62500. Table 4 represents the test results. Compared with the basic model, FCN is improved by 18.9%, 29.3%, and 12.8%, respectively. Moreover, it is enhanced by 1.7%, 2.1%, and 2.2% compared to EncNet’s semantic segmentation model.
4.6. TransRoadNet Testing
To validate the model, the semantic segmentation network model, TransRoadNet, based on transfer learning is examined on the three transfer learning datasets. The training parameters are consistent with that of RoadNet, for which the results are recorded in Table 5. The average merge ratio of TransRoadNet is 62.7%, 30.6%, and 35.8%, respectively, which is 4.1%, 24.4%, and 12.7% higher than the model without transfer learning and 1.9%, 3.7%, and 4.4% higher compared to the common transfer learning algorithm.
For Cityscapes transfer learning dataset, the performance of the transfer learning algorithm is only enhanced by about 4%. This is owing to the limitations of the urban transferred dataset itself. The deviation of the dataset and the performance loss are small; hence, the effect of the transfer learning algorithm is not as clear as it is in the remaining datasets.
Figure 8 represents the inference graph of the semantic segmentation model based on transfer learning, where graph (a) shows the original image, graph (b) represents the annotation image, and graph (c), graph (d), and graph (e) denote the inference graphs of the semantic segmentation model based on transfer learning. It is observed that TransRoadNet is noticeably better than other semantic segmentation models in terms of transfer learning in edge segmentation effect and target classification accuracy.
Based on Cityscapes, two datasets with various perspectives of urban roads and their transferred datasets are constructed. RoadNet designed based on ARM and CascadeASPP possesses good portability and performance. TransRoadNet based on CSD shows a higher performance in the experiments compared to the un-transferred RoadNet and the transfer learning algorithms.
All data included in this study are available upon request to the corresponding author (e-mail address: [email protected]).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the National Natural Science Foundation of China (NSFC) under grant no. 61363066.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of the Neural Information Processing Systems, pp. 1106–1114, Lake Tahoe, NV, USA, December 2012.View at: Google Scholar
K. S. A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proceedings of the International Conference on Learning Representations, pp. 1–14, San Diego, CA, USA, May 2015.View at: Google Scholar
D. George, H. Shen, and E. Huerta, “Deep transfer learning: A new deep learning glitch classification method for advanced LIGO,” 2017, https://arXiv:1706.07446.View at: Google Scholar
M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domain adaptation with residual transfer networks,” in Proceedings of the Conference on Neural Information Processing Systems, pp. 136–144, Barcelona, Spain, December 2016.View at: Google Scholar
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected CRFs,” in Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, May 2016.View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, 2014.View at: Google Scholar
R. Girshick, “Fast R-CNN,” in International Conference on Computer Vision, pp. 1440–1448, Las Condes, Chile, December 2015.View at: Google Scholar
K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 99, no. 1, pp. 2980–2988, 2018.View at: Google Scholar
Y. Fisher and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” in Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, May 2016.View at: Google Scholar
B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence, pp. 2058–2065, Phoenix, AZ, USA, February 2016.View at: Google Scholar
G. Kang, L. Jiang, Y. Yang, and A. G. Hauptmann, “Contrastive adaptation network for unsupervised domain adaptation,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 4893–4902, Long Beach, CA, USA, June 2019.View at: Google Scholar
H. Zhang, K. J. Dana, J. Shi et al., “Context encoding for semantic segmentation,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 7151–7160, Salt Lake UT, USA, 2018.View at: Google Scholar
I. Goodfellow, J. Pougetabadie, M. Mirza et al., “Generative adversarial nets,” in Proceedings of thee Neural Information Processing Systems, pp. 2672–2680, Montreal, Canada, December 2014.View at: Google Scholar
K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proceedings of the European Conference on Computer Vision, pp. 630–645, Amsterdam, The Netherlands, October 2016.View at: Google Scholar
T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the Computer Vision and Pattern Recognition, pp. 936–944, Honolulu, HI, USA, July 2017.View at: Google Scholar
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 249–256, Sardinia, Italy, May 2010.View at: Google Scholar
Z. Zhang, X. Zhang, C. Peng, C. Peng, Xi. Xue, and J. Sun, “ExFuse: enhancing feature fusion for semantic segmentation,” in Proceedings of the European Conference on Computer Vision, pp. 273–288, Munich, Germany, September 2018.View at: Google Scholar