Abstract

Aiming at the problems of low building segmentation accuracy and blurred edges in high-resolution remote sensing images, an improved fully convolutional neural network is proposed based on the SegNet network. First, GELU, which performs well in deep learning tasks, is selected as the activation function to avoid neuron deactivation. Second, the improved residual bottleneck structure is used in the encoding network to extract more building features. Then, skip connections are used to fuse images The low-level and high-level semantic features are used to assist image reconstruction. Finally, an improved edge correction module is connected at the end of the decoding network to further correct the edge details of the building and improve the edge integrity of the building. Experiments are carried out on the Massachusetts building dataset, and the precision rate, recall rate, and F1 value reach 93.5%, 79.3%, and 81.9%, respectively, and the comprehensive evaluation index F1 value is improved by about 5% compared with the basic network.

1. Introduction

With the development of remote sensing technology, massive high-resolution remote sensing images provide data guarantee for research in the field of remote sensing [1, 2]. As the most important part of the national basic geographic database, buildings have very important research value in the fields of urban planning, change detection, and geographic information system construction. Building segmentation using high-resolution remote sensing images has always been the focus and difficulty of remote sensing research [35]. Traditional building segmentation methods are mostly based on traditional remote sensing image classification technology, but this method cannot achieve high-precision and fully automated segmentation. With the development of deep learning in the field of computer vision, Shao and Cai [6] proposed Fully Convolutional Networks (FCN) for image segmentation tasks, which overcome the shortcomings of traditional image segmentation methods and become the mainstream mode in image segmentation tasks. Subsequently, researchers have successively proposed image segmentation networks such as U-Net [7, 8] and SegNet [9, 10] on the basis of FCN. In order to improve the segmentation effect of buildings, many researchers in the field of remote sensing have made improvements on the basis of U-Net and SegNet networks. These methods either improve the feature extraction part of the network or compare the basic network with the classical structures in other networks. Combined, the segmentation accuracy of buildings is improved, but there is still the problem of edge blur caused by loss of details [1113].

Therefore, based on the SegNet network, this paper designs a residual bottleneck structure that can extract multiscale features in parallel by modifying the activation function. Combined with the skip connection operation and the improved edge correction module, an improved deep semantic segmentation network RsBR-SegNet (Residual + Boundary Refinement-SegNet) is used to improve the accuracy and edge integrity of high-resolution remote sensing image building segmentation and provide a reference for the practical application of remote sensing image building segmentation.

2. Experimental Data

2.1. Introduction to Datasets

In order to verify the effectiveness and practicability of RsBR-SegNet in the task of building segmentation, experiments were carried out successively on the “Satellite dataset I (global cities)” [14] and the aerial remote sensing image dataset “Massachusetts Buildings Dataset” [15]. The “Satellite dataset I (global cities)” dataset contains 204 satellite remote sensing images of 512 × 512 pixels, with resolutions ranging from 0.3 m to 2.5 m. The “Massachusetts Buildings Dataset” dataset consists of 151 aerial remote sensing images in the Boston area, each image is 1500 × 1500 pixels in size, and the data are divided into 137 training sets, 10 testing sets, and 4 validation sets, with a resolution of 1 m.

2.2. Dataset Preprocessing and Expansion

In this paper, the “Satellite dataset I (global cities)” dataset is divided into the training set and test set according to 4:1, without any transformation, only to verify the effectiveness of the model in the task of building segmentation. Then, in order to prove the practicability of the network model in the field of remote sensing image building segmentation and considering the limited computing power of the computer, the “Massachusetts Buildings Dataset” dataset was cropped and expanded. First, each 1500 × 1500 image in the original training set is cropped into 9 images of 512 × 512 size, and then the training set is expanded to 12330 images through a series of data augmentation operations such as translation, mirroring, rotation, and random combination. We are required to crop the test set only and expand it to 90 images of size 512 × 512.

3. The Working Principle of the RsBR-SegNet Network Model

In order to improve the segmentation effect of buildings at the edges and details, this paper improves the SegNet network structure and builds a fully convolutional neural network RsBR-SegNet for building segmentation in remote sensing images. Its structure is shown in Figure 1. RsBR-SegNet preserves the upsampling way of the original SegNet, using GELU [16]. As an activation function, we are required to avoid neuron necrosis; retain the first layer of standard convolution in each convolution group in the encoding network to undertake the maximum pooling operation, and use the improved residual bottleneck structure to replace the remaining volumes in the encoding network. Layers are stacked to further extract image features, deepen the network depth, and improve the segmentation accuracy of buildings; use skip connections between the encoding network and the decoding network to fuse low-level features and high-level features between image channels to further retain the original detail information of buildings; the end of the decoding network is connected to an improved edge correction module to refine the edges of buildings and improve the integrity of building segmentation. The input of the network is a three-channel (red, green, and blue) remote sensing image of buildings, and the output is a single-channel segmentation result map, where the white pixels are the segmented buildings, and the black pixels are the background.

3.1. Activation Function

The original SegNet network uses ReLU (Rectified Linear Units) [17] as the activation function, but when the input value of the function is negative, the neuron will appear necrotic, which is an unavoidable defect of the ReLU function. For this reason, this paper selects GELU (Gaussian Error Linear Units, Gaussian Error Linear Units), which performs well in deep learning tasks, as the activation function in the RsBR-SegNet network because it is derivable at the origin and introduces the idea of random regularity. Therefore, the final activation transformation will establish a random connection with the input, avoiding the phenomenon of neuron necrosis and improving the speed and accuracy of learning. The function image is shown in Figure 2.

3.2. Improve the Residual Bottleneck Structure

By increasing the network depth, the model can learn more complex detailed features, but the increase of the network depth will lead to problems such as gradient instability and network degradation during the training process. The residual bottleneck structure proposed in the ResNet network can alleviate this phenomenon. The MobileNetV2 network proposed by Guillermo et al. [18] is based on the original residual bottleneck structure and proposes a reverse residual bottleneck structure, which reverses the original channel dimension and uses depth-wise separable convolution for feature extraction, which improves segmentation speed and accuracy.

Although the depth-wise separable convolution used in the literature significantly reduces the number of weights, there is still room for improvement in segmentation performance. For this reason, this paper proposes an improved residual bottleneck structure to obtain more feature map information and improve the accuracy of building segmentation. First, in the improved residual bottleneck structure, the first layer adopts the convolution kernels of 5 × 5, 3 × 3, 2 × 2, and 1 × 1 for parallel calculation of channel-by-channel convolution, receives the feature maps of different receptive fields, concatenates the feature maps of each path together to obtain more features, and then uses point-by-point convolution to reduce the number of channels to the original input size, so that the improved residual bottleneck structure can effectively reduce the number of weights and improve segmentation performance. At the same time, the ReLU activation function will cause information loss due to neuron inactivation in low-dimensional input. GELU can effectively alleviate this phenomenon and improve performance. Therefore, GELU is also used as the activation function after the channel-by-channel convolution and the point-by-point convolution. After reducing the nonlinear transformation, the improved residual bottleneck structure is shown in Figure 3.

The improved residual bottleneck structure is influenced by the idea of the reverse residual bottleneck structure. In this structure, the channel dimension is also expanded and then contracted. By stacking depth-wise separable convolutions of different sizes, global features are further obtained and features improved. Extracting ability and reducing the occupation of running memory, the number of parameters is shown in equation (1). The number of parameters of the reverse residual bottleneck structure in the MobileNetV2 network is shown in equation (2).

In the above formula, P represents the number of parameters, M represents the number of input channels, and N represents the number of output channels.

The number of input channels of the residual bottleneck structure in the RsBR-SegNet network is equal to the number of output channels, so the number of parameters of the improved residual bottleneck structure is less than the number of parameters for the inverse residual bottleneck structure in the MobileNetV2 network.

3.3. Improve the Edge Correction Module

At present, most of the deep learning-based remote sensing image building segmentation methods generate building segmentation results in one step and do not make further corrections to the results. There is a large difference between the segmentation results and the ground truth [1921]. In order to further correct the segmentation results, this paper proposes an improved edge correction module, which takes the single-channel probability map output by the model as input, automatically learns the residual between the input image and the corresponding real result during the training process, and further refines the input. Image for more accurate segmentation results. The original edge correction module was originally proposed by Song et al. [22] to further refine the boundary information, and the structure is shown in Figure 4(a). Although this structure improves the segmentation accuracy of the boundary to a certain extent, due to the small number of network layers, the deeper features of the input image cannot be extracted. Therefore, an improved edge correction module is proposed, which corrects the original edge. On the basis of the module, the depth of the network layer and more receptive fields are increased, and its structure is shown in Figure 4(b).

In the improved edge correction module, four holes convolutions with expansion rates of 1, 6, 12, and 18 are used to extract image features, and then the extracted feature maps are superimposed. After each convolution operation, normalization, and in the activation operation, in order to avoid the phenomenon of neuron necrosis in ReLU [2328], GELU is also selected as the activation function, and then the standard convolution of 3 × 3 is used to convert the number of feature map channels to 1, and then the obtained feature map is compared with the input image of this module. Fusion is performed to obtain the preliminary information of the prediction module, and finally, the fused feature map is classified by the Sigmoid function to obtain the final segmentation result map [2933]. Compared with the original module, the improved edge correction module proposed in this paper has a deeper structure, and the extracted image features are richer. At the same time, the dilated convolution with different expansion rates can also obtain more global information, which makes the final segmentation result of the building more accurate and complete.

4. Experimental Results and Analysis

The computer hardware configuration in this experiment is Intel Xeon(R) Gold [email protected] GHz, 64 G memory, NVIDIA GeForce RTX 2080 Ti GPU. The operating system is 64-bit Ubuntu18.04, Cuda10.0 + Cudnn7.5, and the code is based on the PyTorch framework.

4.1. Evaluation Indicators

We use precision rate, recall rate, F1-score, and intersection over union (IoU) to evaluate and analyze the segmentation effect of remote sensing image buildings. The calculation formula is as follows:

Among them, tp indicates the pixels that correctly segment the building, fp represents the pixels that are wrongly classified as buildings, and fn represents the pixels that are buildings but not correctly segmented. The precision rate is used to measure the probability that the correctly predicted building samples account for all the predicted building samples in the prediction result. The larger the value, the more accurate the building segmentation is; the ratio is actual building samples, the larger the value, the more complete the segmentation of the buildings in the sample; the F1 value is used to integrate the two evaluation indicators of precision and recall, and the larger the value, the better the network model. The segmentation is more effective; IoU is used to evaluate the similarity between the identified building area and the ground truth area, and in IoU, a higher value indicates a higher correlation between the identified buildings and the ground truth.

4.2. Evaluation of Segmentation Results

In order to prove the effectiveness of the network in this paper, the classical semantic segmentation networks FCN, U-Net, SegNet, and the network in this paper are tested on the small sample dataset “ Satellite dataset I (global cities),” and the experimental results are shown in Figure 5. Here, (a) is the original image, (b) is the label corresponding to the building in the original image, (c) is the segmentation result of the FCN network, (d) is the segmentation result of the SegNet network, (e) is the segmentation results of the U-Net network, and (f) is the segmentation result of the network RsBR-SegNet. In this paper, the area surrounded by the dotted frame is the comparison of segmentation details, and the area surrounded by the solid frame is the misclassification and omission in the segmentation results. It can be seen from the segmentation results that compared with other networks, the image scale change has less impact on the network in this paper, and there are fewer misclassifications and missed classifications in the segmentation results, and it performs better in the segmentation of small buildings. Edge recovery is also more complete. It can be seen from the first line of segmentation results that compared with other networks, RsBR-SegNet can effectively overcome the misclassification of buildings. From the second line of segmentation results, it can be seen that U-Net has a better segmentation effect on buildings than FCN and SegNet. RsBR-SegNet can further identify small buildings that U-Net misses and loses detailed information. The phenomenon has been effectively alleviated. The third row of segmentation results shows that for buildings interfered by vegetation and road shadows, RsBR-SegNet has a certain antiinterference ability, and the integrity of the building edge is higher.

Table 1 records the test results of each network model on the “Satellite dataset I (global cities)” dataset. As can be seen from the data in the table, compared with SegNet, the improved network has an increase of 3.5%, 13.4%, and 9.3% in evaluation indicators such as precision rate, recall rate, and F1 value, and an increase of 11.2% in IoU. It can be seen from the index comparison results that compared with SegNet and FCN, the improved network RsBR-SegNet achieves a significant improvement in the segmentation performance of buildings, and it also has certain improvement advantages compared with the U-Net network. A good segmentation effect can also be achieved on the dataset.

In order to prove the practicability of the network in this paper in the task of building segmentation, each network is tested on the expanded Massachusetts building dataset. The experimental results are shown in Figure 6. The meaning and legend of each column are consistent with Figure 5.

It can be seen from the segmentation results that the improved network has more advantages in the intensive small building segmentation task. The first line of segmentation results shows that compared with other networks, RsBR-SegNet has fewer misclassifications and missed classifications, and the edges of buildings are restored more completely. It can be seen from the segmentation results of the following lines that for small buildings that cannot be recognized by other networks in the figure, the improved network can still identify them effectively, and the overall segmentation effect of RsBR-SegNet is better than other comparison networks.

They are tested on the expanded Massachusetts building datase . Table 2 records the index evaluation results of each network. From the data in the table, it can be seen that in the large sample data set, the indicators of all networks have improved. Compared with SegNet, the improved network is improved by 1.7%, 6.1%, 5.0%, 6.7% in precision, recall, F1 value, and IoU, respectively. Compared with other classical semantic segmentation networks, the RsBR-SegNet network has improved various evaluation indicators, the accuracy rate reaches 93.503%, and the IoU reaches 69.746%, which fully proves the practicability of the improved network in the task of remote sensing image building segmentation.

5. Conclusion

This paper proposes a fully convolutional neural network RsBR-SegNet suitable for building segmentation. CELU as the activation function in the network has to be used to improve the learning ability of neurons, skip connections have to be used to fuse the low-level semantic features and high-level semantic features of the image, the phenomenon of loss of details needs to be alleviated, and the improved residual bottleneck structure and edge correction module are to be used to extract more buildings It can improve the segmentation accuracy and edge integrity of buildings. Experiments are carried out on satellite and aerial remote sensing image datasets, respectively, and the results show that the RsBR-SegNet network has more accurate segmentation results than the classical segmentation networks FCN, U-Net, and SegNet and effectively overcomes the edge blurring phenomenon. Compared with the evaluation indicators such as precision rate, recall rate, F1 value, and IoU, RsBR-SegNet has achieved the highest value, which is more suitable for remote sensing image building segmentation tasks.

Data Availability

The dataset can be accessed upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors thank the project supported by Fujian Province Young and Middle-Aged Teachers’ Education and Scientific Research Project Grant “Research on the Activation Design of Traditional Fujian Village Cultural Landscape” (Grant no. JAS21007) and project supported by the Fujian Province Social Science Planning “Fujian Folk Animation Research” (Grant no. FJ2021B195) support.