Abstract

Boundary pixel blur and category imbalance are common problems that occur during semantic segmentation of urban remote sensing images. Inspired by DenseU-Net, this paper proposes a new end-to-end network—SiameseDenseU-Net. First, the network simultaneously uses both true orthophoto (TOP) images and their corresponding normalized digital surface model (nDSM) as the input of the network structure. The deep image features are extracted in parallel by downsampling blocks. Information such as shallow textures and high-level abstract semantic features are fused throughout the connected channels. The features extracted by the two parallel processing chains are then fused. Finally, a softmax layer is used to perform prediction to generate dense label maps. Experiments on the Vaihingen dataset show that SiameseDenseU-Net improves the F1-score by 8.2% and 7.63% compared with the Hourglass-ShapeNetwork (HSN) model and with the U-Net model. Regarding the boundary pixels, when using the same focus loss function based on median frequency balance weighting, compared with the original DenseU-Net, the small-target “car” category F1-score of SiameseDenseU-Net improved by 0.92%. The overall accuracy and the average F1-score also improved to varying degrees. The proposed SiameseDenseU-Net is better at identifying small-target categories and boundary pixels, and it is numerically and visually superior to the contrast model.

1. Introduction

In the computer vision field, semantic segmentation is an important issue. In the past few decades, many classic traditional segmentation algorithms have emerged, including region-based methods, watershed algorithms, threshold methods, and cluster-based segmentation methods. In practical applications, high-resolution images are difficult to automate for two reasons: first, their spatial resolution is higher, but their spectral resolution is lower; second, the surface texture features of small targets become visible. These two factors lead to an increase in intraclass variability in the image, while the differences between classes decrease. Image semantic segmentation aims to determine the most proposed class label for each pixel in an image drawn from a set of predefined limited labels.

In 2012, the AlexNet network, proposed by Krizhevsky et al. [1], caused a new upsurge in imaging applications in the field of deep learning. Later, Tsogkas and Kokkinos [2] combined a convolutional neural network (CNN) with a fully connected conditional random field (CRF) approach to learn the lost prior information. In the data fusion competition in 2015, Lagrange et al. [3] used a pretrained CNN model as a feature extractor to classify land cover. Paisitkriangkrai et al. [4] used true orthophoto images, a corresponding digital surface model image (DSM) and a normalized digital surface model image to train a relatively small set of CNN models. Finally, the results were further optimized using CRF. Long et al. [5] proposed a fully convolutional network (FCN) to classify images at the pixel level. Unlike the classic CNN, FCN can accept an input image of any size and restore it to the same size as the input image, thus generating a prediction for each pixel while retaining the spatial information in the original input image.

In 2016, Volpi and Tuia [6] proposed a CNN-based system CNN-FPL. This system relies on a downsample-then-upsample architecture so that CNN learns to densely label every pixel at the original resolution of the image. In 2017, Nogueira et al. [7] compared several popular neural networks and training strategies. Their experimental results on three remote sensing image datasets indicated that fine-tuning networks is the best training strategy. Liu et al. [8] used a composed inception module to replace common convolutional layers, providing a multiscale receiving area with rich context for the network. Badrinarayanan et al. [9] proposed an architecture for semantic pixelwise segmentation termed SegNet that eliminates the need to learn to upsample. The upsampled maps are convolved with trainable filters to produce dense feature maps. In 2018, Gao et al. [10] proposed a weighted equilibrium function and a neural network based on a multifeature pyramid structure. Chen et al. [11] proposed the FCN-based model structures SNFCN and SDFCN to process VHR remote sensing images. They designed the SNFCN and SDFCN frameworks with dense-shortcut connection structures and SDFCN adds three additional identity mapping shortcut connections between the symmetrical encoder-decoder pairs. This approach ensures that the gradient information can be passed directly to the upper layers of the network. Zhang et al. [12] proposed the novel supervised deep-CNN-based OBIC framework to deal with segmented superpixels and introduced two mask policies for network models. Chen et al. [13] proposed DeepLabv3+, applied depthwise separable convolution to both the decoder modules, and added Atrous Spatial Pyramid Pooling, resulting in a faster and stronger encoder-decoder network. The TreeUNet model proposed by Yu K et al. [14] in 2019 was the first to use both an adaptive hierarchy and deep neural networks in a unified deep learning structure. TreeSegNet adopts an adaptive network to increase the classification rate at the pixelwise level. The experimental results of this algorithm on the ISPRS Potsdam dataset achieved improved results.

In summary, deep learning has been widely used for image preprocessing, target recognition tasks, high-level semantic feature extraction, and remote sensing scene understanding, but how to improve the image semantic segmentation accuracy and resolve interclass imbalance are problems that remain challenging. The main contributions of this paper are as follows:(i)As technology has developed, the types of data available in the field of image processing have also become more diverse and include as true orthophoto images, normalized digital surface models, RGB-D images containing depth information, and even three-dimensional image data. We consider data with two different statistical characteristics and use them as simultaneous model inputs, achieving parallel processing of different remote sensing image data types. Finally, the two parallel processing chains are used to fuse the features to generate dense label maps.(ii)We adopt a suitable loss function for semantic segmentation of remote sensing images. This function introduces a factor based on the traditional cross-entropy function to suppress the dominant position of large target categories in training and focus the training process on small-target categories. This approach both guarantees the overall accuracy and improves the segmentation effect for small-target categories.(iii)Because of complex textures and lighting, the “building” category can easily be misclassified as an “impervious surface” by other models. Based on the visual maps of local results, our model achieves excellent performance on incomplete phenomenon of the “building” category and can segment the “building” category almost completely. We consider that this result is due to the model's excellent feature fusion capabilities.

However, the SiameseDenseU-Net model uses the max pooling layer in the downsampling block to expand the receptive field of the model, which causes some information to be lost during the downsampling process. The idea of Atrous convolution [15] can be borrowed to increase the receptive field without losing information. This is also our future research work.

Through extensive research on satellite remote sensing images, researchers have found that high-resolution remote sensing images have lower spectral resolution than low-resolution remote sensing images. In most cases, only the three RGB channels are available, and category information is not fully captured. Therefore, for high-resolution remote sensing images, analyzing texture and spatial context is particularly important. Many studies have focused on extracting features from pixel spatial neighborhoods [16, 17]. The semantic segmentation task for high-resolution remote sensing images is designed to predict each pixel as a category from a predefined set of semantic categories, such as buildings, low vegetation, trees, or cars. Timely access to accurate segmentation results is critical for tasks such as urban planning, environmental monitoring, and economic forecasting.

In the past few decades, a large number of statistical methods based on spectral features, including the maximum likelihood method [18] and the K-means method [19], as well as machine learning-based methods such as neural networks (NN) [20], the support vector machine (SVM) [21], object-oriented classification [22], and sparse representation [23], have been widely used in remote sensing image segmentation tasks. However, these shallow network methods often fail to adequately consider the interrelationships between global and local samples. In recent years, deep learning methods, especially convolutional neural networks, have performed well on visual learning tasks. A deep network takes the original image as input and transforms the graph through multiple processing layers. By aggregating the features to the gradually increasing context neighborhood, the information becomes more explicit, thus achieving a distinction between different object categories [24]. The parameter set for the entire network model is learned from the original data and tags, including the underlying layer containing the original features, the middle layer containing specific task context information, and the high level that performs the actual classification.

The remote sensing image semantic segmentation task can be described as follows: given a set of labeled training data sets, the classifier learns predictive conditional probability arithmetic from the spectral features. The original pixel intensity, a simple combination of raw values, and various types of statistical information describing the local image texture [25, 26] are typical choices for input features. Another common method is to precalculate a large number of redundant feature sets for training and then let the classifier select the optimal subset [27, 28]. In this way, less relevant information can be ignored during the feature encoding process.

HSN [8] uses inception and residual modules. The inception module enables the network to extract information from multiscale receptive areas. Residual modules are employed together with the skip connection, feeding information forward from the encoder directly to the decoder to make more effective use of the spatial information. In addition, the model uses overlap inference (OI) to mitigate the boundary effects of the image. Finally, postprocessing methods based on weighted belief propagation (WBP) visually enhance the classification results. HSN is superior to the state-of-the-art FCN [5] and FPL [6] and SegNet models in terms of overall accuracy and average F-score. The core idea underlying DenseU-Net is to connect CNN features through cascade operations and use the symmetric model structure to fuse shallow information with high-level abstract semantic features. DenseU-Net has made significant progress in the segmentation accuracy for small-target categories.

However, HSN and DenseU-Net simply add more complex processing modules to the existing network structure; they do not consider the problem of processing different statistical feature images at the same time. In the field of image semantic segmentation, it has been difficult to make a large breakthrough in network structure since the emergence of U-Net. More complex network structures not only require longer training times but also lead to model overfitting. Therefore, we focus on data processing and utilization. SiameseDenseU-Net combines two parallel DenseU-Net modules to process images with different statistical characteristics simultaneously. The resulting feature information is fused by the connected channels, which improves the network’s ability to extract image features.

This study was inspired by DenseU-Net [29] and makes improvements based on its work. We further explore the potential of CNNs for end-to-end semantic segmentation of high-resolution remote sensing images.

3. Proposed Methods

SiameseDenseU-Net uses two similar parallel DenseU-Nets, each of which is composed of an encoder and a decoder. The encoder consists of five consecutive sets of downsampled blocks that double the number of feature dimensions, while the decoder consists of five consecutive sets of upsampling blocks that halve the number of feature dimensions. The input feature extracts the context information through the downsampling block to obtain a hierarchical feature and then recovers the resolution of the extracted features via the upsampling block, restoring the spatial position information lost by the encoder. Simultaneously, each downsampling block has a connection with its corresponding upsampling block. The shallow texture, color, and other details are combined with the high-level abstract semantic features to form a single DenseU-Net network. SiameseDenseU-Net fuses the features extracted from the two parallel processing chains and uses a softmax layer to predict the output characteristics to generate dense label maps.

3.1. Sampling Blocks

The D-dimensional H × W feature map is the input to the downsampling block structure. The input features first pass through two convolutional layers with a padding of 1, a stride of 1 × 1, and a filter size of 3 × 3. The input x of the downsampling block and the output features yd1 and yd2 of the two convolutional layers are subjected to a cascade operation to obtain a 3D-dimensional feature map. Finally, after the 1 × 1 convolution and after dimensionality reduction, the dimension z is obtained. On the one hand, it is then passed to the corresponding upsampling block; on the other hand, it forms the input to the max pooling layer. Continuous downsampling blocks can extract CNN features, providing a wider receptive field for the network and generating more accurate classifications.

Figure 1 shows that the structures of the upsampling blocks and downsampling blocks are similar. In the upsampling block, the feature map of the D-dimensional H × W is used as an input to obtain a 2H × 2 W feature map through the 2 × 2 transposed convolution layer. Then, feature fusion is performed using the same size feature map from the downsampling block. The dimensionality is further reduced by a 1 × 1 convolution. The inputs yu2 and yu3 to the layer and the output yu4 from the second convolutional layer are subjected to a cascade operation to obtain a 3D dimensional feature map. Finally, the feature map is reduced to D dimensions by a 1 × 1 convolution. After all the convolutional layers are complete, a batch normalization (BN) operation and a rectified linear unit (ReLU) are performed. In the extended path phase, the resolution of the image is recovered layer by layer using successive upsampling blocks, after which the model can obtain accurate positional information.

The model structure can be formalized as follows:

Model=<x, , , yi, cascade, o, , z, maxpool, x1, >

The meaning of each variable is described below:(1)x: the input of the downsampling block(2): the i-th convolution operation: the filter size is 3 × 3, and i = 1, 2(3): compound function, which denotes the ReLU and BN operations(4)yi: the output after the i-th convolution operation, where i = 1, 2(5)cascade: the cascade operation(6)o: the output(7): the convolutional dimension reduction operation: the filter size is 1 × 1(8)z: the features(9)maxpool: the max pooling operation(10)x1: the input of the upsampling block(11): the transposed convolution operation

The output characteristic yd1 of the first convolution layer is given by

The output yd1 of the first convolution layer is connected to the input x of the downsampling block by a cascade operation; therefore, the output yd2 of the second convolution layer is given by

Similarly, the outputs yd1 and yd2 and the input of the downsampling block are connected to form a 3D-dimensional feature map; then a 1 × 1 convolution is used for dimensionality reduction, thus reducing the dimensions of the feature map and improving the calculation efficiency. The characteristic z after dimensionality reduction is as follows:

The dimensionally reduced feature z is passed as the input to both the max pooling layer and the corresponding upsampling block. x1 represents the D-dimensional feature of the upsampling block input; consequently, the output characteristic yu1 of the transposed convolutional layer is given by

The output feature yu1 of the transposed convolution layer is cascaded with the feature z transmitted by the corresponding downsampling block through the connection channel, and the connected features are subjected to dimensional reduction by a 1 × 1 convolution:

The dimensionally reduced feature yu2 is used as the input to the two layers of densely concatenated convolutional layers; thus, the output characteristic yu3 is given by

The outputs yu3 and yu2 of the convolution layer are connected by a cascade operation, and the output yu4 of the second layer convolution layer is as follows:

Finally, yu3 and yu4 are connected to the input yu2 of the densely concatenated convolution layer through the cascade operation, and the connected feature map is subjected to dimensionality reduction using a 1 × 1 convolution. The output o of the final upsampling block is given by

The model uses a jump layer to fuse the shallow color and texture details with the high-level abstract semantic features, which can effectively improve the segmentation accuracy for relatively small classes.

3.2. Loss Function

Cross-entropy loss is a commonly applied function in image segmentation tasks. However, that loss function is calculated by summing all the pixels, which fails to consider category imbalance. Inspired by Eigen and Fergus [30], the median frequency balance is used to weight the loss of a class. The median frequency balance weights the class loss based on the ratio of the median of the sample class frequency in the training set to the target class frequency; however, this approach is insufficient to distinguish easy from difficult samples. To improve the segmentation accuracy for the small-target categories in remote sensing images, the idea of focal loss was introduced by Lin et al. [31]. By suppressing the leading role of the simple samples during training, the training process can concentrate on complex and difficult samples.

Here, N represents the number of samples in a minibatch, C represents the number of categories, represents the true label of the one-hot encoding corresponding to sample n, and is the softmax probability of sample n being in class c. The cross-entropy loss function is defined as follows:

The focus loss function MFB_Focalloss, which is based on the median frequency balance, is defined as follows:

The frequency of the category c pixel is denoted by fc, median(fc) is the median of the pixel frequencies of each category, and represents the weight value corresponding to category c:

3.3. SiameseDenseU-Net

Inspired by DenseU-Net [29], this paper proposes a new end-to-end neural network called SiameseDenseU-Net. The network uses true orthophoto images and their corresponding normalized digital surface model images as the inputs to two DenseU-Net structures, and the downsampling blocks extract deep image features in parallel. Information is fused through the connected channels. Finally, a softmax layer is used to predict the output characteristics to generate dense label maps. The model structure is shown in Figure 2.

The dual-channel data input and the parallel model structure used for feature extraction inevitably increase model complexity, which will cause the model to spend much time on training and prediction, making it unable to quickly verify our ideas, and thus fail to improve the model. Too many parameters can also cause the model to overfit. Because the depth of the model is closely related to the feature extraction capability of the model and the size of the convolution kernel itself is already small, we cut the number of channels of the original DenseU-Net model by half when performing model clipping. Consequently, the SiameseDenseU-Net model does not add additional parameters or calculation costs.

Table 1 gives the detailed parameters of each layer of SiameseDenseU-Net. The experimental results on the Vaihingen dataset show that SiameseDenseU-Net still performs better than does the original DenseU-Net without increasing the complexity of the model.

4. Experiments and Analysis

This experiment uses the MFB_Focalloss and the cross-entropy loss function. The effectiveness of the SiameseDenseU-Net model was verified by comparing it with the original DenseU-Net and U-Net models. The HSN [8] model uses the cross-entropy loss function MFB_CEloss based on the median frequency balance in this experiment; OI is used to further improve the prediction accuracy, and finally, WBP is performed during postprocessing to further improve the overall accuracy.

4.1. Dataset

The experiment used the Vaihingen dataset from the 23rd International Photogrammetry and Remote Sensing Society 2D Semantic Annotation Competition in 2016 [32]. The dataset contains 33 high-resolution TOP images and corresponding DSM images taken over a German town of Vaihingen. Among the 33 images in the dataset are 16 labeled images. The official ISPRS organizer also provided 33 normalized digital surface model images (nDSM) corresponding to the TOP image to limit the effects of different ground heights. There are two ground-truth versions used in the evaluation: the original version (denoted by GT) and the eroded version (indicated by erGT). Some examples are shown in Figure 3.

The experiment divided the 16 available GT images into training and testing sample sets. The training set consists of 11 images (regions 1, 3, 5, 7, 13, 17, 21, 23, 26, 32, and 37), and the test set includes 5 images (regions 11, 15, 28, 30, and 34).

The Vaihingen dataset contains six categories: impervious surfaces, low vegetation, cars, clutter/background, buildings, and trees. In the dataset, the “car” category is relatively small compared to the other categories; thus, it belongs to the small-target category, as shown in Figure 4. At the same time, in the image, the diversity of car colors also leads to large intraclass differences.

In this experiment, we cut the 11 training set images and the corresponding GT and nDSM images into 256 × 256 pixel images, with a 50% overlap between adjacent images. Then, each of the cut images and the corresponding GT image were rotated at four angles (0°, 90°, 180°, and 270°), and each rotated image was horizontally mirrored. Following this approach, each picture is represented by 8 enhanced images, including itself. These operations increased the diversity of the data.

4.2. Evaluation Index

According to the 23rd International Society for Photogrammetry and Remote Sensing 2D Semantic Annotation Competition, the overall accuracy evaluation standard is the percentage of pixels for which the correct category is predicted, and the F1-score is used as an evaluation criterion for measuring the segmentation accuracy of each category. The effect parameters are all between 0 and 1: the larger their values are, the higher the accuracy is. The F1-score formula better balances the two precision and recall parameters and thus better measures model performance. The definition of the F1-score is as follows:

N represents the total number of predictions, M is the number of correct prediction results, G is the sum of the predicted correct results and the unpredicted correct results, P represents the precision rate, and R represents the recall rate, and P and R are defined as follows:

The percentage of the correctly predicted pixels to the total pixels is used as an evaluation criterion of the overall accuracy, in which TP represents the number of correctly predicted pixels, and AP represents the total number of all pixels. This metric is defined as follows:

4.3. Experimental Results

The experiment uses true orthophoto images and the normalized digital surface model as the input to SiameseDenseU-Net. These are, respectively, sent to the two parallel DenseU-Net models for training. Finally, the features extracted by the two parallel DenseU-Net models are fused, and the fused features are intensively predicted using the softmax layer to generate dense label maps. It is worth noting that the number of channels of the two parallel DenseU-Nets in the SiameseDenseU-Net model is half that of the original DenseU-Net model. Compared to the original DenseU-Net, the SiameseDenseU-Net model does not add additional parameters or computational costs.

As shown in Table 2, SiameseDenseU-Net + MFB_Focalloss outperforms the original DenseU-Net + MFB_Focalloss model, except on the “low vegetation” category. It also improves the F1-score, overall accuracy, and average F1-score of the other categories to varying degrees. When considering boundary pixels, the overall accuracy and average F1-score increased by 0.57% and 0.58%, respectively. This experiment shows that the SiameseDenseU-Net model outperforms the DenseU-Net and U-Net models without requiring additional parameters or increasing the computational cost. It is particularly noteworthy that when considering edge pixels, the newly proposed SiameseDenseU-Net + MFB_Focalloss increases the F1-score of the small-target “car” category by 0.92% compared to the original DenseU-Net + MFB_Focalloss model. That is, SiameseDenseU-Net + MFB_Focalloss achieves excellent performance at enhancing the semantic segmentation of small-target categories.

It can also be seen from Table 2 that SiameseDenseU-Net + MFB_Focalloss achieves a better overall accuracy than does HSN + OI + WBP even without postprocessing, reaching 86.2%. Moreover, its F1-scores on each category are better than those of HSN + OI + WBP. Especially for the small-target “car” category, SiameseDenseU-Net + MFB_Focalloss’s F1-score increased by 8.2% over that of HSN + OI + WBP.

When ignoring the boundary pixels (erGT), the performances of all the networks are better than when the boundary pixel are considered (GT) due to object boundary ambiguity.

The experiments on the Vaihingen dataset show that the SiameseDenseU-Net model can better identify small-target “car” categories while maintaining its overall accuracy, making it numerically and visually superior to the existing DenseU-Net, U-Net, and HSN models.

Figure 5 shows the experimental results of different models on the global image. It can be seen that SiameseDenseU-Net + MFB_Focalloss outperforms the other models on the Vaihingen dataset.

Figure 6 shows a local comparison of the experimental results on the “car” category. For the small-target “car” category, the segmentation effect of the DenseU-Net + MFB_Focalloss model is already excellent, but the new SiameseDenseU-Net + MFB_Focalloss model performs even better on “car” boundary pixels and defective “cars”.

Both Figures 7 and 8 show a partial segmentation visual comparison of the “building” categories of different models. In Figure 7, the SiameseDenseU-Net + MFB_Focalloss model is optimal for semantic segmentation of the “building” category. In the “buildings” category in the upper left corner of the image, some pixels are misclassified by the other models as “impervious surfaces” due to their complex textures and lighting, which causes the “building” category in the image to be incomplete. Thus, the SiameseDenseU-Net + MFB_Focalloss model also solves the problem of defective “buildings” and completely segments the “buildings” category.

As shown in Figure 8, the SiameseDenseU-Net + MFB_Focalloss model is optimal for performing semantic segmentation of the “building” category boundary pixels. The boundary pixels of the “building” category in the image are jagged, and the other models fail to recognize these boundary pixels. Some models predict that the “building” category image is incomplete. In contrast, the SiameseDenseU-Net + MFB_Focalloss model not only solves the problem of the incomplete “building” image but also accurately identifies the boundary pixels of the “building” category.

5. Conclusions

To solve the problems of blurred boundary pixels and unbalanced categories in urban remote sensing image segmentation tasks, this paper proposed an end-to-end SiameseDenseU-Net model based on DenseU-Net. The model uses two parallel DenseU-Net networks to extract features from true orthophoto images and their corresponding normalized digital surface model images. Two parallel downsampling blocks extract image features at the same time. The features of the downsampling blocks are transmitted to the upsampling blocks for feature fusion through the connected channel. Finally, a softmax layer is used to perform prediction and generate dense label maps. The number of channels in the SiameseDenseU-Net model is half that of the original DenseU-Net model. The experimental results show that the SiameseDenseU-Net model is better at identifying the small “car” category and the “building” category without requiring additional parameters or increasing the calculation cost, and it also better solves the incomplete phenomenon of the “building” category. Simultaneously, it improves the overall accuracy and the average F1-score and outperforms the compared models with regard to both numerical and visual comparisons.

Data Availability

The data used to support the results of this study can be obtained by visiting http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant 61762024 and in part by the Natural Science Foundation of Guangxi Province under Grant nos. 2017GXNSFDA198050 and 2016GXNSFAA380054.