Abstract

With the increase of spatial resolution of remote sensing images, features of feature imaging become more and more complex, and the change detection methods based on techniques such as texture representation and local semantics are difficult to meet the demand. Most change detection methods usually focus on extracting semantic features and ignore the importance of high-resolution shallow information and fine-grained features, which often lead to uncertainty in edge detection and small target detection. For single-input networks when two temporal images are connected, the shallow layer of the network cannot provide the information of the individual original image to the deep layer features to help reconstruct the image, and therefore, the change detection results may be missing in detail and feature compactness. For this purpose, a twins context aggregation network (TCANet) is proposed to perform change detection on remote sensing images. In order to reduce the loss of spatial accuracy of remote sensing images and maintain high-resolution representation, we introduce HRNet as our backbone network to initially extract the features of interest. Our proposed context aggregation module (CAM) can amplify the convolutional neural network receptive field to obtain more detailed contextual information without significantly increasing the computational effort. The side output embedding module (SOEM) is proposed to improve the accuracy of small volume target change detection as well as to shorten the training process and speed up the detection while ensuring the performance. The method has experimented on the publicly available CDD dataset, the SYSU-CD dataset, and a challenging DSIFN dataset. With significant improvements in precision, recall, F1 score, and overall accuracy, the method outperforms the five methods mentioned in the literature.

1. Introduction

Change detection of acquired remote sensing images of the same geographical area, at different times, is an important part of many practical applications such as land use, vegetation change detection, ecosystem detection, and damage assessment [1]. The traditionally time-consuming and laborious way of analyzing changes in remotely sensed images based on manual work makes the automation of this process an important and practically needed research area. Automatic implementation of time-series image change detection is of great scientific research and application value, and research in this field has been carried out in the remote sensing community for decades [26]. In recent years, artificial intelligence algorithm techniques represented by deep learning have developed rapidly and have been applied in various fields, such as computer vision [7], speech recognition [8], and information retrieval [9], especially in the field of computer vision. J. Bruna et al. proposed a fully convolutional neural network [10] to achieve end-to-end pixel classification of images using convolutional layers instead of fully connected layers. Hughes et al. proposed a pseudo-twin convolutional neural network-based method applied to the detection of changes between SAR images and optical images [11]. As the research progressed, in 2017, Ashish Vaswani, Niki Parmar, and several other researchers joined together to publish the landmark paper that ushered in the era of large models [12]. In this paper, they proposed the famous transformer architecture. In 2018, a model called BERT blew up the NLP community, setting a new SOTA record of 11 NLP tasks, and it was transformer that was responsible for it. Transformer has been extensively used in the field of remote sensing, especially in the field of semantic segmentation. Gori et al. [13] used RNN to compress node information and learn graph node labels and first proposed the concept of graph neural network (GNN). Graph neural networks (GNNs) are a deep learning-based method for operating in the graph domain. Later, graph convolutional network (GCN) was proposed in the literature [14], which formally used CNNs for modeling graph-structured data. Although both transformer and graph neural networks are popular deep learning methods in recent years, relatively little research has been conducted in dual-time remote sensing image change detection. Therefore, twin convolutional neural networks are still used in this paper.

At present, scholars at home and abroad have proposed various remote sensing image change detection methods based on deep learning. In terms of the framework of the methods, they can be broadly divided into 3 categories. The first type is the method of extracting features first and then detecting them, i.e., feature extraction of the image using a deep network followed by change detection based on the features [15]. The second category is the method of preclassification followed by detection. That is, the images are primarily preclassified using traditional algorithms, and then, the deep network is trained with explicitly changed and unchanged samples. Finally, uncertain samples are fed into the trained network to obtain the result graph [16]. Although both types of methods are based on deep learning and their results also are better than traditional methods, they are still subject to human experience and prone to errors in the steps required for threshold judgment, clustering, and sample selection in the detection process. The third category is the methods based on fully convolutional networks. It is a completely end-to-end learning framework with no human interference in between, and the whole process is more robust and efficient [6]. Depending on the input method of the image, this type of method can be subdivided into networks with a single input and networks with dual input.

Although the full convolutional network-based approach achieves better change detection performance, there are still some shortcomings. The cascading pooling operation of encoders from high resolution to low resolution will lead to a decrease in spatial resolution, and it is difficult for the decoder to recover its resolution. Direct use of the full twin convolutional change detection network suffers from low detection completeness, easy false detection, and missed detection. This is primarily limited by the lack of network feature extraction capability and the ineffective use of contextual semantic information in the spatial and channel domains. Full convolutional networks do not easily obtain the edge information of images when extracting the deep features of dual-temporal images. To this end, this paper proposes a twin context aggregation network (TCANet) to address the above problem. First, to obtain a high-resolution representation, we use the HRNet network for our backbone network. Second, in order to enhance the network feature extraction capability as well as to effectively utilize the channel domain contextual semantic information, we propose the context aggregation module (CAM). Finally, in the decoding part, we introduce the side output embedding module (SOEM) in order to obtain the edge information and small target information of the image, while suppressing useless information and further improving the accuracy of change detection. Remote sensing image change detection, as a more cutting-edge research direction, is still one of the relatively less studied areas at present, and the best experimental results can be obtained using our proposed three modules. The main contributions are as follows.(1)We introduce dual HRNet as the backbone of our twin network. This network can maintain high resolution from beginning to end, and information interaction of different branches can supplement information loss caused by the reduced number of channels.(2)In the coding part. The context aggregation module (CAM) is proposed in order to improve the extraction of network features and to efficiently utilize the channel domain contextual semantic information. The module turns the output feature maps into four 1/4-channel feature maps and then uses dilation convolution with different dilation rates to integrate the multichannel contextual information in parallel.(3)In the decoding part. We introduce the side output embedding module (SOEM) in order to obtain the edge change information of the dual-time phase images as well as more detailed fine image details and complex texture features of the high-resolution remote sensing images.(4)Our network achieves impressive results on all three datasets. More specifically, we have obtained 89.87% F1 score on the challenging DSIFN test set.

The rest of the paper is organized as follows: Section 2 describes the work related to change detection. Section 3 discusses the proposal of the method. Section 4 presents the experimental data set and evaluation metrics. Section 5 presents the experimental design and results. Finally, Section 6 discusses our work and conclusions.

In recent years, many neural network techniques and components for scene segmentation have been applied for change detection tasks to extract deeper representations. First U-Net [17] pioneered the benchmark model and then used the Siamese network [1824] to become the standard method for change detection. To improve the performance of change detection, a lot of work has been conducted on depth feature extraction and refinement.

2.1. Siamese Neural Network

The Siamese neural network is a coupled architecture based on two artificial neural networks. In simple terms, a Siamese neural network composes of two neural networks with the same structure and shared weights spliced together. The “concatenation” of neural networks is achieved by sharing the weight.

Change detection methods based on fully convolutional networks are roughly divided into two categories. One type is the early fusion approach (single-input); the other is the twin network approach (dual-input). A single-input network is a cascade of dual-temporal images into one image before being fed into the network [25, 26]. For example, in the literature [26], dual-temporal image pairs are concatenated as input to an improved UNet++ network, and change maps at different semantic levels are merged to generate the final change map. In contrast to single-input networks, dual-input networks are borrowed from twin networks [18, 22, 27], where the front-end feature extraction part of the fully convolutional network is replaced by two networks with the same structure. For instance, the literature [18] proposed three fully convolutional neural network frameworks for change detection in remote sensing images, one of which is single input and the other two are dual input. The results of many change detection experiments demonstrate that a dual-input network architecture is more suitable for change detection. Many scholars have done research on Siamese network based on change detection methods. Yu et al. proposed the NestNet [28] network, a model that introduces two parallel modules to extract the respective features of diachronic images and then uses absolutely different operations to process the features of the two images. The literature [22] proposes an IFN-based method, which is a fully convolutional network-based method belonging to a dual input. In this method, dual-temporal images are extracted with depth features by a twin network, and the down-sampled change maps are directly fed into the middle layer of the network during training. And finally, the network parameters are updated by calculating the losses independently. Fang et al. proposed a SNUNet-CD network [29]. This network is a modification of UNet++, which differs from UNet++ in that it uses two twin convolutional filters to extract features from two images, and aggregates and refines features from multiple semantic layers by integrating the channel attention module. Finally, the test results were obtained.

2.2. Contextual Information Aggregation

Since each pixel point in an image cannot be isolated, one pixel must be related to the surrounding pixels in some way. The interconnection of a large number of pixels is what produces the various objects in an image, so the contextual features of an image are of great importance. Not having sufficient access to rich contextual information during the change detection task can have an impact on our detection results. Many approaches add modules on top of the encoded network to expand their effective receptive field and integrate more contextual information. In the paper [30], global pooling operations are introduced to learn the scenario-level global context, and the importance of the receptive field is discussed. PSPNet [31] extends the application of global pooling to image subregions and proposes a parallel spatial pooling design that aggregates multiscale contextual information. Dilated convolution is another design that can amplify the receptive field of CNNs without significantly increasing the computational effort [32, 33]. Combining the atrous convolution and multilevel pooling design in PSPNet, the atrous space pyramid pooling (ASPP) module was proposed in [34] and improved in [3537]. The attention mechanism [3840] uses the sigmoid function to generate “attention” descriptors after global pooling operations, which are another contextual aggregation design. In order to better serve the change detection of remote sensing images, this paper uses multiple parallel inflated convolutions to obtain global and local contextual information.

3. The Proposed Methodology

In this section, we describe in detail the proposed twin context aggregation network (TCANet) for remote sensing image change detection. First, the general structure we proposed will be outlined. After this, we give an illustration of the HRNet (our baseline network) architecture. Finally, the design of each module is presented, including the CAM module and the SOEM module.

3.1. Overview of the Network

As shown in Figure 1, a twin context aggregation network (TCANet) is designed in this paper. The model feeds dual-temporal images into two networks with shared parameters to extract features separately. First, dual-temporal images are input into the backbone network HRNet to obtain four change feature maps with different sizes and a different number of channels. The structure preserves spatial detail information but does not take full advantage of contextual information. Therefore, we integrate more contextual information to improve network performance by introducing a scale-dependent contextual aggregation module in four different branches. Since the four parallel outputs generated by HRNet are information-dispersed, we embed different levels of local contextual information from the context aggregation module (CAM) into these features to make the outputs informative. Then, the two feature values obtained from the feature encoding stage are differenced, and the absolute values are taken to obtain dual-temporal feature fusion information at different scales. Finally, we input the fused feature maps to the side output embedding module to facilitate the detection of edges and small targets. The detailed process of the network is as follows.

As shown in Figure 1, B1, B2, B3, and B4 are the four parallel branches generated by HRNet. CAM module processing is performed first, then the output of the CAM module is up-sampled (the output of the first CAM module is not up-sampled), and then the associated feature maps are concatenated.where denotes the output feature map after CAM processing, denotes the up-sampling operation, and represents the tandem stitching operation. Immediately afterwards, , , , and are channel compressed using 1 × 1 convolution so that the four output feature maps have the same number of channels. Then, a difference absolute value operation is performed with the output of the other encoder to fuse the features of the two images. That is,where represents the 1 × 1 convolution operation to achieve channel compression. is the feature map extracted from the image at the moment of T1 by the encoding structure. is the feature map extracted from the image at the moment of T2 by the encoding structure, and represents the feature fusion of the two images after the difference in absolute value processing. Finally, the obtained , and are input to the side output embedding module (SOEM) to obtain the final predicted map.

3.2. Baseline: HRNet

Most existing encoder methods perform cascading pooling operations (down-sampling) from high to low resolution to obtain. But the cascade pooling operation leads to a loss of spatial accuracy that is difficult to recover by the decoder. To overcome this limitation, HRNet [41, 42] introduced a multiscale parallel design. It maintains high-resolution output throughout and fuses multiscale information so that the network can initially extract as many image features as possible.

As shown in Figure 2, HRNet consists of parallel subnetworks from high resolution to low resolution, with repeated information exchange across multiresolution subnetworks (multiscale fusion). Specifically, the four parallel subnetworks are B1, B2, B3, and B4, where B1 always maintains a high-resolution representation. At the same time, the feature map after each convolution block is convolved 3 × 3 with a step of 2 (down-sampling) to reduce the space size of the feature map and up-sampling the feature maps after each convolution block to connect to different branches for multiscale fusion (the first convolution block of B1 does not need to be up-sampled). Finally, the network generates four sets of feature maps with different resolutions. They are first up-sampled to recover to the same size as branch B1 and then fused, and the fused feature maps can be used to generate segmentation results, which, of course, we do not need to generate here. The four branches of HRNet are equivalent to 1/4, 1/8, 1/16, and 1/32 of the original input size.

3.3. Contextual Aggregation Module (CAM)

Contextual information is essential to determine the two categories of objects changed and unchanged, and many approaches add modules to the top of the encoding network to expand their effective receptive fields and integrate more contextual information. Atrous convolution is one design that can amplify the receptive fields of a convolutional neural network without significantly increasing the computational effort [32, 33] since the dilation convolution structure is simple, easy to understand, and used directly or indirectly by most papers. Therefore, this paper proposes a contextual aggregation module that not only obtains global information but also provides more detailed local information. And it can be used together with the other two modules proposed in this paper for better performance.

The detailed design of the CAM is shown in Figure 3. Given an input feature map size of T = C × H × W, the number of channels of the input feature map T is reduced to C = C/4 by 1 × 1 convolution. After that, four parallel dilation convolutions with dilation rate (dilation rate) of [1, 2, 4, 8] are used to integrate more contextual information. This method can increase the receptive field of the convolution kernel to obtain a larger range of information while keeping the number of parameters constant and can ensure that the output feature map size remains unchanged. Finally, the convolved feature map is connected to T to obtain X = 2C × H × W, and then, a 1 × 1 convolution is used to compress X to C × H × W. The receptive field is calculated by the following formula:where denotes the dilation rate, and is the original convolutional kernel size. In this paper, a parallel dilation convolution method is used to sample features with different dilation rates to obtain feature maps with different receptive fields and then fuse these features with channels to obtain information from different channels. If the size of our input image is 512 × 512, the output sizes of the four branches of HRNet are 128 × 128 × 64, 64 × 64 × 128, 32 × 32 × 256, and 16 × 16 × 512. First, based on experience, we set the dilation rate of the dilation convolution to 1, 2, 4, and 8 (the dilation convolution with rate = 1 is equivalent to the ordinary convolution), and the receptive fields are calculated to be 3 × 3, 7 × 7, 15 × 15, and 31 × 31, respectively. Since the output feature maps of the latter two branches of HRNet are 32 × 32 and 16 × 16, respectively, they basically cover their main areas and achieve global awareness. Although our four CAM modules are designed the same, the four branches of HRNet have different output scales, and we can also get multiscale information. Finally, a feature map that incorporates these different receptive fields will further improve the performance of the network.

3.4. Side Output Embedding Module (SOEM)

Side output embedding module (SOEM) proposed in this paper is a combination of feature pyramid network (FPN) and intermediate supervision. Feature pyramid networks can fuse shallow and deep features and can improve the accuracy of small-volume targets and edge detection. Intermediate supervision allows shallow layers to be trained more fully to avoid gradient disappearance and slow convergence. Therefore, this module allows us to improve our network in both speed and accuracy.

As shown in Figure 4, we use four (, , , ) feature maps with different sizes after dual-temporal feature fusion to replace the top-down part of the pyramid network. First, the 1 × 1 convolved feature map is summed with the up-sampled feature map to obtain a feature map with both coarse-grained features and fine-grained features. The obtained small-size feature maps are then compressed to 2 dimensions by 1 × 1 convolution and up-sampled to a size of 128 × 128 (large-size feature maps are not up-sampled). Although obtained four groups of feature maps are of the same size, the semantic levels are different, and the spatial location representation is also different. Finally, these four feature maps are concatenated, compressed to 2 dimensions using 1 × 1 convolution, and up-sampled to a size of 512 × 512. And using the truth map (ground truth) and the obtained feature map to calculate the loss, the final prediction map is supervised to be generated. The process of obtaining the forecast map is as follows:where denotes the feature map obtained by the feature pyramid network, denotes the summation of the horizontal and vertical results, denotes the up-sampling operation, and denotes the 1 × 1 convolution operation, denotes compressing the feature map to 2 dimensions and up-sampling to a size of 512 × 512, and generating the final prediction map using the ground truth with the obtained feature map for loss supervision.

3.5. Loss Function

Due to the problems of significant imbalance of sample categories between changing and nonchanging regions, changing targets show diverse scale characteristics and small relative background occupancy. We use a loss function that is a combination of balanced binary cross-entropy and dice coefficient loss that is valid for sample equilibrium, and the loss function [43] is a weighted sum of the two. The formula iswhere is the balanced binary cross-entropy loss; is the dice coefficient loss; λ is the weighting factor, which takes the value of 0.5. Where , and . and represent the numbers of changed and unchanged pixels in the ground truth label images, respectively. is the sigmoid output at pixel .

4. Experimental Dataset and Evaluation

To evaluate the effectiveness of the method, we conducted a comprehensive experiment on three datasets, CDD, DSIFN, and SYSU-CD. And precision (P), recall (R), F1 score (F1), and overall accuracy (OA) were used as evaluation metrics.

4.1. The CDD Dataset

The CDD [44] dataset with real seasonal variation was used for the first experimental data. The dataset contains 7 pairs of images of 4725 × 2700 pixels in size. To meet the hardware requirements, the original image is sliced into 16,000 sample pairs of size 256 × 256 pixels. By cropping and rotating 7 pairs of seasonally varying images and dividing them into training set, validation set, and test set in the ratio of 10 : 3:3, the spatial resolution is 3–100 cm.

4.2. The DSIFN Dataset

The second experimental data consist of 6 large, dual-time, high-resolution images covering 6 cities in China (i.e., Beijing, Chengdu, Shenzhen, Chongqing, Wuhan, and Xi’an). The five pairs of dual-time images of Beijing, Chengdu, Shenzhen, Chongqing, and Wuhan are cropped into 394 pairs of subimages of size 512 × 512. After data enhancement, a collection of 3940 dual-time image pairs was acquired. The Xi’an image pair was cropped into 48 image pairs for model testing. There are 3600 pairs of images in the training dataset, 340 pairs of images in the validation dataset, and 48 pairs of images in the test dataset.

4.3. The SYSU-CD Dataset

The dataset contains 20,000 pairs of aerial images of size 256 × 256 taken in Hong Kong between 2007 and 2014. The main types of changes in the SYSU-CD dataset include (a) new urban buildings; (b) suburban sprawl; (c) preconstruction foundation works; (d) changes in vegetation; (e) road expansion; and (f) marine construction. The 20,000 pairs of images are divided into a training set, a validation set, and a test set in the ratio of 3 : 1 : 1. There are 12,000 pairs of images in the training data set, 4,000 pairs of images in the validation data set, and 4,000 pairs of images in the test data set.

4.4. Evaluation Metrics

Remote sensing image change detection usually uses precision (P), recall (R), F1 score (F1), and overall accuracy (OA) as evaluation indexes, as shown in equations (6) to (9). F1 score is the summed average of precision and recall, and the higher the F1 score, the more robust the model is. In the CD task, a large value of denotes a small number of false alarms, and a large value of represents a small number of missed detections. Meanwhile, F1 and OA reveal the overall performance, where their larger values will lead to better performance. Four evaluation metrics are described as follows:where precision represents the precision rate, and recall represents the recall rate. P and N represent the judgment results of the model, T and F are used to evaluate whether the judgment results of the model are correct, FP refers to false positive cases, FN refers to false negative cases, TP refers to true cases, and TN refers to true negative cases.

5. Experimental Design and Results

Our network is implemented by TensorFlow as the backend using the Keras framework. The lab is equipped with a dedicated server for the training of the network, for which we use small batch gradient descent with a batch size of 4. We chose the Adma optimizer to optimize the network, where the initial learning rate for each dataset was set to 0.001, and all experiments were trained for 500 rounds.

5.1. Intermodule Ablation Experiments

In this section, we conduct intermodule ablation experiments on CDD, SYSU-CD, and DSIFN datasets. Table 1 shows the quantitative analysis of the different modules on the two datasets CDD and DSIFN. Table 2 shows the quantitative analysis of the different modules on the SYSU-CD dataset. Figures 57 show the qualitative analysis of the two data sets.

5.2. Ablation Study of the Baseline Network

We conduct experiments with vgg_16 and HRNet as the backbone networks, respectively. The experimental results shown in Tables 1 and 2 demonstrate better performance when HRNet is used as the backbone network. Therefore, our benchmark network first uses two twin high-resolution networks HRNet as the feature encoding module for feature extraction. Secondly, the feature maps of different sizes after the output of the feature encoding module are first channel-normalized. Then, an up-sampling operation and a tandem splicing operation are performed to obtain the information of eigenvalues with the same dimensionality. Finally, the eigenvalues of dual-temporal images are different, and the absolute values are taken to obtain dual-temporal feature fusion information at different scales. In the feature decoding module, the feature maps of different sizes obtained by fusing the differences are subjected to different levels of up-sampling operations and fused for output. We quantitatively evaluated the performance of the benchmark network as shown in the second row of Tables 1 and 2. The fourth and fifth columns in Figures 57 visualize that the results with HRNet as the backbone network are better than those with vgg_16 as the backbone network. The general outline of HRNet as the backbone network is shown in the visualization.

5.3. Ablation Studies of CAM

We have designed the contextual aggregation module (CAM). This module can amplify the convolutional neural network receptive field without increasing the computational effort, obtaining not only global but also (for different channels) detailed local contextual information. As can be seen from Table 1, the addition of the CAM module to the baseline network results in a significant improvement in all metrics. On the CDD dataset, the precision (P), recall (R), F1 score, and overall accuracy (OA) were 95.55%, 86.59%, 90.85%, and 96.73%, respectively, with the addition of this module to the baseline network. Compared to the baseline network, P, R, F1, and OA are improved by 0.71%, 0.82%, 0.78%, and 0.70%, respectively. On the DSIFN dataset, the precision (P), recall (R), F1 score, and overall accuracy (OA) were 90.02%, 84.43%, 87.14%, and 94.84%, respectively, after adding the module. Compared to the baseline network, P, R F1, and OA improved by 3.59%, 0.88%, 2.18%, and 0.48%, respectively. On the SYSU-CD dataset, the precision (P), recall (R), F1 score, and overall accuracy (OA) were 88.07%, 81.97%, 87.77%, and 92.25%, respectively, with the addition of this module to the baseline network. Compared to the baseline network, P, R, F1, and OA are improved by 1.23%, 2.09%, 1.53%, and 2.51%, respectively. From the sixth column of Figures 57, it is clear that the CAM module has improved the boundaries compared to the baseline network, and some small targets can be seen in the sixth column of Figures 5 and 6. Its outline has been fully revealed, but there are still some detailed features that have not fully emerged and need to be extracted further. This indicates that parallel atrous convolution of multiple channels ensures maximum information extraction.

5.4. Ablation Studies of SOEM

We also studied the contribution of the SOEM module to the network. Side output embedding module (SOEM) can fuse shallow and deep features and can improve the accuracy of small-volume targets and edge detection. And it enables the shallow layer to be trained more fully to avoid gradient disappearance and too slow convergence. As can be seen in Table 1, the significant gains in precision (P), recall (R), F1 score, and overall accuracy (OA) were achieved with the addition of the SOEM module compared to the first two ablation experiments. On the CDD dataset, the addition of this module improves P, R, F1, and OA by 1.86%, 2.21%, 2.06%, and 1.83%, respectively, compared to the baseline network. Compared to baseline + CAM, P, R, F1, and OA improved by 1.15%, 1.39%, 1.28%, and 1.15%, respectively. On the DSIFN dataset, P, R, F1, and OA improved by 4.97%, 5.38%, 4.91%, and 1.01%, respectively, compared to the baseline network after adding this module. Compared to baseline + CAM, P, R, F1, and OA were improved by 1.02%, 4.50%, 2.73%, and 0.53%, respectively. On the SYSU-CD dataset, the addition of this module improves P, R, F1, and OA by 4.36%, 4.05%, 2.47%, and 5.15%, respectively, compared to the baseline network. Compared to baseline + CAM, P, R, F1, and OA improved by 3.13%, 1.96%, 0.94%, and 2.64%, respectively. As can be seen in the seventh column of Figures 57, the addition of this module not only improves the detection performance in general but also makes its edges more complete. And the seventh column of Figures 5 and 6 can find that some small targets are displayed more accurately. At the same time, the detailed features extracted by this module make the predicted result maps closer to the real labels. Therefore, using three modules together enables the network to be used to its best advantage.

5.5. Comparative Experiments

In order to show the superiority of our proposed method, a comparison with five other change detection networks is presented in this paper. Five change detection networks include full convolutional network with pyramid pool (FCN-PP) [45], fully convolutional siamese-concatenation (FC-siam-conc) [46], fully convolutional siamese-difference (FC-siam-diff) [46], Unet++_MSOF [26], and IFN [22]. Table 3 shows the experimental comparison of our method with the other five methods on the CDD and DSIFN datasets. Table 4 shows the experimental comparison of our method with the other five methods on the SYSU-CD dataset. As shown in Figure 8, the performance of different methods on CDD, DSIFN, and SYSU-CD datasets is quantitatively analyzed using a line graph format. Figures 911 show the visualization of the different methods on CDD, DSIFN, and SYSU-CD datasets.

As shown in Table 3, we quantitatively evaluated the results of TCANet and different methods on the CDD and DSIFN datasets. As shown in Table 4, we quantitatively evaluated the results of TCANet and different methods on the SYSU-CD dataset. As shown in Figure 8, our network has the highest performance metrics on three datasets. On the CDD dataset, precision, recall, F1 score, and OA of TCANet were 96.70%, 87.98%, 92.13%, and 97.88%, respectively, compared to IFN by 1.00%, 0.18%, 0.55%, and 0.17%, respectively. On the DSIFN dataset, these three metrics for TCANet were 91.04%, 88.93%, 89.87%, and 95.37%, respectively, compared to IFN by 2.19%, 3.73%, 3.18%, and 6.51%, respectively. On the SYSU-CD dataset, precision, recall, F1 score, and OA of TCANet were 91.20%, 83.93%, 87.02%, and 94.89%, respectively, compared to IFN by 3.36%, 0.60%, 1.37%, and 3.78%, respectively. Figures 911 show the visualization of the different methods on the three datasets. The red boxes represent the improvement areas. In comparing the visualization results of the experiments, it can be seen that the prediction maps of our proposed method are closer to the real labels, thus demonstrating the effectiveness of our proposed method.

6. Conclusion

In this paper, a twins context aggregation network (TCANet) is investigated. This study extracts features separately by feeding dual-temporal images into two networks with shared parameters. In the feature extraction stage, the limitations of the traditional “encoder-decoder” structure are considered. We introduce a parallel multiscale branching HRNet to reduce the loss of spatial information. In addition, we designed separate contextual aggregation modules (CAM) for each branch, expanding their effective receptive field and integrating more contextual information. Then, the two feature values obtained from the feature encoding stage are differenced, and the absolute values are taken to obtain dual-temporal feature fusion information at different scales. Finally, we input the fused feature maps to the side output embedding module to facilitate the detection of edges and small targets. Our proposed architecture shows a greatly improved approach to existing architecture and achieves the better results on three remote sensing image datasets (CDD, DSIFN, and SYSU-CD datasets). One of the limitations of the method is that in order to avoid intensive computation, HRNet reduces the space size of the input data in the early layers.

In the future, we will plan to improve the training speed and accuracy of change detection by reducing the computation and trying to merge more parallel branches of HRNet. Since transformer has been widely used in the remote sensing field, many scholars have migrated transformer to the semantic segmentation field, so I will plan to introduce transformer to change detection in the next step.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by the National Key Research and Development Program under Contract 2017YFB0504203, Planned project of Gansu Science and Technology Department under Contract 21JR7RA310, and Youth Science Fund Project of Lanzhou Jiaotong University under Contract 2021029.