Abstract

Infrared and visible image fusion needs to preserve both the salient target of the infrared image and the texture details of the visible image. Therefore, an infrared and visible image fusion method based on saliency detection is proposed. Firstly, the saliency map of the infrared image is obtained by saliency detection. Then, the specific loss function and network architecture are designed based on the saliency map to improve the performance of the fusion algorithm. Specifically, the saliency map is normalized to [0, 1], used as a weight map to constrain the loss function. At the same time, the saliency map is binarized to extract salient regions and nonsalient regions. And, a generative adversarial network with dual discriminators is obtained. The two discriminators are used to distinguish the salient regions and the nonsalient regions, respectively, to promote the generator to generate better fusion results. The experimental results show that the fusion results of our method are better than those of the existing methods in both subjective and objective aspects.

1. Introduction

Image fusion aims to utilize complementary information of two source images to synthesize a fusion image with a more comprehensive understanding of the scene [1, 2]. The infrared image can identify the target according to thermal radiation contrast, and the visible image can provide a clear image in line with the human visual system [3, 4]. Due to the characteristics of the infrared and visible image, the fusion result of the infrared and visible image can preserve the significant target of the infrared image and the texture detail of the visible image simultaneously [5, 6]. Infrared and visible image fusion has been widely used in many fields, such as target recognition, video surveillance, and scene understanding [79].

The key of image fusion is to integrate the effective information and remove the redundant information of the source image to gain a better fusion image [10, 11]. For this purpose, a large number of infrared and visible image fusion methods have been proposed. These methods can be divided into two categories: (i) traditional methods, which usually complete the fusion task based on mathematical transformation and manual design; (ii) deep learning-based methods, which usually use the specific loss function to optimize the neural network to obtain a fusion result [12].

Although the abovementioned methods can complete the fusion task successfully, several aspects still need to be improved. Firstly, the manually designed fusion rules lead to the traditional methods being more complex and time-consuming. Secondly, some methods only apply deep learning in the part of fusion process, which is difficult to give full play to the advantages of deep learning [13]. Thirdly, due to the lack of ground truth, GAN-based methods are difficult to determine the input of the discriminator. The existing methods usually use two technical routes to solve this problem: (i) using one source image as the input of the discriminator, which will inevitably lead to the gradual loss of information of the other source image [14]; (ii) using the generative adversarial network with dual discriminators, which takes both two source images as the input. However, this scheme is difficult to control the balance between the two discriminators [15].

To address these challenges, this paper proposes a generative adversarial network with dual discriminators for infrared and visible image fusion based on saliency detection (SDGAN). Firstly, the proposed method is based on deep learning, which optimizes the network through specific loss functions and overcomes the increasing complexity caused by manually designed fusion rules. Secondly, in order to solve the problem of lacking ground truth, we use the generative adversarial network with dual discriminators to deal with the fusion problem. At the same time, in order to maintain the balance between the two discriminators, we introduce saliency detection into image fusion. The two discriminators take significant targets and nonsignificant targets as inputs, respectively, to ensure that the two discriminators can work smoothly without conflict.

2.1. Infrared and Visible Image Fusion
2.1.1. Traditional Methods

The traditional fusion method can be divided into three steps: feature extraction, feature fusion, and feature reconstruction [16]. As feature reconstruction is usually an inverse process of extraction, the key of traditional methods is feature extraction and feature fusion. By employing different strategies for feature extraction and feature fusion, a large number of fusion methods have been proposed.

For feature extraction, there are four categories: (i) multiscale transform, such as pyramid transform [17], wavelet transform, and edge-preserving filter; (ii) sparse representation [18]; (iii) subspace analysis, such as independent component analysis (ICA) [19], principal component analysis (PCA) [20], and nonnegative matrix factorization (NMF) [21]; (iv) hybrid methods, which can combine the advantages of other methods to obtain better fusion results [22]. After feature extraction, appropriate feature fusion methods must be selected to fuse these features. The commonly used rules include four categories: (i) maximum-operation; (ii) minimum-operation; (iii) addition-operation; (iv) L1-norm constraints.

Although the above three steps summarize most traditions, some methods are not suitable for the abovementioned three steps, such as GTF, which is based on gradient transfer and total variation minimization [23].

2.1.2. Deep Learning-Based Methods

Although the traditional fusion methods can gain a satisfactory fusion image, the fusion methods are generally very complex and time-consuming due to the artificially designed fusion rules. With the rise of deep learning, more and more fusion methods based on deep learning are proposed.

Li and Wu [24] employed the encode/decode network architecture and introduced the densely connected convolution layer in the encoder to extract the features of the source image to avoid losing information in the convolution process. Yang et al. [25] proposed a fusion model based on visual saliency sparse representation and detail injection to avoid the loss of significant thermal radiation targets of infrared images. Zhang et al. [26] propose an image fusion network based on proportional maintenance of gradient and intensity, named PMGI, which can preserve source image information through the gradient and intensity path. With the rise of the generative adversarial networks, Feng et al. [11] tried to use GAN to solve the problem of image fusion, named FusionGAN. Subsequently, its variant [27] was proposed by introducing the target-enhancement loss to enhance edge details of the fused image. However, these methods force the fused image to obtain more details in visible images as the adversarial game proceeds. In contrast, the thermal information of infrared images is gradually lost. To address this issue, Ma et al. [15] introduced dual discriminators into GAN to avoid excessive loss of information in the source image.

2.2. Saliency Detection

The human visual system will focus on important regions of the image, which helps humans easily obtain important information. Saliency detection aims to simulate the human visual system to extract the significant regions of the image and prioritize allocating computing resources for important regions in subsequent processing.

Itti et al. [28] first combined the multiscale features to get an initial saliency map and then used a neural network to optimize the initial saliency map to get the final result. Hou and Zhang [29] extracted the spectral residual of an image in the spectral domain by analyzing the log spectrum of an input image and proposed a fast method to construct the corresponding saliency map in the spatial domain. Cheng et al. [30] proposed a saliency detection method based on regional contrast, simultaneously evaluating global contrast differences and spatial coherence. Traditional saliency detection methods mainly rely on manual extraction of features and then combine these features to obtain a saliency map. Vig et al. [31] proposed an entirely automatic data-driven method that performs a large-scale search for optimal features to gain a saliency map. Kümmerer et al. [32] first used depth networks to solve saliency detection, which can reuse existing networks that have been pretrained on the task of object recognition in models of fixation prediction. Then, a large number of saliency detection methods based on the neural network have been proposed and achieved good results.

3. Proposed Method

3.1. Problem Formulation

The infrared image can highlight the target by the difference of thermal radiation. Relatively, the visible image contains richer texture details. Infrared and visible image fusion can retain the highlighted target of the infrared image and the texture details of the visible image simultaneously. Saliency detection can extract highlighted targets of the image. Therefore, introducing saliency detection into infrared and visible image fusion can improve the performance of the image fusion algorithm.

For a given infrared image , the significance value of pixel is obtained by calculating the distance between pixel and all other pixels on the image, which can be defined as follows:

The significance map of the infrared image can be obtained by calculating the significance values of all pixels of the infrared image pixel by pixel.

Then, the weight map can be obtained by normalizing all values on the saliency map to the interval [0, 1], which can be used to constrain the fusion weights of different targets in the loss function. The calculation process of the weight is shown as follows:

Finally, the saliency map is binarized to extract the salient region of the image successfully. Specifically, the pixel value of the saliency map is normalized pixel by pixel. If the pixel value is greater than b, the corresponding pixel value in the mask takes the value 1; otherwise, it is 0. In this paper, the mask can be obtained when b is determined as 0.25. The mask can be obtained when all pixels are calculated. The mask calculation process is shown as follows:

As shown in Figure 1, the saliency map and mask of two typical infrared images are given. It can be seen that the saliency map can indeed detect the significant target of the infrared image and the mask can indeed represent the significant area.

Given an infrared image and a visible image, the goal of image fusion is to obtain a generator constrained by the source image. The fused image generated by the generator can retain the salient target of the infrared image and the texture details of the visible image at the same time.

This paper proposes a generative adversarial network with dual discriminators for infrared and visible image fusion based on saliency detection, named SDGAN. The entire procedure of our proposed SDGAN is shown in Figure 2. The infrared image and visible image are input to the generator to gain an initial fused image. However, it is difficult to obtain a satisfactory fusion image only by the generator ; therefore, two discriminators and are introduced in our network to establish the adversarial games with the generator. The generator can generate a better fused image through adversarial games. The discriminator is used to distinguish the salient regions of the fused image and the infrared image. The discriminator is used to distinguish the nonsalient regions of the fused image and the visible image. The significant region of the image can be obtained by multiplying the mask and the image pixel by pixel, and the nonsalient region can be obtained by multiplying the mask and the image pixel by pixel. Since the two discriminators are used to distinguish the complementary regions of the source image and the fused image, the two discriminators can complete their own tasks independently without conflict.

The goal of generator is to synthesize a fused image, which can make it difficult for both discriminators to distinguish whether the input image is from the fused image or the source image at the same time. Mathematically, the training goal of generator is minimization:where ⊙ represents the Hadamard product, represents the generator, and represent the two discriminators, represents the mask, which is used to extract the salient area of the image, and is used to extract the nonsalient area of the image. The training goal of and is to maximize equation (4).

3.2. Loss Function

The original GAN is prone to lead to artifacts and noisy or other incomprehensible results in the generated image due to the instability of its training process. In order to make the training process more stable, a common solution is to introduce content loss. To improve the quality of fusion image, in addition to adversarial loss , this paper also introduces an enhancement loss . Therefore, the loss function of generator mainly includes three parts: content loss , enhancement loss , and adversarial loss , as shown inwhere and are introduced to control the tradeoff.

The content loss is used to constrain the similarity between the fused image and the source image in content, which mainly consists of two parts, gradient loss and intensity loss , as shown inwhere is obtained to maintain the balance between the gradient loss and intensity loss .

The gradient loss is committed to preserving the texture details of the source image in the fused image, which is defined as follows:where represents the gradient operator, which is used to extract the gradient of the image, represents the Euclidean norm, and is used to balance two items.

The intensity loss is used to constrain that the fused image and the source image have similar intensity distribution, which is defined as follows:where is introduced to control the tradeoff.

The enhancement loss is mainly used to enhance the highlighted targets and the texture details, which is defined as follows:where represents the weight map, which is used to control the retention degree of significant targets in the fused image, is used to control the retention degree of nonsignificant targets in the fused image, and is used to balance two items.

The adversarial loss comes from the game between the generator and the discriminators, as shown inwhere represents the mask, which is used to extract the significant region of the fused image, and is used to extract the nonsignificant region of the fused image.

In order to make the generator converge smoothly, two discriminators and are obtained to construct the adversarial relationship between the generator and the discriminators. The loss functions of the two discriminators and are defined as follows:

3.3. Network Architecture
3.3.1. Generator Architecture

As shown in Figure 3, a dual-encoder-single-decoder structure is introduced in the generator. Two encoders are used to extract the features of two source images, respectively. Each path of the encoder adopts four-layer network architecture for feature extraction. All convolution kernel sizes are set to 3 × 3. All steps are set to 1, and batch normalization and ReLU activation function are used to avoid the vanishing gradient and speed up network convergence. Moreover, dense connections are employed in each encoder path to realize feature reuse [33]. For the decoder, the output of the dual encoder is connected as the input to reconstruct the fused image. The decoder also adopts a four-layer network architecture, with the convolution kernel size of 3 × 3, and contains batch normalization and LReLU activation functions.

3.3.2. Discriminator Architecture

Two discriminators and are used to establish the adversarial game with the generator and promote the generator to generate more realistic and detailed images by distinguishing the input. The discriminator is used to distinguish the true and false aspects of significant targets between the fused image and infrared image, and the discriminator is used to distinguish the true and false aspects of nonsignificant targets between the fused image and the visible image. The architecture of two discriminators and is the same, but they does not share parameters. The network architecture is shown in Figure 4. The first four layers are convolution kernels with a size of 3 × 3, and the activation function is LReLU. The last layer is the linear layer, and the activation function is tanh, which is used to generate a scalar to estimate the probability that the input is from real data. The step of all convolutions is set to 2.

4. Experimental Results

4.1. Dataset and Training Details

The training dataset comes from the public infrared and visible dataset TNO, which is the most commonly used dataset in infrared and visible image fusion tasks. 28 images are selected from the dataset TNO to train the model in this paper; however, only 28 images are not enough to train a good model. Therefore, the clipping strategy is carried to expand the training dataset, and each image is cropped into image patches with the size of 120 × 120. Eventually, 23364 image patches can be used to train the model.

In the training process, the generator selects 32 pairs of infrared and visible image patches as input at one time. Next, we used 32 pairs of the salient areas of the infrared and fused image patches as the input of the discriminator . Simultaneously, 32 pairs of the nonsalient areas of the visible and fused image patches are used to input into the discriminator . We first train the discriminator 1 time and then train the generator until reaching the maximum number of training iterations. All the parameters of our model are updated by the Adam optimizer [34] at a learning rate of 10−4.

4.2. Compared Methods and Objective Indexes

As we all know, we need to make qualitative and quantitative comparisons with the existing advanced methods in order to evaluate the performance of our method. For qualitative comparisons, we compare our method with five existing methods, including three traditional methods, i.e., LP [35], DTCWT [36], and FPDE [37], and two deep learning-based methods, i.e., FusionGAN [14], and DenseFuse [24]. All traditional methods run on the same CPU i7-7700k, while deep learning-based methods run on the same GPU GTX 1080ti. All comparison methods are implemented based on public code, and the parameters are default.

Although qualitative comparison can measure the performance of the method to a certain extent, it is easy to be affected by people’s subjectivity. In this paper, qualitative comparisons are used to evaluate our method more comprehensively. Three qualitative metrics are adopted to evaluate our SDGAN and other comparison methods, i.e., EN [38], SD [39], and SSIM [40].

Entropy (EN) is a common parameter for statistical image features, reflecting the amount of information obtained from infrared and visible images. Mathematically, entropy can be defined as follows:where denotes the gray level of the image and is the normalized histogram with the gray-scale value of in the fused image.

Standard deviation (SD) represents the dispersion of image gray-scale value relative to the average gray-scale value, defined as inwhere is the pixel value of the fused image in the -th row and the -th column, denotes the size of the fused image , and is the average pixel value of the fused images.

Structural similarity (SSIM) mainly simulates image loss and distortion from three aspects: loss of correlation, luminance, and contrast distortion. The product of the three components is the evaluation result of the fused image, which can be defined as follows:where and represent two source images. denotes the mean value, represents the standard deviation/covariance, and , , and are the parameters to make the metric stable.

4.3. Qualitative Comparisons

Qualitative comparisons mainly evaluate the performance of the method according to the human visual system. In this paper, three typical infrared and visible images are used to evaluate the method. The experimental results are shown in Figure 5.

From top to bottom in Figure 5, the three lines are the fusion result of Kaptein_1654, Marne_02, and soldier_in_trench_2. From left to right, the first two columns in Figure 5 present the original infrared images and visible images. The last column is the fusion results of the proposed SDGAN, and the remaining columns correspond to the fusion results of LP, DTCWT, FPDE, FusionGAN, and DenseFuse.

As shown in Figure 5, all methods can complete the fusion task, but all comparison methods can only better retain the information of a certain source image. For example, the fusion result of FusionGAN better retains the significant target of the infrared image but loses a lot of texture details of the visible image. Relatively, although the fusion results of LP, DTCWT, FPDE, and DenseFuse retain the texture details of the visible image, the significant targets of the infrared image are not so prominent. In contrast, the fusion results of our SDGAN can highlight significant targets and retain texture details of visible images at the same time, such as the human in Kaptein_1654 and soldier_in_trench_2, significantly brighter than other comparison methods. The texture details on the wall in Marne_02 are clearer than in other methods.

4.4. Quantitative Comparisons

Quantitative comparisons show difficulty in avoiding the influence of people’s subjective emotions. In order to evaluate our SDGAN more comprehensively, quantitative comparisons are also employed to evaluate our fusion methods. It is not comprehensive to use only one objective metric to evaluate the fusion method. Therefore, entropy (EN), standard deviation (SD), and structural similarity (SSIM) are used to evaluate the methods in this paper. Quantitative comparisons are performed on 32 image pairs. The results are shown in Table 1.

From the experimental results, we can find that the proposed SDGAN achieves the optimal results on three metrics. The optimal entropy shows that the fusion result of our SDGAN obtains the most information from source images, which shows that the fusion method in this paper is indeed effective and can retain rich source image information. The largest standard deviation shows that the fused image of our method has higher contrast, which proves that the fused image retains more intensity information of the infrared image in the significant area and more texture details of the visible image in the nonsignificant area by significance detection. The optimal structural similarity shows a strong correlation between the fusion results of SDGAN and the source images, and the fused image does not have serious distortion, which shows that the fusion method proposed in this paper can retain more information from the source image.

The average running time of LP, DTCWT, FPDE, FusionGAN, DenseFuse, and the proposed SDGAN is presented in Table 2. It can be seen that the average running time of this method is second only to LP, indicating that this method does not lose the efficiency of the algorithm on the basis of improving the quality of the fused image.

4.5. Ablation Experiment

In order to generate high-quality fusion images, two discriminators are employed in our network. We have conducted ablation experiments to verify the role of two discriminators by removing all the discriminators. The comparison results are given in Figure 6. We can find that our GIDGAN can better preserve the texture details of the visible image while preserving the significant targets of the infrared image, such as the result of our GIDGAN can better preserve details of shrubs in the first row.

5. Conclusions

In this paper, an infrared and visible image fusion method based on saliency detection is proposed. The saliency map of the infrared image is extracted through saliency detection, which is employed not only in the loss function to train the model but also in network architecture. We obtain a generative adversarial network with dual discriminators. The saliency map can divide the image into significant regions and nonsignificant regions, and the dual discriminators can be used to identify them, respectively. By the adversarial game, the generator can generate more realistic fusion images with highlighted target and rich texture details. Qualitative and quantitative experiments show that the proposed SDGAN can achieve the promised effect that the fused image retains both the salient target and rich texture details.

Data Availability

The dataset used to support the findings of this study are included within the open data collection in https://figshare.com/articles/TNO20Image20Fusion20Dataset/%201008029.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors would like to thank Jiayi Ma and Hui Li for providing their codes. This work was supported in part by the National Natural Science Foundation of China under Grants 62171327, 62171328, and 62072350, the Hubei Technology Innovation Project under Grant 2019AAA045, the Key Scientific and Technological Research Project of Hubei Provincial Education Department under Grant D20201507, the first batch of application basic technology and science research foundation in Hubei Nuclear Power Operation Engineering Technology Research Center under Grant B210610, and the Nuclear Energy Development Project (Sub-Project: Artificial Intelligence in Nuclear Reactors) in State Administration of Science, Technology and Industry for National Defence, PRC under Grant ZX200302.