Abstract

A novel infrared and visible image fusion method in a multilevel low-rank decomposition framework based on guided filtering and feature extraction is proposed to address the lack of edge information and blurred details in fused images. Based on multilevel low-rank decomposition, the fusion strategy of base part and detail contents has been improved. Firstly, the source infrared and visible images are decomposed to the base part coefficients and -level detail content coefficients by multilevel low-rank decomposition. Secondly, the base part coefficients are learned by the VGG-19 network to get the weight map, and then, the improved weight map is obtained by guided filtering, and the coefficients of the base part are fused to acquire the fused base part coefficients. The -level detail content coefficients are fused using the rule of dynamic level measurement with maximum value and then reconstructed to obtain the final fused detail content coefficients. Finally, the fused base part and detailed content information are superimposed to get the final fusion result. The results show that the fusion algorithm can effectively preserve the edge and detail features of the source image. Compared with other state-of-the-art fusion methods, the proposed method performs better in objective assessment and visual quality. The average value of evaluation metrics and have been improved by 0.5337 and 1.0673 on the six pair images.

1. Introduction

Image fusion is an enhancement technology. Image fusion is aimed at combining different images to generate a steady and informative image, which can facilitate subsequent processing and help in decision making. Recently, many fusion methods have been proposed to fuse the features in infrared and visible images into a single image [1]. The visible images usually have high spatial resolution and large detail contrast but are easily affected by harsh environments and climatic conditions. Infrared images depict the temperature or radiation of an object, which is not easily affected by the environment and climatic conditions. However, infrared images contain a few shortcomings, such as inconspicuous texture details and poor resolution. So we can make full use of different modalities to convey complementary information. It applies in a lot of applications, such as surveillance [2], object detection, and target recognition [35]. The methods of multiscale transforms [6, 7] and representation learning [8] are generally used in image fusion field.

The traditional multiscale transformation method decomposes the source images into base parts and detail content at distinct dimensions. The base part mainly represents the contours and edges of the source image, and the detail content contains more detailed texture information. The base part and detail content are fused according to predefined rules in the transform domain. Then, the final fused image is obtained through inverse multiscale transformation [9]. There are some typical algorithms, such as the discrete wavelet transform (DWT) [10], contourlet transform [11, 12], shearlet transforms [13], and multilevel decomposition latent low-rank representation(MDLatLRR) [14]. These decomposition methods can be consistent with human visual characteristics but are easy to introduce artifacts. Hence, many other approaches have attracted great attention, such as sparse representation and low-rank representation.

In the sparse domain, the sparse representation(SR) [15] and dictionary learning [16] are widely used in image fusion. For instance, Li et al. [17] proposed a novel multimodal fusion method via three-layer decomposition and SR. Also, there are many methods that combine SR and other approaches for image fusion, such as low-rank representation (LRR) [18]. Zhu et al. [19] proposed a novel multimodality image fusion method based on image decomposition and sparse representation in which the texture components can be preserved well by a sparse representation based method. In [20], Liu et al. proposed a fusion method based on convolutional sparse representation (CSR) in which the detail of the source image can be retained well by multilayer features that can learn more about it in [21]. Besides, the joint sparse representation (JSR) [22] and cosparse representation [23] are also used in sparse domain. Although SR-based methods can improve image fusion performance, these methods are too time-consuming in dictionary learning operations [24]. These issues have prompted a growing study in deep learning to replace dictionary learning in SR.

In deep learning-based fusion methods, deep features of the source images can be extracted to reconstruct the fused images. For example, the VGG-19 [25], ResNet-50 [26], and DenseFuse [27] network architecture are commonly used in deep learning-based methods. Ma et al. [28, 29] proposed a multimodal image fusion based on adversarial networks, which improves the performance of image fusion to a large extent. Although the deep learning-based methods have performed well in image fusion, these methods still have some drawbacks, such as the deeper the network; the choice of parameters can be more complex.

To preserve many of the edge and detail features of the source image, we proposed a multilevel low-rank decomposition framework based on guided filtering and feature extraction algorithm for infrared and visible image fusion. This solution uses the MDLatLRR method to decompose the original images to extract the detail content coefficients. The fused detail content coefficients can be obtained by dynamic level measurement with maximum value. After superimposing these detail content coefficients, the edge and structure information of the original images will be well retained and improve the display of the object. Then, the VGG-19 network is used to extract the significant area, structure, and object characteristics of the base part coefficients, and the weight maps can be produced according to the base part’s activity level. In order to better preserve the edge information of the base part, the improved weight map is obtained by guided filtering. The improved weight map and the base part coefficients then make the Hadamard product to acquire the fused base part coefficients. Finally, the fused base part and detailed content information are superimposed to get the final fusion result. After the above fusion scheme process, the experimental results show that the proposed method significantly outperforms the comparison methods in image information retention. The significant contributions of this paper are summarized as follows: (1)We introduce MDLatLRR to decompose the source images and determine the optimal number of decomposition layers for infrared and visible image fusion(2)Base part fusion: to obtain more features information, we use the VGG-19 network and guide filtering to fuse the base part. Firstly, the base part coefficients are learned by the VGG network to get the weight map. The weight map obtained in this way can well adapt the base part coefficients of the source image with a block distribution of pixel information. And then, the improved weight map is obtained by guided filtering, which can effectively preserve edge information and reduce noise in the weight map. Finally, the fused base part coefficients are acquired by multiplying the improved weight map and the base part coefficients(3)Detailed content fusion: it is well known that the larger the detail content coefficient is, the more information it contains. The -level detail content coefficients are fused using dynamic level measurement with maximum value and then reconstructed to obtain the final fused detail content coefficients, which can preserve more sufficient detail content information from the source images(4)We first conducted ablative experiments on the number of decomposition layers of MDLattLRR and the number of layers of VGG-19 and finally selected a five-layer VGG-19 network to sufficiently extract features

The remainder of this thesis is as follows. Section 2 introduces multilevel decomposition latent low-rank representation to decompose the source images. Section 3 presents the fusion method of this paper. The base part is fused by the VGG-19 network and guide filtering. The detail content is fused by dynamic level measurement with maximum value. Section 4 presents the structure of the proposed image fusion algorithm. The experimental results are discussed and presented in Section 5. Section 6 summarizes this paper.

2. Multilevel Decomposition Latent Low-Rank Representation

In this section, the method of MDLatLRR is introduced. Liu et al. [30] proposed the method of low-rank representation (LRR) which can extract features from the input data. LRR is a method to explore the structure of data multisubspace by finding the lowest rank representation among the data.

However, this method can not work well when the input data is inadequate and damaged. In order to obtain good performance, the latent low-rank representation (LatLRR) of theory [31] is proposed. The method utilizes more data to acquire the dictionary. In addition, the salient features can be extracted from the source data [31] by using the method of LatLRR. More specially, the single-level decomposition LatLRR (DLatLRR) problem is formulated as where is the source image. represents the two-stage operator, which composes of reshuffling and the sliding window technique. denotes the projection matrix which is obtained by LatLRR. means the decomposed result of the source image. is the operator which reconstructs the detail image based on detail content. and , respectively, signify detail content and the base part from the source image.

Due to DLatLRR, a multilevel latent low-rank representation (MDLatLRR) [14] is formed which is able to extract saliency features from the source image. The method of MDLatLRR is formulated as where and represent the present and the highest decomposition layers, respectively. means the th-level decomposition result of the source image. and , respectively, signify the th-level detail content the base part the source image. indicates the source image. In the end, a base part and detail contents are obtained in different decomposition levels.

The framework of MDLatLRR is described in Figure 1. The source image is decomposed base part and detail content by DLatLRR. In order to obtain more feature information from the base part, the is further decomposed and by DLatLRR. If the decomposition layer is , it will get detail contents and a base part. As a result, the fused image can show more information from the source image. Nevertheless, with the decomposition layer increasing, the artifacts will introduce more. An important problem was how to select a suitable decomposition layer. The detailed description is in Section 5.1. Next, we will introduce the fusion method of the base part and the detail content, respectively.

3. Fusion Method

The source images are decomposed base parts and detail contents using the method of MDLatLRR. The base part contains edge information and basic contour information. Simonyan and Zisserman [25] employed the VGG network for the first time to extract features from images of different levels and obtained excellent results. As the level of decomposition increases, the amount of information contained in the base layer becomes less and less. Using the VGG network to extract the base part, more helpful information will be identified and integrated. The generated weight map will contain more useful information. Then, in order to contain more edge information, guided filtering [32] is used to smooth the weight map. Finally, the fused base part can be acquired by multiplying the refined weight and the source of images. In contrast to the base part, the detailed content contains more structural and textural information. The fused detail content is obtained by using the rule of taking the maximum for dynamic measurement [33].

3.1. Fusion of Base Parts

VGG-19 is a convolutional neural network with 19 layers, including 16 convolutional layers and three fully connected layers [34]. The structure of VGGNet is straightforward, using the same size convolutional kernel size () and maximum pooling size () for the whole network. The performance can be improved by continuously deepening the network structure. The structure diagram is shown in Figure 2. For the fused base part to contain more information, a five-layer VGG-19 network is used to extract the base part to form the feature map.

For the base part and , and indicate the deep features extracted from base part by the fifth convolutional level of VGG-19. As shown in Figure 2, the 5th convolutional layer is conv3-512, so there are 512 deep features in the each base part. In addition, the pooling process will resize the the feature map, which is times of the original size. Impacted by [18], the -norm of can transform into the activity level survey of the original detail content, where . Hence, the activity level map is shown in

The soft-max operator is used to obtain the initial weight maps , which is shown in where is the amount of weight map, which is set to . denotes the value of the weight map.

Using the upsampling operator, the final weight map is obtained that is consistent with the size of the detail content. The final weight is shown in

In order to retain more edge information in the base part, guided filtering is used to smooth the final weight map . The detailed calculation procedure for the guided filtering is described in [32]. First, is processed to obtain the binary image, which is calculated by where denotes the value of the th pixel of the th image in the binary image, and means the value of the th pixel of the th image in the weight image.

Then, using the source image and as guided image, the guided filtering is applied to and , as shown in where and denotes the refined weight map, which is smoothed by the guided filtering. G is the guided filtering function. and represent the parameters of guided filtering. If it is too smooth, it will cause the image of the edge and feature to be inconspicuous. The values of and parameters are set in the experimental Section 5.1.

The fused base part is calculated by

3.2. Fusion of Detail Content

In general, the greater the level of coefficient activity, the more information is contained in the image. To make the fused image include rich information, we use a fusion method called the dynamic level measurement with maximum value to fuse the detail content. The variance of each image patch over or windows is calculated as a measure of activity. The activity measure is associated with the pixel in the center of that window. The active measurements at the corresponding position are either taken as the maximum or the average, which is closed to each other. Since the activity measure in [35] corresponds to the cascading of a linear high-pass filter with a nonlinear high-pass filter, it has no clear physical meaning. In our implementation, we use the maximum absolute value within the window as the activity measure associated with the center pixel. Consistency verification can be understood as a switch to image B in the transform domain if the central pixel value is from image A and most of the surrounding pixel values are from image B. The fusion strategy diagram of the detailed content matrices is shown in Figure 3.

Firstly, the energy and are calculated for the corresponding local regions of the infrared and visible detail content, as shown in where denotes the magnitude of the local energy, . defines the size of the local area, which is set to .

Then, the local area matching degree is calculated by

When the two images are strongly correlated, the weighted average is used. Conversely, the coefficient with higher local energy is used. The fused detail content vector is acquired by where is the matching threshold, which is set to . and are the weighting factors.

The strategy is used to all detail content vector . The detailed content fusion procedure is shown in Figure 3. Every detail content is obtained by where denotes the refactor operator, which is mainly used to reorganize vectors into image blocks.

3.3. Reconstruction

The fused base part and detail content is superposed to reconstruct the fused image , as shown in

4. Structure of Fusion Algorithm

We develop a novel infrared and visible image fusion method called a multilevel low-rank decomposition framework based on guided filtering and feature extraction. The source images are denoted as andf , which are preregistered. The proposed algorithm framework in this paper is shown in Figure 4.

The general steps of the proposed algorithm in this paper are shown in Algorithm 1.

Input:
The source of image and .
Output:
Fused image .
/ Part 1: multilevel DLatLRR decomposition. /.
1: for each do
2: for each do
3: Run DLatLRR decomposition on to obtain ,
4: end for
5: end for
 / Part 2: fusion of base parts. /
6: for each do
7: Input image is extracted by the 5th layer of VGG-19 network to acquire ;
8: Transform the -norm of into the activity level map by the Equation (7);
9: Calculate the final weight map via Equations (8) and (9);
10: Use the guided filtering to smooth the final weight to obtain the refined weight via Equations (11) and (12).
11: end for
12: Calculate the fused base parts via Equation (13).
 / Fusion of detail content.
13: for each do
14: Apply the dynamic activity level with the maximum value on to obtain the fused vector as Equation (16);
15: Reconstruct the vector to via Equation (18).
16: end for
 / Reconstruction /.
17: Superpose the fused base part and detail content to reconstruct the fused image , as shown in Equation (19).

5. Experiments

The aim of experiment is to give a supporting evidence for the proposed method. The experiment in this paper is composed of experimental settings, ablation experiment, subjective evaluation, and objective evaluation.

5.1. Experimental Settings

In our experiment, our infrared and visible images were collected from [36], which contains a lot of registered infrared and visible images from a different scene. We randomly selected six pairs of images to compare the fusion results is shown in Figure 5. From left to right, these pairs, respectively, named Men in front of house, Bench, Bunker, Man_in_doorway, Soldier_in_trench_1, and Lake.

The parameter setting for GF. According to [37], the value of and is set to 45 and 0.3. The stride of the sliding window is set to 1, which can decompose the source images into patches. The window size is set to . The number of decomposition layers of the MDLatLRR and the number of network layers of the VGG-19 network to extract the base part of the feature map will be obtained from the subsequent ablation experiments.

Six classical infrared and visible image fusion methods are applied to conduct the same experiment for comparison, containing a generative adversarial network for image fusion(FusionGAN) [28], the joint-sparse representation model (JSR) [38], the JSR model with the method of saliency detection(JSR_SD) [39], multilevel decomposition method of MDLatLRR [14], two approaches based on deep learning VGG-19 [40], and ResNet50 [41].

In order to get a quantitative comparison at different methods, four quality metrics are used for the fused images. These are as follows: entropy [42] measures the amount of information contained in the fused image based on information theory; mutual information [43] represents a measure of the amount of information transferred from the source image to the fused image; [44] indicates the quality of edge information acquired from the source images; MS-SSIM [45] only counts the structural information based on the refined structural similarity. The larger these metrics are, the better result of fusion quality will be.

All the fusion algorithms experiments are prosecuted in MATLAB R2020a on 3.95 GHz AMD(R) Ryzen(R) 5 3500X 6-Core Processor with 16 GB RAM and Win 10 64-bit operating system. The graphics card is GeForce RTX 2070 SUPER 8G.

5.2. Ablation Experiment
5.2.1. Ablation Experiments for Decomposing Layers

To select the best decomposition layer for the proposed method, five pairs of images in Figure 5 are implemented in the proposed algorithm in different decomposition layers. To test the decomposition layers of MDLatLRR, the layer is set from 1 to 4. The decomposition level of fused results for five pairs of the source image is shown in Figure 6. With the increase of MDLatLRR decomposition level, the fused image luminance and contour are improved. However, it introduces the artifact around the object and makes some detailed information degradation. To obtain better fusion quality, the fewer artifacts, the better.

The experimental results are shown in Figure 7, which is obtained by the above quality metric. As can be seen, it is not the case that the greater the number of layers, the greater the value of the quality evaluation index. When the number of decomposition layers is at the first level, the value of and is more prominent than other layers. It suggests that the first layer can make the fused image contain more information from the source image. In addition, there are several images value of that is best at the second layer. That indicates that more edge information is preserved in the fused image with the increasing decomposition layers. As for MS-SSIM, the first two layers show better values. It shows that the structure of the fused image is similar to the source image. However, when the decomposition layer is more than two layers, the fusion performance will decline. It is because that the detailed content obtains more luminance and contour information from the base part. This information can not be fused well by the detail content fusion method. By the way, the larger the value of the above evaluation index, the better the effect of fused images. On the basis of the above analysis, the decomposition of MDLatLRR is set one in our proposed algorithm.

5.2.2. Ablation Experiments for VGG-19 Network Layers

In order to select an appropriate number of layers for the VGG-19 network of the proposed method, five pairs of images in Figure 5 are used for ablation experiments of VGG-19 network layers. The layer is set from 1 to 5, which represents relu_1_1, relu_2_1, relu_3_1, relu_4_1, and relu_5_1, respectively. The fusion results for different VGG-19 network layers are shown in Figure 8. As can be seen from the Figure 8, compared with the others layers of VGG-19 network, the fifth layers of VGG-19 network extract more detail features and salient target information from the source image. For example, Figure 8(t) of the extraction of traffic sign in the fifth layer is better than the others layers, so we choose 5-layer VGG-19 network to extract features.

The experimental results of different network layers are shown in Figure 9. As can be seen, with the number of network layers increases, the values of evaluation metrics EN and MI become larger. It represents that the five-layer VGG-19 network can extract more feature information from the source images. The evaluation metric is basically the best when the source images are extracted using a three-layer VGG-19 network. It indicates that the three-layer VGG-19 network can extract edge information well. As for the evaluation of MS-SSIM, the MS-SSIM values of the third and fifth pairs are the best when using the five-layer VGG-19 network. The MS-SSIM values of the first, second, and fourth pairs are the best when using the three-layer VGG-19 network. To sum up the above, we select a 5-layer VGG-19 network to extract features.

5.3. Subjective Evaluation

Figure 10 shows the subjective fusion results of the first pair of images. Figures 10(a) and 10(b) are the original images. The object of man in the red box and the grass in the green box obtained by JSR and JSR_SD are fuzzy, and the fused images obtained by JSR and JSR_SD have significantly more the visible components than the infrared ones. The fused images obtained by FusionGAN have more the infrared components than the visible ones. In addition, the fused images obtained by MDLatLRR, VGG-19, and ResNet-ZCA are less artifact but the detailed texture information in the visible image is not well preserved. As shown from Figure 10(i), the object of man in the red box and the grass in the green box are the clearest compared with other methods. The proposed method adds more detailed texture information to make the same as a visible image while containing the infrared image of thermal radiation information. It has excellent visibility. Figure 11 shows the subjective fusion results of the second pair of images. Figures 11(a) and 11 (b) are the original images. It can be seen from Figure 11(i) that the pixel consistency of the object edge structure is the best for the fused images. The objects of red and green boxes obtain more texture information. Figure 12 shows the subjective fusion results of the third pair images. Figures 12(a) and 12 (b) are the original images. In Figure 12(i) of the proposed method, the building in the red box contrasts with its surroundings, and the chromatic aberration is consistent with the visible image. The grass in the green box has more texture information. Figure 13 shows the subjective fusion results of the fourth pair of images. Figures 13(a) and 13(b) are the original images. In Figure 13(i), the brand in the green box is the most recognizable compared with the results of other methods. The object in the red box contains more edge feature information. Besides, target object information is lost in some images, such as Figures 13(c), 13(d), and 13(f). The proposed method performs well and has good visibility. Figure 14 shows the subjective fusion results of the fifth pair images. Figures 14(a) and 14(b) are the original images. Figure 15 shows the subjective fusion results of the sixth pair of images. Figures 15(a) and 15(b) are the original images. From the target of red boxes and the detail features of green boxes in Figures 15(c)15(i), the proposed method contains more details information from the source images compared with other methods. The proposed method adds more detail texture information to make the same as a visible image while containing the infrared image of thermal radiation information. Compared with other methods, the object in the red box and green box are more texture information, and the contrast between light and dark details is sharp. The structure is the most consistent with the original images. In addition, we randomly chose 20 pairs of images from [36] to verify the performance of the proposed method in Figure 16.

5.4. Objective Evaluation

For exhibiting the attractive characteristic of our proposed method, four evaluation metrics are applied to compare the fusion property of six popular fusion methods and our proposed algorithm. In the tables, the best values are shown in italics.

In Table 1, the evaluation metrics EN, MI, and are the best. It indicates that the proposed method contains more detailed information from the original images and edge information. In addition, the MS-SSIM is not the best, but the gap between the proposed method of MS-SSIM and the best value by MDLatLRR is tiny. As mentioned in Section 5.1, EN and MI measure the amount of information from the source image in the fused image. But EN is susceptible to noise. As shown in Figure 12(d), the object of the JSR_SD fusion image is distorted and has apparent artifacts. That is why EN and MI perform undeniable advantages in the third pair images. MS-SSIM counts the structural information based on the refined structural similarity, and the artifacts and the distortion of image structure will lower this metric. That will result in poor visibility. The proposed method has obvious advantages in the MS-SSIM index, which contains little noise and distortion of the structure. It is crucial for infrared and visible images. In Tables 25, the proposed method mostly performs the best in EN, MI, and MS-SSIM index. It shows that our algorithm makes the fused image contain more information from the source image and the structure of the fused image is similar to the source image. In addition, for objective evaluation, we provide Table 6 which contains the average values of all test images on different metrics. The evaluation metrics obtained by the proposed method are the best except the values of t/s. It indicates that our proposed method contains more feature information from the source image. And the structure of the fused image is similar to the source image, better than the others compared methods. Among all the compared methods, the proposed method is in the middle level in terms of time consumption in Tables 16. Based on the above analysis, our fusion algorithm is effective.

6. Conclusion

This paper proposes a multilevel low-rank decomposition method based on guided filtering and feature extraction for infrared and visible image fusion. The VGG-19 network and guide filtering are used in the base layer fusion to obtain the weight map. Then, the final base layer is acquired by multiplying the initial base layer and the weight map. As for detail content fusion, we are using the dynamic activity level with maximum value to obtain the final detail content. The results exhibit that our proposed method has an attractive performance in retaining the object detail features information and edge feature information compared with other fusion methods in both subjective and objective. The proposed method can be applied in target detection and recognition in daily computer vision. In addition, there are some drawbacks to our proposed algorithm. With decomposition increasing, more luminance and contour information are introduced, aggravating the fused performances. The artifacts are more bright to interfere with the targets. In the latter work, we will be committed to reducing artifacts’ effect and enhancing the fusion performance with the number of decomposition layers increasing.

Data Availability

The figures data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was partially supported by grants from the National Natural Science Foundation of China (Grant No. 22178036), Chongqing Nature Science Foundation for Fundamental Science and Frontier Technologies (Grant No. cstc2018jcyjAX0483), Science and Technology Research Program of Chongqing Education Commission of China (Grant Nos. KJQN201900821 and KJQN202000803), Innovative Research Group of Universities in Chongqing (Grant No. CXQT21024), Graduate Innovation Project of Chongqing Technology and Business University (Grant No. yjscxx2021-112-45), and Major Science and Technology Funded Project of Chongqing Education Commission (KJZD-M201900802).