Abstract

Image fusion is to effectively enhance the accuracy, stability, and comprehensiveness of information. Generally, infrared images lack enough background details to provide an accurate description of the target scene, while visible images are difficult to detect radiation under adverse conditions, such as low light. People hoped that the richness of image details can be improved by using effective fusion algorithms. In this paper, we propose an infrared and visible image fusion algorithm, aiming to overcome some common defects in the process of image fusion. Firstly, we use fast approximate bilateral filter to decompose the infrared image and visible image to obtain the small-scale layers, large-scale layer, and base layer. Then, the fused base layer is obtained based on local energy characteristics, which avoid information loss of traditional fusion rules. The fused small-scale layers are acquired by selecting the absolute maximum, and the fused large-scale layer is obtained by summation rule. Finally, the fused small-scale layers, large-scale layer, and base layer are merged to reconstruct the final fused image. Experimental results show that our method retains more detailed appearance information of the fused image and achieves good results in both qualitative and quantitative evaluations.

1. Introduction

With the popularity of infrared image applied in military surveillance [1], remote sensing [2], medical imaging [3], space exploration [4], and other fields, people pay more and more attention to the fusion of infrared and visible images. There are some differences in the characteristics and imaging regulation of infrared and visible images. Infrared image (IR) that captures thermal radiation has strong anti-interference capability and can operate all day without being affected by lighting conditions, but its contrast is low and its ability to distinguish details is poor. In contrast, the visible image (VI) that captures reflected light has high spatial discrimination and legible detail texture information, whereas visible image is greatly affected by weather and has poor imaging effect in cloudy, rainy, and night environments. Image fusion technology can make up for their respective shortcomings, enrich the information of single source image, obtain sufficient and accurate target expression, and make the image more abundant.

The research of infrared and visible image fusion has important theoretical significance in many application scenarios of real life. The existing fusion methods are effective, but there are also some common problems, such as poor contrast, block effect, information distortion, and so on. To address these issues and get better fusion performance, a novel fusion algorithm based on fast approximate bilateral filter and local energy characteristics is developed. Our contributions are as follows:(1)We introduce a fast approximate bilateral filter to decompose the original infrared or visible image. Decompose the image into small-scale layers with rich details, large-scale layer with obvious edge features, and base layer with low-frequency information.(2)We perform fast approximate bilateral filter on the infrared or visible input image five times to obtain the image sequence that is increasingly coarsened. Then, the two images separated by one image in the sequence are subtracted to obtain multiple, small-scale layers. The large-scale layer is obtained by the difference between the fourth image and itself after Gaussian filter.(3)We put forward the fusion of the base layer by using local energy characteristics, fusion of the small-scale layers by using the absolute maximum rule, and fusion of the large-scale layer by using the summation rule, so as to retain more local structures and salient features in the fused image. It transmits most of the information from the source image to the composite image, and fusion loss is negligible.

For the purpose of verifying the effectiveness of our fusion algorithm, we compared it with those of 10 other infrared and visible fusion methods. Experimental results show that the image quality fused by our method is high and superior to similar methods on different IR and VI data sets. The organizational structure of this paper is as follows: Section 2 summarizes the related works of image fusion. Section 3 describes the proposed fusion algorithm in detail. Section 4 analyzes the experimental results of our algorithm compared with several classical algorithms. Finally, Section 5 gives the conclusion.

In recent years, the advantages of image fusion have been highly valued by many researchers. Image fusion technology has developed rapidly in theory. From the weighted average method to color space, and then to multiscale transformation, image fusion technology has been developing rapidly. Generally, image fusion methods can be divided into spatial domain-based methods and frequency domain-based methods according to different space of the fusion process.

The fusion algorithm, which is based on spatial domain, refers to the calculation and processing of pixel gray value in the space composed of image pixels. Such methods mainly include principal component analysis [5], color space mapping method and pseudocolor image fusion method [6], gray or contrast modulation [7], and artificial neural network [8]. The image fusion algorithm, which is based on frequency domain, is to analyze and process the information characteristics of the transform coefficients converted to frequency domain through multiscale transform tools. In the mid-1980s, Burt and Adelson first proposed the Laplacian pyramid algorithm [9], which can represent more prominent feature information of images, containing Laplacian pyramid (LAP) [10], gradient pyramid [11], and contrast pyramid [12]. However, this method will lose image information and the decomposition process does not have directionality. In order to overcome this problem, Mallat proposed a multiresolution algorithm of wavelet transform [13], which can obtain not only low-frequency information, but also horizontal, vertical, and diagonal information of high-frequency part. Compared with the traditional algorithm based on tower decomposition, the overall fusion performance of wavelet transform is better. However, due to the limited direction information obtained by wavelet transform, some texture and contour features of the image cannot be accurately captured. Then, Candes and Donoho proposed curvelets transform (CVT) [14], and Cunha et al. introduced the nonsubsampled contourlet transform [15], which overcomes the disadvantage of translation invariance.

With the progress of deep learning in target tracking [1619], target detection [20], and image restoration [21], some algorithms that are based on deep learning have also appeared in the field of image processing [22, 23]. Compared with traditional methods, deep learning allows computing models composed of multiple processing layers to learn data representation with multiple levels of abstraction. Deep learning uses backpropagation algorithm to direct how the machine works, so as to find the complex structure in large data sets and solve the problem of poor consideration of handcrafted. Applications in the field of image processing mainly include CNN [24], GAN [25], Siam network [26], and automatic encoder [27].

3. Proposed Method

3.1. Overview and Notations

We summarize the main symbols used in this paper in Table 1. Specifically, we use IR to represent infrared image, VI to represent visible image, and FU to represent the fused image. On a special note, represents the base layer with image type t, where , is the small-scale layer with image type t, where , and n is the filtering times, n = 1,2,3,4. is the large-scale layer. If the subscript is not specifically defined, it can be either an infrared image or a visible image by default.

The framework of the proposed algorithm is shown in Figure 1. At the beginning, we decompose the input image using fast approximate bilateral filter [28]. The specific decomposition process is introduced in the next sections. Then, we use the absolute maximum selecting rule, the summation rule, and local energy characteristics to fuse small-scale layers, large-scale layer, and base layer, respectively. Finally, the fused image is merged with the fused small-scale layers, the fused large-scale layer, and the fused base layer. Our method can better suppress the noise in source images, transfer the valuable detail texture information to fused image, and solve the problem of insufficient visual detail fusion in the fusion process.

3.2. Image Decomposition

We divide the image into three parts: the base layer containing most of the residual low-frequency information, the small-scale layer containing image detail texture information, and the large-scale layer containing image edge structure. Figure 2 shows the specific decomposition process. Firstly, the small structure information of the source infrared and visible image are repeatedly removed by the fast approximate bilateral filter. We perform fast approximate bilateral filter five times on input image to acquire the gradually coarsened image sequence . Experiments show that the effect of five times filtering is better. Then, the two images separated by one image in the sequence are subtracted to obtain multiple small-scale layers . Finally, the large-scale layer is obtained by subtracting from . We can use this decomposition method to extract fine texture details from visible image, which is very important for the fusion of infrared and visible images.

The initial image is the original visible or infrared input image. We can think of this process as an iterative process. The nth iteration is calculated as follows:where FABF() represents the fast approximate bilateral filter. is the image obtained after Gaussian filter, represent the standard deviations of the range and spatial Gaussians, parameter setting as [28]. is mainly determined by the image size, which is set as . is mainly determined by the pixel intensity difference of the image, which is set as maximum pixel difference of . represents the obtained small-scale layer or large-scale layer, is the Gaussian filter, and scales smaller than structure can be removed by the filter.

Generally, the base layer does not need to retain the details and edge information of the source image and is the coarsest layer of the input image. The base layer is mainly used to control the appearance and global contrast of image. When too much detail and edge information is retained in the base layer, the image will lose some useful information. In our method, we choose as the base layer.

Figure 3 shows the decomposed layers. (a) and (b) are infrared and visible images, (c) is the fused base layer, and (d) is the fused image. The second row shows the infrared image, and the third row shows the visible image. (a1)-(d1), (a2)-(d2) are small-scale layers from infrared and visible image decomposition. Different small-scale layers contain different features. It can be seen that the small-scale layer corresponding to (a1), (a2) mainly contains fine-detail texture information. From (b1), (b2) to (d1), (d2), the small-scale layers contain more and more abundant target feature information, and the hillside, trees, and road information contained in (d1) and (d2) are the most obvious. (e1), (e2) are the large-scale layers and (g1), (g2) are the base layers. We can see that each scale contains a corresponding proportion of content. In short, due to the characteristics of edge preservation and scale perception, this decomposition method can reduce the halo and preserve the edge features well, which is beneficial to the further image fusion.

3.3. Fast Approximate Bilateral Filter

Bilateral filter (BF) [29] is a filter that can denoise and maintain edges. It is a nonlinear filter. Because the filter consists of two functions, one of which determines the filter coefficient through pixel difference, and the other function determines the filter coefficient through geometric space distance, the denoising effect can be achieved. The advantage of bilateral filter over ordinary Gaussian low-pass filter [30] is that it considers both the influence of image position on the center pixel and the radiation difference in the pixel range. The filtered output formula of image A at pixel position p is as follows:where is the filtered image, represents a normalization factor, are the standard deviation of the spatial Gaussian function and the distance Gaussian function, p, q represent the pixel coordinates, represent the pixel intensity values of pixels p, q, calculates the Euclidean distance of pixels p, q, calculates the absolute value of , and is spatial domain.

Bilateral filter (BF) has been widely used in many fields and proved to be very effective. However, bilateral filter needs the gray information of each central point neighborhood to determine its coefficients, which leads to its long running time. In this paper, we suggest decomposing the original image using fast approximate bilateral filter [28]; we use a signal processing method, mainly adding the one-dimensional signal strength to the original domain to form a high-dimensional space, and downsampling in the high-dimensional space. This scheme produces equivalent running time but significantly improves the accuracy. Next, we describe the implementation of this method in detail.

Firstly, we rewrite (2) using a two-dimensional vector:

We specify the weight  = 1 to keep the characteristics of the weighted average value of the bilateral filter:

We further promote the interpretation of the above equation in 3D space. In order to more conveniently define the sum of the whole three-dimensional space, we added a new dimension and Kronecker’s symbol (1 if  = 0,0 otherwise), define each point intensity of A, and rewrite equation (5) using ; when , the terms are cancelled:

Equation (6) is a sum on a product space . The product represents a separable Gaussian kernel on :

Then, we introduce two new functions A and :

We rewrite the right part of (6) according to (8) and (9):with definition (10), we get

The value of the above formula at point is as follows:where represents convolution operation. On the basis of the above formula, we introduce the functions abf and :

Therefore, the bilateral filter is represented as convolution operation, and then we perform nonlinear operation:

In fact, the nonlinear part consists of two operations. The first operation is slicing, that is, evaluating the function of and at point (p, Ap). The second is division. In our case, the slice and segmentation results are independent of their corresponding position order.

3.4. Image Reconstruction
3.4.1. Base Layer Fusion

The traditional average fusion rules cannot effectively fuse the information of base layer of source images, which leads to the loss of some low-frequency information. We preserve more image information by fusing local energy characteristics to obtain better visual effect. The gray value of the image is what we call energy. In this paper, we define energy as the weighted average gray value of a region. The greater the gray value, the higher the energy of the image. For image fusion, each image has good parts and bad parts, what we need to do is to extract the good parts of each image. As can be seen in Figure 3 (f1), the human information is more prominent, so the energy is greater. In Figure 3 (f2), there is strong-detail texture information in the left corner of the image, and the energy in this area is greater. When fusing, we choose the part with large energy in the local region for fusion, the fused image contains most of the useful information of the original image. The fused base layer is calculated as follows:where k and l define the size of the region, is the base layer of IR or VI. (p, q) is the central point of the neighborhood. In this paper, we set m = n = 3, and W is the weight template, expressed as follows:

3.4.2. Small-Scale Layers Fusion

At the small-scale level, we select all decomposition levels from n = 1 to n = 4. The absolute maximum selection rule is used to integrate all important texture features and edge intensity into the fused small-scale layers. The formula is given bywhere is the fused small-scale layer image, (p, q) represents the corresponding position of , , and , and n denotes the number of decomposition levels.

3.4.3. Large-Scale Layer Fusion

At the large-scale level, we only select the bottom layer of decomposition (n = 5). Most of the feature information contained in them does not overlap due to the different imaging principles of infrared and visible images. It can be seen from Figure 3 (e1), (e2) that the two layers have a large amount of supplementary information. Obviously, using the weighted average method will lose large amounts of important information. Therefore, we use the summation rule to fuse the large-scale layers to prevent information loss. The formula is as follows:where , are the weight values of and , respectively. We choose  = 1 and  = 1 to save more image information. In order to show the superiority of using summation rule to fuse large-scale layer, we compare it with absolute maximum selection rule and weighted average rule and use API [31], SD [32], AG [32], and EN [33] to measure the quality of these three methods. It can be easily seen from Table 2 that except that the API is slightly lower than the absolute maximum selection rule, the fusion result obtained by the summation rule is better in performance indicators.

3.4.4. Image Composition

Finally, the decomposition information at all levels is fused to obtain the fused image FU and is given by

This process is briefly introduced in Algorithm 1.

Input:
Output: FU
1)
2) Image decomposition
  forn = 1,2,3,4,5 do
    
  end for
  
  forn = 1,2,3,4 do
   
  end for
  
3) Image fusion
  The fused base layer is obtained by (16)
  forn = 1,2,3,4 do
  The fused small-scale layers are obtained by (18)
  end for
  The fused large-scale layer is obtained by (19)
4) The fused image is obtained by (20)

4. Results and Discussion

4.1. Experimental Setup

For the purpose of verifying the effectiveness of our fusion algorithm, we compared it with those of 10 other latest infrared and visible fusion methods, containing lp [9], CVT [14], NSCT [34], DTCWT [35], MSVD [36], TIF [37], VSM-WLS [38], FPDE [39], MGFF [40], and GTF [41] methods. All the experimental parameters of these ten algorithms are set on the basis of the original paper.

In our experiments, the test images are obtained from the TNO_Image_Fusion_Dataset. Five pairs of infrared and visible images of different scenes are used. All the experiments in this paper are implemented in MATLAB (2017b), using an Intel® Core™ I5–10210U CPU.

4.2. Qualitative Evaluation

The visual perception quality comparison of Camp, Road, Keptein, Meting, and Steamboat data sets are shown from Figures 4 to 8. Among them, (a) is infrared image, (b) is visible image, (c)-(l) are the results of compared methods, and (m) is the proposed method. To see the difference more distinctly, we mark a rectangle in the fused image; what we can see is that our method transfers most of the important background and details to the fused image, the important infrared structure information is properly retained, and the image noise is also reduced. Figure 4 is a fusion example on the “Camp” image. In the red rectangle, we can observe that the fence information presented by MSVD, FPDE, and GTF methods are fuzzy and difficult to identify. In contrast, our method has clearer fence texture details and reduces the interference of noise to a certain extent.

In Figure 5, we can find that our method has more advantages when we obtain some detail features from the source image under weak light conditions because we improve the overall brightness of the image, while other methods obviously do not achieve good visual results. Among them, FPDE and NSCT have fusion artifacts, and the characters of NSCT, MSVD, and TIF are not clearly displayed.

In Figure 6, our method makes the edge structure of tree branches vivid and the ground texture information clear. In contrast, DTCWT, CVT, and TIF have distorted artifacts due to the influence of incorrect introduction of visible light information; the overall images of lp, MSVD, and GTF are dark and spectral features are not rich enough.

In Figure 7, our method can accurately fuse the obvious person information of the infrared image and the brick hole structure detail on both sides of the visible image. It can be seen from the figure that MSVD, VSM-WLS, FPDE, and GTF tend to produce some unnatural artifact information. Because the brightness difference between source images is large, and these methods cannot effectively suppress noise, they are difficult to show a clear hole structure. Overall, our method can produce better fusion results. Similarly, as shown in Figure 8, for the “steamboat” image, our method has better contour and contrast and improves the clarity of the image.

4.3. Quantitative Evaluation

It is difficult for human vision to distinguish small differences in images. Thus, it is unreliable to evaluate the effect of fused image only from the qualitative evaluation. The quality of image fusion also needs quantitative analysis. For quantitative evaluation, we use four indicators to evaluate the fusion accuracy. The formula of these indicators is defined as follows:(1)Average pixel intensity (API) or mean is the arithmetic mean of the gray values of all pixels in the image and is calculated by , where is the size of the image and H(i, j) is pixel intensity.(2)Standard deviation (SD) is the dispersion of image gray relative to the mean, which is used to evaluate the size of image contrast and is calculated by .(3)Average gradient (AG) can sensitively reflect the expression level of image contrast to small details and is calculated by(4)Entropy (EN) is mainly an objective evaluation index to measure the amount of information contained in image and is calculated by , where i is the gray value and is the gray probability distribution.

For all the four metrics, the higher the objective indexes API, SD, AG, and EN are, the better the fusion effect will be.

The quantitative comparisons of the experimental results are presented in Table 3, and we show the maximum value in bold among different algorithms. It can be seen from the table that our method has higher fusion quality indicators than other methods in terms of fusion quality indexes API, SD, AG, and EN and has good correlation with the results of visual comparison.

In order to more intuitively show the advantages of the proposed method, we obtained the average value of each index of five groups of pictures (Camp, Road, Keptein, Meting, and Steamboat) and obtained the bar statistical chart of four indexes. As shown in Figure 9, it can be seen that the performance of the proposed method is the best among all indexes. We also record the run time of different fusion methods for each group of images. As shown in Table 4, our method is slightly slower, which is where we will further improve.

5. Conclusions

In this paper, an infrared and visible image fusion algorithm based on fast approximate bilateral filtering and local energy characteristics is proposed. The image decomposed by fast bilateral filter is smoother and contains less noise. According to the characteristics of each decomposition layer, we use different fusion regulation to fuse the decomposed image, which not only avoids the information loss of the traditional fusion rules, but also enriches the visual information of the fused image. On most occasions, the image looks more natural and contains less artificial information. Experimental results fully illustrate the superiority of the proposed algorithm. The comparison with other 10 fusion methods shows that our algorithm can better describe the most significant information in the image, improve the overall contrast of the image, and maintain the information of the source image to the greatest extent.

Data Availability

The TNO Image Fusion Dataset is available at https://figshare.com/articles/dataset/TNO_Image_Fusion_Dataset/1008029.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Authors’ Contributions

Zongping Li was mainly responsible for data collating and draft writing. Wenxin Lei designed the method and validated the results. Xudong Li and Tingting Liao designed and debugged the program code. Jianming Zhang mainly reviewed and polished the article. All the authors have read and agreed on the current version of the article.

Acknowledgments

This work was supported by the Natural Science Foundation of China (61972056 and 61901061), the Basic Research Fund of Zhongye Changtian International Engineering Co., Ltd. (2020JCYJ07), the Research Fund of Hunan Provincial Education Department (19C0028 and 19C0031), the Enterprise-University Joint Postgraduate Scientific Research Innovation Fund of Hunan Province (QL20210205), and the Postgraduate Scientific Research Innovation Fund of CSUST (CX2021SS70).