Abstract

In recent years, the success rate of solving major criminal cases through big data has been greatly improved. The analysis of multimodal big data plays a key role in the detection of suspects. However, the traditional multiexposure image fusion methods have low efficiency and are largely time-consuming due to the artifact effect in the image edge and other sensitive factors. Therefore, this paper focuses on the suspect multiexposure image fusion. The self-coding neural network based on deep learning has become a hotspot in the research of data dimension reduction, which can effectively eliminate the irrelevant and redundant learning data. In the case of limited field depth, due to the limited focusing depth of the camera, the focusing plane cannot obtain the global clear image of the target in the depth scene, which is prone to defocusing and blurring phenomena. Therefore, this paper proposes a multifocus image fusion based on a sparse denoising autoencoder neural network. To realize an unsupervised end-to-end fusion network, the sparse denoising autoencoder neural network is adopted to extract features and learn fusion rules and reconstruction rules simultaneously. The initial decision graph of the multifocus image is taken as a prior input to learn the rich detailed information of the image. The local strategy is added to the loss function to ensure that the image is restored accurately. The results show that this method is superior to the state-of-the-art fusion methods.

1. Introduction

Image fusion refers to the comprehensive processing of two or more complementary source images obtained from different sensors to obtain a new fused image, which enables the fused image to have higher credibility [14], clarity, and better understandability. In the case of limited field depth, due to the limited focusing depth of the camera, the focusing plane cannot obtain the global clear image of the target in the depth scene, which is prone to defocusing and blurring phenomena. Multifocus image fusion technology is to fuse multiple images with different focus positions in the same scene into a fully focused image with more information [5]. At present, multifocus image fusion algorithms can be divided into transform domain-based fusion method, space domain-based fusion method, and deep learning-based fusion method according to the fusion strategy.

The fusion method based on the transform domain generally uses a variety of decomposition tools to decompose the source image into multilevel coefficients and then designs different fusion rules according to the characteristics of each level coefficient [6, 7]. Finally, it performs the inverse multiscale transformation on the fused coefficients of each level to obtain the fused image. The design of transformation tools and the design of fusion rules play an important role in the fusion performance of transformation domain-based fusion methods.

Common transformation tools include curvelet transform (CVT) [8], nonsubsampled contourlet transform (NSCT) [9], Laplacian pyramid (LP) [10], low-pass pyramid, and gradient pyramid (GP) [11]. The fusion rules include maximization, weighted average, saliency, and active contour. The sparse representation (SR), higher-order singular value decomposition (HOSVD) [12], and other sparse principal component analysis- (RPCA-) based multifocus image fusion methods [13] have attracted more attentions.

The fusion method based on the spatial domain can be divided into three types according to the different focus measurement objects: pixel-based, block-based, and region-based. The pixel-based multifocus image fusion method can extract the feature information from the source image and retain the original information to the greatest extent. It has the characteristics of high accuracy and strong robustness, which includes dense scale-invariant feature transform (DSIFT), guided filtering (GF), and image matting (IM). The multifocus image fusion method based on blocks and regions adopts some segmentation strategies to divide the source image into different blocks or regions and then selects more focus blocks or regions as part of the fused image by focus measurement [14]. The common focus measurement methods include image gradient and spatial frequency. The block size and segmentation algorithm can directly affect the visual effect of the fused image, which is prone to “block effect.” Both transform domain-based fusion methods and spatial domain-based fusion methods require to manually design the fusion rules. However, complex image scenes limit the expressive ability of features and the robustness of fusion rules.

In order to improve the feature expression ability and the robustness of fusion rules, deep learning technology has been introduced into multifocus image fusion research [1517]. Karim et al. [18] proposed a drone plane for monitoring and targeting street crime criminals based on real time image processing techniques. Liu et al. [19] used the multiscale Gaussian filter with different standard deviations to fuzzy process the random region on the gray image to simulate the multifocus image. By using supervised learning, the image was classified into focusing pixels and defocusing pixels, and the focus map with the same size as the input image was obtained. Then, the focus decision graph was generated by verifying the size and consistency of the focus map. Finally, based on the judging criteria, the weighted average strategy was used to obtain the fused images in the spatial domain. Tang et al. [20] proposed a multifocus image fusion method based on a pixel-wise convolutional neural network (P-CNN). This model used Cifar10 as the training set, and three kinds of pixels could be learned from adjacent pixel information: focusing pixel, defocusing pixel, and unknown pixel. After the source image was scored by PCNN, a scored matrix representing the focusing level of the pixel was formed. Then, by comparing the scores matrix of the two source images, then it obtained the decision graph. Finally, the weighted average value of the two input images was obtained according to the final decision graph filtered by a threshold. The model had excellent performance in real-time performance and fusion effect, but the limitation of supervised learning was that accurate label data could not be obtained for image fusion.

To further distinguish the private and public features in multifocus images, Luo et al. [21] proposed a joint convolution self-encoding network, which obtained the focus map based on the image features learned by the private branch and used the pixel-level weighted average rule to obtain the fully focused fused image. This method adopted unsupervised learning and did not need manually designed label and achieved ideal results on subjective evaluations and multiple objective evaluation. However, these methods only take advantage of CNN feature extraction and classification capability and still use the manually designed fusion rules, which makes the model unable to adjust the fusion strategy according to the application scenarios.

To further realize the self-learning of fusion rules and make full use of the feature extraction of CNN, combined with the prior knowledge of manual features, in this paper, a multifocus image fusion network with self-learning fusion rules is designed. The multifocus image and its initial decision graph are taken as the input of the network, so that the network can learn more accurate detailed information. The structural similarity index measure (SSIM) and local mean squared error (MSE) are used as loss functions to drive fusion rules.

The rest of this paper is organized as follows. Section 2 designs the proposed approach and, after that, Section 3 describes experimental results. Finally, Section 4 concludes the paper.

2. Proposed Multifocus Image Fusion

This paper first introduces the network structure of multifocus image fusion, then discusses the network fusion in detail, and finally discusses the loss function design.

2.1. Feature Extraction Network Based on Sparse Denoising Autoencoder Neural Network

Figure 1 shows the sparse denoising autoencoder neural network (SDNA-ENN).

The whole network is divided into the input layer, coding layer, fusion layer, decoding layer, and output layer. The input layer includes the initial decision graph of multifocus image A, multifocus image B, and multifocus image A. The coding layer includes 9 trainable convolutional layers with a convolution kernel size of 3 × 3, and each convolutional layer is followed by a ReLU layer. The coding layer can be divided into the private branch PriA, public branch ComA of multifocus image A, and the private branch PriB, and public branch ComB of multifocus image B, where PriA and PriB are used to extract the private features of the input images, respectively. ComA and ComB share weights to extract the common features from multiple input images. The fusion layer cascades the feature map output by PriA and PriB along the channel and then connects the cascaded feature map to the next trainable convolution layer with a convolution kernel size of 1 × 1. The output feature map of ComA and ComB is treated in the same way as PriA and PriB. The decoding layer consists of four trainable convolution layers with a convolution kernel size of 3 × 3, and the last convolutional layer is used to reconstruct the fully focused image. In this paper, a short connection is added to the public branch to solve the problem of gradient disappearance during the training process. Compared with the previous networks, this new network adds fusion units and uses short connections to improve the robustness of feature learning.

2.2. Fusion Layer Design

In the study of multifocus image fusion based on deep learning, the network fusion layer usually contains two methods that can be used to fuse the convolution features of multiple inputs:(1)Cascade the convolution features of multiple inputs along the channel, and then fuse them with the next convolutional layer(2)The multiple input convolution features are fused by the pixel-level fusion rule

The cascade fusion method stacks multiple inputs, so that the network can learn sufficient feature information.

The pixel-level fusion rule includes summation, taking large and mean value [22]. The fusion strategy can be selected according to the features of the data set. In multifocus images, because the pixel value of the image represents the information saliency, the proposed method in this paper introduces the mean rule on the basis of cascading fusion to ensure the diversity and accuracy of feature learning. The concrete realization of the fusion layer design includes weight initialization and weight constraint.

2.2.1. Weight Initialization

The weight initialization is to simulate the weighted average fusion rule, and the features extracted from the coding layer can be accurately fused by the reasonable weight assignment in the fusion layer. The output feature graphs of PriA, PriB, ComA, and ComB coding layers are splicing along the channel, followed by a trainable convolutional layer of 1 × 1. The first and 1 + p weight value of the k-th channel in the 1 × 1 convolutional layer is initialized to 0.5; that is,where is the channel number after the convolution operation. I is the filter number of the k-th channel. P = 128, which can be adjusted according to actual requirements. is the I-th weight value of the k-th channel.

2.2.2. Weight Constraint

Because the weight value may appear numerical over-bounds phenomenon in the process of network iteration, the constraints are added to each weight value to realize the weight value fluctuation in the effective range. According to the mean value rule in the image fusion method, the sum of fusion coefficients of the two images is 1. However, the activation function of the training network adopts ReLU, for the k-th channel, . Therefore, we make two improvements in this process. One is to improve the activation cost function, the second is to apply the minimum/maximum norm weight constraint to the 2p weights of the k-th channel in the fusion layer.

In order to make the activation units with fewer hidden layers represent the most effective features, through the traditional autoencoder neural network research, this paper proposes to add sparse restriction to the hidden neurons in the denoising autoencoder neural network (DAE), which can suppress most of the output neurons and use fewer activation units to represent features.

The sparse denoising autoencoder network structure consists of a sparse denoising autoencoder and a softmax classifier as shown in Figure 2. represents the original data layer, represents the data layer with disturbing noise, and represents the hidden layer.

Specifically, assuming that the number of input samples is . represents the input. represents the output. represents the layer number of the neural network. represents the neuron number in hidden layer . Then the activation cost function of the sparse denoising autoencoder neural network is defined as follows:

The residual of each neuron in the hidden layer is

Then, the partial derivatives of weight and bias items are calculated as follows:

Then, we calculate the L2-norm of 2p weights in the k-th channel.

is truncated in the range ; that is,where is the minimum L2-norm of input weight value. is the maximum L2-norm of input weight value.

Finally, each weight value of the k-th channel is readjusted.where is the m-th weight value of the k-th channel and is the constraint range of the weight value. is the proportion of constraint; when  = 1, the constraint is strictly enforced, and when , the weight must be adjusted for each step. In order to avoid gradient explosion, . After weight initialization and constraint, the rules of the fusion layer are finally converted to

2.3. The Design of Loss Function

In order to ensure that the network can learn the features of the input image accurately and effectively, the local strategy is added into the loss function, including local structure similarity and local mean square error.

2.3.1. Local Structure Similarity

Human visual system is more sensitive to structural loss and deformation. Therefore, the structural similarity index measure (SSIM) [23] can be used to intuitively compare the structural information of distorted images and original images. SSIM is mainly composed of three parts: relevancy, brightness, and contrast as shown in the following:where represents the structural similarity of source image X and fused image F. and represent the image blocks in the source image and the fused image, respectively. and represent the mean and standard deviation of the image X, respectively. and represent the mean and standard deviation of fused image F respectively. represents the covariance of the source image and the fused image. , , and are the parameters used to stabilize the algorithm.

On the basis of SSIM, the corresponded region of image X is extracted by combining the initial decision graph of the input image X.

The initial decision graph corresponding to the input images A and B are and , respectively. According to (10), corresponding regions , , and of images A and B and fused image F can be obtained, respectively. According to (9), and can be calculated.

2.3.2. Local Mean Square Error

Mean square error is used to measure the difference degree between the source image and the fused image. The mean square error is inversely proportional to the quality of the fused image. The smaller value denotes higher fusion quality. Its calculation formula iswhere MSE(X, F) represents the difference between the input image X and the fused image F.

According to (11), and can be obtained.

The final loss function of the proposed network iswhere and represent the weights of local structure similarity and local mean square error, respectively. In this paper, is used to adjust the similarity between the fused image and the source image. The larger denotes the higher similarity between the fused image and the source image. is used to enhance the focus area of the source image in the fused image. The larger denotes the significant focus area of the source image. Based on the extensive experiments, this paper sets  = 5,  = 5, respectively.

3. Experiment and Analysis

In order to verify the performance of the proposed fusion method, we conduct comparison experiments with seven state-of-the-art fusion methods, namely, DE [24], NFBD [25], GDMC [26], LRRW [27], NNSR [28], CFM [29], and FRL-PCNN [30]. The experiment environment is MATLAB7a, Windows10, GPU TX1060, Memory 16G, and Intel(R) Core(TM) i7-67001. The Keras framework of Tensorflow is used for network training in this paper. All the comparison methods use the same parameters [31, 32]. Then, the detailed subjective and objective comparison and analysis are carried out on multiple multifocus images.

Because suspects are classified as the country secret data, this paper tests suspects and open datasets in the laboratory. The results are only from the public datasets. This paper conducts experiments on 60 pairs of multifocus images. 20 pairs are from the open-source dataset Lytro [33], the other 20 pairs have been widely used in the study of multifocus image fusion, and another 20 pairs are from actual suspect images. The sliding window method is adopted to take blocks with a stride length of 14. Each image in the dataset is divided into M image blocks with 224 × 224pixel. The initial decision graph acquisition in this paper consists of three parts: segmentation, mapping, and reprocessing. First, each image in the dataset is segmented into blocks with 4 × 4 pixel, and the spatial frequency is calculated. Then, the spatial frequency matrix is mapped to the original size of the source image, and the overlap part is processed with the mean value to obtain the spatial frequency map. The binary map is obtained by comparing the size. Finally, the initial decision graph of the network is obtained through consistency verification and guided filtering. The fusion results with different methods are shown in Figures 38.

To compare the fusion methods more intuitively, this paper selects a smaller region at a certain contour in each fused image, marks it with rectangular box, and gives an enlarged region. We give an analysis for image “disk.” It can be seen from Figure 7 that the above methods can obtain fully focused images with good subjective vision. DE and NFBD present false information such as “artifact” in the edge of alarm clock. The fusion effect of IM is good, but there is a certain “Gibbs” phenomenon in the disk area, and some details are lost. GDMC shows fuzzy distortion in the local amplification region due to the emphasis on looking for boundaries and the focus metric is performed within a single block. The fusion results from LRRW, NNSR, CFM, and FRL-PCNN are good, but there is a slight “sag” on the left edge of the alarm clock.

Comparatively, the visual effect of the proposed method in this paper is similar to the subjective visual effect of other methods. It can be seen from the enlarged area in Figure 7 that the proposed method in this paper handles the details well, especially the edge area of the alarm clock is smooth and natural. A better fusion result is obtained. Since the initial decision graph of the focused image and the local strategy of the loss function are added into the network, the obtained fused image by the proposed method in this paper performs well in the retention of key information and is suitable for human visual perception. Figures 36 and 8 show the fusion results of the other 5 pairs of multifocus images in various fusion methods. As can be seen from the figures, all the methods can better fuse the multifocus image to some extent. Compared with other methods, the proposed method achieves better fusion results.

To objectively evaluate the results of each fusion method, this paper uses the evaluation index: entropy (EN), proposed by Piella and Heijmans, correlation-coefficient (CC), and Visual Information Fidelity (VIFF) to verify the effectiveness of the proposed method. Entropy is an index based on information theory, which is used to reflect the amount of information in an image. If the entropy value is relatively large, it indicates that the fused image contains relatively more information. is a variant of the universal image quality index, which explores the position and size of distorted pixels by assigning high weights to visual saliency areas. The greater denotes the better fusion effect. The correlation coefficient measures the correlation between the source image and the fused image. The correlation value is positively correlated with the fusion effect. The VIFF is an index that simulates the subjective vision of human eyes to measure the fidelity of fused image. It includes four steps: partitioning, evaluation, calculating the fidelity of subband, and calculating the total fidelity. The higher VIFF presents the lower the distortion between the fused image and the source image. In order to ensure the fairness of objective evaluation, all indexes use the same parameters.

Tables 16 display the fusion objective evaluation results on 6 pairs of multifocus images with the eight fusion methods. As can be seen from the tables, the proposed fusion method has obvious advantages over other fusion methods in terms of the fusion indexes. In general, the proposed method achieves the best results in terms of , CC, EN, VIFF, and average accuracy index, indicating that this new algorithm is an effective fusion method.

4. Conclusions

In this paper, an end-to-end unsupervised multifocus image fusion algorithm based on sparse denoising autoencoder neural network is proposed. Combined with the prior knowledge of multifocus image, the network can learn accurate image details. Reasonable weight initialization and weight constraint are designed in the fusion layer. Local structure similarity and local mean square error strategies are used in the loss function to drive the fusion unit to learn the fusion rules effectively. Experimental results show that the proposed method not only can realize the fusion rules in the fusion process of self-learning. In addition, good results can be obtained in subjective vision and objective evaluation. It is of great significance to further understand the multifocus image fusion mechanism based on deep learning and to study the general multi-modal image fusion framework. In the future, more newest deep learning methods will be utilized to analyze the multifocus image fusion.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.