Computational Intelligence and Neuroscience

Volume 2019, Article ID 4179397, 23 pages

https://doi.org/10.1155/2019/4179397

## Multifocus Image Fusion Using Wavelet-Domain-Based Deep CNN

^{1}School of Computer Science and Technology, Shandong Technology and Business University, Yantai 264005, China^{2}Co-innovation Center of Shandong Colleges and Universities: Future Intelligent Computing, Yantai 264005, China

Correspondence should be addressed to Genji Yuan; moc.kooltuo@ijnegnauy

Received 11 September 2018; Revised 5 January 2019; Accepted 20 January 2019; Published 20 February 2019

Academic Editor: Pedro Antonio Gutierrez

Copyright © 2019 Jinjiang Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Multifocus image fusion is the merging of images of the same scene and having multiple different foci into one all-focus image. Most existing fusion algorithms extract high-frequency information by designing local filters and then adopt different fusion rules to obtain the fused images. In this paper, a wavelet is used for multiscale decomposition of the source and fusion images to obtain high-frequency and low-frequency images. To obtain clearer and complete fusion images, this paper uses a deep convolutional neural network to learn the direct mapping between the high-frequency and low-frequency images of the source and fusion images. In this paper, high-frequency and low-frequency images are used to train two convolutional networks to encode the high-frequency and low-frequency images of the source and fusion images. The experimental results show that the method proposed in this paper can obtain a satisfactory fusion image, which is superior to that obtained by some advanced image fusion algorithms in terms of both visual and objective evaluations.

#### 1. Introduction

Because sensor imaging technology is affected by its imaging mode, the imaging environment, and other factors, the generated image display of the target object is one-sided and superficial; reorganizing such information can describe the object more comprehensively and in more detail. Image fusion refers to the process of fusion of multiple images of the same scene into one image according to the corresponding fusion rules [1, 2]. The resulting image is more comprehensive than that obtained using the information expressed by a single source image; the resulting image exhibits clearer vision and is more consistent with human eye and machine perception [3, 4]. Therefore, the realization of multifocus image fusion is of practical significance.

The focus range of the visible light imaging system on the target area is limited by the depth of field of the optical system. In an image generated for the same scene, only the vicinity of the focus is clear and other objects are blurred to varying degrees. Multifocus image fusion technology can fuse differently focused images to generate a single image and combine some objects or information to obtain a more accurate description. Multifocus image fusion can overcome the limitations of a single sensor in terms of spatial resolution, geometry, and spectrum to improve the reliability of image processing [5], such as via feature extraction, edge detection, object recognition, and image segmentation. Multifocus image fusion technology has been widely used in remote sensing, transportation, medical imaging [6], military, and machine vision.

At present, many spatial domain-based fusion methods exist. Using these methods, better quality fused images have been obtained. Nevertheless, image artefacts are usually present in the fusion results obtained by these classical spatial methods. In response to this problem, scholars have proposed a variety of image fusion algorithms based on spatial transformation, such as image matting [7], wizard filtering fusion [8], multiscale weighted gradient [9], and quad-tree and weighted focal length measurement [10]; these algorithms can extract the details of the original image and maintain the spatial consistency of the fusion image. However, such methods cause a block effect at the boundary of the image subblock due to windowing, which has a major influence on the quality of the fused image. The fused image obtained by the image fusion method based on the transformation domain is usually accompanied by image distortion and other phenomena. Therefore, determining a new multifocus image fusion algorithm has important theoretical significance and practical value.

The key to multifocus image fusion is to extract the information of the clear part of the two images for fusion processing. This paper uses the deep learning method to learn the direct mapping between the source image and the fused image. In this paper, the deep convolutional neural network (CNN) is used to train the clear image and its corresponding blurred image to encode the mapping. The fusion rules of multifocus images can be generated through CNN model learning. On the basis of this idea, this paper uses a wavelet transform to extract the high-frequency and low-frequency information of the image and inversely transforms the fused high-frequency and low-frequency information into the fused image. The low-frequency subband of the image contains the key features of the image, and the high-frequency subband of the image contains the detailed information of the image, which is related to the sharpness of the image. A convolutional neural network is used to learn the direct mapping between the high-frequency and low-frequency subbands of the source and fusion images, respectively [11], and obtain the fusion rules of the low-frequency and high-frequency subbands. These rules determine the high-frequency and low-frequency information of the fused image. Experiments demonstrate that the fused images obtained using the convolutional neural networks are reliable.

Overall, the primary contributions of this paper cover the following three aspects:(1)An end-to-end method based on CNN is proposed for multifocus image fusion.(2)Using wavelet multiscale characteristics, high-frequency and low-frequency information is decomposed from the image. Next, the two CNNs are separately trained to encode the high-frequency and low-frequency images separately.(3)Multifocus fused images obtained through end-to-end training are of higher quality.

The remaining paper is organized as follows. In Section 2, we introduce the related work. In Section 3, the network structure employed in this paper is discussed more in detail. In Section 4, we provide details concerning the training set, training methods, evaluation indicators, and experimental results of this paper. In Section 5, we summarize the main idea and findings of this paper.

#### 2. Related Work

Multifocus image fusion can be divided into three levels: pixel-level fusion, feature-level fusion, and decision-level fusion. The pixel-level fusion involves comprehensive processing using the pixel points of an image; it is the lowest level of fusion in the three levels of fusion. More information regarding the image can be obtained through this fusion such that the image is more conducive to human eye observation or computer processing. Pixel-level image fusion methods can be summarized into two categories: image fusion methods based on the spatial domain and image fusion methods based on the transformation domain.

The image fusion method based on the spatial domain [7–10] involves selecting the pixels in the clear part of an image to form a fused image. A clear area is identified based on a certain sharpness indicator, and later, the clear blocks—which are usually obtained by window or image segmentation of a specific size—are merged in the image. To obtain subblocks of an appropriate size, Bai et al. [10] specified the use of the quad-tree method to divide images into subblocks of different sizes adaptively. Some spatial domain methods based on gradient information [12–14] have also been proposed recently.

Image fusion methods based on the transformation domain usually decompose the original image into different transformation coefficients; next, they fuse these transformation coefficients by the corresponding fusion rules and finally obtain the fusion image by reconstruction of the fusion coefficient. With the development of multiscale theory, multiscale transformation (MST) has been widely applied in image fusion, including pyramid decomposition [15], discrete wavelet transform [16], double-tree complex wavelet transform [17], and nonsampled contour wave transform [18]. The basic concept of these methods is to perform multiscale decomposition on each source image, to subsequently fuse all the decomposition coefficients, and finally to reconstruct the fused image through inverse transformation. The method of combining the decomposition coefficients plays a key role in the MST-based image fusion method [19, 20]. These methods all use the same framework, which consists of decomposition [21], fusion, and reconstruction.

The spatial-domain-based approach has the advantage of directly fusing the focal region of the source image; however, this method is highly dependent on the choice of clear measurement criteria, such as gradient energy, standard deviation, or spatial frequency of the image. Since the structure information cannot be represented by a single pixel, the spatial-domain-based method requires efficient extraction of the focus area from the source image. Li et al. [7] used the matting technique to obtain the focus area of each source image. However, due to the unstable performance of the matting technique, the boundary of the focal region obtained by this method is not completely reliable. Considering the grey-scale similarity and set similarity of adjacent pixels, Kumar and Processing [22] proposed the use of a cross-bilateral filter to fuse multifocus images. However, the universality of this technique is not satisfactory, and the size of the filtering window in this method cannot be adjusted adaptively.

Some new image fusion methods, such as the method based on sparse representation (SR) [23, 24], the method based on variational and partial differential equations [25–27], and the method based on dictionary learning [6, 21], have attracted increasing attention. These methods overcome the block effect of image fusion, but the result of fusion is unstable and the edge is not natural. Zhang and Levine [23] proposed a robust sparse representation model (RSR) and multitask robust sparse representation model (MRSR). In contrast to that in the traditional SR model, the reconstruction error obtained by the decomposition of MRSR serves as the discrimination basis for the image focus region, and the focus region obtained is more accurate. However, this method uses a single source image to build a dictionary, which can easily lead to the formulation of an incomplete dictionary. Guorong et al. [28] introduced a structure tensor (ST) into image fusion to enhance the visualization of images. Li et al. [6] incorporated low rank and sparse regularization terms into the dictionary-learning model, which can effectively remove image noise and preserve texture details when merging images.

To further improve the fusion rules, many new methods have been proposed. Guorong et al. [28] and Zhao et al. [29] proposed a new transformation domain, and Li et al. [19] and Liu et al. [20] proposed a new fusion rule. Liu and Wang [30] proposed a new sparse model and more complex fusion rules. Bai et al. [10] proposed a new method of molecular block division. The existing multifocus image fusion algorithms, in particular, the image fusion algorithm based on the spatial domain, focus on proposing a new model, designing more complex fusion rules, or obtaining an index to measure the resolution of image pixels or subblocks for guiding image fusion. However, a single image feature cannot be applied suitably to a variety of complex image environments, and it is almost impossible to design an ideal fusion model that considers all factors.

Liu et al. [31] used a deep neural network for multifocus image fusion; however, the designed network is basically a classification network, which may lead to an inaccurate boundary between the focused and unfocused regions. Du and Gao [32] stated that a decision graph contains complete and clear information of the image to be fused and proposed a new multifocus image fusion algorithm based on image segmentation. A convolutional neural network was used to analyse the input image at multiple scales, and the corresponding decision graph was derived by segmenting the focus and nonfocus regions of the source image. Zhao et al. [33] proposed the use of a multilevel deep supervised convolutional neural network for multifocus image fusion and the design of an end-to-end network, through which joint generation feature extraction, fusion rules, and image fusion could be learned. Zhao et al. [33] constructed a new model to fuse the captured low-frequency features with high-frequency features. In this paper, the characteristics of a wavelet multiscale were used to decompose the image to obtain its low-frequency and high-frequency information. Xu et al. [34] attempted to use images with different foci for end-to-end mapping and establish many-to-one mapping between the source and output images. A full convolutional dual-stream network architecture was designed to realize pixel-level image fusion. Mingrui et al. [35] designed a pixel-by-pixel convolutional neural network to recognize the focus and defocus pixels in the source image for multifocus image fusion according to the neighbourhood information. However, more labels need to be designed for the focus area. In this paper, the performance of deep networks could be improved by using wavelet transform. Literature [36] proposed a residual network based on directional wavelet transform domain for low-dose X-ray CT reconstruction. The direction wavelet was used to embed the input dataset and label dataset into the high-dimensional feature space and learn its mapping. Literature [37] proposed a wavelet-based CNN multi-scale face super-resolution network, which could obtain finer details of high-resolution images. Literature [38] combined multifocus image fusion and super-resolution and used CNN to directly produce both super-resolution and full-focus output images, to obtain detailed enhanced fusion images. The considered network structure was similar to that in literature [34], but the fusion rule designed in [38] involved directly using the weight fusion, which may not achieve the ideal fusion effect. Literature [14] proposed a new multifocus image fusion algorithm based on the boundary. The focus detection task was considered to find the boundary between the focused and nonfocused regions in the source image, and the method could accurately process the boundary of the focusing and nonfocusing regions. Compared with traditional methods, the use of CNN to fuse multifocus images is more advantageous. The design of fusion rules is the main task in multifocus image fusion, while CNN does not require manual design. The fusion rules can be obtained directly through network learning, and the generated fusion rules can be regarded as “optimal” to some extent; the main task of multifocus image fusion then becomes the design of the network structure. With the advent of CNN platforms such as Caffe [39], the design of the network is more convenient. The rapid development of GPUs makes it possible to apply large amounts of image data. Therefore, a method based on CNN is more likely to obtain high-quality fusion results.

#### 3. Multifocus Deep Convolutional Neural Network

In recent years, the breakthrough of deep neural networks comes from deep convolutional neural networks. Convolutional neural networks represent a special case of artificial neural networks, which are inspired by animal visual cortex neural networks. Convolutional neural networks consist of continuous linear functions and nonlinear functions. The local characteristics of the convolution can effectively process the image, while the presence of a nonlinear function allows for more complex data representation. A CNN tries to learn the representation mechanism of features of image data at different levels of abstraction. Each convolution layer contains a certain number of feature maps, which correspond to the level of abstraction of features. The local receiving domain, shared weight, and subsampling are three basic structural concepts of CNNs.

The correspondence between the input and output of the convolutional neural network iswhere is the input, is the output, is the convolution matrix of the *i*-th layer, and is the deviation of the *i*-th layer. is the excitation function and can be selected from several options, but the rectified linear unit (ReLU) is commonly used. is the set of all tunable parameters including and . The goal of the CNN framework is to determine appropriate parameters to minimize the loss of experience:where and represent the *k*-th input and output, respectively. denotes the Euclidean distance. The inverse error propagation algorithm is used to minimize equation (2). The basic structure of a CNN is composed of an input layer, a convolution layer, a pooling layer, the fully connected layer, and the output layer. To alleviate the overfitting problem, the number of internal variables can be reduced. Specifically, the training data are subdivided into specific small batch basic data units, a process known as batch normalization. The basic structure of a CNN is shown in Figure 1.