#### Abstract

We propose a novel algorithm for image sequence fusion and denoising simultaneously in 3D shearlet transform domain. In general, the most existing image fusion methods only consider combining the important information of source images and do not deal with the artifacts. If source images contain noises, the noises may be also transferred into the fusion image together with useful pixels. In 3D shearlet transform domain, we propose that the recursive filter is first performed on the high-pass subbands to obtain the denoised high-pass coefficients. The high-pass subbands are then combined to employ the fusion rule of the selecting maximum based on 3D pulse coupled neural network (PCNN), and the low-pass subband is fused to use the fusion rule of the weighted sum. Experimental results demonstrate that the proposed algorithm yields the encouraging effects.

#### 1. Introduction

Video sensors have been applied extensively to video monitor and machine vision fields with the performance improvement and the reduced cost. In real applications, because of the effect of objects movement, occlusion, and illumination variation factors, single video sensor may not satisfy the requirements [1]. To get the complete information of a scene, multiple different sensors are employed simultaneously to capture the content of the same scene [2]. To utilize the information captured from multiple sensors sufficiently and efficiently, we need to combine the contents from different video sensors into an image sequence. This can be easily satisfied by the image sequence fusion methods, which can merge multiple image sequences from different sensors into a single image sequence containing all important information, eliminating redundancy and improving the availability of the information.

In the past decade, a variety of image fusion approaches have been developed for different applications [3–5]. The simplest fusion method is the weighted average of source images. However, the method is brittle and easy to lead to the contrast reduction and introduce artifacts. Now, fusion methods based on multiscale decomposition in the transform domain are increasingly popular because of better robustness and reliability [6–8]. The most existing fusion methods are designed for static images; however, the fusion methods especially for image sequences and videos are seldom. The fusion methods for static images can even ensure the quality of single frame by the frame-by-frame fusion [9], but the temporal consistency and stability of the image sequence is hard to be preserved. So far, there are several fusion methods for image sequences and videos. By utilizing the information of temporal axis, in recent years, several state-of-the-art video fusion approaches [10, 11] have been developed. Based on the three-dimensional surfacelet transform, Zhang et al. [12] propose a video fusion framework which fuses multiframe images of input videos as a whole procedure. However, the method is still insufficient in extracting the spatial-temporal information from videos with dynamic background. Zhang et al. [13] propose a multisensor video fusion method based on three-dimensional uniform discrete curvelet transform (3D-UDCT) and spatial-temporal structure tensor. However, these schemes did not still obtain the satisfactory results due to the insufficient capability of representing motion information.

In addition, the most existing image fusion methods still only pay attention to combine the useful pixels of source images and seldom consider dealing with the artifacts. The noises are easily introduced into the fusion image together with the useful pixels if the further processes are not performed for eliminating noises. There are some fusion methods that consider the enhancement and denoising when the fusion procedure is performed. Piella [14] presents a variational model to perform the fusion of input images while preserving the salient information and enhancing the contrast. Yang and Li [15] propose a sparse representation-based multifocus image fusion method which can simultaneously carry out denoising and fusion of noised source images.

In the paper, we propose a novel image sequence fusion algorithm which implements fusion and denoising simultaneously in 3D shearlet transform domain. Input image sequences are first decomposed by 3D shearlet transform into high- and low-pass subbands, and then the coefficients are merged and denoised in frequency domain. Finally, inverse 3D shearlet transform is applied on the fused coefficients to reconstruct the image sequence. Multiple frame images can be decomposed to different frequency scales once in 3D space by 3D shearlet, which has certain direction selectivity and can avoid aliasing and instability for the coefficients of neighbor frames. 3D shearlet considers the motion feature in temporal axis, such that the decomposed coefficients can approximate sufficiently the spatial-temporal features of image sequences in different scales. For merging coefficients, to preserve the consistency and stability of the interframe coefficients, we propose a spatial-temporal fusion rule based on 3D PCNN [16], which can extract the spatial-temporal information of the corresponding subbands from the neighbor frames. In addition, the most existing image fusion methods still only pay attention to combine the pixels of source images and seldom consider dealing with the noises. Here, we combine the fusion and denoising together and directly deal with noises on the coefficients using the recursive filter. Superior to the separate fusion and denoising based on multiscale transforms, our method will reduce the error only due to the need to perform the decomposition and reconstruction once.

The remainder of the paper is organized as follows. Section 2 reviews basic 3D shearlet transform theory in brief. Section 3 describes the proposed image sequence fusion and denoising algorithm in detail. Section 4 presents and discusses the experimental results. Section 5 concludes.

#### 2. 3D Shearlet Transform

In this section, we briefly review theory and properties of 3D shearlet transform, which will be used in the rest of this paper (see [17] for details).

The shearlet approach inherits the general advantages of curvelets and surfacelets. During the last decade, to overcome the limitations of wavelets and other traditional methods, a new class of multiscale methods is introduced through a novel framework. The framework can effectively combine the standard multiscale decomposition and efficiently capture anisotropic features. The shearlet just belongs to part of the new class of multiscale methods. The shearlet representation is a multiscale pyramid of well-localized waveforms defined at various locations and orientations. This representation can break through the limitations of traditional multiscale systems in dealing with multidimensional data.

The 3D shearlet transform is constructed by a shearlet system associated with the pyramidal regions. Three pyramidal regions , , and are obtained by partition of the Fourier space , defined as follows (shown in Figure 1):

**(a)**

**(b)**

**(c)**

The directionality of the shearlet systems is controlled through the use of shearing matrices. The 3D shearlet systems for are a Parseval frame, which is obtained by using an appropriate combination of the systems of shearlets associated with the pyramidal regions (, ). In this way, the 3D shearlet systems are defined as the collections consisting of the coarse-scale shearlets, the interior shearlets, and the boundary shearlets: where the shearing parameters and control the orientations of the support regions in 3D shearlet systems. Figure 2 illustrates a typical support region. The orientation of the support region is controlled by . It can be seen that the support region is becoming more elongated as increases.

A numerical implementation of the 3D discrete shearlet transform takes advantage of the sparsity properties of the corresponding continuous representation. The 3D digital shearlet transform can preserve the discrete integer lattice and enable a natural transition from the continuous to the discrete setting due to the use of shearing matrices rather than rotations. The 3D digital shearlet transform algorithm is implemented through a cascade of a multiscale decomposition and a directional filtering stage. The multiscale decomposition is first implemented using the Laplacian pyramid algorithm. And then the directional components are obtained using shearing matrices to control orientations in the pseudospherical domain.

#### 3. Proposed Image Sequence Fusion and Denoising Algorithm

In this section, the proposed image sequence fusion and denoising algorithm is presented in detail. The main idea is that image sequence fusion and denoising are implemented simultaneously in 3D shearlet transform domain. The framework of the proposed algorithm is shown in Figure 3. For the clearness of the presentation, we assume that two registered image sequences are combined.

Two groups of image sequences and are decomposed by 3D shearlet transform into two groups of high-pass subbands and two groups of low-pass subbands for several frames. Then, high-pass subbands are denoised by the recursive filter (RF) [18]. Next, the fused subband coefficients of all frames are obtained to employ a 3D PCNN-based spatial-temporal saliency fusion rule. Finally, the fused image sequence is reconstructed by inversing 3D shearlet transform on the fused coefficients of all frames. The steps of the proposed algorithm are presented as follows.

*Step 1. *Input image sequences are transformed using 3D shearlet transformation to the frequency domain and produce the low- and high-pass subbands and for several frames, where denotes the low-pass subband coefficients at the coarsest scale of the th frame and denotes the high-pass subband coefficients at the th scale and in the th direction and at the th frame.

*Step 2. *For the high-pass subband coefficients, the recursive filter is performed on the coefficients of each frame for eliminating noises. Then, 3D PCNN is used to compute the spatial-temporal activity levels of the denoised high-pass coefficients to obtain the spatial-temporal activity maps, which are employed to merge the fused high-pass subband by the selecting maximum.

*Step 3. *The low-pass subband coefficients are merged using a spatial-temporal energy weighted fusion rule based on the activity maps yielded by 3D PCNN.

*Step 4. *Apply the inverse 3D shearlet transform to the fused coefficients for all frames and then obtain the fused image sequence .

##### 3.1. 3D PCNN-Based Spatial-Temporal Fusion

This section discusses the proposed 3D PCNN-Based Spatial-Temporal Fusion in detail. Based on the experimental observations of synchronous pulse bursts in cat and monkey visual cortex, a novel biological neural network called PCNN is developed. PCNN neuron consists of receptive field, modulation field, and pulse generator [19, 20]. Different from the traditional neural network, PCNN is a feedback network and does not need to be trained. A secondary receptive field of PCNN, known as the linking field, integrates inputs from adjacent neurons to modulate the primary feeding field. In image processing, 2D PCNN is a single layer and two-dimensional connection neural network. Considering the spatial features of 2D image plane, 2D PCNN utilizes the output of the spatial neighborhood pixels as the inner input of next iteration. The similar neurons in PCNN generate pulses simultaneously to compensate effectively the spatial incoherence and the slight amplitude changes, so that PCNN can measure the salient object regions completely. It has been successfully used to deal with the single static image. However, if 2D PCNN is applied directly on an image sequence frame by frame, it may not extract the temporal motion information.

To adapt PCNN to deal with image sequences, 2D PCNN is extended to 3D PCNN utilizing the correlation of neighbor frames. 3D PCNN has been used to the segmentation of stereo images successfully [16]. Here, 3D PCNN is employed to measure the activity energy of the coefficients from 3D shearlet decomposition. Let indicate the coefficient located at in the th scale at the th direction and at th frame. in each subband is inputted to 3D PCNN as the external feeding input. Both the last output of the spatial neighbor pixels and the corresponding output of the neighbor frames are used as the inner linking input. In this way, 3D PCNN can extract the spatial-temporal information sufficiently in an image sequence. The 3D PCNN is defined as follows: where the coefficient is inputted to the feeding input . The linking input is equal to the sum of neurons firing times in linking range, where indicates the decay constants and is the amplitude gain. is the weighted coefficient (, , point out the size of linking range in 3D PCNN). is the output of the neuron from the previous iteration. The internal state signal is obtained by modulating and , where is the linking strength. is the threshold, where is the decay constants and is the amplitude gain. indicates the iteration times. If , the neuron will generate a pulse, called one firing. If , the neuron will not generate a pulse.

In the paper, the 3D PCNN can be used to measure the spatial-temporal saliency of the coefficients. Through combining the coefficients of the corresponding scales and directions from neighbor frames, we construct a 3D volume whose size is . In this way, a neuron model of 3D PCNN is constructed, where each of the coefficients is the external input of 3D PCNN. The neuron with the maximum coefficient value is first fired. Following, the similar neurons from the 3D space constructed by the internal linking matrix are motivated to produce synchronization pulse through pulse propagation. The generated pulse sequence forms a 3D binary sequence, which contains the saliency information of images, for example, regions, edges, textures, and so forth.

In applications, the firing times are generally employed to represent image information. Firing times can be computed by accumulating all pixels before and including the present iteration, written as follows: where is often used to indicate the total firing times in iteration. Here, the firing times indicate the activity energy of the coefficients.

The high-pass subbands of 3D shearlet decomposition contain abundant detail information, for example, lines, edges, contours, and so forth. To preserve the detail components in fused images, we propose a fusion rule of spatial-temporal selecting maximum based on 3D PCNN for the high-pass subbands. Coefficients with large firing times are selected as the fused coefficients. Consequently, the fused high-pass subband coefficients located at in the th scale at the th direction and at th frame denoted as are defined as

The low-pass subband of 3D shearlet decomposition in the coarsest scale contains the main energy of source images and denotes abundant structural information. The fusion rule of the low-pass subband employs a spatial-temporal weighted fusion rule based on firing times of 3D PCNN. The fused coefficients of low-pass subbands denoted as employ a weighted fusion rule based on firing times of 3D PCNN on coefficients and , which are defined as where is the weight of coefficients and is computed by (3) and (4):

Finally, apply the inverse 3D shearlet transform to the fused coefficients of frames , and then obtain the fused image sequence .

##### 3.2. Recursive Filter Denoising

The previous section introduces the fusion method of the decomposed coefficients. When input images contain some noises, the fused image will also introduce noise artifacts if the coefficients are only directly merged. So we need first to deal with the coefficients for denoising. This section presents mainly the recursive filter method [18] which is used as the denoising of the coefficients. The recursive filter is a real-time edge-preserving smoothing filter. Comparing with the separate fusion and denoising based on multiscale transform, the fusion and denoising simultaneously on the coefficients decomposed by 3D shearlet transform will reduce the error caused by the decomposition and reconstruction. In addition, the denoising filter is run on the coefficients of different scales and directions such that it enhances the robustness of the algorithm.

The low-pass subband contains the main energy of images and denotes the structural information. The high-pass subbands contain the abundant details, for example, lines, edges, and contours. In general, the noises appear in the high-pass subbands, so here the recursive filter is only performed on the high-pass subband coefficients to obtain the denoised coefficients , which are defined as follows: where RF indicates the recursive filter. When the noises are eliminated, the details need to be preserved on the high-pass subbands. This can be easily satisfied by the edge-preserving filter, called the recursive filter: where is a feedback coefficient, is the value of the th coefficient of the input high-pass subband, is the th coefficient of the filtered high-pass subband, and is the distance between neighborhood coefficients of the high-pass subband. As increases, goes to zero, stopping the propagation chain and preserving details on high-pass subbands. More details about the recursive filter can refer to [18].

The recursive filter is performed in all frame images for obtaining the denoised high-pass coefficients. Following, the fused coefficients can be obtained through merging the denoised coefficients.

#### 4. Experiments and Analysis

In this section, the proposed image sequence fusion algorithm based on 3D shearlet and 3D PCNN (named as 3DShearlet-3DPCNN) is tested on several groups of image sequences. For comparison, besides the fusion scheme proposed in this paper, another three fusion algorithms, the discrete wavelet transform based (DWT), 3D DWT based, and 3D dual-tree complex wavelet transform based (3D DTCWT) methods, are used to fuse the same image sequences. All of these methods use the average sum and absolute maximum selection schemes for merging low- and high-pass subband coefficients, respectively. The decomposition level of all of the transforms is three. It is assumed that source image sequences have been registered. The tested collections contain the clear and the noised image sequences.

##### 4.1. Fusion Results of Clear Image Sequences

The first experiment is a pair of visible light and infrared image sequences without noises. A pair of frames from input image sequences and four fused images produced by DWT, 3D DWT, 3D DTCWT, and proposed 3DShearlet-3DPCNN are shown in Figure 4. Figure 4(a) is an infrared image frame from an input image sequence, and Figure 4(b) is a visible light source image frame from the corresponding visible light image sequence. From Figure 4, we can observe that Figure 4(c) is the worst result, which has the lower contrast, and the motion objects tend to blur. Figures 4(d)–4(f), produced by 3D DWT, 3D DTCWT, and proposed 3DShearlet-3DPCNN, present the better results, which have higher contrast, clearer motion objects. However, Figures 4(d) and 4(e) have some distortion near the windows. The edges of the windows appear warp. The middle wall between windows introduces the dark regions. In contrast, the proposed method (Figure 4(f)) yields the best result. This comparison reveals that the proposed fusion approach effectively determines complementary or redundant information between input image frames. It can preserve all the useful information of the input image frames while avoiding artifacts. In addition, to evaluate the performance of image sequence fusion methods in temporal stability and consistency, a clearer comparison is made by examining the interframe-difference (IFD) between the current frame and the preframe, shown in Figure 5. One set of IFD images for source and fused frames between the current frames in Figure 4 and their corresponding preframes are shown in Figure 5. Obviously, the IFDs in Figure 5(c) introduce many artifacts, which exist neither in Figure 5(a) nor in Figure 5(b). In Figure 5(d), artifacts are greatly reduced. In Figures 5(e) and 5(f), we cannot see nearly any artifacts. This further demonstrates that 3D DTCWT and proposed fused image frames have better temporal stability and consistency.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

Figure 6 shows one pair of input image frames and the corresponding fused results. Comparisons of different fused results (Figures 6(c)–6(f)) and the frames fused using DWT, 3D DWT, and 3D DTCWT (Figures 6(c)–6(e)) are not clear enough and have lower contrast. In particular, artifacts around the man were also introduced in Figure 6(c). The image frame fused using the proposed approach (Figure 6(f)) is obviously clearer and has stronger contrast. Experimental results further demonstrate that the proposed algorithm can effectively improve the quality of the fused image sequence.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

For more accurate comparison, besides the visual analysis, the performance of fusion algorithms needs to be further measured by objective quantitative analysis tools. Here two metric tools were used: the spatial-temporal gradient preservation based video fusion performance metric [21] and the mutual information of IFD images (IFD_MI) [9]. indicates how much the spatial-temporal information is extracted and transferred into the fused image sequence. IFD_MI reflects the performance of the image sequence fusion method in temporal stability and consistency. The higher the values are for the two metrics, the better are the fusion results.

The measurement results of the different fusion methods for the image frames in Figure 4 and Figure 6 are shown in Figure 7. The results show that the DWT method has the lowest scores. Compared with the DWT, 3D DWT and 3D DTCWT methods have the improved results. The proposed method represents the highest performance in the listed fusion approaches for and IFD_MI metrics. The quantitative results are consistent with the visual analysis, illustrating that the proposed fusion method has the superior performance while it can preserve the temporal stability and consistency for image sequences.

**(a)**

**(b)**

##### 4.2. Fusion Results of Noised Image Sequences

In previous discussion, the fusion results of different algorithms for the clear image sequences have been analyzed by both visual and objective metric tools. In the section, the denoising and fusion results of different algorithms for the noised image sequences are compared. Our method is compared to the 3D DTCWT fusion method followed by 3D DTCWT denoising [22], named as 3D DTCWT-FD.

Figure 8 presents the fusion and denoising results yielded by our method and 3D DTCWT-FD method from a pair of real world noise image sequences. Observing the source image frames (Figures 8(a) and 8(b)), the visible light image frame (Figure 8(a)) is clear and the infrared image frame (Figure 8(b)) contaminated by noises. It can be seen that the fusion image frames (Figures 8(c) and 8(d)) have eliminated the noises. However, the 3D DTCWT-FD result (Figure 8(c)) becomes blurred due to over-smoothing. The proposed denoising and fusion method yields the optimal result which nearly removes all the noises and preserves the high contrast.

**(a)**

**(b)**

**(c)**

**(d)**

In addition, to evaluate the performance of different fusion and denoising schemes, , IFD_MI, and peak signal to noise ratio (PSNR) metrics are used. Table 1 shows the results of and IFD_MI metrics for Figure 8. We can see that the proposed method gets the higher scores in both and IFD_MI. This indicates that the proposed fusion and denoising simultaneously method contains all the useful information from source images, while eliminating the noises. Figure 9 presents the PSNR results for 1000 frame fusion images associated with Figure 8. Here, to get the PSNR results, the clear visible light source image frames are used as the reference images. From Figure 9, we can observe that the results of the proposed method have higher PSNR scores which indicate that our method yields better the denoised images. Consequently, from the comparisons of , IFD_MI, and PSNR metrics, the experiments further demonstrate that the proposed image sequence fusion and denoising simultaneously method yields the satisfactory results, which merge the important complementary information and remove the artifacts.

#### 5. Conclusion

The paper proposes a novel algorithm for image sequence fusion and denoising based on 3D shearlet transform. The most existing image fusion methods do not deal with the artifacts when the fusion procedure is performed. If source images contain noises, the noises may be also transferred into the fusion image together with useful pixels. Therefore, we propose that the recursive filter is first performed on the high-pass subbands to obtain the denoised high-pass coefficients in 3D shearlet transform domain. The high-pass subbands are then combined to employ the fusion rule of the selecting maximum based on 3D PCNN, and the low-pass subband is fused to use the fusion rule of the weighted sum. In this way, the fusion and denoising simultaneously can be achieved for the noised image sequences. Experiments demonstrate that the proposed method improves greatly the quality of the fused image sequence.

#### Conflict of Interests

The authors do not have any actual or potential conflict of interests.

#### Acknowledgments

This work was supported by the National Basic Research Program of China (973 Program) 2012CB821200 (2012CB821206) and the National Natural Science Foundation of China (no. 61320106006).