Abstract

Deblurring methods in dynamic scenes are a challenging problem. Recently, significant progress has been made for image deblurring methods based on deep learning. However, these methods usually stack ordinary convolutional layers or increase convolution kernel size, resulting in limited receptive fields, an unsatisfying deblurring effect, and a heavy computational burden. Therefore, we propose an improved U-Net (U-shaped Convolutional Neural Network) model to restore the blurred images. We first design the model structure, which mainly includes depth-wise separable convolution, residual depth-wise separable convolution, wavelet transform, inverse wavelet transform, and a DMRFC (dense multireceptive field channel) module. Next, a depth-wise separable convolution is designed, which reduces model calculations and the number of parameters when compared with the standard convolution. A residual depth-wise separable convolution is designed, which allows for propagation of detailed information from different layers when compared with standard convolution and a standard residual block. The wavelet transform realizes downsampling by separating the contextual and texture information of the image. It also reduces model training difficulty. The inverse wavelet transform realizes upsampling, which reduces the loss of image information. Finally, by combining an extensional receptive field and channel attention mechanism, a DMRFC module is proposed to extract detailed image information, which further improves the reconstructed image quality via inverse wavelet transform. Experiments on the public dataset GOPRO show that the image deblurring method in this paper has higher-quality visual effects, while the PSNR (peak signal-to-noise ratio) rises to 30.83 and SSIM (structural similarity) rises to 0.948. Fewer model parameters and a shorter recovery time are needed, which provides a more lightweight image deblurring method.

1. Introduction

Blurred images are caused by complex reasons in dynamic scenes, such as camera shake, atmospheric turbulence, low light levels, and target motion transformation [1]. The blurred images not only influence the visual experience subjectively, but also influence the following visual tasks such as object detection and object tracking [25]. Therefore, image deblurring in dynamic scenes is a challenging problem in computer vision [6].

Most traditional image deblurring approaches use regularization and prior hand-crafted images to assess the blur kernel [7] and the next iterative optimization to increasingly recover the image. The conventional method needs to assess complex blur kernels, which results in a lengthy deblurring process as well as poor real-time and algorithm performance. Recently, image deblurring approaches using deep learning have obtained a growing application [8]. Li et al. [9] advanced a brand-new convolution architecture, which greatly expanded the reception field. A scale-aware convolutional neural network to restore a clear image was also proposed. Quan et al. [10] advanced a deep-learning-based method to nonblind image deblurring, which combined Gabor-domain and complex-valued CNN-based prior to better handle the noise with unknown parameters or statistical characteristics. Liu et al. [11] advanced a deblurring module to sharpen blur images of dynamic scenes based of high-frequency residual image learning. Zhang et al. [12] advanced an enhanced adversarial network model, which used the weight of feature channels to generate clear images and remove draught board artefacts. Zhang et al. [13] advanced a deep hierarchical multipatch network, which can handle blurred images using a fine-to-coarse hierarchical representation. It can adjust the performance and runtime of a single network to fit different application scenarios. Chen et al. [14] advanced an attention-adaptive module, which adaptively decided the arrangement of channel and spatial attention modules. A transformable convolutional module was also proposed to process the geometric variations. Zeng and Diao [15] applied a dense network to remove image blur, which may prevent the gradient disappearance problem. By analysing the above deep learning methods, the receptive field of the network is insufficient. Furthermore, the amount of image information obtained is limited while model calculations are extensive and time consuming. This paper proposes an improved method using U-Net to solve these problems.

The overall processing flow of U-Net uses an encoder and decoder, which was proposed by Hinton as early as 2006. At that time, its main function was image compression and denoising. This method of image compression relies on the input of an image and encoding of downsampling to obtain a string of features partially describing the original image. This is equivalent to compression, followed by upsampling for decoding to obtain the original image. Therefore, storing an image needs a feature and a decoder to be stored. This idea can also be used in denoising of the original image. How the method works is to artificially add noise to the original image in the training stage. It is then put into the encoder to train the network based on the difference between the results and the original image followed by the restoration of the original image. Later, this idea was used in image segmentation, so U-Net has been widely used in image segmentation and related fields in recent years [16, 17]. U-Net can integrate different levels of features and has a flexible network structure. As such, a deep network with many parameters can dramatically reduce the number of parameters that lie within the acceptable accuracy range. In recent years, U-Net variants have also been used in other image processing fields. Zhang et al. [18] incorporated domain-specific knowledge to design an attention-based Tri-U-Net architecture which included feature extraction, feature fusion, and image reconstruction; these generated high-quality and high-resolution multispectral images. Fan et al. [19] advanced a SUNET model, which combined Swin Transformer layer and U-Net to denoise the image. Zang et al. [20] proposed a novel cascaded Dense-U-Net structure to fully utilize all hierarchical features for super-resolution single images. A Dense-U-Net block included many short and dense skip pathways, which could profit the flow of information and integrate different receptive fields. Guan et al. [21] advanced an improved convolutional neural network architecture called Fully Dense U-Net to remove artifacts from 2D photoacoustic tomography images reconstructed from sparse data. Then, the reconstructed image quality of Dense U-Net and standard U-Net is compared. Alimjan et al. [22] proposed a remote image change detection algorithm based on U-Net network, which added a multifeature self-attention mechanism between the encoder and decoder to achieve more plentiful context dependence.

Inspired by the above literature, we propose an image deblurring method based on U-Net. The key to U-Net is downsampling, upsampling, and skip connections, whose implementation methods are very important to the performance of U-Net. Wavelet transform can realize the downsampling operation of CNN, while inverse wavelet transform (IWT) can realize the upsampling operation [23]. Wavelet transform may represent contextual dependencies and texture information of the image at different levels. High-frequency sub-bands profit to restore image texture details and constrain reconstruction of low-frequency sub-bands. Consequently, wavelet transform may be used for image processing goals such as super-resolution [24], image reconstruction [25], image defogging [26], and image deblurring [27]. In these processes, the combination of wavelet transforms and deep convolutional neural network is realized to eliminate any image blur influenced by wavelet transform, further removing blur from a noisy image. Considering that U-Net adopts a full CNN structure, we therefore propose to use wavelet transform to realize downsampling and IWT to realize upsampling. However, when the number of conventional convolution parameters is large, this requires a large number of calculations. Furthermore, deepening of network layers causes the gradient to disappear. The depth-wise separable convolution [28] and residual network [29] can solve these problems [30]. Also, the skip connection in U-Net directly connects the features extracted from the encoder to the corresponding decoder layer, which produces a lot of redundant information. The attention mechanism can extract key information [18, 31], which solves this problem. Inspired by the above methods, we aim to increase the receptive field and reduce calculation by combining wavelet transform, depth-wise separable convolution, residual network, and attention mechanism to improve U-Net. As such, we propose a comprehensive image deblurring model and achieve better deblurring performance.

For this paper, an improved U-Net model is proposed to remove image deblur, with the main contributions as follows: (1) We embed a multilevel two-dimensional discrete wavelet transform and IWT into U-Net. The wavelet transform realizes downsampling, which obtains contextual and texture information at different image frequencies and reduces model training difficulty. The IWT realizes upsampling, which reconstructs the processed information to achieve a more accurate image. (2) We propose the DMRFC module to extract semantic features of the image and improve image deblurring performance accordingly. The module consists of four multireceptive field channel blocks and a bottleneck layer. A multireceptive field channel block is composed of an extensional receptive block and a channel attention module. This increases the diversity of extracted features and features are adaptively weighted at each channel according to their importance, while the bottleneck layer reduces the number of feature inputs. (3) We embed a residual depth-wise separable convolution module instead of a standard residual block. This is for propagation of detailed information from different layers to improve blur reconstruction quality. The proposed model is then analysed quantitatively and qualitatively using the GOPRO dataset. The experimental results show that it improves the quality of image restoration. It also reduces information reuse, increases the image receptive field, and results in a good visual effect.

2. The Proposed Methodology

2.1. The Model Structure

We propose an improved U-Net model with the model structure shown in Figure 1.

As shown in Figure 1, the model mainly includes depth-wise separable convolution, residual depth-wise separable convolution, wavelet transform, inverse wavelet transform, and a DMRFC module. The model structure presents a U shape, with the encoder on the left and the decoder on the right. The convolution kernel is 3 × 3 and the activation function in the model is Leaky ReLU. The encoder realizes the extraction of image information through four stages. In the first stage, one 32-channel depth-wise separable convolution and three 32-channel depth-wise separable residual convolutions are used. In the second stage, wavelet transform, one 64-channel depth-wise separable convolution, and three 64-channel residual depth-wise separable convolutions are used. In the third stage, wavelet transform, one 128-channel depth separable convolution, and three 128-channel residual depth-wise separable convolutions are used. The fourth stage uses wavelet transform, one 256-channel depth-wise separable convolution, and two 256-channel residual depth-wise separable convolutions as well as a dense multireceptive field channel module. The decoder realizes image reconstruction through information fusion, which consists of four stages. The first stage uses two 256-channel residual depth-wise separable convolutions and one 512-channel depth-wise separable convolution as well as IWT. The output of the second stage and the third stage of the encoder are feature-fused through skip connections, three 128-channel residual depth-wise separable convolutions, one 256-channel depth-wise separable convolution, and an IWT. The outputs of the third stage and the second segment of the encoder stage are fused through skip connections, three 64-channel residual depth-wise separable convolutions, one 128-channel depth-wise separable convolution, and an IWT. The output of the fourth stage and the first segment of the encoder stage are feature-fused through skip connections. Three 32-channel residual depth-wise separable convolutions and one 3-channel depth-wise separable convolution restore the feature map to its original resolution. Finally, the restored image is obtained by fusing this with the input data.

Therefore, the encoder uses depth-wise separable convolution to implement the standard convolution operation and uses residual depth-wise separable convolution instead of the standard residual block. This reduces the number of parameters in the network. Two-dimensional DWT (discrete wavelet transform) implements downsampling to obtain contextual and texture information at different image frequencies. This reduces the computational complexity and training difficulty. The DMRFC module obtains image information at different scales and reduces gradient disappearance, feature reuse. The decoder implements upsampling process by IWT to reduce image information loss. The image is then reconstructed effectively using depth-wise separable convolution and residual depth-wise separable convolution. This helps to propagate detailed information from different layers and improves the deblurring effect. The main modules are introduced in detail below.

2.2. Depth-Wise Separable Convolution

The depth-wise separable convolution improves conventional convolution, reduces the number of parameters, and makes the network lightweight. The depth-wise separable convolution structure in this paper is shown in Figure 2. It consists of depth-wise convolution and point-wise convolution. Depth-wise convolution first divides the multichannel features of the previous layer into the feature map of a single channel. It then uses a 3 × 3 convolution kernel for convolution. This is then followed by re-stacking them together, that is, the size of the feature map of the previous layer is adjusted, while the number of channels remains unchanged. The characteristic image obtained by depth-wise convolution is convoluted using point-wise convolution for the second time using a 1 × 1 convolution kernel followed by fusion of the convolution results for depth-wise convolution and freely changing the number of output channels. The related parameters for depth-wise convolution (Figure 2) in the model are shown in Table 1.

2.3. Residual Depth-Wise Separable Convolution

When the number of network layers increases, the gradient will gradually disappear in the back propagation process, resulting in an inability to effectively adjust the weight of the previous network layer. The residual network structure solves the problem of the gradient disappearance. The residual depth-wise separable convolution structure in this paper that is inspired by the residual network structure is shown in Figure 3.

The residual depth-wise separable convolution is based on the residual convolution using two depth-wise separable convolutions and addition of skip connections. The input x is directly transmitted to the output as the initial result. The training goal is to approximate the residual result to 0 so that the accuracy does not decrease as the network deepens. The related parameters of the residual depth-wise separable convolution in the model are shown in Table 2.

2.4. Two-Dimensional DWT and IWT

The Haar wavelet is a wavelet basis function which is easy to implement and operate. Therefore, this paper uses a two-dimensional Haar wavelet for wavelet transform operations to divide the image signal into directional sub-bands in order to achieve different frequency image information. Filter is an effective method to realize DWT. Let the one-dimensional high-pass filter be represented by and the low-pass filter be represented by . The detailed transformation process is as follows: first, is used to filter and vertically downsample each column of the image. and are then used to filter and horizontally downsample each row. The sub-frequency information for and is obtained. Second, is used to filter and vertically downsample each column of the image. and are then used to filter and horizontally downsample each column. The sub-frequency information of and is obtained. Sub-frequency information for the four parameters is shown as follows:

The and y in equations (1)–(4) denote rows and columns for the information of the image. expresses the horizontal high-frequency and vertical low-frequency information of the image. expresses the horizontal and vertical high-frequency information of the image. expresses the horizontal and vertical low-frequency information of the image. denotes the horizontal low-frequency and vertical high-frequency image information. In contrast, the IWT performs inverse operations on the four sub-images using the above filter. Thus, , , , and are used to fuse into the original image. Consequently, as shown in Figure 4, the original image is decomposed by DWT and then reconstructed by IWT without loss of information. And so on, multilevel wavelet transform can be implemented by further processing of , , , and according to the above method. For the two-dimensional Haar wavelet transform, the sum mean value is used for low-frequency information, regarded as , and the difference mean value is used for high-frequency information, regarded as . We take a 4 × 4 two-dimensional matrix A (Figure 5) as an example.

A is divided into 4 sub-matrices. The sub-matrix composed of odd rows and odd columns is represented by x1, the sub-matrix composed of odd rows and even columns is represented by x2, the sub-matrix composed of even rows and odd columns is represented by x3 and the sub-matrix composed of even rows and even columns is represented by x4. They are shown in equations (5)–(8) while equations (1)–(4) can be expressed by equations (9)–(12). , , , and are the results of the real-time Haar wavelet transform.

The Haar inverse wavelet transform calculates x1, x2, x3, x4 using equations (13)–(16) to reconstruct matrix A.

2.5. DMRFC Module

In order to obtain deep semantic information from the image and improve image deblurring performance, this paper proposes the DMRFC module, which is shown in Figure 6.

The DMRFC module is composed of four multireceptive field channel blocks and a bottleneck layer. The semantic features of the image are extracted through multireceptive field channel blocks. The bottleneck layer is used to reduce the number of feature inputs to improve the compactness and computational efficiency of the model. A dense connection is used to strengthen the transmission of image features and make more effective use of image features. The DMRFC is shown as follows:where represents the feature map generated by the multireceptive field channel blocks of 0, 1, …, i − 1 layers in series; Hi means converting multiple input tensors into a single tensor; means the output of the bottleneck layer, is the super-parameter of the bottleneck layer, and the filter size used in the bottleneck layer is 1 × 1.

The multireceptive channel blocks used by the DMRFC module are shown in Figure 7, which is a combination of the extensional receptive field blocks and the channel attention module. To increase the diversity of extracted feature in the extensional receptive field, four feature extraction branches are used that utilize 3 × 3 convolution kernels with extensional rates of 1, 3, 5, and 7. The connection operation fuses the parallel feature maps of the four branches as shown in equation (18). The channel attention module from CBAM [32] is used for learning the weights of each channel, which make the nonlinear features be adaptively weighted at each channel. Merging the average and maximum pooling features improves the nonlinear representation of the network and its deblurring ability. The output of the multireceptive channel block is shown in

In equations (18) and (19), means the input features, ∗ means the convolution operation, and means the convolution layer. The superscript means the extensional convolution size and the subscript means the convolution kernel size. denotes the Leaky ReLU activation function, the denotes the connection operation, and means the fused features. The expression Avgpool means the average pooling and means the fully connection layer. means the Sigmoid activation function and Out means the output of multireceptive channel blocks.

2.6. Loss Function

The mean square error (MSE) can obtain the difference between predicted and actual values; it is widely used as the loss function of the model evaluation. Because MSE does not consider the distribution characteristics of the image, which makes the model to be too smooth for restored image, while the SSIM can obtain the image edge details. The calculation methods of MSE and SSIM are shown in equations (20) and (21). Thus, in order to achieve a clearer image, we fuse MSE and SSIM to design the loss function, which is shown in equation (22). Consequently, we train the network by combining MSE and SSIM loss functions.

In equations (20) and (21), denotes an image to be blurred, denotes a clear image, and denotes that it is calculated using structural similarity. is the trade-off parameter and is set to 0.001.

Therefore, the main steps of our method are shown in Figure 8, where Algorithm 2 implements the forward propagation, which is shown in Figure 9.

3. Results

3.1. Dataset

In this paper, a GOPRO dataset was used to train the model, which consists of 3214 clear and blurred image pairs including 22 different scenes. Using 2103 image pairs as the training set and 1111 pairs of images as the test set, to enhance the generalization ability of the model, a data enhancement operation was carried out on the training set. This was conducted through random rotation including randomly flipping left, right, up, and down. Gaussian noise was added as well. The rotation angles are 90, 180, and 270 degrees with the average value of the noise being 0 and the variance being 0.0001.

3.2. Training Details

In order to avoid model overfitting, images from the training set were randomly cropped to a size of 256 × 256 pixels. The training period was set to 4000 rounds and the initial learning rate was set to 1e − 4, which was halved every 1000 rounds. The network optimization method used Adam [33], with parameters  = 0.9 and  = 0.999. The network model was realized based on PyTorch framework and a GTX 2080 GPU.

3.3. Quantitative Evaluation

We use PSNR and SSIM to quantitatively evaluate the model, the larger the value, the better the image quality. Their calculation expressions are shown inwhere MAX is the maximum value for the image and the value of MAX is 255. RMSE denotes the root mean square error of the restored and blurred images. and denote the average values of the images and , respectively, , denote the variance of images and , respectively, and denotes the covariance of images and . and denote small constants to prevent division by zero.

Table 3 shows the average PSNR and average SSIM for the GOPRO test dataset, which are quantitatively compared with those for other methods. Through comparison, it was found that our method is superior to other methods when comparing PSNR and SSIM.

Table 4 shows the comparison of running time and model parameters size between different algorithms on the GOPRO test dataset. The running time is labelled as “Time.” The model parameters size is labelled using “Size.” The proposed model requires less time and fewer model parameters than in [33, 39].

4. Discussion

4.1. Visual Effect

Figure 10 shows a visual comparison example generated by this method and other methods on the GOPRO test dataset. [15, 37, 38] all show a relatively limited receptive field. Although high-quality images can be obtained, they are lacking in the restoration of texture details. Deblurring is realized by introducing a high-quality reference image into the deep network [39]. It is found that this method has a significant deblurring effect [39], but high-quality reference images are not always easy to obtain, which prevent recovery of the blurred image with spatial variation. The proposed model uses wavelet transform to retain image details and uses the DMRFC module to learn nonlinear image features. Compared with the details and structures generated by other methods, deblurring using this network maintains clear texture details, has a stronger effect, and leads to higher-quality visual effects.

4.2. Effectiveness of Each Module

To prove the effectiveness of the designed module, we conduct three model experiments to verify the effectiveness of wavelet transform and the DMRFC module on the model in this paper. In model 1, only depth-wise separable convolution and depth-wise separable residual convolution are included. The upsampling and downsampling of the image is controlled by the convolution step size, with five depth-wise separable residual blocks replacing the DMRFC module. Model 2 uses wavelet transform instead of image upsampling and downsampling and is based on model 1. Model 3 further introduces DMRFC based on model 2, with model 3 being the proposed model. Their results are compared in Table 5.

As shown in Table 5, in model 1, the image evaluation metric PSNR is 26.21 while that for SSIM is 0.871. After introducing wavelet transform in model 2, PSNR rises to 28.17 and SSIM rises to 0.891. This shows the introduction of wavelet.

Transform can improve model performance because wavelet transform can obtain sub-frequency information in four directions and learn more image details. When the DMRFC module is introduced in model 3, PSNR rises to 30.83 and SSIM rises to 0.948. It shows that the DMRFC module profits to improve model restoration quality, because they reduce information reuse and increase the image receptive field.

5. Conclusion

In this paper, an image deblurring method using an improved U-Net model is proposed, in which a two-dimensional discrete Harr wavelet is introduced and a DMRFC module is designed. A wavelet transform is used instead of the downsampling in the model and the inverse wavelet transform is used t instead of the upsampling. Through the wavelet transform, more image details are obtained and computational complexity is reduced. The MRFC module is used to connect the multireceptive channel blocks using a dense connection, which reduces the number of multireceptive field channel block parameters and strengthens the feature transmission. Our method significantly reduces the number of model parameters. It also reduces the running time for the clear image restored by the model and achieves good visual results when deblurring the image. The average PSNR and average SSIM for the GOPRO dataset is 30.83 and 0.948, respectively, which is higher than that for other image deblurring methods. Therefore, our proposed method achieves a better deblurring effect and is more lightweight.

In future work, by further reducing model parameters and calculations, the ability of the model in restoring the blurred images can be further improved. This will make the model more lightweight and allow for exploration of its application in resource-constrained and widely used mobile devices.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publishing of this paper.

Acknowledgments

This work was supported by the Heilongjiang Education Department Basic Scientific Research Business Research Innovation Platform “Scientific Research Project Funding of Qiqihar University” (Grant: 135409421).