Abstract

Blurred vision images caused by rainy weather can negatively influence the performance of outdoor vision systems. Therefore, it is necessary to remove rain streaks from single image. In this work, a multiscale generative adversarial network- (GAN-) based model is presented, called DR-Net, for single image deraining. The proposed architecture includes two subnetworks, i.e., generator subnetwork and discriminator subnetwork. We introduce a multiscale generator subnetwork which contains two convolution branches with different kernel sizes, where the smaller one captures the local rain drops information, and the larger one pays close attention to the spatial information. The discriminator subnetwork acts as a supervision signal to promote the generator subnetwork to generate more quality derained image. It is demonstrated that the proposed method yields in relatively higher performance in comparison to other state-of-the-art deraining models in terms of derained image quality and computing efficiency.

1. Introduction

In recent years, with the development of network communication, the intelligent monitoring system based on image and video processing technology has achieved promising progress. Such system plays a vital role in the maintenance of public security. Therefore, some computer vision issues related to the intelligent monitoring system have attracted a wide spread attention. Most of the computer vision algorithms proposed for addressing these issues can work well on the high visibility of video or image data. However, when the algorithms face degraded data, their performances may obviously degrade. This is because the training processes of these algorithms are based on the high visibility of the video or image dataset. Bad weather such as rainy days seriously degrades the visual quality of the captured videos or images, which may affect the performance of many computer vision algorithms like tracking, recognition, and retrieval [1]. This kind of circumstance may happen on some safely related accidents recorded by mobile phone or monitoring camera in rainy days. In such circumstance, the captured video or image data may contain a large number of fast-moving rain streaks, which leads to the image signal distortion as well as reducing the signal-to-noise ratio and image quality. These impacts caused by rainy weather on video or image data have brought great difficulties for intelligent traffic, outdoor monitoring, military reconnaissance, and so on [2, 3]. In order to enhance the reliability of outdoor computer vision systems, there is a need to exploit effective algorithms to remove rain streaks from the degraded single image caused by rainy weather.

Mathematically, the process of the deraining procedure can be written aswhere I represents the rainy image and O and R represent rain streaks and restored clean image, respectively.

In the past few decades, some algorithms have been proposed to jointly address the rain detection and removal task. According to their concerns, these methods can be roughly classified into two categories, i.e., video-based methods and single-image-based methods. Video-based deraining methods concentrate on eliminating the rain streaks from video sequence [4, 5] through exploiting frequency properties and temporal information of rain streaks. Single-image-based methods consider the problem from two aspects:

A task of blind image single decomposition: These methods mainly include morphological component analysis with sparse coding [2, 3], generalized low rank model [6], structural similarity constraints [7], and nonlocal means smooth [8].

A task of learning an end-to-end projection between the rainy image and its corresponding derained clean image: Recently, due to their relatively superior ability of learning nonlinear functions, some deep learning-based methods have been proposed to address this issue, in which an end-to-end projection between the rain image and its corresponding ground truth is directly learned. These methods contain convolution neural network- (CNN-) based models [9, 10] and generative adversarial network-based models (GAN) [11, 12].

Although existing methods have achieved some degree of success, there still exist several limitations caused by the following aspects: For the basic operations of many existing approaches are performed on a small receptive field or a local image patch, the spatial contextual information between the patches or receptive fields is usually ignored. Since the background texture patterns and rain streaks are internally overlapping, texture details in nonrain regions are removed by most approaches, which results in the restored images containing some oversmoothness regions. Some of these models introduce additional image enhancement techniques to improve the visual effect of image, which reduces the efficiency of the algorithm.

In order to alleviate these limitations mentioned above, our goal is to exploit a novel architecture with the ability of removing rain and keeping the restored clean image details jointly. More specifically, we introduce a multiscale GAN-based architecture called single image deraining generative adversarial network (DR-Net) to handle the single image deraining issue. The architecture includes two subnetworks, i.e., generator network and discriminator network. Generator network acts as a feature exactor that can eliminate the rain streaks while encoding the image contents. In other words, it learns a nonlinear projection function which transfers a rainy image into a restored clean image and keeps the details of raw image simultaneously. To capture more spatial information and local rain drops, we propose a multiscale parallel convolution generator network which consists of two-branch convolution operator with different kernel sizes. Discriminator network uses the restored image generated by the generator architecture and ground truth image as inputs. It aims at differentiating the restored clean image from the real ground truth image. The function of discriminator network is to boost the generator architecture to generate more quality derained image that closely resembles the ground truth.

To sum up, the contributions of this work are as follows:

We design a novel generative adversarial architecture to address the single image deraining issue. The generator architecture consists of two parallel convolution subnetworks that have different kernel sizes, specifically, one subnetwork with large kernel size that captures more spatial information of the raw image, and the other subnetwork with relatively small kernel size which aims to acquire more local rain streaks knowledge. The multiscale operators are helpful for keeping the details of raw image and eliminating rain streaks at the same time. Additionally, with fewer feature maps, our network has the advantages of less parameters and less computing effort; thus, the training convergence and test speed are relatively faster among the comparison methods.

Experiments on publicly synthesized dataset and real images show the effectiveness of the proposed network. The proposed model performs better than other recent state-of-the-art single image deraining techniques.

The remainder of this paper is organized as follows. A brief review on existing methods for image deraining is given in Section 2. Section 3 provides the detail of the proposed DR-Net architecture. Section 4 presents the experiment results on both synthetic and real images. Finally, a brief discussion is concluded in Section 5.

In the past few years, a number of models have been presented to enhance the visibility of images captured with rain streaks. These models can be divided into two categories: video sequences-based methods and single-image-based methods. In this section, we give a brief review of these image deraining models.

(A) Video Sequences-Based Methods. Video sequences-based rainy image recovery has been extensively studied. Garg et al. [1] proposed a deraining model for the rain streak detection and removal from video sequences. The rain streaks detection has two constraints [13]: First, since the rain streaks are dynamic, the changes in intensity inside their several frames are comparatively high. Second, because other objects may also be dynamic, through examining whether the relations between the intensity changes along the streak and background intensity are photometrically linear, the rain streaks can be differenced. The second constraint can reduce the error alarms caused by the first. After detecting the rain streaks, the average intensity of the pixels taken from the previous and subsequent frames is used to remove the streaks. Soon after this, they further developed a postprocessing architecture for video sequences-based deraining [14]. Specifically, they first proposed a photometric method which can describe the intensities generated by individual rain. Then, a dynamic model that can capture spatiotemporal attributes of rain streaks was presented. Finally, they used these models together to describe the visual appearance of rain streaks. Zhang et al. [15] introduced another constraint named chromaticity constraint. They point out that the intensity changes in the R, G, and B channels are alike for representing rain streaks. Based on the size information and photometry properties of rain streaks, Bossu et al. [16] proposed a rain detection algorithm to fit a Gaussian distribution on rain streak histograms. They adopted a Gaussian mixture model to separate the foreground used to detect the rain streaks from the background in video sequences.

(B) Single-Image-Based Methods. Since there are no temporal information for rain streaks detection and removing, compared with the video sequences-based removal issue, single-image rain removal (SIRR) is more challenging. Some researchers regard the SIRR problem as a task of layer separation. Kang et al. [3] used a bilateral filter to decompose the rainy image into high-frequency and low-frequency parts. Then, they utilized sparse coding and dictionary learning to separate the rain component from the high-frequency part. Through analyzing the aspect ratio of elliptical kernel and the orientation angle in each pixel location, Kim et al. [8] first detected the rain streaks regions. Then, they used adaptive nonlocal means filter to these rain streak regions to remove the rain streaks. Luo et al. [17] proposed a nonlinear screen blend model to model the rainy images. Specifically, through learning a dictionary with reciprocal exclusivity, they used sparse coding to separate the rain layer and derained layer. Since the rain streaks on the imaging scene usually appear recursively with similar patterns, Chen et al. [6] proposed and generalized a low-rank rain appearance architecture to capture the spatiotemporal relationship between rain streaks. Li et al. [13] proposed a model that uses patch-based priors which are based on Gaussian mixture model for rain and background layers. Moreover, these priors could accommodate multiple scales and orientations of the rain streaks to certain degree.

Recently, deep learning-based methods have achieved outstanding performances in many domains [1821], including image deblurring [22], image denoising [23], superresolution [24], style transfer [25, 26], and inpainting [27]. There also exists some literature that adopts the deep learning system to address the SIRR issue. These deep learning-based methods aim to learn a nonlinear mapping between the rainy image and corresponding deraining image. These methods can be mainly divided into CNN-based and GAN-based models. Fu et al. [9] designed a convolution neural network named DrainNet for removing rain streaks from single image. They first decomposed the input rainy image into its detail layer and base layer, in which the base layer keeps the structure and detail layer contains object details and rain streaks. Then, they used the detail layer as the input of deep architecture to detect and remove rain streaks. Finally, they added the output of the deep architecture to the base layer to obtain the final output. Yang et al. [28] proposed a multitask deep convolution neural network which can simultaneously learn the appearance of rain streaks, the binary rain streak, and the background. Besides, they developed a recurrent rain detecting and removing network to clear up the rain accumulation and remove rain streaks iteratively. As a popular deep learning technology, GAN [29, 30] has been adopted in many computer vision tasks. Most recently, Zhang et al. [11] proposed a conditional generative adversarial model for SIRR, in which a new refined loss function that combines perceptual loss, Euclidean loss, and adversarial loss is presented. In this paper, we also use a GAN-based method to address the SIRR problem. The architecture of the proposed generator subnetwork is different from the models mentioned above. More specifically, we design a generator with multiscale convolution operators that can simultaneously focus on the local rain drops and the spatial information of the rainy image.

3. Proposed Method

Our purpose is to learn a nonlinear projection between the input rainy image and the output derained image through constructing a GAN-based deep architecture. The proposed GAN-based network consists of two subnetworks, i.e., generator network and discriminator network. The primary target of generator subnetwork is to remove rain streaks without missing any detail message from the rainy image. The discriminator subnetwork acts as a supervised signal to boost the quality of derained image generated by generator subnetwork. In this section, we discuss the architecture in detail.

3.1. Generative Adversarial Loss

To make the derained image generated by generator subnetwork with high quality, to fool the discriminator subnetwork, and to learn, a good discriminator has enough ability to validate whether the derained image looked real. Given a rainy image I, the optimization function of the GAN can be formulated aswhere O is the generated output image, D represents the discriminative subnetwork, and G is the generative subnetwork.

3.2. Generator Network

Architecture. As mentioned above, generator subnetwork aims to learn a mapping function from a rainy image to a derained image. The proposed generator framework is shown in the top of Figure 1. Specifically, we first implement convolution operator on the raw image with the kernel size and feature maps of 77 and 64, respectively. After the first convolution layer, we introduce two parallel convolution branches. The kernel size of one convolution branch is set to 33, and the other is set to 55. Both numbers of feature maps of each convolution layer for the two branches are set to 64. The generator subnetwork with multiscale branches has two advantages. Firstly, the small kernel size can capture more local rain information while the larger kernel size acquires more spatial information. Secondly, the multiscale convolution kernel is used to make the generator have a variety of filters, and then the learning process of weights and biases is more diverse, and thus the useful information of the image can be fully and effectively extracted. Both of the two settings are helpful for generating higher quality derained image. After five convolution layers for each branch, the outputs of the last convolution layer are concentrated by a simple addition operator. Besides, three skip connections between the front convolution layers and the several later convolution layers are introduced. As shown above, our network contains several convolution operations, which may seriously damage the details of the raw image. However, the feature maps generated by the front convolution layers contain many image details, and integrating these feature maps into the later convolution layers can help the generator to retain the image details. In addition, similar to the deep residual networks [31], the introduction of skip connections is conductive to backpropagate the gradient to the bottom layers, which makes the training phase more stable. Then, the concentrated branches are passed through two convolution layers both with 32 feature maps and 33 kernel size. Finally, the output layer is stacked after these convolution operators. In order to maintain the size of the raw image after the convolution operators, we set the padding to 3 pixels for the 77 conv layer, 1 pixel for 3×3 conv layers, and 2 pixels for 5×5 conv layers. The detailed parameters of the generative subnetwork layers are shown in Table 1.

Generative Loss Function. As the generator subnetwork aims to generate the derained image as closer as possible to the ground truth, therefore, we adopt the Euclidean loss to supervise the generator. Given a rainy image I, the loss function can be defined aswhere C, M, and N represent the channel, width, and height of the images, respectively. R is the corresponding ground truth of the input image I.

Aside from the Euclidean loss, we also introduce the perceptual loss [32] which can calculate the global difference between the features of the ground truth and those of the outputs of certain layer. The introduction of perceptual loss is helpful to improve the visual performance of generator subnetwork. The perceptual loss can be written aswhere S, W, and H represent the channel, width, and height of the output of a certain convolution layer, respectively. VGG16 is the VGG-16 model [33]. Following the work of [34], we use VGG-16 model to compute the feature loss at the additional layer.

Based on the two formulations above, we refined the generative loss function asin which is the predefined weights for the perceptual loss.

3.3. Discriminator Architecture

Architecture. In our GAN-based network, discriminator architecture is designed for making the derained image synthesized by generator subnetwork much closer to the ground truth. It uses the mixed derained images and ground truths as inputs and classifies if the input is fake or real. Following the work of [11, 33], the convolution operator with PReLU activation and batch normalization is used as a basic unit throughout the whole discriminator subnetwork. The subnetwork contains five convolution layers. The numbers of feature map for each layer are set to 24, 48, 96, 192, and 384, respectively. We set the kernel size of the five convolution layers to 33. After a set of convolution layers, a sigmoid function is attached at the output layer to produce a probability value that indicates the input image as fake or real. The proposed discriminator subnetwork is shown as in the bottom of Figure 1. And the detailed parameters of the discriminator subnetwork layers are shown in Table 2.

Discriminator Loss Function. Since the goal of discriminator subnetwork is to differentiate the synthesized derained image from its corresponding ground truth, we regard it as a binary classification network. Given a mixed N images set, the discriminator loss function can be expressed aswhere Ti is the label of input image I and Ti=1 indicates that I is a real ground truth while Ti=0 indicates that I is a fake.

4. Experiment and Results

In this section, we first introduce the dataset and evaluation protocols used in this work. Then, the details of the experiments implemented to evaluate the proposed DR-Net model are presented. Finally, some comparison experiments results are discussed.

4.1. Dataset and Evaluation Protocols

Synthetic Dataset. We use the synthesize dataset created by [11] as the training and testing data. The training set of this dataset contains 700 paired images, in which 200 images are selected from BSD training set [35] and the rest of the 500 images are chosen from the UCID dataset [36]. The test set contains a total of 100 paired images, in which 50 images were selected from BSD dataset and the rest of the 50 were chosen from UCID dataset. Besides, we also use the test set created by [9] to evaluate the proposed model.

Real-World Rainy Image. To validate the effectiveness of the proposed DR-Net, we also test it on the real-world rainy image. Specifically, we use the real-world dataset created by [11] and some traffic images downloaded from the Internet to evaluate the performance of our model.

Evaluation Protocols. We adopt the structural similarity index (SSIM) [37] and the visual information fidelity (VIF) [38] to evaluate the performance of our model and the compared state-of-the-art methods as well. The higher SSIM value is, the closer to ground truth the derained image is. For the clean image, the SSIM value is 1. Similarly, higher VIF indicates higher quality of the derained result.

4.2. Training Setting

In this study, we use the torch framework [39] to implement our network. The batch size is set to 9. We use the Adaptive Moment Estimation (Adam) algorithm with the learning rate of 0.0002 to optimize the network. All the training images are resized to 480480 pixels. The training process converges in roughly 4-5 hours with NVIDIA GTX Taitan-xp GPU.

4.3. Comparison with Baseline Networks

In this section, we evaluate the performance of the proposed DR-Net with the following four baseline architectures:(i)33 Single: Single-scale network is trained only using 33 convolution branches.(ii)55 Single: Single-scale network is trained only using 55 convolution branches.(iii)GEN: Only generator subnetwork is used, which equates to a traditional CNN architecture.(iv)DGAN: The depth and kernel’s numbers of network are increased.

We train these four networks as well as DR-Net on the synthetic training dataset. Table 3 shows the SSIM results of DR-Net compared with the four baseline architectures on synthesized test images. From Table 3, we can observe that the proposed DR-Net achieves the highest SSIM values among the five configurations. Compared with the DR-Net, the generator with single-scale network (i.e., 33 or 55 convolution branches) achieves the lower SSIM values. When we discard the discriminator subnetwork, the performance of the architecture is also decreased. The DGAN achieves comparable SSIM values with DR-Net. However, increasing the depth and the kernel’s number leads to more running time whether in the training or testing phases. Sample results of the proposed method compared with the four networks on synthesized test images are shown in Figure 2. From Figure 2, it can be seen that the four baseline networks can improve the quality of the rain image but have obvious chromatic aberration and blurred background. The proposed DR-Net achieves the best visual effect in terms of the quality of derained image.

4.4. Comparison with State-of-the-Art Methods

We compare the proposed method with the following representative single-image deraining methods:DSC: discriminative sparse coding-based method.GMM-LP: layer prior-based model.DrainNet: convolution neural network-based method.ID-CGAN: conditional general adversarial network-based method.

Results on Synthesized Data. In this set of experiments, we implement the comparisons between the proposed model and the four compared methods on the newly synthesized image data. For the ground truths of these test images are known, the structure similarity index (SSIM) and the visual information fidelity (VIF) for quantitative measure can be calculated. From Tables 4 and 5, we can observe that the proposed model achieves the highest SSIM and VIF values.

The visual comparisons for five synthesized images with different intensity and orientations are shown as Figure 3. As can be seen, DSC can remove the partial streaks and reduce the dense degree of rain streaks, but they cannot completely remove the rain streaks. The same situation also takes place in GMM-LP algorithm. Among these four compared algorithms, DrainNet and ID-CGAN get better visual expression. However, compared to the two models, the derained results of the proposed model are better. Moreover, our SSIM and VIF values of five derained images are higher than theirs. Both the experiment results demonstrate the effectiveness of the proposed method.

Results on Real-World Data. We also evaluate the proposed model on several real-world rainy images. Figure 4 presents the testing results on four real-world rainy images. From Figure 4, we can observe that both GMM-LP and DSC fail to completely remove rain streaks. Among them, the deraining effects of DrainNet and ID-CGAN are on a par with the proposed model from the visual perspective. However, compared with DrainNet and ID-CGAN, our derained results are able to maintain more details of the raw input images. In order to have a better comparison, we show one specific region of interest for each derained result of the five algorithms. From the regions of interests, we can see that, compared with other four algorithms, our model provides the best visual performance on jointly removing rain streaks and retaining details, which further verifies the validity of the proposed method.

User Study Comparisons. Since there is no ground truth for real-world data, we constructed a user study to supply real feedback and quantify the subjective evaluation of the proposed model. We selected 10 real-world rainy images from the real-world dataset created by [11] and some rainy traffic images downloaded from the Internet. Figure 4 shows some derained results. For user study, we first used all methods to generate deraining images and randomly ordered the deraining results. Then, we displayed the ordered results on the screen and asked 10 participants with computer vision expertise to rank the results from 1 to 5, with 1 being the worst and 5 being the best. Table 6 shows the average subjective scores of the five methods on the real-world rainy images. We can see that our model achieves the highest average score among the five methods, which indicates that the proposed method can generate better deraining results on real-world rainy images from the subjective perspective.

Running Time and Parameter Number Comparisons. To estimate the efficiency of the proposed method, we computed the running time of the compared state-of-the-art methods as well as the proposed method. All the evaluations are implemented on the 480480 rainy image. DSC and GMM-LP are non-deep learning methods that are implemented on CPU according to the provided code. DrainNet, ID-CGAN, and our methods are implemented on GPU. Table 7 presents the comparison results. In general, our multiscale network requires only 1.4 seconds to process a 480480 rainy image, which is the same as the existing single-scale deep learning methods. Table 8 compares the parameter numbers of DrainNet, ID-CGAN, and the proposed network. Although our generator has two branches, it has less feature maps. It can be seen that the parameter number of our multiscale rain-removal network is slightly more than ID-CGAN, but less than DrainNet.

5. Conclusion

In this study, we have proposed a generative adversarial network-based architecture for single rainy image removal. The presented architecture consists of two subnetworks, i.e., generation and discrimination subnetworks. The generation subnetwork with multiscale convolution operators can capture local rain drops and spatial information of rainy images simultaneously. To reserve more detail of the raw background of the rainy images and to improve the steadiness of the training process, three skip connections between the front convolution layers and later convolution layers are introduced. Acting as a supervisory signal, the discrimination subnetwork can be helpful in improving the quality of derained images generated by generation model. Experiments on synthetic and real-world images show that the proposed architecture outperforms other state-of-the-art methods. In the future, we will consider how to use the superresolution and attention mechanism to further improve the ability of deraining for the network.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China [Grant no. 61673222], the Jiangsu Universities Natural Science Research Project [Grant no. 13KJA510001], and the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie [Grant no. 701697].