Abstract

In the foggy environment, the images collected outdoors are prone to problems, such as low contrast and loss of details. In order to solve this problem, this paper proposes an algorithm based on multiscale parallel-depth separable convolutional neural network (MSP-DSCNN) to remove fog from foggy images and improve image quality. The multiscale feature extraction module extracts texture feature details from fog images at different scales and extracts high-dimensional and low-dimensional features from fog images by using parallel depth and shallow channels. In order to further optimize the network model, a split convolution method is proposed, which can split the feature graph into two categories, one is the main feature and the other is the secondary feature. The key information is extracted from the main features with high complexity, and the compensation information is extracted from the minor features with low complexity. Experiments show that compared with other algorithms, the model constructed in this paper has obvious advantages in defogging effect, natural color of restored images, good detail retention, and dominant indicators. It effectively solves the problems of incomplete haze, color offset, and poor visibility of detail maintenance in the current image.

1. Introduction

Intelligent equipment such as video surveillance plays an important role in the application of computer vision system. Because of the existence of fog and haze, the quality of images collected by imaging equipment is seriously degraded, and the target object cannot be well monitored, which brings great inconvenience to the visual system [13]. Therefore, image defogging has very important practical significance, research value, and practical value.

The visibility of fog image decreases evidently, due to the existence of fog. There are two main reasons to produce foggy images: first, the reflected light of the object is absorbed and scattered by suspended particles in the atmosphere, resulting in the attenuation of reflected light energy and the reduction of image brightness and contrast. Second, much of the ambient light, such as sunlight, is scattered by scattering materials in the atmosphere to form a bright background [4]. In various applications based on computer vision, input foggy images may lead to system performance degradation or even errors [58].

In recent years, image defogging has attracted extensive attention. Based on the atmospheric scattering model [9], many single-image defogging methods have been developed. The key idea shared by most methods is to adopt various image priors, such as dark channel priors [10] and color attenuation priors [11]. Literature [12] estimated the local aerial components of the image and then used the multiscale fusion method to restore the fog-free image. In reference [13], the transmission image is estimated by using per-pixel alpha mixing method by estimating the light source region and nonlight source region, respectively; thus, they are effectively mixed, since the distance between different things and the camera in foggy images has different transmission coefficients and the assumption of each close cluster in fog-free images. Literature [14] makes each cluster to become a line in RGB space in foggy images, and the algorithm can use these lines to recover depth maps and fog-free images. These methods are not ideal when dealing with foggy images under complex imaging conditions.

In order to improve the shortcomings of traditional algorithms, deep learn-based defogging methods develop rapidly and become the mainstream direction of image defogging. The method uses convolutional neural network (CNN) to train the required features by constructing an end-to-end Network architecture.

Literature [15] built a Dehaze-Net defogging network based on deep CNN to train the transmittance, and recovered fog-free images with the help of atmospheric scattering model, which achieved good results, especially in the sky area. However, this network is constructed with a priori or assumption in traditional methods, which has certain limitations. The end-to-end gated fusion defogging network GFN proposed in literature [16] realizes the final defogging through a series of operations such as white balance, contrast enhancement, and gamma correction, but the network implementation process is quite complicated. AOD network proposed in literature [17] integrates transmittance and atmospheric light into a unified parameter K, training through the formula of deformed atmospheric scattering model, changes the previous methods of training atmospheric light and transmittance separately. However, due to the small number of network layers and limited accuracy, good defogging effect cannot always be achieved. Literature [18] proposed a feature attention network FFA-NET combining channel attention and pixel attention mechanism since different channel features contain completely different weighted information and fog concentration distribution on different image pixels is also different. Because this network only considers the difference between images with fog and images without fog, and does not start from the overall style of the image. Although it has a good effect on the composite image, there are still serious details on loss and incomplete removal of fog in the real scene. Literature [19] found through the statistics of different color channels of many foggy images that foggy areas are mainly concentrated in the brightness channel of Ycbcr color space. Therefore, the atmospheric lighting priori is proposed and the AIPNet fog removal network is built. The image quality is improved to a certain extent, but there are still incomplete feature extraction and residual fog in some restored images.

Based on the above traditional methods and the shortcomings of existing networks, in this paper, an image defogging model based on multiscale parallel-deep split convolution neural network (MSP-DSCNN) is proposed. This model draws on the idea of multiscale information extraction [20] in target detection [21]. In the network training stage, convolution of different sizes is used to check the input image for convolution operation; then, the extracted features are fused. And then, the fused feature images are put into deep channel and shallow channel, respectively, for learning. In order to further optimize the network model, a split convolution operation is proposed, which divides the feature graph into two categories, namely, primary concern feature and secondary concern feature, simplifying the process of defogging and highlighting the importance of features.

In the study of fogging image removal, the main innovations of this paper are as follows. (1)The multiscale feature extraction module is used to extract image feature information from different scales(2)The parallel deep and shallow channels are proposed to extract high- and low-dimensional features from foggy images, respectively(3)A split convolution operation is proposed to further reduce the number of network parameters, accelerate the reasoning speed of the model, and improve the model’s defogging accuracy

This paper consists of the following four main parts. (1)The first part is the introduction(2)The second part is the image defogging model(3)The third part is the experiment and analysis(4)The fourth part is the conclusion

Besides, there are also abstract and references.

2. Image Defogging Model

2.1. Problem Description

Assume that the foggy image is , the fogged image is , and the foggy noise image in the foggy image is . Equation (1) can be obtained as follows.

In the traditional convolutional neural network, the foggy image is taken as the network input, and the mapping model formula (2) from t is directly learned by training the model. However, it is very difficult to train the model in such a direct way of learning network mapping, and the precision of defogging is not high. Therefore, this paper uses residual learning strategy to reduce the difficulty of network learning. The outputs and inputs of the network form a large residual unit.

The actual mapping that the network learns is shown in Formula (3), that is, the network learns the image of fog noise distribution in the image.

The residual learning method can transform the haze removal image mapping obtained by direct learning into the fog noise distribution in the haze image, which reduces the difficulty of network learning and improves the model’s fog removal accuracy.

2.2. Multiscale Feature Extraction Module

In the convolutional neural network, the receptive field is the mapping region on the input image corresponding to the output image of each layer in the convolutional neural network. The larger the receptive field, the larger the area of input image to focus on. In the image input stage, three convolution kernels of different scales (1 × 1, 3 × 3, and 5 × 5) were used for feature extraction. A large convolution check should have a larger receptive field, which can extract features in a large range. Small convolution kernel can extract small range of detail features.

The feature fusion method of parallel superposition [22] is used to fuse all feature maps together as the input of depth channel, see Figure 1. layer is the activation function layer. The purpose is to improve the nonlinearity of the model, as shown in Equation (4).

In actual model training, a 5 × 5 convolution kernel with a cavity step of 2 is used. The number of model parameters remained unchanged while expanding the receptive field.

2.3. Deep and Shallow Channel Parallel Method

This paper uses parallel depth and shallow channels to extract feature information of different dimensions from foggy images. The shallow channel is composed of 6 layers of neural network, which is mainly used to extract low-dimensional feature information.

The deep channel is composed of 16 layers neural network, which is used to extract high-dimensional feature information. The feature fusion method of parallel superposition is used to merge the depth and shallow channel information.

2.4. Split Convolution Operation

To further optimize the network model, a split convolution operation is proposed, as shown in Figure 2. Feature maps of the same layer proposed from the image often have similar results and feature redundancy. Therefore, in accordance with the principle of bisection, the feature graph of each layer is divided into two parts for independent convolution operation. Group convolution with a convolution kernel size of 3 × 3 and group number of 2 was used to conduct group convolution operation on some feature graphs to obtain the main features of interest. Meanwhile, for the remaining feature graphs, point convolution operation with convolution kernel size of 1 × 1 is used to obtain the features of secondary concern. Then, the global average pooling operation is carried out on the feature graph by using Equation (5) to generate the initial weight value of each channel [2325] and obtain and , respectively, where and represent the height and width of the feature map, respectively.

In order to dynamically adjust the weight of each channel, the generated and are stacked together and passed to the full connection layer. Then, Softmax function is used to regenerate the main and secondary attention features, and the weight and of each channel, as shown in Equation (6).

The network output after deconvolution is shown in Equation (7).

2.5. Overall Structure of MSP-DSCNN Model

As shown in Figure 3, the overall framework of model MSP-DSCNN consists of four parts. In the initial stage of the network, a multifeature extraction module is used to extract the detail features of the foggy images at multiple scales by using convolution kernels of different sizes. The upper part of the model is Sconv Block module, whereis separable convolution operation andis batch regularization. The shallow channel module consists of 6 layers of networks. Each separable convolution operation is followed by a and . The deep channel module consists of 16 layers of networks. In order to reduce the difficulty of network learning, residual network is first used to directly transfer the features of the first layer of separable convolution processing to the fifth, ninth, and thirteenth layers. Then, the feature fusion module of parallel superposition is used to merge the information extracted from depth channel. Finally, a layer of convolution operation is used to transform the feature graph into the output image of the first layer. In order to reduce the learning difficulty of the whole network, the whole network is composed of a large residual unit. The input and output form a subtraction operation. In this way, the network can directly learn the fog noise distribution in the fog image, instead of directly learning the clean image.

This paper uses mean square error as the loss function of the model, as shown in Equation (8), whererepresents the residual image block obtained from network learning andrepresents the actual residual image block, that is, the label.

The convolution kernel size and the number of output channels of each layer of model MSP-DSCNN are shown in Table 1. The first layer is the multifeature extraction layer, which consists of 48 1 × 1 convolutional kernels, 64 3 × 3 convolutional kernels, and 32 5 × 5 (cavity step 2) convolutional kernels. In depth channel, separable convolution operation is used first. The main feature is 2, 38 convolution operations with a size of 3 × 3. Secondary features 26 convolution kernels of 1 × 1 size are used for feature completion operation. Then, 64 feature layers of each depth channel are superimposed together in parallel. After the convolution layer of the last layer 1 output channel, the residual image block is obtained.

3. Experiment and Analysis

In order to verify the effectiveness and rationality of the proposed network, quantitative evaluation of its defogging effect on synthetic images and real images is made in this paper, and the results are more convincing by comparing with the current advanced algorithms from two aspects of visual effect and objective data.

3.1. Experimental Settings and Data Sets

The network implementation in this paper is based on Pytorch framework. Use M2000GPU in Ubuntu environment to train defogging network. Set the initial learning rate to experience value 0.0001 and batch size to 32. The momentum decay exponents of Adam optimizer are, respectively, beta 1 = 0.899 and beta 2 = 0.999. The number of iterations is 40, and the training time for one iteration is 32 minutes. Finally, the size of the trained network model is 1608 KB.

The training network of RESIDE (Realistic Single Image Dehazing) [26] is used in this paper. RE⁃SIDE Images are selected from NYU2 [27] depth data set. By randomly selecting atmospheric light A (A∈(0.8, 1.0)) and scattering coefficient β (β∈(0.5, 1.5)), different foggy images are synthesized. It includes Indoor Training Set (ITS), Outdoor Training Set (OTS), and Synthetic Objective Testing Set (SOTS), the ITS data set in RESIDE as the training set and SOTS as the test set. The training set (ITS) includes 1399 clear images and 13990 foggy images with different fog concentrations. The test set SOTS includes 500 indoor foggy images and 500 outdoor foggy images. In addition, in order to better verify the effectiveness and practicability of the proposed network, fog images with rich colors in the real environment are selected for testing.

In terms of quantitative evaluation of network performance, this paper selects some representative algorithms in the field of defogging for comparison, including traditional algorithm [28] based on physical model and algorithm [29] based on literature. Literature [28] algorithm, and literature [30] algorithm based on deep learning.

3.2. Quantitative Evaluation of Synthetic Image Data Sets
3.2.1. Subjective Evaluation

The network was evaluated from both subjective and objective aspects on the synthetic image data set SOTS, and one indoor image and one outdoor image were randomly selected for effect evaluation. The defogging effect of each algorithm is shown in Figures 4 and 5. The image obtained by the method in reference [28] is completely defogged, but the overall image is darker than the renderings of other algorithms, and the sky region is seriously distorted due to the limitations of dark channels, as shown in Figure 5. In addition, the estimation of transmittance at the edge mutation is inaccurate, leading to the occurrence of halo and block effect at the edge of the restored image. As shown in Figure 5, there is halation around the roof. The image obtained by the method of literature [29] has natural color, especially the sky area, but there are problems of incomplete defogging and serious loss of details. In Figure 4, there is obvious residual fog in the background wall area. And in Figure 5, the details of the lawn are blurred. Compared with the algorithm in reference [28], the algorithm in reference [28] has improved color bias in sky region, but some images have problems of incomplete defogging and color oversaturation. The fourth image in Figure 4 and the fog removal effect in Figure 5 are poor. The image obtained by the method in reference [30] has good color fidelity and thorough defogging. However, due to the shallow network structure, the feature extraction is not very rich, and the detail recovery of the remote area is poor, as shown in Figure 5. For the fog-free image obtained by the algorithm in literature [28], the indoor image has a good defogging effect and the restored image has a natural color, while the outdoor image is not completely defogged, but the sky area has a natural color and no obvious color deviation, as shown in Figure 4. Comparing with the previous classical algorithms, the algorithm presented in this paper has a significant effect on defogging, with good details and no obvious distortion, color deviation, and other problems. The color of defogging images is natural and has good performance in both indoor and outdoor images.

3.2.2. Objective Evaluation

Objective evaluation indexes adopt Structural Similarity Index Measurement (SSIM) and Peak Signalto Noise Ratio (PSNR), which are uniformly used in deep learning. Among them, the structural similarity SSIM is used to measure the similarity degree of two images, and the value range is [0,1]. The peak signal to noise ratio (PSNR) is used to represent the ratio of useful signal to noise in an image, and the larger the better. The corresponding mathematical expressions of the two indicators are shown in Formula (9) and Formula (10), respectively. where and are the mean values of images and , and are the standard deviations of the two images, is the covariance of the two images, and and are constants. where is the minimum mean square error of the image after defogging and the clear image and is generally 255. Objective indicators on are shown in Table 2 (data are mean values). The pictures obtained by the network processing proposed in this paper. SSIM index is ahead of other algorithms, PSNR index is better than most algorithms, and good results are obtained. The validity of the proposed network is verified.

3.3. Quantitative Evaluation of Real Image Data Sets
3.3.1. Subjective Evaluation

In order to further confirm the practical value of the proposed network, this paper conducts an experimental evaluation on its effect images in real scenes, and the defogging effect of each algorithm is shown in Figure 6. You can see that the algorithm in literature [28] has a good defogging effect on images with rich close-range, but due to the misestimation of transmittance caused using the minimum filter, the recovered image is prone to halo at the edge, and the inaccurate estimation of atmospheric light leads to serious distortion of the defogging image in the distant area with the sky. Compared with other algorithms, both the algorithm in literature [29] and the algorithm in literature [28] have problems such as incomplete defogging and serious detail loss. The algorithm in literature [30] has a remarkable effect of defogging, especially in the sky area, where the color is restored to nature. However, the degree of defogging in some images is too large, resulting in the loss of some pixels and darkening of images. Compared with other algorithms, the effect picture of the algorithm in this paper is completely defogged, the details are maintained well, and the scene is clear and natural.

3.3.2. Objective Evaluation

Objective evaluation adopts the unreferenced image quality evaluation method often used in the field of real scene defogging [31], whose evaluation indexes include the number of saturated pixels , the number of visible edge sets , the average gradient , and the running time (unit is second). The smaller the number of saturated pixels and the running time, the better the performance of the algorithm. The larger the value of visible edge set and mean gradient , the better the algorithm is.

Each index in each algorithm is shown in Table 3 (the data in the table is the mean value).

As it can be seen from Table 3, the network proposed in this paper performs well in the number of saturated pixels and the running time, and has a leading advantage in the number of visible edge sets and the mean gradient. Comparing with other algorithms, the proposed network improves the number of visible edge sets by 20.6% and the average gradient by 20.9%.

3.4. Ablation Experiment

In order to further verify the contribution of the proposed modules to the result of fog removal, an image in the SOTS data set was randomly selected for the ablation experiment as follows.

It mainly includes the following steps. (1)Only the features (LF) in a wide range of foggy images are extracted to reconstruct the original image(2)Only a small range of detail features (DF) of foggy images are extracted to reconstruct the original image(3)Large range of features and small range of detail features work together

It can be seen from the comparison diagram in Figure 7.

The large range feature image only retains some global features and loses a lot of details. The detail information of small range detail feature image increases obviously, but there are problems of incomplete defogging and color deviation. In contrast, the image under the joint action of the two features not only retains the color of the original image but also has obvious details, and which is more thoroughly defogged. Each image index is shown in Table 4.

As it can be seen from Table 4, the SSIM and PSNR of the image obtained under the combined action of the two features were significantly improved.

4. Conclusion

In order to keep details as much as possible while defogging the image, a multiscale parallel-depth separable convolutional neural network (MSP-DSCNN) was designed in this paper. By comparing the experimental results, compared with other comparison algorithms, model MSP-DSCNN has better performance in both subjective visual effect and objective evaluation indicators. At the same time, the network ablation experiment proves that the proposed multiscale feature extraction module and the parallel strategy of shallow and deep channels have a good effect of haze removal and high image fidelity. Comparing with other algorithms, the algorithm has distinct advantages and effectively solves the problems such as incomplete defogging of the current image, poor color offset, detail retention, and reduced visibility. The accuracy and accuracy of fog removal have also been further improved. The split convolution operation further reduces the number of parameters in the network model, improves the feature redundancy of traditional convolutional neural network, and accelerates the training and convergence speed of the network. The overall experiment shows that the model MSP-DSCNN can finish the image defogging task faster and better. However, since the restoration module of the network essentially relies on the atmospheric scattering model, some restored images still have shortcomings in detail preservation and color fidelity of sky area. Therefore, improving the accuracy of the defogging model will be the focus of further research in the future.

Data Availability

The labeled data set used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The author declares no competing interests.

Acknowledgments

This work is supported by the Wenhua University.