Abstract

Digital cameras with a single sensor use a color filter array (CFA) that captures only one color component in each pixel. Therefore, noise and artifacts will be generated when reconstructing the color image, which reduces the resolution of the image. In this paper, we proposed an image demosaicing method based on generative adversarial network (GAN) to obtain high-quality color images. The proposed network does not need any initial interpolation process in the data preparation phase, which can greatly reduce the computational complexity. The generator of the GAN is designed using the U-net to directly generate the demosaicing images. The dense residual network is used for the discriminator to improve the discriminant ability of the network. We compared the proposed method with several interpolation-based algorithms and the DnCNN. Results from the comparative experiments proved that the proposed method can more effectively eliminate the image artifacts and can better recover the color image.

1. Introduction

Images are widely used in people’s daily life. Compared to analog images, digital images are more superior in their higher resolution and easier storage, and they are more suitable for computer processing. With the development of computer technology, the digital imaging technology has attracted lots of attention and digital cameras have gradually become the mainstream imaging equipment that is widely used in intelligent transportation [1, 2], medical imaging [3, 4], remote sensing technology [5, 6] and other fields. In our daily life, the digital color images are most commonly used, which include three color components, that is, red, green, and blue, in each pixel. Ideally, digital cameras with three sensors can get full-color images with each sensor capturing one color component, and the three components are combined together into a color image. However, in practice, the arrangement of the three color sensors will affect the subsequent color synthesis, and cameras with three sensors are usually expensive and relatively large. Therefore, most digital cameras use single sensor with a color filter array (CFA) placed in front of the sensor. The obtained CFA image needs to be processed to acquire the full-color image and this process is known as image demosaicing [7]. As only one color component is captured for each pixel in the CFA, without image demosaicing, the CFA image can only reflect the general outline of the scenery instead of the complete color information, which consequently affects subsequent image processing [8].

The CFA image demosaicing is essentially an ill-posed inverse problem [9]. The methods for image demosaicing generally include interpolation-based algorithms and learning-based algorithms. Generally, image demosaicing using interpolation methods can achieve high accuracy for smooth areas with approximately same colors and gradient brightness. For the color images, the red, green, and blue components occupy different color channels, respectively. When the high-frequency signals change (high-frequency single/information refers to the region with strong color variation, such as edges and angles), there may be spatial offsets in each color channel. Therefore, the reconstructed images may display color artifacts and zippering when doing interpolation [7]. In addition, some traditional interpolation-based methods ignore the correlation among different color channels, which results in unsmooth images [8]. On the whole, the interpolation-based algorithms still have some limitations for image demosaicing, especially at the high-frequency areas.

In recent years, neural networks have been rapidly developed and widely used in image processing, such as image classification [10, 11], motion recognition [12, 13], and image super-resolution [14, 15]. Recently, the generative adversarial network (GAN) [16] has been proposed and rapidly attracts attention of many researchers. Ledig et al. [17] proposed a super-resolution generative adversarial network (SRGAN), which used a deep residual network for the training and can well recover the image textures from greatly downsampled images. Inspired by super-resolution image reconstruction and the conditional generative adversarial network (CGAN), Kupyn et al. [18] applied the CGAN to image deblurring and effectively restored clear images. Pan et al. [19] proposed a physics model constrained learning algorithm so that it can guide the estimation of the specific task in the conventional GAN framework, which can directly solve image restoration problems (such as image deblurring and image denoising).

GAN has been used and played important roles in several areas; however, it has not been used for image demosaicing. In this paper, we proposed a novel learning-based image demosaicing method using GAN to improve the ability for color image recovery. Our contributions are as follows:(1)We proposed a CFA image demosaicing method based on GAN(2)We carefully designed each part for the GAN model(3)We introduced long jump connections for the improved U-net [20] model to design the generator(4)We used the dense residual network, which includes dense residual blocks with long jump links and dense connections for the discriminator(5)We combined the adversarial loss, the feature loss, and the pixel loss together to further strengthen the network performance

In the experimental section, we show the performance of our method using some comparative experiments. The results prove that the proposed method can more effectively remove artifacts and recover the full-color image, especially for some high-frequency areas such as edges and angles.

2.1. Interpolation-Based Algorithms

There are many interpolation-based methods for image demosaicing. Linear interpolation algorithm is the simplest one. However, this method often causes artifacts and blurring at the image edges [21]. The bilinear interpolation algorithm [22] estimates the unknown pixels from their adjacent pixels. This method often causes color distortion in the reconstructed image. Malvar et al. [23] proposed a high-quality linear (HQL) interpolation algorithm, which can greatly reduce the computational complexity. However, the artifacts still occur at high-frequency components of the image. In order to further reduce the artifacts, different interpolation techniques were proposed.

Within the gradient-based schemes, Hamilton and Adam [24] proposed the Hamilton–Adam algorithm, which uses the second derivative of the sampled color channels when doing interpolation. Therefore, this method considers the correlation among different color channels and significantly improves the image details. Mukherjee et al. [25] proposed a two-line (TL) interpolation algorithm, which used the homogeneity of the cross-ratios of different spectral components around a small neighborhood to interpolate the pixels lying in the low gradient directions, so as to produce high-quality images.

Within the directional interpolation schemes, Chung and Chan [26] used the prior decision in the horizontal interpolation and the vertical interpolation and got the interpolation result according to the trend of the image edges. This method is prone to producing false colors at tiny edges, especially when the edges are not in the horizontal or vertical directions. Zhang et al. [27] proposed a local directional interpolation and nonlocal adaptive thresholding (LDI-NAT) algorithm. This method used the nonlocal redundancy of the image to improve the local color reproduction and can better reconstruct the edges and reduce color artifacts.

Within the residual interpolation schemes, Kiku et al. [28] proposed a minimized-Laplacian residual interpolation (MLRI) algorithm. This method estimated the tentative pixel values by minimizing the Laplacian energy of the residuals, which can effectively reduce the color artifacts. Monno et al. [29] proposed an adaptive residual interpolation (ARI) algorithm, which adaptively selects a suitable iteration number and combines two different types of residual interpolation algorithms at each pixel. Kiku et al. [30] incorporated the residual interpolation algorithm into the gradient-based threshold free (RI-GBTF) algorithm, and the interpolation accuracy is greatly improved. Besides, L. Zhang and D. Zhang [21] proposed a joint demosaicing-zooming scheme. This method used the correlation of the hyperspectral spatial for the CFA image to calculate the color difference, so as to restore the three color components, which can effectively eliminate color artifacts.

2.2. Learning-Based Algorithms

Recently, neural networks have also been used for image demosaicing. Prakash et al. [31] used a denoising convolution neural network (DnCNN) to perform demosaicing and denoising independently, which effectively suppressed the noise and artifacts. Tan et al. [32] used the deep residual network for image demosaicing and image denoising, which also effectively obtained high-resolution color images. Shopovska et al. [33] proposed an improved residual U-net and used it for image demosaicing, which achieved high-quality reconstructed color images for different CFA patterns. Generally, the learning-based strategies can achieve better performance compared to the traditional interpolation-based methods. However, higher-resolution and clearer recovered color images are the constant pursuit for image demosaicing; that is why we are trying GAN for this task.

3. Problem Formulation

3.1. CFA Image

To obtain a color image with detailed description of the natural image, the best solution is to use three sensors to accept the red, green, and blue components for each pixel, respectively. Then the color image can be synthesized by combining the three color components. Considering the cost and volume, most digital cameras use a single image sensor for the image acquisition systems. The image acquisition of the camera with single sensor is shown in Figure 1. The CFA is set before the sensor. For common CFA, such as the Bayer pattern [34] that is used in this work, the light reaching the sensor mainly consists of the red, green, and blue components. Within the CFA, each pixel only accepts one color component. As shown in Figure 1, the obtained Bayer pattern image can only estimate the approximate gray outline of the scenery instead of the complete color information. The color arrangement of Bayer pattern can be clearly seen from the local zoomed in area. In the Bayer pattern, a set of red and green filters and a set of green and blue filters are alternately used. The number of green pixels is 1/2 of the total number of pixels, while the numbers of red and blue pixels are both 1/4 of the total number of pixels. As only one color component is captured for each pixel, the other two color components need to be recovered according to the color information from adjacent pixels; then a full-color image is obtained from the CFA image. This processing is called image demosaicing.

3.2. Theory of GAN

GAN is a kind of probabilistic generative network, which was first introduced by Goodfellow et al. [16] into the deep learning field. The general architecture of GAN is shown in Figure 2. GAN uses to perform inverse transformation sampling of the probability distribution and capture the distribution of the ground truth data . Based on noise data which obeys a certain distribution (such as Gaussian distribution), will generate a fake sample similar to . The output of represents the probability of the incoming data. Thus, if the input is , the output is a large probability value; otherwise it outputs a small probability. The GAN’s training process is to maximize the discrimination accuracy by training , as well as to minimize the difference between the generated sample and the real sample by training . Thus, the training for and is a min-max game problem. The performances of and are improved by alternative optimization. Finally, and reach Nash equilibrium, so that the data distribution synthesized by is similar to that of the ground truth data . The loss function of the above process is defined aswhere represents the value function [16]. represents ground truth data obeying a real data distribution , and represents noise data obeying a simulated distribution (such as the Gaussian distribution). and are the classification outputs of for the ground truth data and the generated data , respectively. means expectation.

4. The Proposed Method

In this section, we propose an effective demosaicing algorithm based on GAN. The whole process is shown in Figure 3. The proposed algorithm first extracts the red, green, and blue components from the original CFA image to form the 3-channeled split CFA image. Then the extracted green component is further separated into two channels to form the 4-channeled split CFA image. Subsequently, the algorithm extracts only the pixel values that are not 0 to compress the 4-channeled split CFA image. The compressed 4-channeled image is taken as the input of in GAN. The output of is the interpolated 3-channel full-color image. The output images from and the ground truth images are then inserted into . The parameters of are optimized according to the output of . We designed the architectures for and and trained the database through an end-to-end trainable neural network. In addition, the algorithm combined the adversarial loss, pixel loss, and feature loss to design the generator loss function in order to further improve the network performance [35]. In the following, we give a detailed introduction to different parts of the network.

4.1. Generator

The purpose of the generator is to convert the 4-channeled compressed CFA images to the 3-channeled output full-color images. The structure of is shown in Figure 4. We used the improved U-net [20] model for . Overall, the generator consists of an encoder (the first half) and a decoder (the second half), which is shown in Figure 4. One layer in the encoder and the corresponding layer in the decoder form a U-shaped symmetric layer. The long jump links within each symmetric layer in the U-net model can reduce the information redundancy. Besides, we remove the pooling layer in the U-net, which can avoid the loss of useful information in the feature maps and increase the stability of the training process.

The encoder is mainly based on the downsampling operation (i.e., convolution operation). It can analyze the input data to obtain the most significant features and provide feature mappings to its corresponding layer in the decoder. The activation function of the encoder is a leaky rectified linear unit (LReLU), which is defined aswhere is a positive constant (). represents the input vectors for a specific layer of the encoder. In our experiments, we set as 0.1.

The decoder is mainly based on the upsampling operation (i.e., deconvolution operation) to restore the full-color images. The activation function of the decoder is a standard rectifier linear unit (ReLU), which is defined aswhere represents the input vectors for a specific layer of the decoder.

Particularly for the final layer of the decoder, the activation function is the tanh activation function, which is defined aswhere represents the input vectors for the final layer of the decoder.

In order to accelerate the convergence and improve the network performance, we introduce the batch normalization (BN) operation after each convolution and deconvolution operation to slow down the transfer of internal covariates and reduce the sensitivity of the network to initialization weights [36].

Detailed parameters for the convolution and deconvolution layers are shown in Table 1.

4.2. Discriminator

We used a dense residual network, which is inspired by the ResNet [36], for the discriminator . The ResNet is formed by stacking multiple consecutive residual blocks (RB). In order to improve the network performance and solve the problem of gradient disappearance and gradient dispersion during the network training, we used an improved residual dense block (RDB). The structure of is shown in Figure 5. The long jump connection after each RDB helps to transfer the output of this RDB to the final convolution layer. Within each RDB, there are several units with each unit consisting of the ReLU activation function, the convolution layer, and the BN operation. There are dense connections with different distances among these units. The output from the final convolution layer is mapped into 0 or 1 using the sigmoid activate function. The sigmoid function performs a probability analysis that can normalize the discriminant result, which is defined aswhere represents the input vectors for the sigmoid function.

For the convolution layers in , the kernel size is set as 3 × 3, the stride size is set as 1 × 1, and an output channel is 64.

4.3. Loss Function

Denote the ground truth images , where represents the number of images. After a series of operations, the CFA images are transformed into the corresponding 4-channeled compressed CFA images, which are denoted as and are regarded as the input of . According to the loss function inspired by Alsaiari et al. [35], we combined the adversarial loss, the feature loss, and the pixel loss together with appropriate weights to work as the final loss function for the generator. The adversarial loss function () is expressed aswhere represents the 4-channeled compressed CFA image. The 3-channeled color images are produced using Equation (6) to fool the generator .

The feature loss function () is defined aswhere represents the feature mapping matrix extracted from the pretrained VGG network [35]. represents the L2 norm. Using Equation (7), we can extract the image features and restore the image details by comparing feature data between the generated image and the ground truth image .

The pixel loss (pixel-to-pixel Euclidean distance) function () is defined aswhere is the regularization item, with representing the regularization weight. Using Equation (8), we can correctly restore the image information by comparing each pixel between the generated image and the ground truth image .

We combine , , and together with appropriate weights to form the final loss function for the generator, which is defined aswhere , , and represent the predefined positive weights according to the empirical values [35].

According to Equation (1), the discriminator uses the following equation to update the parameters:

For the ground truth image , the probability of the output is close to 1. For the generated image , the probability of the output is close to 0.

Based on the above strategy, the generator and the discriminator will be alternately optimized.

Based on the above introduction, we give the whole pipeline in Figure 6 to clearly describe the proposed method. The real scenery is captured by the camera and converted to the CFA image. The obtained CFA image is then further converted to the 4-channeled compressed CFA images, which is inserted into the generator designed by the U-net model. The output from the generator and the ground truth image are then inputted into the discriminator that is designed by the dense residual network. The generator will finally give near-real demosaiced image through the network training.

5. Experiments

In this section, we demonstrate the performance of the proposed network with numerical experiments. The network training is carried out under the TensorFlow environment, which is installed on a PC with Nvidia GeForce® MX250 GPU and Intel Core i5-8265U CPU. The training sets are created beforehand and then uploaded into TensorFlow.

5.1. Training Details

The training database used in this paper is from the Waterloo Exploration database (WED) [37], which contains 4744 pristine natural images. We first randomly selected 400 images to create the training set. For the training set, we used data augmentation operations such as cropping and rotations to increase the number of images. To be more specific, we first scaled down each selected image by 1, 0.9, 0.8, and 0.7 times and then used a sliding window to crop the scaled images into patches with a size of pixels. The sliding step-lengths in the horizontal and vertical directions are both 20 pixels. Subsequently, the obtained patches are sequentially vertically and horizontally flipped and rotated 90°, 180°, and 270°, respectively, as shown in Figure 7. Through the above data augmentation operations, we obtained 86400 training images. These images are input in batches during the network training process to reduce the calculation and avoid local extreme value problems.

During the training, the weighting parameters in the loss functions are set as , , , and according to the empirical values [35]. The batch size is set as 256 and there are 200 iterations for the whole network training. During the training, we used a variable learning rate, where the initial learning rate is set as 0.01 and the value is reduced by 1/10 every 40 iterations. The trained network is tested with the Kodak database and the McMaster database [38]. The Kodak database consists of 24 images with the size of pixels. The McMaster database consists of 18 images with the size of pixels.

In order to quantitatively evaluate the performance of the proposed network, we used color peak signal to noise ratio (CPSNR) and structural similarity index (SSIM) as measurement standards for the demosaicing results. The CPSNR value is calculated aswhere and represent the pixel value of the ground truth image and the demosaiced image for the color channel, respectively. and represent the height size and width size of the image.

The SSIM measures the similarity between two images, which is defined aswhere and represent the mean intensity and the standard deviation of the ground truth image . and represent the mean intensity and the standard deviation of the demosaiced image ; is the covariance between and . and are two constants used to keep the equation balanced and stable, which are usually set as and , with , , and .

5.2. Image Demosaicing Test

In this section, we prove the effectiveness of the proposed method by comparing different demosaicing methods. The methods used for comparison are the Bilinear [22], TL [25], HQL [23], Zhang’s [21], LDI-NAT [27], ARI [29], MLRI [28], RI-GBTF [30], and DnCNN [31] methods, as well as the proposed method.

Table 2 shows the CPSNR and SSIM of the test results for the Kodak database from different methods. Figure 8 shows the corresponding box plots of the CPSNR and SSIM for easier comparison. Table 3 shows the CPSNR and SSIM of the test results for the McMaster database from different methods. Similarly, Figure 9 shows the corresponding box plots. We can see that the proposed method shows higher CPSNR and lower SSIM, which means better performance compared with other methods.

For additional comparison, Figure 10 shows the reconstructed images using different methods for the 19th image in the Kodak database. The marked portion of the image within the black box (the fence) is enlarged for clearer comparison. It can be seen that this part has obvious vertical textures and it is prone to artifacts. Residual images (i.e., difference between the ground truth image and the demosaiced images) for the enlarged portion are also shown for easier comparison. From the reconstructed images and the residual images, we can see that, compared with other methods, the proposed method can more effectively suppress the artifact phenomenon, especially for some tiny edges and angle areas.

Figure 11 shows the reconstructed images using different methods for the 22th image in the Kodak database. The marked portion in the black box (the window) is enlarged. It can be seen that this part is prone to appearing color stripes and zipperings. The residual images are also shown for easier comparison. From the reconstructed images and the residual images, we can see that most methods can obtain satisfactory results for the smooth areas, while there may appear some wrong colors at the edges. From the comparison, we can see that the results of the proposed method show relatively fewer artifacts and color stripes.

Figures 12 and 13 show the reconstructed images using different methods for the 1st and 12th images in the McMaster database, respectively. Similarly, the marked portions in the images are enlarged and the residual images for the enlarged portion are shown for clear comparisons. It can be seen that, compared with other methods, the proposed method can better recover the images with fewer artifacts, especially at some tiny edges, which proved the validity and performance of the proposed method.

6. Discussion

In this work, we proposed a new method for image demosaicing based on GAN, which aims to more effectively reconstruct the full-color image. One of the challenges of this task is the recovery of the high-frequency information in the image, such as edges and angles. Many related algorithms have strong ability to process the smooth part of the image; however artifacts, zippering, and strip colors still exist in the high-frequency part. In the current work, we redesigned the generator and discriminator of GAN and combined the adversarial loss, the feature loss, and the pixel loss to further improve the network performance. Numerical experiments showed that the proposed algorithm can effectively reduce the artifacts at the edges and produce near-real reconstructed images, which can be the basis for subsequent image processing, such as image recognition and image transmission. The proposed method can produce better recovered color images; however, the learning-based strategy is relatively time-consuming in the data training phase. Therefore, how to improve the efficiency of the network training is an important aspect to further enhance the performance of the learning-based technology. In practice, there are many kinds of CFA patterns. We used the Bayer pattern in this paper. Different patterns of the CFA image may have different impact on the reconstructed image, so we will use CFA images with different designs to test the network in the near future. In the current work, we assumed the CFA images are noiseless. However, the images from cameras in practice may have been affected by noises. Therefore, we will try combining image demosaicing and denoising in the future. The current work focuses on directly generating the demosaiced images using neural network. We will test on combining traditional demosaicing algorithms and the neural network in the future.

7. Conclusions

In this paper, we proposed an image demosaicing method based on GAN. The generator is designed by using the improved U-net architecture to directly generate the demosaicing images. For the discriminator, we used the dense residual network including dense residual blocks with long jump connections and dense connections to overcome the problem of gradient disappearance and gradient dispersion during the network training, which can improve the discriminant ability of the network. In addition, we combined the adversarial loss, the pixel loss, and the feature loss together to improve the loss function. The network was trained using images from the Waterloo Exploration database and the trained network was tested with the Kodak database and the McMaster database. Comparisons among different image demosaicing methods showed that the proposed method can better eliminate artifacts in the reconstructed image and can especially better restore high-frequency features, such as edges and angles of the image.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request. The data used to support the findings of this study are open datasets which could be found in general websites, and the datasets are also freely available.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors acknowledge the National Natural Science Foundation of China (Grant no. 41704118) and the Natural Science Basic Research Plan in Shaanxi Province of China (Grant no. 2020JM-446).