Abstract

In recent years, with the rise of Internet of Things (IoT), a majority of smart technologies, such as autonomous vehicles, smart healthcare, and urban surveillance, require a huge number of images of high quality and resolution. Currently, image superresolution reconstruction technologies are widely used for obtaining high quality images. Unfortunately, the existing methods generally focus on the whole image without highlighting foreground information and lack visual focus. Also, they have low utilization of shallow features and numerous training parameters. In this paper, we propose a feature extraction module that focuses on foreground information: the parallel attention module (PAM). PAM computes channel and spatial attention in parallel, inputs the obtained attention values into a cascaded gated network, and dynamically adjusts the weights of both using nonuniform joint loss to focus on image foreground information and detail features to improve the reconstructed image’s foreground sharpness. To further improve the performance, we propose to connect multiple PAM modules in series with skip connections and call it PAMNet. PAMNet can better leverage the shallow residual features, and the reconstructed images are closer to ground truth. Thereby, the applications in the urban image processing IoT systems can obtain high-resolution images more quickly and precisely. The comprehensive experimental results show that PAMNet performs better than the state-of-the-art technologies.

1. Introduction

With the rapid development of artificial intelligence (AI) [15] and 5G [6, 7], many emerging technologies, such as Internet of Things [817], blockchain [1821] autonomous vehicles [2224], smart healthcare [2531], and urban surveillance, that meet people’s aspirations for a better life, are developing very fast. In these smart technologies, image processing IoT applications such as autonomous vehicles, smart healthcare, and urban surveillance are playing important roles in the upcoming smart society. Figure 1 shows the application of urban IoT systems.

However, due to the heterogenous properties of the smart camera devices and complicated network environment, the smart applications deployed in the remote cloud can often only obtain low-resolution images, which largely limit the usage of the smart applications. For example, (1) high-speed cars need to recognize the contents of road signs as early as possible, but due to the long shooting distance and small road signs, it is necessary to convert the captured low-resolution images into high-resolution images with the help of image superresolution methods. (2) In suburban community hospitals, we need to superresolve the transmitted low-resolution images to improve the accuracy of doctors’ remote diagnosis due to the poor quality of the captured equipment. (3) Police often tracks the trajectory of suspects through urban surveillance systems, and after image superresolution reconstruction, get a clearer picture of the suspect’s appearance and characteristics to speed up the process of crime solving. In summary, image superresolution reconstruction has broad applications in urban IoT systems.

Single image superresolution (SISR) is the task of generating high-resolution images using a single low-resolution image [32]. SISR algorithms are divided into three main categories: interpolation-based methods [33], reconstruction-based methods [34, 35], and learning-based methods [36, 37]. Learning-based methods are one of the most widely used methods at present. In particular, with the development of deep learning and generative adversarial networks, image superresolution has made great progress.

Dong et al. [38] proposed an SRCNN method that has realized end-to-end super resolution image reconstruction and better performance compared with other previous methods. However, the simple network structure limits its ability to extract features, and the MSE loss used by SRCNN stresses improving the image objective index, ignoring the subjective effect of the image. The detailed features of the blurred reconstructed images are VDSRs—depth models based on residual learning, which were proposed by Kim et al. [39]—that improve the model performance by introducing a residual structure, but there are problems such as large number of training parameters and unclear background of reconstructed images. EDSR proposed by Lim et al. [40] removes the BN layer and superimposes more layers to improve the reconstructed image quality by reducing the memory consumption of the BN layer. However, since L1 loss is used for training, the objective index of the reconstructed image is low.

Thanks to the generative adversarial networks proposed by Goodfellow et al. [41], the image superresolution task has opened a new chapter dominated by generative adversarial structures. SRGAN proposed by Ledig et al. [42] uses generative adversarial networks for image superresolution while using perceptual loss and adversarial loss to improve the realism of the reconstructed image, which makes the reconstructed image and the ground truth closer in semantics and style. However, the reconstructed image loses some high-frequency information due to the mere use of MSE loss to train the generator. ESRGAN proposed by Wang et al. [43] removes the BN layer based on SRGAN and introduces dense connections to avoid artifacts. VGG features before activation are used to improve perceptual loss and to make the edges and details of the reconstructed images clearer. The idea of relativistic GAN [44] is applied for reference to judge the probability that real images are more realistic than generated images in the discriminator, greatly enhancing the subjective effect of reconstructed images. Nevertheless, ESRGAN has many parameters and a long training time. RFB-ESRGAN proposed by Shang et al. [45] introduces a multiscale receptive field module to extract edge features of images and alternately uses nearest-neighbor interpolation [46] and pixel-shuffle [47] in the upsampling module to promote the information interaction between network space and depth. However, asymmetric convolution in the multiscale module can reduce the parameters and affect the accuracy of feature extraction, which is not conducive to restoring the original image’s detailed features.

Due to the good performance of attention mechanisms in computer vision tasks represented by image classification [48], object detection [49], and semantic segmentation [50], Zhang et al. [51] first introduced channel attention into the image superresolution reconstruction task and proposed RCAN, which highlights the foreground information of reconstructed images to some extent. The SAN proposed by Dai et al. [52] uses a second-order attention network to capture distant spatial features, leveraging the underlying image features, and the reconstructed image color is closer to the original image. Liu et al. [53] proposed RFANet based on EDSR’s proposed RFA module to exploit shallow residual features to achieve a good balance between model performance and parameter number and proposed an ESA spatial domain attention module to extract spatial domain features using stride length convolution and pooling instead of dilation convolution for dimensionality reduction to avoid the lack of image detail information caused by dilated convolution and achieve better results.

Although the above methods have achieved good results in image superresolution tasks, there remain problems such as the image foreground not being highlighted, lack of visual focus, etc. In this paper, we innovatively propose the parallel attention module (PAM) and use it as the basis to introduce skip connection and group convolution to build PAMNet, aiming to design a high-performance, high-quality image superresolution model that attends more to image foreground information and detailed features and has a smaller number of training parameters. The main contributions of this paper are as follows: (1)Proposing a generic module named PAM, which computes channel attention and spatial attention in parallel on the residual block’s residual branch, and then dynamically adjusts the weights of both using gated networks and nonuniform joint loss, so that the PAM module focuses on the attention domain with higher weights and thus can extract foreground information deeply(2)Based on the PAM modules, we proposed PAMNet. By concatenating multiple PAM modules in PAMNet and introducing skip connections, the residual features from all the preceding PAM modules are fed directly to the PAM module at the end of the network for aggregation to leverage the shallow residual features, and the reconstructed images are closer to ground truth. In addition, by using group convolution, PAMNet is more lightweight than other methods

The reminder of the paper is organized as follows. Section 2 describes the PAM module, PAMNet, and the loss function of this paper in detail. Section 3 verifies the effectiveness and generality of this paper’s method through ablation experiments and comparison experiments. Finally, Section 4 presents the conclusion of this study.

2. Method

2.1. PAM

The PAM module proposed in this paper can directly replace the residual block in the ResNet [54] backbone network, compute channel and spatial attention in parallel, splice the results in the channel dimension, and feed them into the gated network to extract the weight coefficients. In the backpropagation process, the channel attention and spatial attention weights are dynamically adjusted by nonuniform joint loss, focusing on extracting image foreground information in the attention domain with higher weights. The specific structure of PAM is shown in Figure 2.

In computing channel attention, a structure similar to SENet [55] is used, and the fully connected layer in SENet is replaced by convolution, which can preserve the image’s spatial features. The specific computation of channel attention is given by Eq. (1): where represents the input of the residual block, represents the output after computing the residuals, represents computing channel attention, and represents the final output of the channel attention. Meanwhile, in this paper, represents the channel dimension of the feature map, represents the height of the feature map, and represents the width of the feature map, so that the three dimensions of a feature map can be represented as .

Referring to the HDC idea [56], PAM computes spatial attention using a three-layer cascaded dilation convolution with dilation rates of 1, 2, and 3. First, we use a convolution to downscale the feature map with input dimensions into a feature map with dimensions, where is the downscaling factor, and in this paper, we take . Second, the feature map after downscaling is convoluted with three different expansion rates to expand the perceptual field with the minimum number of parameters in a finite number of steps to ensure the continuity of the perceptual domain and avoid the information loss caused by pooling. Finally, we use a convolution to fuse the information of different channels of the feature map and go through Sigmoid activation to get the feature map weights in the dimension and assign the weights in the dimension to multiply to the input feature map to focus on the image foreground information. The specific computation of spatial attention is given by Eq. (2): where represents the input of the residual block, represents the output after computing the residuals, represents computing spatial attention, and represents the final output of spatial attention.

After obtaining the channel attention and spatial attention using the above method, the two are spliced in the channel dimension to obtain the input of the gated network. Then, we use a convolution to fuse the information and reduce the dimension of . Then, two convolutions for feature extraction and Sigmoid activation are used to obtain an activation output with values in the range (0, 1). Finally, the final output is obtained by multiplying by and as a linear combination of coefficients. Meanwhile, this weight is continuously updated during the backpropagation process, and the weights of channel attention and spatial attention are dynamically assigned in learning progress, focusing on extracting image foreground information in the attention domain with higher weights. The computation is given by Eq. (3):

The specific structure of the gated network module is shown in Figure 3.

2.2. PAMNet

PAMNet is built with the PAM module as the core unit and postupsampling as the base structure, using skip connection, group convolution, and feature fusion. The network comprises a down-sampling layer, a feature extraction layer, and an upsampling layer. In this case, the downsampling layer uses a serial convolution to initially extract image color, contour, and texture features. The upsampling layer uses pixel shuffle to enlarge the image. PAMNet benefits from the PAM module and skips connection, which focuses more on image foreground information reconstruction and can leverage shallow residual features, highlighting the visual focus of reconstructed images. The specific structure of PAMNet is shown in Figure 4.

The downsampling layer initially extracts the image’s underlying features by two times convolution and increases the number of feature map channels. The feature extraction layer and the upsampling layer are the core of PAMNet. The feature extraction layer uses PAM as the basic unit and serially multiple PAM modules to extract detailed features. The basic structure of the traditional residual block is two same convolutions; serializing multiple blocks induces many parameters and complex computations, which seriously slows down the model’s training. Therefore, in this paper, we use group convolution in the PAMNet feature extraction layer to reduce the number of parameters and add convolution to fuse the group information. Taking the input feature map , output feature map , and convolution kernel as an example, the number of parameters of a residual block is given by Eq. (4):

While using group convolution with group number and convolution, the number of parameters is reduced to Eq. (5):

According to the reference [57], take and , we can get and ; we can see that the number of parameters using grouped convolution is only 17.36% of the normal convolution, which simplifies the number of parameters of the model while significantly increasing the training speed. Since the shallow residual features must pass through multiple computations before reaching the last PAM module, the deeper layers of the network fail to leverage the shallow information and lose some of the image’s shallow features, which is inconvenient for reconstructing the image’s color and texture information and severely limits the model’s image reconstruction capability. Existing SR methods, such as RFANet, only use skip connection inside the RFA module, which fails to preserve the shallow features of the image completely. In this study, we introduce a skip connection within PAMNet, and by skip connection, we input the residual features of all preceding PAM modules to the last PAM module in the feature extraction layer, reducing dimensionality and aggregating shallow features by convolution. Compared with the simple stacking of multiple residual blocks, PAMNet retains the underlying image information so that it can participate in the subsequent computation to further extract the high-level semantic information while sending it directly to the end PAM module without any interference, which retains the underlying features and focuses on extracting the high-level image information.

The upsampling layer acts as the final layer of the network and is responsible for scaling the image to a specified magnification. Commonly used upsampling methods include linear interpolation, deconvolution [58], transposed convolution [59], subpixel convolution [60], and metaupscale [61]. Interpolation methods are the fastest, but reconstructed images are blurred and have low definition. Deconvolution and transposed convolution reconstruct images with a field of perception up to the same magnification as the image, which is not conducive to obtaining global features, and the reconstructed images are prone to checkerboard artifacts. Subpixel convolution has a larger field of perception and more contextual information, and the reconstructed image is clear in detail. The metaupscale does not need to determine the scale factor in advance, the image can be continuously enlarged by any factor, and the reconstructed image is high definition, which is often used for video superresolution reconstruction. Due to the faster computation speed of subpixel convolution and the high quality of reconstructed images, pixel shuffle is used for upsampling in this paper.

2.3. Loss Function

Similar to existing methods [42, 43, 5153], this paper trains the network model based on the generative adversarial structure and optimizes the model parameters by the joint discriminator loss and generator loss, where discriminator loss is defined as Eq. (6): where is the real image, is the reconstructed image, and (, ) computes the difference between the real image and the reconstructed image and uses the Sigmoid restriction (, ) (0, 1).

Unlike the above methods, the generator loss in this paper comprises nonuniform joint loss, adversarial loss, and content loss. By using nonuniform joint loss, constraint the network learn image color and texture features while extracting more discriminative features and detailed information, focusing more on the reconstruction of image foreground information.

The nonuniform joint loss is based on L1 loss, and the reconstructed image and the original image are fed into the pretrained VGG-19 network to compute L1 loss before the first pooling layer and L1 loss before the last pooling layer by adjusting the weights of and to constrain the generator to extract the underlying features while learning more detailed information and discriminative features. The specific computation is given by Eq. (7): where is the weight of , and is the weight of ; in this paper, we take and .

The adversarial loss is computed as in [11], and the specific computation is defined by Eq. (8):

Content loss computes the pixel difference between the real image and the reconstructed image using both L1 loss and L2 loss. Methods such as RFANet only use the L1 loss to compute the content loss, which induces the loss of some high-frequency information in the reconstructed images, and L1 loss is prone to sparse solutions and cannot be derived at the zero point, increasing the instability of GAN training. SRGAN only uses L2 loss to compute the content loss, which is influenced by outlier points. Although the reconstructed image has a higher peak signal-to-noise ratio (PSNR (dB)) but is prone to artifacts, the visual effect is poor, which opposes the original intention of image superresolution. PAMNet computes content loss using both L1 loss and L2 loss to enhance the method’s robustness while reducing sparse solutions. The specific computation is given by Eq. (9): where represents the ground truth, represents the reconstructed image, and and represent the weight of L1 loss and L2 loss, respectively. In this study, we take and .

In summary, the generator loss is defined by Eq. (10):

where , , and represent the weights of adversarial loss, nonuniform joint loss, and content loss. In this paper, we take , , and .

3. Experiment

3.1. Settings

Referring to the existing methods [40, 43, 5153], to verify the effectiveness of this paper, we select 3450 images from DIV2K [62] and Flickr2K [63] as the training dataset and randomly select 60,000 subimages as the training images after cropping and mirror reversal operations on the original images. Meanwhile, we select Set5 [64], Set14 [65], BSD100 [66], and Urban100 [67] as the test datasets. The main parameters of the network are shown in Table 1.

This paper is based on PyTorch for experiments with the following hardware parameters: Intel i7 9700, NVIDIA 2080ti, and 32gRAM.

3.2. Results

In this paper, we focused on SISR reconstruction on a four-time deflation factor and used Set5, Set14, BSD100, and Urban100 as the test sets to compare with existing image superresolution methods from both subjective and objective aspects. We also embedded the PAM module into the backbone networks of SRGAN and ESRGAN to verify the effectiveness and generality of the module. Meanwhile, PSNR and SSIM were used as objective indices to quantify the quality of the reconstructed images.

3.2.1. Effectiveness and Generality of PAM

This section verifies the effectiveness and generality of the PAM module by replacing the basic residual block of SRGAN and the RRDB structure in ESRGAN using the PAM module and keeping the other structures and loss functions in the original network unchanged. The replaced models are called PAM-SRGAN and PAM-ESRGAN, and we selected the images in Set5, Set14, BSD100, and Urban100 for analysis. The results are shown in Figure 5.

The performance of SRGAN and ESRGAN with embedded PAM modules on different datasets is shown in Table 2.

As seen in Table 2, PAM-SRGAN improves PSNR by 1.84 dB, 1.91 dB, 0.75 dB, and 0.54 dB over SRGAN on the four test sets; PAM-ESRGAN improves PSNR by 0.14 dB, 0.03 dB, 0.04 dB, and 0.21 dB over ESRGAN on the four test sets. The results show that the PAM module improves the performance of SRGAN and ESRGAN networks with good generality.

3.2.2. Performance of PAMNet

To give PAMNet the best performance, we performed the following experiments on the number of PAM modules in the feature extraction layer. Let the total number of PAM modules in PAMNet be and [3, 5, 7, 9, 11]. Keeping the other structures in PAMNet unchanged, the test results on different datasets are shown in Table 3.

As seen in Table 3, the performance of PAMNet outperforms SOTA method RFB-ESRGAN (32.66 dB, 28.88 dB, 27.79 dB, and 26.92 dB) and RFANet (32.72 dB, 28.91 dB, 27.77 dB, and 26.89 dB) when .

As the number of PAM modules () increases, the PSNR of PAMNet reconstructed images on different datasets grows accordingly. The variation relationship is shown in Figure 6 for the Urban100 dataset, for example.

Figure 6 shows that the PSNR of the reconstructed images does not continue to improve significantly with the increase in the number of PAM modules (), and for a good balance between model performance and complexity, PAMNet takes . After determining the number of PAM modules and selecting images from the Set5, Set14, BSD100, and Urban100 test sets, one image from each test set was taken for analysis, and the results are shown in Figure 7.

As can be seen in Figure 7, due to the addition of a gated network and nonuniform joint loss in PAMNet, our method can produce sharper foreground information than existing methods (Figures 7(a) and 7(b)), and the detailed texture features of the reconstructed images are closer to Ground Truth (Figures 7(c) and 7(d)). In addition, PAMNet basically preserves the color and texture features of the image by introducing skip connection, and the overall image sharpness is basically on par with SOTA methods such as RFB-ESRGAN and RFANet.

To verify the effectiveness of PAMNet from an objective perspective, we selected PSNR and SSIM as objective indices. The PSNR and SSIM of each image in Figure 7 are shown in Table 4.

As shown in Table 4, the PSNR and SSIM of PAMNet reconstructed images outperformed other methods, and only the RFB-ESRGAN method had slightly higher PSNR than PAMNet on Figure 6(a). To verify the generalization performance of PAMNet, the PSNR and SSIM of different methods on different test sets are shown in Table 5.

As shown in Table 5, the PSNR of PAMNet reconstructed images outperformed other methods in each test set, improving 0.01 dB, 0.02 dB, 0.04 dB, and 0.04 dB over RFB-ESRGAN in four test sets and improving 0.07 dB, 0.05 dB, 0.02 dB, and 0.01 dB over RFANet, and SSIM was Set5, and Urban100 datasets were slightly lower than RFB-ESRGAN. The experimental results show that, thanks to the PAM module and nonuniform joint loss, PAMNet can effectively extract image foreground information, improve the PSNR and SSIM of the reconstructed images, and enhance the foreground clarity while ensuring a clear background in the reconstructed images.

3.2.3. The Effect of Skip Connection

In this section, the skip connection in PAMNet was removed, and the other structures and loss functions were kept unchanged to investigate the effect of skip connection on PAMNet. The experimental results are presented in Table 6.

As shown in Table 6, the PSNR of PAMNet reconstructed images on different test sets decreased by 0.52 dB, 0.57 dB, 0.16 dB, and 0.24 dB, after removing the skip connection in PAMNet, and the performance of PAMNet decreased significantly, which constrained the utilization of shallow features by the model. The experimental results show that the skip connection significantly impacted PAMNet, and the use of skip connection could improve the utilization of shallow features in PAMNet, thereby enhancing the comprehensive performance of the model.

3.2.4. Model Complexity

To evaluate the complexity of the PAMNet model, it was compared with existing SR methods: SRCNN, SRGAN, VDSR, EDSR, DBPN, SAN, ESRGAN, RFB-ESRGAN, and RFANet. The results are shown in Figure 8.

As seen in Figure 8, PAMNet has smaller parameters and better performance than DBPN, RFANet, SAN, and ESRGAN. Compared with RFB-ESRGAN, PAMNet has a slightly larger number of parameters but slightly outperforms RFB-ESRGAN overall.

4. Conclusion

In this paper, we proposed a generic PAM module for image superresolution reconstruction to extract foreground information and high-frequency features of images. The module computed channel attention and spatial attention in parallel and used the gated network to extract the two-weight coefficients and cooperated with the nonuniform joint loss to dynamically modify the two weights during the backpropagation process, so that the network attended more to the extraction of foreground information and discriminative features. To fully reflect the good performance of PAM modules, PAMNet was further proposed to connect multiple PAM modules in series in PAMNet. The ablation experiments verified the effectiveness and generality of the PAM module and the necessity of skip connection in PAMNet. By contrast experiments with existing state-of-the-art image superresolution methods, the average PSNR improvement of PAMNet on different data sets is 0.4 dB, and the average SSIM improvement is 0.005. It is verified that PAMNet achieves a good balance between performance and model complexity. By using PAMNet, in many applications of urban IoT systems, such as autonomous vehicles, smart healthcare, and urban surveillance, it is possible to generate clearer and more foreground-focused high-resolution images than existing image superresolution methods, improving the reliability of urban IoT systems and satisfying people’s vision of a better life. Limited by the training equipment and the research content, only the image superresolution method on the ×4 magnification factor has been studied. In the future, we will also continue to research faster and greater magnification image superresolution methods, so that various smart technologies can continue to benefit humanity and all families.

Data Availability

The open datasets used to support the findings of this study are included within the article. The link is as follows: https://data.vision.ee.ethz.ch/cvl/DIV2K/.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is partially supported by Telecommunications Advancement Foundation (Japan) Research Grant, RIEC Nationwide Cooperative Research Projects, Research Institute of Electrical Communication, Tohoku University, Japan, H31/B18, and ROIS NII Open Collaborative Research 2021 (21FA03).