Abstract

Image superresolution (SR) is a classical issue in computer vision area. Recently, there are elaborated convolutional neural networks (CNNs) demonstrating remarkable effectiveness on image SR. However, most of the previous works lack effective exploration on the structural information, which plays a critical role for image quality. In this paper, we find that the hierarchical design can effectively restore the structural information and devise a multilevel feature exploration network for image SR (MFSR). Specially, we design an encoder-decoder architecture to concentrate on structural information from different levels and devise a spatial attention mechanism to address the inherent correlation among features for effective restoration. Experimental results show the proposed MFSR can restore more correct edges and lines and achieves both better objective and subjective performances than the state-of-the-art methods with higher PSNR/SSIM results, indicating the effectiveness on structural information restoration.

1. Introduction

Image superresolution (SR), as a traditional image processing issue, is widely considered in advanced vision applications, such as medical image enhancement, slope collapse detection, and art image classification [1]. Given a low-resolution (LR) image, the task of image SR is to find a corresponding high-resolution (HR) instance with satisfactory visual experience and more correct textures [2].

Recently, convolutional neural networks (CNNs) show the amazing performance on image restoration. As far as we know, superresolution convolutional neural network (SRCNN) [3] proposed by Dong et al. is the first CNN-based method for image SR. After SRCNN, there are elaborate network designs for effective image SR, such as enhanced deep residual superresolution network (EDSR) [4], residual dense network for image superresolution (RDN) [5, 6], deep residual channel attention image superresolution network (RCAN) [7], deep Laplacian pyramid superresolution network (LapSRN) [8, 9], and cascading residual network (CARN) [10]. These works concentrate on elaborate designs and utilize different blocks for effective performance. However, very few of them specially concentrate on the structural information restoration.

There are also works considering the structural information as a prior. Fang et al. design a soft-edge assisted network for single image SR, which is termed as SeaNet [11]. Cheng et al. consider gradient guidance into structure-preserving superresolution (SPSR) [12] and utilize generative adversarial network for satisfactory performance. Sparse-mask superresolution network (SMSR) is also investigated by Zhang et al. for efficient inference. Yang et al. devise a deep edge guided recurrent residual network for image SR (DEGREE) [13]. However, the structural information is usually applied as a prior term, and they almost neglect to focus on designing effective architecture to explore the edges and textures.

Multiscale structure proves to be an effective design pattern for structural information exploration. Multiscale residual network for image superresolution (MSRN) [14] designs a multiscale residual block for effective feature exploration. Recently, multiscale dense cross network (MDCN) [15] is also proposed for hierarchical exploration. However, these works lack finding the inherent correlations among features with restricted performance.

Attention mechanism has become a popular design in recent CNNs for different tasks. Recently, channel-wise attention [16, 17] has also been widely considered in image SR networks, such as residual-in-residual channel attention network for image SR (RCAN) [7] and channel-wise and spatial feature modulation network (CSFM) for single image SR [18]. However, the channel-wise attention usually utilizes a global pooling method to estimate the information of different channels, which misses the spatial diversity of feature maps.

In this paper, we design a multilevel feature exploration network for image superresolution, which is termed as MFSR. To effectively explore the structural information, we consider a UNet-style [19] to hierarchically process the image features and concatenate the multiscale features for joint restoration. Besides the UNet-like backbone, we also devise a spatial attention mechanism to consider the inherent correlation among spatial information and remove the global pooling operation in traditional channel-wise attention, which effectively avoids the information loss. Based on the spatial attention mechanism, we devise a spatial attention residual block (SARB) for effective structural information exploration. Experimental results show our MFSR can restore the HR images with more satisfactory textures and closer to the ground truth. Compared with the state-of-the-art methods, specially multiscale structure-based networks, our network achieves much higher peak signal-to-noise rate (PSNR) [20] and structural similarity (SSIM) [21] results with better visual quality.

For this paper, the main contributions are as follows:(i)We design a multilevel feature exploration network for image superresolution, which can effectively process the hierarchical structural information.(ii)We devise a spatial attention mechanism to explore the inherent correlation among spatial information without global pooling operation, which effectively avoids the information loss. Based on the spatial attention mechanism, we devise a spatial attention residual block (SARB) for effective structural information exploration.(iii)Experimental results show our MFSR can restore the HR images with more satisfactory textures and closer to the ground truth with higher PSNR/SSIM results.

In the following, we introduce the related work in Section 2. Then, the detail of MFSR is described in Section 3. Finally, the experiment results are shown in Section 4.

2.1. Image Superresolution

Image SR is a traditional topic in computer vision area, which is widely investigated in recent years. Classical SR methods can be generally separated into five types: interpolation-based methods, filter-based methods, patch-based methods [22, 23], optimization-based methods [24, 25], and sparse-coding-based methods [26, 27]. Interpolation aims to build a function that uses the neighbor pixel values to estimate the missing pixel value. Cao et al. bring low-rank matrix completion and recovery into image SR and investigate a novel representation of interpolation function [28]. Similar to interpolation methods, filter-based methods try to find a feasible filter to change the pixel value with the help of neighbor’s information and generate pleasant visual results. Patch-based methods separate the image into several adjacent patches and optimize the performance of each patch separately to improve the whole image quality. Zhu et al. use deformable patches for single image SR [22]. Yang et al. convert the patches into sparse representation and perform the image restoration for raw data [23]. Optimization-based methods consider single image SR as minimizing the energy function and utilize iterative optimization to approximate the solution. Huang and Xia use matrix-variable optimization for fast blind image superresolution [24], and Shi et al. optimize the image quality with low-rank and total variation regularization [25]. Different from previous works, sparse-coding-based methods enable the learning step into image SR and derive a codebook for adaptive image restoration. Shi and Qi consider low-rank sparse representation with self-similarity for image SR [26]. Lu et al. also perform geometry constrained sparse coding to produce an acceptable result.

Convolutional neural networks (CNNs) demonstrate amazing performance on image SR task. To the best of our knowledge, superresolution convolutional neural network (SRCNN) [3] is the first CNN-based method for image SR, which brings a greater performance improvement than traditional methods. SRCNN follows a sparse-coding-like structure and uses three convolutional layers to denote the feature extraction, nonlinear mapping, and restoration operations separately. After SRCNN, there are numerous CNN-based works with elaborate designs for superior capacity. A fast version of SRCNN termed as FSRCNN [29] is proposed by Dong et al. to accelerate the inference phase and achieves better performance than SRCNN with a deeper but faster network. A very deep network with residual connection termed as VDSR [30] also increases the network depth for boosting the capacity. Kim et al. design a deep recursive network for image SR, which is termed as DRCN [31]. These works use deconvolution to upscale the feature map and generate the HR image or preupsample the image before the network processing. Shi et al. provide a new perspective to upscale the feature map by subpixel convolution and integrate the upsampling operation into the efficient subpixel convolutional neural network (ESPCN) [32]. After ESPCN, most of the image SR networks consider subpixel convolution to upscale the feature map instead of using deconvolution. Recent end-to-end networks can be generally separated into single-scale methods and multiscale methods. Single-scale networks consider a deeper and wider design to seek higher quantitative performance. Enhanced deep residual network for image SR (EDSR) is a representative method to achieve the state-of-the-art performance, which stacks numerous residual blocks [33] and removes the batch normalization [34] layers. After EDSR, there are well-designed blocks for image SR with state-of-the-art performance. Dense net for image SR (SRDenseNet) [35] introduces the densely connection into image SR. Zhang et al. combine the superiority of dense connection and residual connection and propose a residual dense connection network (RDN) for image SR [5, 6]. Recently, residual-in-residual network with channel attention for image SR (RCAN) [7], residual feature aggregation deep network (RFDN) [36], and other works also achieve state-of-the-art performance. Deep network with component learning (DNCL) is recently presented for fast and accurate image SR. Inspired by traditional filter-based methods, Li et al. propose an adaptive filter learning network termed as FilterNet [37].

Besides the single-scale networks, there are also multiscale methods that concentrate on the hierarchical structural information exploration. Laplacian pyramid image SR network (LapSRN) [8, 9] designs a Laplacian pyramid inspired network to progressively restore the residual image and increases the image resolution. Multiscale residual network (MSRN) [14] utilizes a multiscale residual block for accurate image restoration. Multiscale dense cross network (MDCN) upgrades the MSRN and achieves state-of-the-art image SR performance. Resolution-aware network (RAN) [38] was recently considered for simultaneous SR of multiple factors. He et al. consider a multiscale design to accelerate the network speed, which is named as MRFN [39]. Recently, a deep inception-residual Laplacian pyramid network (IRLP) is also proposed for accurate single image SR [40].

2.2. Attention Mechanism

As far as we know, channel-wise attention [16, 17] is the first tiny but effective attention block for CNN design. The channel-wise attention omits the spatial information and only concentrates on the different importance of channels. Residual-in-residual channel attention network (RCAN) [7], information multiple distillation network (IMDN) [41], and other recent works consider the channel-wise attention for better restoration performance. Besides the channel-wise attention, there are also effective attention mechanisms for image SR. Residual feature information distillation network (RFDN) [42] proposes an enhanced spatial attention to overcome the shortcoming of channel-wise attention and considers the spatial information for better attention calculation.

3. Methodology

In this section, we introduce the MFSR in the following manner. Firstly, we describe the overall network design for restoring HR image from LR instance. Then, we specially discuss the blocks design of SARB. Finally, we show the hyperparameter settings for MFSR.

3.1. Network Design

We aim to restore the HR image from the given LR image . Let be the proposed MFSR; then, the restoration is

Figure 1 shows the network structure of proposed MFSR. There are three main modules in the MFSR: feature extraction module, multilevel feature exploration module, and restoration module. Before the restoration, one convolutional layer converts the image into feature space for further exploration aswhere is the converted feature maps from .

The feature extraction module extracts the feature maps from . Let denote the feature extraction module; then, there iswhere represents the extracted features from LR image.

After feature extraction, the multilevel feature exploration module processes in a hierarchical manner for effectively restoring the structural information. Firstly, we build the multilevel features form for further processing. We use max pooling to downscale and generate feature maps with different resolutions aswhere denotes the max pooling with scaling factor . The downscaled features with different resolutions are explored by several spatial attention residual blocks (SARB) for generating the feature maps with HR information. There are SARBs for processing the feature maps with scaling factor . After SARBs, a padding structure with residual connection is devised to keep the original information and improve the gradient transmission. The padding structure consists of two convolutional layers and one ReLU activation. Thus, the multilevel exploration iswhere is the padding structure for scaling factor and means the processing procedure with SARBs. is the HR feature with scaling factor .

After exploration, the multilevel features should be concatenated jointly for hierarchical consideration. In this paper, we use bilinear interpolation to upscale the resolution. Bilinear interpolation considers the image interpolation from two directions and finds a suitable value for missing pixels [43]. The concatenation is demonstrated aswhere is the concatenation operation, is the bilinear upsampling operation, and the concatenated multilevel HR feature is .

After multilevel exploration, the restoration module generates the HR image from the feature maps. There are two steps in the restoration module. Firstly, we explore the multilevel HR feature for further integrating the multilevel information. There are SARBs for the information integration aswhere is the explored hierarchical HR feature. Then, the upscale block generates the image from the feature maps, which is composed of one convolutional layer and one subpixel convolution. Skip connection is designed between the LR and HR features for better gradient transmission. Finally, there iswhere is the upscale block.

The overall proposed MFRN can be concluded as in Algorithm 1.(1)MFSR  Input: The input LR image  Output: The restored HR image (2)Convert LR image into feature space by convolution as .(3)Extract LR features by the feature extraction module as .(4)Generate the hierarchical HR feature according equation (4)–(7).(5)Produce the HR image from the feature as  = Upscale ()(6)Return .

Algorithm 1: The proposed MFSR for generating HR image from the LR instance

3.2. Spatial Attention Residual Block

Spatial attention residual block (SARB) is the main component for the MFSR, which is generally utilized in the multilevel feature exploration module and the restoration module. The left-middle box in Figure 1 shows the block design of SARB. SARB enjoys a residual design with two convolutional layers, one Leaky ReLU activation, and one spatial attention (SA) layer. The skip connection in SARB helps effective gradient transmission and information preservation. Specially, we mainly introduce the spatial attention in the SARB that effectively explore the spatial correlation among features.

Figure 2 shows the architecture of SA layer, which is composed of four convolutional layers, three Leaky ReLU activation, and a Sigmoid activation. Let be the input feature of SA, there is one convolution with ReLU activation to process the feature as

After convolution, the feature maps are downsampled by max pooling with scaling factor to increase the receptive field. Then, one convolutional layer and one ReLU activation explore the correlation among features as

The processed feature will be upscaled by bilinear upsampling to the original size for calculating the attention. After upscaling, two convolutional layers with a ReLU activation explore the feature to find the attention map. Sigmoid activation is used to maintain the nonnegativity. Finally, the output feature of SA layer iswhere denotes the Sigmoid activation and denotes the bilinear interpolation with scaling factor .

It is worth noting that SA considers the spatial correlation of features. Different from channel-wise attention which utilizes global pooling to embed the channel information [7, 16, 18], SA considers the spatial correlation by exploring the attention with the same size of the input feature maps. By utilizing SA, every pixel of feature map will get a specific attention value.

3.3. Settings

In this paper, all layers are set with filter number as 64 except for the upscale block. The filter number of the upscale block follows the same design as IMDN [41]. There are SARBs in the scale of multilevel feature exploration, SARBs in the scale, and SARBs in the scale. It is worth noting that the block numbers are set to maintain the same receptive field for different scaling factors. There are four convolutions in feature extraction module, and five SARB blocks in restoration module. The filter sizes of all convolutional layers are set as .

4. Experiment

In this paper, we train our network with DIV2K [44] dataset. DIV2K is firstly proposed in NTIRE 2017 challenge [44] and contains numerous high-resolution instances for image SR. There are numerous SR works trained on this dataset [47]. We choose 800 images from DIV2K for training, and 5 images for validation. For testing, we use five widely used benchmarks to evaluate the performance: Set5 [45], Set14 [46], B100 [47], Urban100 [48], and Manga109 [49]. Peak signal-to-noise ratio (PSNR) [20] and structural similarity [21] are selected to estimate the objective performance of restored images. We update our model by Adam [50] optimizer with learning rate as and halve the learning rate for every 200 epochs. The loss function of this paper is set as L1-loss. We use PyTorch 0.4.1 to train the network. The CUDA version is 10.0. The batch size is set as 32. The experimental environment is Ubuntu 18.04, and we train the model on one NVIDIA GTX-1080Ti GPU.

4.1. Ablation Study
4.1.1. Investigation on the Multilevel Exploration

In this paper, we set different number of SARBs in the three scales of multilevel exploration module to keep the receptive field. To investigate the effectiveness of the number settings, we consider different combinations for comparison. Table 1 shows the performances of different block settings. Herein, (, , ) denotes the different number of SARBs for the scaling factor as , , and . In the table, we find that, with the increase of SARBs, the PSNR/SSIM results are higher. According to the third and forth rows, the PSNR of on Set5 is 32.23 dB while the PSNR of is 32.17 dB. The decrease of leads to near 0.06 dB PSNR drop on Set5. However, the PSNR results of the second, third, and last rows on Set5 are 32.26 dB, 32.23 dB, and 32.26 dB separately, which shows the decrease of and leads to less than 0.03 dB drop on PSNR. In this point of view, influences the performance more than blocks from other levels. We believe this is because the resolution of is larger than other levels, and the computation complexity is much higher.

4.1.2. Investigation on Spatial Attention

We also investigate the effectiveness of spatial attention. Table 2 shows the performance comparisons on spatial attention. The model with spatial attention enjoys the SARB design proposed in the paper. The model without spatial attention holds the same block design as EDSR [4]. In the table, we can find that the spatial attention mechanism brings 0.04 dB, 0.02 dB, and 0.04 dB PSNR gain on Set5, Set14, and B100 datasets separately. Similarly, the model with spatial attention achieves 0.8961, 0.7838, and 0.7381 SSIM results on different benchmarks, which are all higher than the model without attention mechanism. In this point of view, attention mechanism is an effective component for better performance.

4.2. Comparison with State-of-the-Art Methods

We mainly compare our model with several classical and recent works, including SRCNN [3], FSRCNN [29], VDSR [30], DRCN [31], LapSRN [8, 9], RAN [38], DNCL [51], FilterNet [37], MRFN [39], SeaNet [11], DEGREE [13], IRLP [40], FSN [52], and DSRLN [53]. Table 3 shows the PSNR/SSIM performance on five benchmarks with scaling factor , , and .

In the table, we can find the proposed MFSR achieves the best performance on all testing benchmarks. It is worth noting that our model is mainly designed for structural information restoration. In this point of view, we achieve significant PSNR/SSIM improvement on the benchmark with plentiful lines and edges, such as Urban100 and Manga109. Particularly, our method achieves near 0.3 dB higher on Manga109 and Urban100 than other works with scaling factor and . When scaling factor is , we achieve near 0.2 dB higher than other works. Urban100 is composed of high-resolution building images, and Manga109 is composed of comic cover pages. Thus, the high PSNR/SSIM performance demonstrates our effectiveness on structural information restoration.

We also achieve superior performance than other edge preserving and multiscale SR works. FilterNet, DEGREE, and SeaNet are aiming at structure information preserving, and MRFN is a modern effective multiscale network for image SR. In Table 3, our model gains more than 0.2 dB on PSNR, which is a significant improvement.

Besides quantitative comparisons, we also compare the visualization performance with different methods. We mainly compare our method with LapSRN [9], MSRN [14], and CARN [10]. LapSRN is a representative progressive structure inspired by Laplacian pyramid, and MSRN is a state-of-the-art multiscale design. Thus, we compare with them to show the effectiveness of our multilevel design. Figure 3 shows the visualization comparison. In the first and second rows of the figure, our method recovers the lines on the building more accurately than MSRN and LapSRN. In the last row, our method can restore more correct small lines, while both LapSRN and MSRN fail to recover the straight lines.

5. Conclusion

In this paper, we proposed a multilevel feature exploration network (MFSR) to restore the structural information for better visualization experience, which is designed with a UNet-like architecture for hierarchical exploration. To furthermore explore the structural information, spatial attention is considered in the MFSR for better performance. Spatial attention residual blocks (SARBs) are specially designed based on the proposed spatial attention mechanism. Experiment results show MFSR achieves better PSNR/SSIM performance than other multiscale and structural-preserving image SR works with scaling factor , , and . When the scaling factor is larger, there is near 0.2 dB PSNR improvement on the benchmarks with plentiful edges and lines, indicating the effectiveness on structural information restoration. The visual comparison result also demonstrates the superiority of MFSR on producing more accurate lines and edges.

Data Availability

The image data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was partially supported by the Program for Liaoning Innovation Talents in University (no. LR2019034) and the Overseas Training Foundation of Liaoning (no. 2019GJWYB015).