Abstract

Convolution neural networks facilitate the significant process of single image super-resolution (SISR). However, most of the existing CNN-based models suffer from numerous parameters and excessively deeper structures. Moreover, these models relying on in-depth features commonly ignore the hints of low-level features, resulting in poor performance. This paper demonstrates an intriguing network for SISR with cascading and residual connections (CASR), which alleviates these problems by extracting features in a small net called head module via the strategies based on the depthwise separable convolution and deformable convolution. Moreover, we also include a cascading residual block (CAS-Block) for the upsampling process, which benefits the gradient propagation and feature learning while easing the model training. Extensive experiments conducted on four benchmark datasets demonstrate that the proposed method is superior to the latest SISR methods in terms of quantitative indicators and realistic visual effects.

1. Introduction

Superresolution (SR) image reconstruction is widely used in various applications, such as military surveillance, medical diagnostics [1, 2], remote sensing [3], and video streaming [4, 5]. Single image superresolution (SISR) is aimed at reconstructing a high-resolution (HR) image from its counterpart low-resolution (LR) input, which is an essential and classic task in computer vision. Recently, high-resolution (HR) demand images have boosted. However, physical constraints limit the conduction of high-resolution pictures. A series of successful works brought attention to the research community.

The task of recovering HR images from its counterpart (LR) version is ill-posed. Researchers have made many efforts to this task and invented numerous algorithms, including interpolation-based, reconstruction-based, and learning-based methods [1], respectively.

The traditional SISR algorithms, for instance, bicubic interpolation [6], are high-speed while suffering from poor accuracy. It is easy to fail in practice. To limit possible solving space, researchers present more advanced methods, reconstruction-based algorithms [7, 8], by introducing available prior knowledge. These algorithms may restore clear details (i.e., texture details), but extensive experiments show that they degrade sharply when the scale factors increase; subsequently, the algorithms with learnable parameters [9] are proposed to analyze relationships between the image and their counterpart image by training concrete instances [10, 11]. Although such learning-based methods perform very well, the time-consuming optimization problems they involve are very tricky.

In recent years, CNNs have been introduced to facilitate the progress of the SISR field because of their excellent feature representation ability. Dong et al. [12] were the first to propose a three-stage convolutional network to solve the SISR problem, which has become a milestone in this field. Since then, the research community has set out to design more complex networks to improve performance. EDSR, a very large network with residual blocks, was presented by Lim et al. [13] and achieved satisfactory performance in both PNSR and SSIM [14]. However, these state-of-the-art methods still have some limitations: (1)The state-of-the-art (SotA) models [13] mainly improve the performance by considerably growing the depth and width of the proposed methods. Therefore, massive parameters and increasing resource-consuming problems are inevitable(2)Many progressive models do not fully take advantage of the hierarchical information from the primary LR images, which are essential for improving visual performances

To address these shortcomings, we present a model named CASR, exploring two separate strategies to functionally extract features for precise SISR. Figure 1 shows the ×4 SR results of our proposed model on dataset DIV2K [15]. First, we propose a small but functional depthwise separable convolution network named head module aimed at more systematic feature extraction.

Second, we present another cascaded residual network (CAS-Block) for better feature and gradient propagation. Our proposed method combines features from excessive layers at both the regional and global levels with such architecture. Moreover, a stacking broader local residual connection is applied to exploit the feature of the and let the vast low-level mappings be transmitted. This schema unites nonlocal actions to capture remote spatial features from former inputs.

As the crucial integrant of the presented method, the CAS-Block includes six subtrunks, each of which consists of two convolutional layers and a nonlinear activation pReLU. Because using the activation function in bottlenecks does affect the performance, we take advantage of channels before the pReLU layer to construct the inverse residual block, resulting in performance improvement.

The three main contributions of this article are summarized as follows. (1)We propose a head module applied with a series of depthwise separable convolution operations for feature extraction. In addition, we replace all existing conventional convolution operations with deformable convolution layers in the module. At the same time, in order to effectively retain the features, we extend the low-dimensional representation to high-dimensional before passing the activation function. This maintains a balance between a large number of parameters and excellent performance(2)In order to effectually raise feature fusion and gradient propagation, we introduce a cascaded block called CAS-Block. This mechanism allows our network to combine features from diverse layers. Furthermore, such a structure is also used to construct the network and promote its functional expression(3)We utilize the with the addition of total variance loss instead of the traditional sole loss function, which significantly improves the quality of the reconstruction image . Meanwhile, to obtain better optimization weights, we explored various parameter settings

2.1. SISR Using Deep CNNs

In the field of superresolution, compared to conventional image restoration methods, CNN-based models have a stronger feature expression ability and have achieved great success. Dong et al. [12] first proposed an algorithm named SRCNN, which is an end-to-end algorithm based on CNN. It consists of three convolutional layers, and its performance is impressive compared to traditional methods (i.e., sparse coding [7] and bicubic interpolation [6]). Later, the research community designed more intricate CNN architectures and developed more profound networks. For example, in order to grow the depth of the network, VDSR [16] introduced residual learning, and the verification experiment proved that this strategy heightens the SR image qualities and promotes convergence. DRCN [17] uses deep recursion to construct a neural network and uses the same convolution kernel 16 times in the reference network, effectually dropping the number of parameters. He et al. [18], inspired by the ordinary differential equation (ODE), propose an intriguing network named OISR, which provides a new understanding of network designs. It is worth noting that most of these latest methods use interpolated images for input, which will not only cause the details to be too smooth but also boost additional computational cost and time consumption.

2.2. Skip Connection

ResNet [19] was the first to adopt the concept of skip connection, and then, the idea was extended to various computer vision tasks, such as image restoration [20] and semantic segmentation [21]. Since it is difficult for ordinary SR networks to construct extremely deep networks, employing various skip connections avoids the gradient vanishing trap and boosts performance. The strategy is roughly divided into two categories, namely, global or local residual connections and dense connections.

2.2.1. Global or Local Residual Connections

In image restoration tasks, LR images are closely related to HR ones. Obtaining the residual feature maps among the image’s pixels can learn the absent high-frequency detailed information. VDSR is the first residual model for superresolution. Extensive experiments have proved that residual learning can enhance reconstruction performance and promote convergence speed. Therefore, this method has been widely used in various computer vision tasks [22].

2.2.2. Dense Connections

A dense connection allows the current layer to connect with all former layers, and the architecture provides more intriguing effects on restoring high-resolution patterns. DenseNet [23] first presents the dense connection in the SR field, starting from the features, achieving better results with fewer parameters through the extreme use of the features.

Traditional neural networks are basically unidirectional propagations, and the signals received in the later layers are very weak. To solve this problem, MemNet [20] stacks memory blocks and adds dense connections among each block, which is called the long-term memory model. Such architecture reduces the weight of the entire network, facilitates convergence, and deepens the network.

RDN [24] uses a similar architecture, but MemNet does not take all the intermediate feature information, while RDN applies global residual learning to use all of them.

Jiang et al. [25] proposed a hierarchical dense network (HDRN) in 2019, which can effectively establish realistic mapping relationships between the LR and HR image, promoting information interaction and representation.

Different from the above models, CARN also uses a cascade mechanism at the local and global levels to integrate features from multiple layers, which can reflect input representations at different levels for receiving more information [26]. Haris et al. [27] proposed D-DBPN, which connects the features of the up- and downsampling stages and improves the SR result.

2.3. Depthwise Separable Convolution

The cross-channel correlation and spatial correlation of the convolutional layer can be decoupled, and they can be mapped separately to achieve better results. Some lightweight networks, such as MobileNet [28], apply depthwise separable convolution, which is a combination of depthwise (DW) and pointwise (PW) to extract feature maps. Compared with the conventional convolution operation, the number of parameters and operation cost are relatively small. In Figure 2, we use the separable convolution operation in the depth direction in the head module.

2.4. Multiscale Learning

So as to utilize computing resources more efficiently and extract more features under the same amount of calculation, Szegedy et al. [29] present the inception module. There are two main contributions of the inception structure: one is to use convolution to perform dimensionality reduction; the other is to simultaneously perform convolution and reaggregation on multiple sizes. Inspired by [29] and [30], MSRB [31] was proposed. Multiscale feature fusion and local residual learning can be applied to adaptively detect images of different scales with different sizes of convolution kernel features. The results show that performing different kernel operations can provide better extraction capabilities. However, this method cannot expand more receptive fields and cannot generate more detailed structural information.

2.5. Deformable Convolution

Conventional convolution kernels are usually of fixed size (for example, .). The biggest problem with this convolution kernel is that it has poor adaptability to unknown changes and weak generalization ability. In order to solve the object space deformation problem, deformable convolution [32] is proposed to heighten the transformation modeling ability of CNN. Deformable convolution is based on traditional convolution, adding the direction vector of the adjustment convolution kernel to make the shape of the convolution kernel closer to the feature.

2.6. Real-World Image Superresolution

In real-world image restoration scenarios, lacking corresponding high-quality references usually conduct poor experimental results. We additionally introduce the naturalness image quality evaluator (NIQE) [33] and Perceptual Index (PI) [22] to perform the evaluation. In fact, these indicators can sensitively reflect content sharpness, detail contrast, and texture diversity. These evaluation indexes have a high consistency with the subjective quality and can effectively reflect the visual quality of images without reference. In particular, the smaller values of NIQE/PI indicate better perceptual quality and clearer content. We intend to apply it to the DRIVE [34] dataset to estimate the restoration capability of the proposed method.

3. Proposed Approach

3.1. Network Architectures

SISR’s algorithm, such as ESPCN [35] and FSRCNN [36], does not take full advantage of low-level feature information. With a deeper structure, there are more parameters. As shown in Figure 3, the proposed CASR consists of three components: (1) head module, (2) cascading block, and (3) upsample module. All we want is the balance between the performance and the cost.

To better explore the mentioned issues, we adopt two different strategies: (1) original feature extraction and (2) cascading connection structure.

3.1.1. Original Feature Extraction

We depict and as the input and output of our models, respectively. Figure 2 illustrates how the head module extracts the original information from LR images: where means a series of convolution operations. In the head module, we first replace the conventional one with a depthwise convolution layer for reducing parameters. Through an activation layer, the feature maps are sent to another specific convolution layer, deformable convolution. As we discussed in Section 2, deformable convolution adds an offset to each convolution sampling point, thus achieving free deformation of the sampling grid. Then, after passing through a specific convolutional layer with kernel and another pReLU activation function, is sent to the next stage for a higher-level abstraction.

3.1.2. Cascading Connection Structure

Now, we present the CAS-Block. The cascade connection allows information to spread across multiple paths in the network, which greatly enhances feature fusion. It [10] has been widely applied in various computer vision tasks. In Figure 4, the mapping process of our cascade network includes CAS-Blocks, each with a skip connection: where presents the output of the th CAS-Block. Each CAS-Block contains one group convolution layer (with or kernel), one traditional convolution layer for adjusting the number of channels, and a pReLU layer. We prefer stacking several kernels with smaller sizes (such as and ) to directly applying larger kernels (such as and ) for enlarging the receptive field of the feature extraction module and decreasing the number of learnable parameters: where means all the outputs of the middle three CAS-Blocks. denotes the cascading operation: where demotes our proposed mapping function. Finally, we use a common upsampling module to fuse the hierarchical structural features and amplify the image size: where indicates an upscale module. In recent years, many upsampling methods have been proposed, such as [12, 27, 36]. We adopt the postupsampling method, which has been proven effectively outstanding. The process of our model roughly includes three steps. First, taking the low-resolution image as the original input, the feature extraction module obtains the initial features from the low-quality image. Then, these features are delivered to a higher abstraction layer. Finally, we adopt a simple upsampling block, including a convolutional layer, and a pixel-shuffle layer to enlarge the SR image.

3.2. Total Variation Loss

Aly and Dubois bring the total variation (TV) [37] loss to the SR field in order to suppress noise in generated images, and for imposing spatial smoothness, Yuan et al. also select this TV loss: where depicts the reconstructed HR image, , represent the dimensions of the corresponding feature maps, and symbolizes the number of channels. On the other hand, although mean square error (MSE) is available, previous work [38] proved that it is not a good choice. Thus, the second loss function is defined as follows:

We applied these loss functions in the training process of our presented model. From the experiment, we found that adopting the loss compared with the simple loss, the model achieves better performance, and set works well. As shown in Figure 5, the loss enables the network to generate smoother recovery images, and Figure 6 comparatively illustrates that the combined loss function may produce sharper SR results.

3.3. Comparison with Other CNN-Based Methods
3.3.1. Comparison with MSRN

Compared with MSRN, our CASR is different as follows. First, the basic module design is distinct. In MSRN, the multiscale residual block (MSRB) incorporates parallel convolution with multiple feature channels. The output of each multiscale residual block is cascaded together through a hierarchical feature fusion to produce the final result, which leads to a lot of calculations. However, our multiscale modules are branch-based, using regional skip connections and cascades extensively, scaling down parameters. Second, it is the difference in the activation function. MSRN utilizes the ReLU function, while we employ PReLU as the activation function. According to the comparison in Figure 7, PReLU optimizes and improves ReLU. Under the premise of almost no increase in the amount of calculation, the PReLU function effectually improves the overfitting problem of the model, accelerates the convergence, and lowers the error. Therefore, our proposed multiscale module owns more effective representation capabilities.

3.3.2. Comparison with MemNet

We summarize the main differences between MemNet [20] and our CASR. The former employs stacking memory blocks and massive shortcuts, while our method avoids extensive dense connections for lowering the number of parameters. What is more, Lim et al. trained their network with the loss, but we prefer loss to loss function. Besides, MemNet regards the interpolated images as input. Contrastively, our proposed method directly extracts hierarchical features from the original LR images upsampled at the end of the process for computational efficiency and SR performance improvement.

4. Experimental Results

4.1. Training Details

We set depthwise separable convolution operations in head module shown in Figure 2, which were first illustrated in the Inception net in the proposed model, and were able to reduce the size of the network parameters effectively. Figure 4 graphically illustrates the cascading process occurring. The medial layers’ outcomes are cascaded into the posterior layers and finally assemble in a convolutional trunk consisting of a depthwise separable convolutional operation with three times the input and output features and then thorough a pReLU activation function.

We prefer to loss as the loss function, though the latter has been generally applied in computer vision tasks because of its intimate relation with PSNR’s calculation. However, the research community recently indicates that loss provides better accuracy and faster convergence; TV loss () imposes spatial smoothness on reconstructed images. Specifically, we set training patches with a size of , and batch . We employ 16000 iterations of back-propagation per epoch; we adopt the ADAM [39] optimizer with , , and for optimization. We set 850 training epochs and the learning rate of all layers to initially, which will be reduced to half for every 50 epochs. All experiments are run under the PyTorch framework and deployed on NVIDIA RTX 2080Ti GPU.

4.2. Datasets

DIV2K [15] is a high-definition dataset containing various image contents. It has 800 training images, 100 verification images, and 100 test images. We employ 800 training images to train the proposed model and randomly select ten validation images as evaluation. In the testing process, we adopt the following benchmark datasets as test datasets: Set5 [40], Set14 [41], B100 [42], and Urban100 [43]. They contain various scenes in real life, such as landscapes, buildings, and people, while the Digital Retinal Images for Vessel Extraction (DRIVE [34]) dataset is a dataset for retinal vessel segmentation. It consists of a total of JPEG 40 color fundus images, including 7 abnormal pathology cases.

4.3. Experimental Analyses
4.3.1. Results on Benchmark Datasets

Our proposed method will be compared with the state-of-the-art SR model on two commonly adopted image quality metrics (i.e., PSNR and SSIM). We analyze our methods with several progressive networks: (1) bicubic, (2) SRCNN [12], (3) VDSR [16], (4) LapSRN [44], (5) DRCN [17], (6) DRRN [8], (7) MemNet [20], (8) RDN [24], (9) HDRN [25], (10) OISR [18], and (11) IDN [45]. As described in the technical literature, these methods were evaluated on four aforementioned datasets. Table 1 lists the performance of all mentioned algorithms. Our networks are much better than the comparison model in variant scale factors except for RDN. On some datasets, the performance of CASR is completely close to RDN, while the consumption of RDN is much larger in the meantime. We will particularly discuss it later.

Compared with other methods, on dataset Set5, our proposed method performs better at all scales. Especially on the ×2 superresolution, the reconstructed image contains very clear texture details.

Our method performs well in image superresolution restoration tasks of various scales on the dataset Set14. Specifically, the best performance of the benchmark SR model is RDN [24] and IDN [45], which reach 30.67/0.8482 and 30.52/0.8462 on PSNR/SSIM metrics, respectively, at the ×3 scale; our model is 0.1 db lower than the former and 0.05 db higher than the latter.

As mentioned earlier, the dataset B100 contains many real-world images. As seen in Figure 8, the vase image recovered by our method has clearer edges, reaching 32.33 db. Figure 8 demonstrates visual comparisons on dataset B100 and Urban100 with scales ×2 and ×4, respectively.

The Urban100 dataset consists of 100 pictures of various buildings, which usually contain clear edges and rich textures. So, according to [24], RDN is expected to perform well on superresolution tasks, reaching 33.09 db on ×2. Our method acts well at ×3 and ×4 superresolution tasks, reaching 28.90 db and 26.67 db, respectively, which is approximately 0.1 db and 0.15 db lower than RDN, while CASR costs much less than the competitors.

4.3.2. Comparison Results on Time Complexity

Besides, we have provided a comparison of the model’s efficiency in terms of time complexity on public dataset Set14 (taking ×4 as an instance), as tabulated in Table 2. The table intuitively shows that the CASR model achieves a similar competitive experiment result compared to VDSR [16] and RDN [24], reaching 28.96/0.7899 on PSNR/SSIM metrics, while spending less time (0.1017 s on a single image) and costing the least resource on processing image restoration. We may conclude that our proposed CASR model takes the least time consumption and adopts acceptable parameters compared with VDSR and RDN.

4.3.3. Superresolution on Real-World Images

Table 3 indicates that our proposed CASR method is highly competitive, achieving the lowest average values of NIQE/PI on benchmark dataset Set14 with scale factor ×4. Figure 9 illustrates the visual image restoration comparison with several SR methods on the real-world image chip. Results visually show that our method, compared to others, achieves better restorative performance. It not only achieves competitive PI and NIQE values but also improves more pleasant visual quality in terms of image, edge, texture, color, and feature-rich regions. Besides, as shown in Figure 10, the restorative performance on the larger scale, e.g., ×4, is also acceptable. The vessel in the retinal image is more clear than the competitors, and the edge of the retina is also sharp as we expected. Considering that the whole experiment was designed and conducted on dataset DIV2K, a supervised public dataset with ground truth images, which is acceptable and compromised, we believe that could provide a further research direction, exploring a more realistic oriented image SR process with a better degradation kernel on a real-world image dataset.

4.4. Ablation Study

In order to further explore the details of the experiment, we design 2 ablation experiments: one is to investigate the influence of different dilation factors on deformable convolution, and the other is the experiment of different loss functions’ effects.

4.4.1. Study of the Deformable Convolution

Figure 11 illustrates two training processes with variant dilation scales. We examine whether the dilation scale of deformable convolution would affect recovery performance or not. As is shown in Figure 11, with epochs rising, both training results grow as well, while the model with dilation two would achieve better performance but cause a worse fluctuation. We also compare the effect of different scale factors on the experimental performance, as shown in Table 4. It can be learned that with the same scale factor ×2, our proposed method, which replaces the convenient convolution with deformable convolution, would achieve better results on both datasets Set5 and B100. With the dilation factor enlarging, the performances go better. This result mainly occurs since the operation may effectively and dynamically expand the receptive field. Because different input feature maps may correspond to objects with different deformation scales, for some tasks, it is essential to adaptively determine the ratio or receptive field size.

4.4.2. Study of the Loss Function

To examine the effect of the mentioned loss functions, we design an ablation experiment to explore it. Expressed formally, let the first model be “” and the other one be (i.e., using the enhanced loss function ). We tried different combinations of scale factor and loss function to examine which one would achieve better performance on dataset DIV2K, as demonstrated in Figure 6 and Table 5. Afterward, the validation process on dataset Set14 and Urban100 proves that the enhanced loss function actually results in a clearer image with more details in Figure 5.

5. Conclusion

This paper presents two novel CNN architectures, namely, head module and CAS-Block, to improve the SISR performance. Compared with the state-of-the-art (SotA) CNN-based algorithms, our presented head module considers low-level feature expression by applying depthwise separable convolution and deformable convolution, which is demonstrated to not only effectively extract the patterns but also reduce the parameter size. At the same time, the CAS-Block employs a global residual connection and abundantly utilizes cascading connections to capture remote spatial features from former inputs. Extensive experiments have illustrated that our presented model has effectively improved both the quality of the reconstructed images and the processing speed compared with the SotA methods in terms of quantitative indicators and realistic visual effects.

Data Availability

The image data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Nos. U1701267 and 61962014), Guangxi Science and Technology Project (AD18216004 and AD18281079), Guangxi Bagui Scholars Special Project (2019GXNSFFA245014, AA17202024, Ji Li, 2018), Guangxi Key Laboratory of Image and Graphic Intelligent Processing (GIIP202001), and Innovation Project of GUET Graduate Education (No. 2020YCXS053).