Abstract

With the power of deep learning, super-resolution (SR) methods enjoy a dramatic boost in performance. However, they usually have a large model size and high computational complexity, which hinders the application in devices with limited memory and computing power. Some lightweight SR methods solve this issue by directly designing shallower architectures, but it will adversely affect the representation capability of convolutional neural networks. To address this issue, we propose the dual feature aggregation strategy for image SR. It enhances feature utilization via feature reuse, which largely improves the representation ability while only introducing marginal computational cost. Thus, a smaller model could achieve better cost-effectiveness with the dual feature aggregation strategy. Specifically, it consists of Local Aggregation Module (LAM) and Global Aggregation Module (GAM). LAM and GAM work together to further fuse hierarchical features adaptively along the channel and spatial dimensions. In addition, we propose a compact basic building block to compress the model size and extract hierarchical features in a more efficient way. Extensive experiments suggest that the proposed network performs favorably against state-of-the-art SR methods in terms of visual quality, memory footprint, and computational complexity.

1. Introduction

Single image super-resolution (SISR) aims to reconstruct a visually natural high-resolution (HR) image from its low-resolution (LR) counterpart, which is an inherently ill-posed inverse problem. Due to the essential role in video processing [1], surveillance system [2], and object restoration [3], super-resolution (SR) is still an active research area.

Recently, deep learning-based image super-resolution methods [47] have shown prominent performance over conventional methods such as Bicubic interpolation and Lanczos resampling. After the proposal of residual learning [8], which simplifies the optimization of deep convolutional neural networks (CNNs), SR networks tend to become even deeper and larger. However, it is impractical to simply pursue performance gains without considering the model size and computational complexity. For devices with limited memory and battery capacity, cost-effective methods are preferred, which encourages the design of lightweight SR models. To reduce the number of parameters, some approaches adopt a recursive manner or parameter sharing scheme [9, 10]. However, to compensate for the performance drop, these methods have to increase the network width or depth, thus, resulting in high computational complexity as shown in Figure 1. Some other methods directly design shallower network architectures, which reduce parameters and calculations simultaneously. For example, [11, 12] are such compact models with fewer than 40 layers. However, their representation ability is restricted by the shallow architecture.

Towards these drawbacks, we propose Dual Feature Aggregation Network (DFAN) that can strike a better trade-off between SR performance and computational cost as illustrated in Figure 1. The key component of DFAN is the dual feature aggregation strategy. It aggregates local features and global features in a coarse-to-fine manner and could largely improve feature utilization via feature reuse. Specifically, the dual feature aggregation strategy consists of two modules: Local Aggregation Module (LAM) and Global Aggregation Module (GAM). LAM uses an efficient connection method and one convolutional layer to adaptively fuse hierarchical features along the channel dimension. Then, GAM further fuses the local aggregated features along the spatial dimension in an iterative manner. This progressive aggregation strategy fully leverages all hierarchical features, which enables the lightweight model to achieve better SR performance. In this paper, we also design an Efficient Convolutional Block (ECB) as the basic building block of DFAN. It comprises group convolutional layers with channel shuffle operation. Although ECB is compact, DFAN can still achieve competitive results with the help of the dual feature aggregation strategy.

In summary, our main contributions are as follows: (i)We propose DFAN, which can achieve better SR performance with limited computational cost. It is more practical in real applications(ii)We propose the dual feature aggregation strategy which aggregates local and global features in a progressive manner. It could make full use of all hierarchical features through feature reuse, which enhances the feature utilization while introducing only marginal computation cost. With our dual feature aggregation strategy, the lightweight SR model can achieve better cost-effectiveness(iii)We also propose ECB as the basic building structure, which can extract hierarchical features in a computationally economical way(iv)We show through extensive experiments that our model can achieve competitive results against state-of-the-art methods with relatively fewer parameters and calculations

2.1. Lightweight SISR

Since Dong et al. [4] first applied CNNs to design Super-Resolution Convolutional Neural Network (SRCNN) and achieved significant improvement, deep learning based SISR methods have been actively explored and shown great advantages in representation capability. To obtain more powerful features for image reconstruction, they continue to enlarge the model size or network depth. Most existing SR methods have hundreds of convolutional layers, such as Residual Channel Attention Network (RCAN) [13], Residual Dense Network (RDN), and Deep Alternating Network (DAN) [14]. However, these methods are computationally expensive for real application. Thus, more and more lightweight SR methods are proposed. Deep Recursive Residual Network (DRRN) [9] and Memory Network (MemNet) [10] introduce recursive learning or weight sharing schemes to reduce parameters. However, they need to increase the computational complexity to compensate for the performance drop. Another idea is to build relatively shallower models, which can cut down the model size and calculations at the same time. Cascading Residual Network (CARN) [11], Information Distillation Network (IDN) [12], and Information Multi-Distillation Network (IMDN) [15] are all lightweight networks that have fewer than 40 layers. However, the shallow architecture could restrict their representation ability to some extent. For our method, we improve the feature utilization through dual feature aggregation, which can better balance the SR performance and computational cost.

2.2. Group Convolution

There has been rising interest in designing small and efficient neural networks [1619] since many deep and complicated neural networks are infeasible in practical applications. Group convolution is an important method for designing efficient neural networks. The application of the group convolution method dates back to [20] where the model is distributed over two GPUs, resulting in gains in accuracy and convergence speed. Depthwise convolution is a special case of group convolution and is originally introduced in [21]. In depthwise convolution, the number of groups is equal to the number of channels. Based on the depthwise convolution, Mobile Network (MobileNet) [18] gains state-of-the-art results among lightweight models in many visual tasks. Then, group convolution and depthwise convolution are generalized in a novel form in [22]. Channel shuffle operation is also proposed in [22] to overcome the side effect of group convolution. Recently, group convolution has been used in some lightweight image super-resolution methods. Ahn et al. [11] proposed efficient residual block containing group convolutional layers, and Hui et al. [12] introduced group convolution to some specific layers. However, there is still room for improvement in the reconstruction performance of these two models. In our DFAN, group convolution is used as a basic building unit without affecting the reconstruction performance.

2.3. Deep Feature Aggregation

As the feature representation capability of a single network layer is limited [23, 24], deep feature aggregation is typically used to fuse features of different layers, which can improve the representation capability in a computationally economical way. For instance, the Densely Connected Network (DenseNet) [25] and the Feature Pyramid Network (FPN) [26] are the dominant architectures for semantic feature aggregation and spatial feature aggregation [27]. DenseNet can better propagate features and gradients through dense connections that connect each layer to every other layer in a feed-forward fashion. FPN can equalize resolution and standardize semantics across the levels of a pyramidal feature hierarchy through top-down and lateral connections. Besides, Residual Network (ResNet) [8] is also a typical feature aggregation method which aggregates features via simple element-wise summation. Recently, Yu et al. [28] proposed an iterative aggregation method and a hierarchical aggregation method, which can further improve the performance of the aforementioned dominant architectures in many visual tasks. Inspired by this work, we introduce an iterative and adaptive global feature aggregation module to DFAN, obtaining more comprehensive information and improving reconstruction performance.

3. Proposed Method

3.1. Network Architecture

As depicted in Figure 2(a), DFAN mainly consists of four parts: the shallow feature extraction layer, stacked local feature aggregation modules, the global feature aggregation module, and the upsampling module.

The shallow feature extraction layer contains only one convolutional layer. It extracts shallow features from the LR image. Then, is input into the stacked LAMs for global residual learning. There are stacked LAMs, and the local aggregated feature from the LAM can be formulated as where refers to the operation of the LAM, and is the local aggregated feature from it. As shown in Figure 2, each LAM is composed of a series of ECBs, therefore, can be viewed as a composite function.

After that, GAM fully leverages local aggregated features from LAMs in an iterative way, which can be expressed as where is the global aggregated feature. denotes the operation of GAM. Then, the global long skip connection adds to , obtaining the final aggregated feature . The global skip connection can better propagate information and gradients, thus, stabilizing the training of DFAN.

Finally, we use an upscale module proposed in [29] to restore the final SR image . That is, where denotes the group convolution, indicates the standard convolution, and is the upscaling module.

3.2. Local Feature Aggregation

Since features of different layers contain different weighted information, adaptively aggregating all hierarchical features could effectively improve the representation ability. Referring to [28], the key axes of feature fusion are semantic and spatial, which are closely related to channel and spatial dimensions, respectively. Thus, we propose the dual feature aggregation strategy, in which features are locally aggregated along the channel dimension, and then globally aggregated along the spatial dimension. In this subsection, we first explain the local feature aggregation.

3.2.1. Efficient Convolutional Block

As depicted in Figure 2(c), ECB is the basic building block of LAM. ECB is a residual learning module consisting of two group convolutional layers with channel shuffle operation [22] and a channel attention module [7]. Group convolution with channel shuffle operation can extract useful features in a computationally economical way. Assuming the group size of an group convolutional kernel is , the parameter amount and computation complexity of this group convolutional kernel will be both of an standard convolutional kernel. Moreover, the channel shuffle operation enhances the information exchange among channels without extra parameters and calculations. There are ECBs in each LAM. LAM fuses hierarchical features from ECBs by exploring the interchannel relationship. The local aggregated feature from the LAM can be obtained by where represents the concatenation of local features from ECBs in the LAM.

3.2.2. Balanced Connection

The connection method in LAM is what we call balanced connection. As shown in Figure 3, compared with two commonly used connection methods in SR, i.e., skip connection and dense connection, our balanced connection is more flexible than skip connection and more lightweight than dense connection. The analysis is as follows: (1)Difference to Skip Connection. As shown in Figure 3(b), for each LAM, if we only use skip connection which makes the elementwise sum of the hierarchical feature maps, all hierarchical features will contribute equally to the final aggregated feature. It may be inflexible since different features contain information of different importance. Our balanced connection can simply solve this issue by a convolutional kernel. This convolutional kernel assigns specific learned weights to each pixel of local features, thus, adaptively aggregating them along the channel dimension(2)Difference to Dense Connection. As shown in Figure 3(c), dense connection connects each ECB and all preceding ECBs to be concatenated and compressed as inputs to all subsequent ECBs, which requires more convolutional kernels and harms the overall efficiency. However, our balanced connection directly connects each ECB for feature aggregation, which not only fully uses local features but also greatly reduces the number of parameters and computation operations

3.3. Global Feature Aggregation

The spatial dimension is orthogonal to the channel dimension. Thus, further fusing local aggregated features along the spatial dimension could supplement more information. Besides, since local aggregated features contain abundant information, it could be suitable to aggregate them in a coarse to fine fashion. Therefore, we design GAM, which can further fuse local aggregated features with spatial attention mechanism in an iterative manner.

In Figure 2(d), represents the global aggregated feature in the iteration, and represents the output of the LAM. The iterative fusion of GAM can be formulated as where is initialized with , which is the output of the first LAM. represents the global aggregation of GAM.

The main parts of GAM are (1) spatial attention generation and (2) iterative feature aggregation. First, the spatial attention is generated by the following operation, where denotes a convolutional kernel that reduces the channel number of by half. denotes a depthwise convolutional kernel to extract spatial information. Depthwise convolution applies a single filter to each input channel, which is more efficient than common convolution in terms of memory and computation. is the Sigmoid activation function constraining the spatial attention to . The spatial attention is the same size as and . Second, as shown in Figure 2(d), the feature fusion in A-Unit can be formulated as where denotes the Hadamard product. is the tensor with all elements being 1.

After the iteration, we obtain the final global aggregated feature . The overall iterative global aggregation can be summarized as follows, where denotes the final spatial attention for , which is determined by all the local aggregated features from LAMs, thus, highly comprehensive. Additionally, Eq. (8) indicates that the global feature aggregation strategy satisfies the convex combination.

4. Experiments

4.1. Experimental Setup
4.1.1. Datasets and Metrics

We use the training set of DIV2K [30] to train all of our models. For testing, we use five standard benchmark datasets: Set5 [31], Set14 [32], BSD100 [33], Urban100 [34], and Manga109 [35]. The visual quality of SR results is evaluated with Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [36] on the channel (i.e., luminance) of transformed YCbCr space. We also represent the number of parameters and multiply-adds to evaluate the memory footprint and computation complexity, respectively.

4.1.2. Degradation Models

To fully demonstrate the effectiveness of our DFAN, we use two degradation models to simulate LR images. The first is the bicubic degradation model. The bicubic degradation model simulates LR images on scale , , and . The second is the blur-down degradation model that blurs HR images by Gaussian kernel with a standard deviation . The blurred image is then downsampled on scale .

4.1.3. Training Details

The size of LR patches is . During training, we randomly rotate input images by , , or and flip them horizontally or vertically. The batch size is . We use loss as the loss function. We use Adam as the optimizer. The initial learning rate is , decayed by half every 200 epochs. We train our model for 1000 epochs.

4.2. Study on Efficient Convolutional Block

Different from most of the super-resolution networks, our DFAN uses group convolutional kernels instead of standard convolutional kernels in an ECB to extract features. Since group convolution is a basic operation of our ECB, we design DFAN_W and DFAN_D to validate the effectiveness of ECB. These two models have the same structure as DFAN, but group convolutional kernels in ECBs are replaced with standard convolutional kernels. All three models have similar number of parameters and computation operations, i.e., approximately 900 K and 60 G, respectively.

We denote the number of ECBs in each LAM as , the number of LAMs as , and the number of channels of each intermediate feature as . For our DFAN, we set , , and to 10, 6, and 64, respectively, and the group number of each group convolutional kernel in ECBs is 8. We set , , and to 3, 2, and 64, respectively, for DFAN_W, and these hyperparameters to 10, 6, and 27, respectively, for DFAN_D. In other words, the width of DFAN_W is the same as DFAN. While the depth of DFAN_D is the same as DFAN.

As shown in Table 1, group convolution makes an outstanding trade-off between representation capability and computational costs. Compared with standard convolution, group convolution can make the model deeper or wider with limited parameters and calculations, which is beneficial to obtain richer hierarchical information.

4.3. Study on Dual Feature Aggregation

In this section, we experimentally investigate the effectiveness of the dual feature aggregation strategy. LAM0_GAM0 is the baseline network by removing balanced connections in LAM and GAM from DFAN. LAM1_GAM0 is built by removing GAM from DFAN. LAM1_GAM1 has both LAM and GAM, which is the same as DFAN. As shown in Table 2, when only LAM is added, PSNR is improved by approximately . When both LAM and GAM are added, the performance is improved by a large margin (PSNR: on Set14).

4.3.1. LAM Analysis

To intuitively show the effectiveness of LAM, we plot the training curves of LAM0_GAM0 and LAM1_GAM0 in Figure 4(a). Benefitting from the balanced connection in LAM, gradients could be better propagated. The margin between the two curves indicates that LAM could not only help the network converge faster but also help it converge to a better point. Additionally, the weight distribution is visualized in Figure 4(b). This indicates how much information of each ECB in an LAM contributes to the local aggregated feature generated by this LAM. Features from different ECBs contribute differently to local aggregated features, which suggest that LAM could adaptively aggregate hierarchical features to improve the final performance.

4.3.2. GAM Analysis

We experimentally prove that GAM also works well for some other networks. We use a shallower RCAN [7] as the baseline network (denoted as sRCAN). To facilitate network training, we set the RG number to 3, and the RCAB number to 5 for sRCAN. Then, we apply our GAM to sRCAN, which is denoted as sRCAN_GAM. As Table 3 shows, with only a small increase in parameters and computational complexity (Paramerters:+9 K, MultAdds:+0.9G), GAM can significantly improve the SR performance on all the benchmark datasets with scaling factor . Therefore, GAM could be used as a general lightweight tool to improve the performance of some existing SR methods.

To better understand the adaptive and iterative aggregation strategy of GAM, we visualize the spatial attention heatmaps generated by GAM in Figure 5. The 3D attention is transformed to 2D by taking the absolute mean along the channel dimension and then normalized to over the spatial dimension. We can see that (1) spatial attention for different LAMs focuses on regions of different frequencies. For example, the spatial attention for LAM_1 (Figure 5(a)) focuses on low-frequency regions such as the background. While the spatial attention for LAM_6 (Figure 5(f)) focuses more on high-frequency regions with rich textures. Thus, both high-frequency and low-frequency information is important for SR. (2) Although some spatial attention focuses on high-frequency regions, they emphasize different parts. In LAM_6, more attention is given to regions of the main object. But in LAM_5, high-frequency regions in the background are emphasized. It indicates that GAM provides additional flexibility to deal with different types of information, which could enhance the representation capability.

4.4. Results with Bicubic Degradation Model

We compare DFAN with other state-of-the-art methods: SRCNN [4], Fast Super-Resolution Convolutional Neural Network (FSRCNN) [37], Very Deep Super-Resolution(VDSR) [5], Deeply-Recursive Convolutional Network (DRCN) [38], DRRN [9], MemNet [10], CARN [11], IDN [12], and IMDN [15].

4.4.1. Quantitative Results

We evaluate the average PSNR and SSIM on five benchmark datasets. In particular, we also calculate the number of parameters and multiply-adds of these models by assuming the HR image size to be 720p (). In Table 4, the proposed DFAN performs favorably against these methods on all benchmark datasets for , , and SR. Note that the number of parameters of our method is inconsistent for different scales because we apply the pixelshuffle operation [29] for upscaling, and the convolutional kernels in the upscaling module are of different sizes for different scales. CARN [11] used to be a strong baseline for lightweight SR models, but our DFAN outperforms it by a large margin (PSNR:, SSIM:+0.0024 on Set5) with fewer parameters and fewer MultAdds on scale . It indicates that our method can achieve a better trade-off between computational cost and effectiveness. Therefore, feature aggregation has promising prospects in the research of lightweight image SR.

4.4.2. Visual Results

In Figure 6, we show visual comparisons on scale . Our method restores the letter “g” in “ppt3” more clearly, while most other methods encounter artifacts or edge distortion. For “img030” in Urban100 and “img86000” in BSD100, most methods do not reconstruct the contour of the window well, but our method can reconstruct these edges better.

4.5. Results with Blur-Down Degradation Model

As mentioned in the main submission, we further apply our method to super-resolve images with blur-down degradation, which is also commonly used in [7, 13]. We compare DFAN with SRCNN [4], FSRCNN [37], VDSR [5], CARN [11], IDN [12], and IMDN [15].

As shown in Table 5, compared with the networks that are stacked by several elaborately designed building blocks, such as CARN, IDN, and IMDN, our lightweight network with the dual feature aggregation strategy can better leverage the hierarchical features. In addition, the visual comparison in Figure 7 also demonstrates the superiority of our method.

5. Conclusions

We propose DFAN that can strike a better trade-off between SR performance and computational cost. The proposed dual feature aggregation strategy makes local and global feature aggregations adaptively. Through feature reuse, it could simultaneously improve feature utilization and representation ability. Benefitting from the dual feature aggregation strategy, our network achieves competitive performances with fewer parameters and lower computational complexity, which is more practical for real applications.

Data Availability

The image datasets supporting this work are from previously reported studies and datasets, which have been cited. The processed data are available at the repository: BasicSR(https://github.com/xinntao/BasicSR/blob/master/docs/DatasetPreparation.md#Image-Super-Resolution).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key R&D Program of China (2019YFB1406200) and was also the research achievement of the Key Laboratory of Digital Rights Services. It is based on our previous teamwork, Lightweight Image Super-Resolution via Dual Feature Aggregation Network, presented in 2021 at the 2nd International Conference on Culture-oriented Science & Technology (ICCST).