Exploration of Human Cognition using Artificial Intelligence in HealthcareView this Special Issue
Med-SRNet: GAN-Based Medical Image Super-Resolution via High-Resolution Representation Learning
High-resolution (HR) medical imaging data provide more anatomical details of human body, which facilitates early-stage disease diagnosis. But it is challenging to get clear HR medical images because of the limiting factors, such as imaging systems, imaging environments, and human factors. This work presents a novel medical image super-resolution (SR) method via high-resolution representation learning based on generative adversarial network (GAN), namely, Med-SRNet. We use GAN as backbone of SR considering the advantages of GAN that can significantly reconstruct the visual quality of the images, and the high-frequency details of the images are more realistic in the image SR task. Furthermore, we employ the HR network (HRNet) in GAN generator to maintain the HR representations and repeatedly use multi-scale fusions to strengthen HR representations for facilitating SR. Moreover, we adopt deconvolution operations to recover high-quality HR representations from all the parallel lower resolution (LR) streams with the aim to yield richer aggregated features, instead of simple bilinear interpolation operations used in HRNetV2. When evaluated on a home-made medical image dataset and two public COVID-19 CT datasets, the proposed Med-SRNet outperforms other leading edge methods, which obtains higher peak signal to noise ratio (PSNR) values and structural similarity (SSIM) values, i.e., maximum improvement of 1.75 and minimum increase of 0.433 on the PSNR metric for “Brain” test sets under and maximum improvement of 0.048 and minimum increase of 0.016 on the SSIM metric for “Lung” test sets under compared with other methods.
Low-resolution (LR) medical images present reduced important pathological details and compromise the diagnostic accuracy. High-resolution (HR) medical images provide vital detailed anatomical information for clinical application and quantitative image analysis. However, image quality is often affected by tremendous limitations. So, super-resolution (SR) is an extremely crucial technique for medical image processing [1, 2].
Recently, CNN-based SR methods [3–10], have achieved surprising performance. The networks are tending to be deeper and deeper from the ﬁrst SRCNN  to deeper VDSR , DRRN  and MemNet , etc. and then to the very deep RCAN . In addition, the whole networks in other effective methods are constructed by simply connecting a series of identical feature extraction modules, e.g., RDN , IDN , MSRN , and SRFBN , which indicates that the capability of each block is important. The GAN model  provides a new idea for image generation and also provides a model basis for HR image generation. SRGAN  is the first work to introduce GAN model into SR reconstruction, which has obtained higher image visual quality and more realistic image high-frequency details. However, the extracted features are often insufficient due to the relatively simple design of SRGAN generation network, which affects the quality of reconstruction. Subsequently, some new SR methods based on GAN models and deep convolutional networks are proposed to improve the quality of image SR at different levels [19–24].
Unsurprisingly, deep learning intensively exploits multi-scale features and HR representations and has achieved impressive results on numerous vision tasks [15, 25–30]. HRNet  and its variant HRNetV2  have superior performance. But they ignore the appropriate use of LR representations for providing contextual information for HR representations.
Although GAN-based SR models can achieve relatively satisfactory results, there are still some shortcomings: (i) the training process is unstable and the SR performance fluctuates greatly using original GAN framework; (ii) it is not suitable for extracting features in SR task because the generation network is too simple, resulting in insufficient image feature extraction and affecting the reconstruction quality. Therefore, we consider the advantages of GANs and CNNs to propose a novel GAN-based architecture for medical image SR via HR representation learning, namely, Med-SRNet. We modify the feature aggregation parts of HRNet and HRNetV2 and import HRNet framework to the SR task. Figure 1 shows the SR result by Med-SRNet, indicating a clearer structure like the multiple punctate lesions in the red square regions. In summary, the contributions in this paper are threefold:(1)We use GAN as backbone of SR considering the advantages of GAN that can significantly reconstruct the visual quality of the images, and the high-frequency details of the images are more realistic in the image SR task.(2)We employ HRNet as backbone of SR to maintain the HR representations and repeat multi-scale fusions to strengthen HR representations for facilitating SR. Also, we adopt deconvolution operations to recover HR representations from the LR medical images with the aim to yield richer aggregated features, instead of simple bilinear interpolation operations used in HRNetV2.(3)We evaluate the proposed method with the constructed medical image dataset and two open-access COVID-19 CT datasets. The experimental results qualitatively and quantitatively demonstrated that the proposed method obtains higher PSNRs/SSIMs and preserves more local details and global features compared with other leading edge methods.
The rest of this paper is formed as follows. We present related work in Section 2. Section 3 gives the proposed method. Performance evaluation is presented in Section 4. Conclusion with a brief summary is drawn in Section 5.
2. Related Work
In the last few years, signiﬁcant improvement of the SR quality has been achieved based on CNN models from the ﬁrst SRCNN  to the latest feedback network . The superiority of the CNN-based SR methods over the conventional ones is remarkable. Due to the shallow structure, SRCNN shows poor performance. To boost the performance, the networks are getting deeper and deeper. For example, the VDSR model  proposed by Kim et al. has a deeper structure, and some recently proposed SR models with very deep structure, e.g., RCAN , achieve satisfying SR performance. Besides, dense connection-integrated SR models, e.g., SRDenseNet  and MemNet , further improve the performance. Moreover, some different forms of methods have been proposed [9, 10, 33]. Kong et al.  proposed the classSR framework to accelerate the SR network, and its classification method reduces the computational cost. Mei et al.  proposed a nonlocal sparse attention mechanism with dynamic sparse attention mode to achieve the robustness and efficiency of sparse representation while maintaining the ability of nonlocal remote modeling. Lin et al.  proposed an improved RCAN model, adding training iterations in the model training stage to improve the performance of the model. For representative computer vision tasks, i.e., object detection, image classification, and semantic segmentation, multi-scale networks [8, 15, 25–30] achieved outstanding results. For SR tasks, multi-scale networks [8, 15, 25] also have superior performance. A multi-scale residual network for image SR with the ability of adaptively detecting the image features at different scales was presented by Li et al. . A multi-scale information distillation network for single image SR by Sang et al.  fully exploits image features and restores the LR images to HR ones with high efficiency. More relevant to this work, a deep multi-scale network (DMSN) for medical image SR by Wang et al.  enables a better representation of global topological structure and local texture detail of HR medical images. But the common deficiency of these multi-scale networks is high computational load caused by huge parameter number. To solve this problem, Sun et al.  proposed a building block by establishing hierarchical residual-like connections within one single residual block, called Res2Net, which is superior to the leading edge baseline methods. For better performance, Sun et al. proposed HRNet  and its variant HRNetV2 , which maintains HR representations through the whole process. However, HRNet and HRNetV2 ignore the appropriate use of LR representations for providing contextual information for HR representations. Besides, Guo et al.  proposed a deep wavelet SR (DWSR) network to recover the HR image from the LR image by predicting “missing details” of wavelet coefficients. Huang et al.  used wavelet transform in the CNN-based face SR for validation and they captured the accurate global topology information and local textural details of faces.
GAN-based SR methods have developed recently. SRGAN  is the first work to introduce GAN model into SR reconstruction, which has obtained higher image visual quality and more realistic image high-frequency details. Subsequently, some other GAN-based SR methods have proposed, including enhanced super-resolution generative adversarial network (ESRGAN) , deep convolutional generative adversarial network (DCGAN) , WGAN , patch GAN , conditional generative adversarial network (CGAN) , and so on. Wang et al.  proposed ESRGAN, which replaces the residual block with the dense block and removes the batch norm (BN) layer. Although the PSNR of the generated image is not ideal, the sensory effect is greatly improved. The discriminator of patch GAN  reduces the training parameters and makes the model lightweight and easy to train. Gao et al.  proposed CGAN-based image SR network. The possible mismatch between input and output when GAN is directly applied to SR is addressed, and its generator adopts a symmetric encoder-decoder structure and applies a skip connection to achieve cross-layer transfer of low-level information. Zun et al.  proposed a multi-scale parallel learning generative network structure based on SRGAN, which consists of two blocks of residual networks, learning the extracted LR images by the multi-scale characteristics of the two subnetworks and then using the fusion network to fuse the high-frequency information at different scales to generate HR images.
3. Proposed Method
In this section, we present the architecture of the proposed Med-SRNet. This work aims to reconstruct an SR medical image from an LR one, which is obtained by the bicubic operation of HR. Let and denote the LR and HR images, respectively. The end-to-end mapping function between and can be derived by solving the following problem:where is the network parameter set that needs to be optimized; L(.) is the loss function for minimizing the difference between and ; and N is the training sample number.
GAN  can be recognized as an effective framework. As shown in Figure 2, GAN is a generative model with zero-sum game thinking, consisting of a generator G and a discriminator D. The generator G falsifies the data by the initial input noisy data, while the discriminator D determines whether the input data are falsified by the generator or are the real data. The two play against each other repeatedly through such a process, which keeps sending back information and optimizing their network capabilities, respectively, until finally the discriminator D can accurately determine the authenticity of data while the generator G generates data powerful enough to blur the judgment of D.
Thus, following SRGAN , we further define a discriminator network in which we optimize in an alternating manner along with to solve the adversarial min-max problem:where denotes the true sample distribution and denotes the generator distribution.
Figure 3 shows the complete architecture of the proposed Med-SRNet. It starts from LR images. Then, we use the HRNet backbone network to learn. Here, we mainly focus on the backbone network as shown in the feature extraction part and feature aggregation part of generate network in Figure 3.
For generate network as shown in Figure 3(a), we employ HRNet  as backbone of SR, which repeats use multiscale fusions to maintain HR representations through the whole process. However, it only uses the representation output from the highest resolution without feature aggregation. In its variant HRNetV2 , Sun et al.  aggregated the upsampled representations from all the parallel convolutions rather than only the HR representations. Inspired by Xiao et al. , deconvolutional layers can recover high-quality HR representations. So, we adopt deconvolution operation to recover HR representations from all the parallel LR images with the aim to yield richer aggregated features, as shown by the red up arrows in the feature aggregation part of Figure 3(a), instead of bilinear interpolation operation used in HRNetV2. It takes further experiment to demonstrate its effectiveness in Section 4.4.
For the feature extraction part, it starts from a HR subnetwork as the first stage and gradually adds high-to-low resolution subnetworks one by one to form more stages. Meanwhile, it connects the multi-resolution subnetworks in parallel. Multi-scale fusions are conducted repeatedly such that each of the high-to-low resolution representations receives information from other parallel representations over and over, leading to rich HR representations.
In the generator of Med-SRNet, we use 4 stages with 4 parallel subnetworks, similar to HRNet-W32 . The resolution is smoothly halved while the channel number is doubled accordingly. The first stage is composed of 4 residual units, and each of them contains a 64-channel (width) bottleneck, and then the width will be reduced to 32 via a 3 × 3 convolution. Stages 2 to 4 contain 1, 4, and 3 convolution units, respectively. Every convolution unit has 4 residual blocks, each of which has two 3 × 3 convolutional layers. Then, we obtain 4 different widths (32, 64, 128, and 256). After that, we adopt 5 × 5, 7 × 7, and 11 × 11 deconvolutional layers on 3 lower resolution representations, respectively. Finally, four groups of HR representations are aggregated via concatenation operation, followed by one 1 × 1 convolution for prediction. All convolutional layers are followed by ReLU .
To discriminate real HR images from generated SR samples, we train a discriminator network, the same as SRGAN . The general architecture is illustrated in Figure 3(b). Here we follow the architectural guidelines summarized by Radford et al.  and use ReLU activation, which avoids max-pooling throughout the network. The discriminator network contains eight convolutional layers with an increasing number (64 to 512) of filter kernels . Strided convolutions are used to reduce the image resolution, and the number of features is doubled. The resulting 512 feature maps are followed by two dense layers and the sigmoid activation function to obtain a probability for sample classification.
Following SRGAN , the total loss function of the proposed model is defined as weighted sum of individual loss functions:where and are weighting parameters; denotes the content loss which is the most widely used optimization target for image SR on which many state-of-the-art approaches rely; denotes the adversarial loss of generative network, which tries to fool the discriminator network; r is the downsampling factor in the downsampling operation; and W and H denote the size of the image, respectively.
In this section, experiments are performed to qualitatively and quantitatively evaluate the proposed method. Also, the quantitative evaluation is based on PSNR and SSIM  in this work.
4.1. Medical Image Datasets
In this work, a database suitable for medical image SR is constructed by integrating the following medical images: the Brain, Lung, Abdomen, and Bone. 250 images for each of these four body parts are used in the database. Brain and Lung images are chosen from the Cancer Imaging Archive (TCIA) . Bone and Abdomen images are provided by the radiology department of a hospital in China. The training set is composed of 175 images for each part, i.e., 700 medical images in total; the test set is made from the remaining 300 images.
In addition, we select two publicly available COVID-19 CT datasets, termed as COVID-CT_349 (https://github.com/UCSD-AI4H/COVID-CT) including 349 images and COVID-CT_19 (https://github.com/ieee8023/covid-chestxray-dataset) including 19 images. We use COVID-CT_349 as the training set and COVID-CT_349 and COVID-CT_19 as the test sets, respectively.
4.2. Implementation Details
For the constructed medical image database, the 700-image training dataset is used for the data augmentation. Following [4, 5], the original training images are first rotated by 90°, 180°, and 270° and then flipped horizontally. Therefore, we have seven additional augmented versions for each original image. The same data augmentation method is performed on COVID-CT_349 and COVID-CT_19.
We run the experiments on HP 7920 series tower server with NVIDIA RTX3090 graphics card. We use Adam optimizer to train the proposed model. The initial learning rate is set to 0.0001 for all layers and decreased by half after every 50 epochs. The proposed model converges after 200 epochs. The training procedure takes roughly 9 hours on a single Tesla P40 GPU.
4.3. Comparison with State-of-the-Art Methods
In this section, the performance of the proposed method is evaluated on both the constructed medical image database (i.e., Brain, Lung, Abdomen, and Bone) and COVID-19 datasets. For a straightforward test, the published codes of other models and the same training set are used for all methods. Tables 1–6 show the comparison results of PSNR and SSIM values for scales 4 and 8, indicating that the proposed Med-SRNet obtains higher PSNR and SSIM values on these datasets on average compared with other methods. Bold indicates the best.
Figure 4 presents patterns of scale 8 for four image datasets, i.e., Brain with suspected cerebrovascular malformation, Lung with atherosclerosis of aorta of pulmonary mediastinal window, Abdomen with renal cyst, and normal Bone sites. The images reconstructed by the proposed Med-SRNet have a clearer structure and abundant detail, which is obviously visible in the zoomed regions. Figure 5 shows the patterns of scale 8 for COVID-19 images with the characterization of ground-glass opacities. It is easy to find that the proposed Med-SRNet obtains better results than other methods in detail recovery.
4.4. Ablation Study
This section evaluates the performance of feature aggregation component on the constructed medical image database. Compared to SRGAN  and HRNet , the proposed feature aggregation part adds one component: deconvolution. The comparison (scale: ) of PSNR for different feature aggregation parts is shown in Table 7. Our method obtains higher PSNR on average. “BI” and “MR” are the abbreviations of upsampled bilinear interpolation operation and multi-resolution, respectively.
We present a GAN-based medical image SR network via HR representation learning. It effectively exploits features of medical images to boost the SR performance considering the advantages of GAN that can significantly reconstruct the visual quality of the images. It is important that HRNet is employed as backbone of SR to maintain the HR representations and repeat multi-scale fusions to strengthen HR representations for facilitating SR. Also, deconvolution operations are adopted to recover HR representations from the LR images with the aim to yield richer aggregated features, instead of simple bilinear interpolation operations used in HRNetV2. Experimental results qualitatively and quantitatively illuminate that the proposed method is superior to other leading edge ones in LR image restorations. In the future, we will study superior multi-scale transform methods, which integrate SR task to better exploit features from medical images.
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This study was supported in part by the Basic Scientific Research Project of Liaoning Provincial Department of Education under grant no. LJKQZ2021152, in part by the National Science Foundation of China (NSFC) under grant no. 61602226, and in part by the PhD Start-Up Foundation of Liaoning Technical University of China under grant no. 18-1021.
J. Zhu, G. Yang, and P. Lio, “How can we make GAN perform better in single medical image super-resolution? A lesion focused multi-scale approach,” in Proceedings of the IEEE International Symposium on Biomedical Imaging (ISBI), IEEE, Venice, Italy, April 2019.View at: Publisher Site | Google Scholar
Y. Sang, J. Sun, S. Wang, Y. Peng, X. Zhang, and Z. Yang, “Multi-scale information distillation network for image super resolution in NSCT domain,” in Proceedings of the International Conference on Neural Information Processing (ICONIP), pp. 50–59, ACM, Sydney, NSW, Australia, December 2019.View at: Publisher Site | Google Scholar
X. T. Kong, H. Y. Zhao, Y. Qiao, and C. Dong, “ClassSR: a General framework to accelerate super-resolution networks by data characteristic,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12016–12025, IEEE, Nashville, TN, USA, June 2021.View at: Google Scholar
Z. Hui, X. M. Wang, and X. B. Gao, “Fast and accurate single image super-resolution via information distillation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 723–731, IEEE, Salt Lake City, UT, USA, June 2018.View at: Publisher Site | Google Scholar
I. J. Goodfellow, J. P. Abadie, and M. Mirza, “Generative adversarial nets,” in Proceedings of the Annual Conference on Neural Information Processing Systems (ICONIP), pp. 2672–2680, Curran Associates, Montreal, Canada, December 2014.View at: Google Scholar
C. Ledig, L. Theis, F. Huszár, A. Cunningham, and A. Acosta, “Photo-realistic single image super-resolution using a generative adversarial network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 105–114, IEEE, Honolulu, HI, USA, July 2017.View at: Publisher Site | Google Scholar
X. Wang, K. Yu, and S. Wu, “ESRGAN: enhanced super-resolution generative adversarial networks,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 63–79, Springer, Munich, Germany, September 2018.View at: Google Scholar
P. Isola, J. Y. Zhu, and T. Zhou, “Image-to-image trans1ation with conditional adversarial networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134, IEEE, Honolulu, HI, USA, July 2017.View at: Google Scholar
X. L. Zun, H. J. Zhong, and L. R. Xing, “Multi-scale generative adversarial networks for image super-resolution algorithms,” Science Technology and Engineering, vol. 20, no. 13, pp. 5217–5223, 2020.View at: Google Scholar
C. P. Wang, S. M. Wang, B. Ma, J. Li, X. J. Dong, and Z. Q. Xia, “Transform domain based medical image super-resolution via deep multi-scale network,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2387–2391, IEEE, Brighton, UK, Brighton, UK, May.View at: Publisher Site | Google Scholar
Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng, “Dual path networks,” in Proceedings of the International Conference and Workshop on Neural Information Processing Systems (NeurIPS), pp. 4467–4475, Springer, Long Beach, CA, USA, December 2017.View at: Google Scholar
X. H. Liu, Y. R. Ma, Z. H. Shi, and J. Chen, “Griddehazenet: attention-based multi-scale network for image dehazing,” in Proceedings of the International Conference on Computer Vision (ICCV), IEEE, Seoul, Republic of Korea, November 2019.View at: Google Scholar
H. B. Huang, R. He, Z. N. Sun, and T. N. Tan, “Wavelet-SRNet: a wavelet-based CNN for multi-scale face super resolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1689–1697, IEEE, Venice, Italy, October 2017.View at: Publisher Site | Google Scholar
X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 315–323, MLR Press, Fort Lauderdale, FL, USA, April 2011.View at: Google Scholar