Multimodality Data Analysis in Information SecurityView this Special Issue
R2AU-Net: Attention Recurrent Residual Convolutional Neural Network for Multimodal Medical Image Segmentation
In recent years, semantic segmentation method based on deep learning provides advanced performance in medical image segmentation. As one of the typical segmentation networks, U-Net is successfully applied to multimodal medical image segmentation. A recurrent residual convolutional neural network with attention gate connection (R2AU-Net) based on U-Net is proposed in this paper. It enhances the capability of integrating contextual information by replacing basic convolutional units in U-Net by recurrent residual convolutional units. Furthermore, R2AU-Net adopts attention gates instead of the original skip connection. In this paper, the experiments are performed on three multimodal datasets: ISIC 2018, DRIVE, and public dataset used in LUNA and the Kaggle Data Science Bowl 2017. Experimental results show that R2AU-Net achieves much better performance than other improved U-Net algorithms for multimodal medical image segmentation.
Medical image plays a key role in medical treatment. Computer-aided diagnosis (CAD) is designed to provide doctors with accurate interpretation of medical images systematically so as to treat the patients better. Manual segmentation not only relies heavily on doctors’ own knowledge and clinical experience in recognition accuracy, but also has very low efficiency. Therefore, the application of deep learning in medical image segmentation has aroused widespread concern. Because medical image labeling requires the experts to spend considerable time and effort, it is difficult to acquire thousands of training images in medical image segmentation tasks. Ciresan et al.  trained networks in sliding windows to predict class tags for each pixel by providing local areas (patch) around pixels. However, this network must run networks independently for each patch, and there are plenty of redundancies due to overlapping patches. Furthermore, more maximum pool layers are needed for large patches, which will reduce the positioning accuracy. Full convolutional neural network  is one of the earliest applied deep neural networks with image segmentation. Without traditional full connection layer, it uses deconvolution to restore original images at the last layer of network. Ronneberger et al.  extended this system and proposed U-Net, which includes coding path and decoding path. Encoder uses output feature map to characterize original image. Through the information output from encoder, decoder restores details and size of the original image. U-Net adds multiple skip connections between the encoder and decoder, which can transfer the features of the shallow network to the deep network. Thus, it can help the decoding path recover the details of the image better. Since then, U-Net becomes a very popular segmentation network and is applied to medical imaging segmentation including cardiac MRI , cardiac CT , abdominal CT  segmentation and pulmonary nodule detection , and liver segmentation . However, target organs vary greatly among different patients, so U-Net will rely extremely on multicascaded CNN. Cascade framework will make dense predictions of ROI, which will lead to repetitive extraction of similar low-level features and result in the waste of computational resource and the increase in model parameters. Therefore, the design of an efficient structure of deep CNN is very important.
So far, many improved versions of U-Net have been proposed. Azad et al.  proposed BCDU-Net. The most important changes are for feature extraction method and skipping connections. The original U-Net relies on multicascaded CNN, which results in the waste of computing resources and the increase of the number of parameters. U-Net is able to splice shallow features and deep features simply by using skip connection. In this paper, an extended version of U-Net is proposed, which uses recurrent residual convolutional neural networks with attention gate connection (R2AU-Net) for medical image segmentation. The contributions of this paper can be summarized as follows: firstly, R2AU-Net uses more attention gates (AGs) to deal with deep features and shallow features. The AGs use the depth feature map in the decoding path as a gating signal to modify the feature map generated in the coding process and suppress feature responses in the irrelevant background area, so as to highlight features that are useful for a specific task . Secondly, R2AU-Net substitutes recurrent residual convolutional unit for the U-Net basic convolutional unit. In the recurrent residual convolutional unit, recurrent connection and residual connection  are added to each convolutional layer , thus not increasing network parameters. The use of recurrent connection can enhance the ability of integrating context information. Residual connection can help train deeper network [13, 14]. In addition, batch normalization  is used to accelerate the convergence speed of the network. R2AU-Net is evaluated on three datasets: retinal vascular segmentation (DRIVE dataset), skin lesion segmentation (ISIC 2018 dataset), and lung nodule segmentation (lung dataset).
2. Proposed Method
Multiple deep learning models are usually taken as functional modules to construct the new network. Inspired by U-Net, R2AU-Net is proposed in this paper. The network structure is shown in Figure 1, which takes advantage of four recently developed deep learning models. There are three differences between R2AU-Net and U-Net. Recurrent convolutional block with the residual unit is used in encoding and decoding paths. Secondly, the skipping connections are replaced by AGs to correct low-resolution features through deep features. The third point is that BN  is used to increase the stability of the neural network and speed up the convergence speed of the network in the upsampling process. BN can standardize data, obtain smaller regularization, reduce generalization error, and improve network performance .
2.1. Encoding Path
The encoding path of R2AU-Net contains four steps. Every step contains a recurrent residual convolutional unit, which consists of two 3 × 3 convolution and adds recurrent connections to each convolutional layer to enhance the capability of integrating contextual information of the model. In addition, residual connections are added to develop more efficient and deeper models. Each time a recursive residual convolutional unit is passed, the number of feature maps is doubled and the size becomes half of the original. The R2AU-Net model applies concatenation on feature mapping from encoding unit to decoding unit. The recurrent convolutional layers (RCL) in R2CL are performed according to the discrete time steps expressed by RCNN . Suppose the input sample in the layer of the R2CL block and a pixel located at in an input sample on the feature map in the RCL. denotes the output at time step and can be expressed as follows:
In formula (1), and represent standard convolutional layers and the input sample of the RCL, respectively. The standard convolutional layer and the RCL of the feature maps are, respectively, weighted by and , and is the bias. The output of RCL is activated by standard ReLU function as follows:
The output of the R2CL unit can be calculated as follows:where is an input sample of R2CL layer. is both the output of downsampling layer in encoding path and the output of upsampling layer in decoding path, respectively. The basic unit of U-Net convolution is shown in Figure 2(a), and the structure of the R2CL block is shown in Figure 2(b).
Formulas (1) and (2) describe the dynamic characteristics of RCL. When RCL is expanded into T time steps, the feedforward subnetwork with depth of T + 1 will be obtained. In this paper, RCL is expanded to two time steps; namely, T = 2. RCL includes a single convolutional layer and two subsequence recurrent convolutional layers. RCL expansion structure is shown in Figure 3.
2.2. Decoding Path
Each step of the decoding path performs the upsampling operation of the output from the R2CL unit of the previous layer. With each upsampling operation, the number of feature maps will be halved and the size will be doubled. At the last layer of the decoding path, the size of the feature map is restored to the original size of the input image. The LNR layer in R2CL is replaced by BN layer, so that the input of each layer keeps the same distribution. In the process of training, the distribution of activation in each layer of neural network will lead to the decrease of training speed. Therefore, BN  is used to enhance stability of neural networks after sampling at each step. It improves the stability of the neural network by subtracting the batch mean and dividing the inputs according to the batch standard deviation. BN accelerates training speed and promotes the performance of network model.
The output of BN layer is sent to AGs. R2AU-Net uses AGs to readjust the output features of the encoder before splicing the features on each resolution of the encoder with the corresponding features in the decoder. This module generates a gating signal which controls the importance of features at different locations. AGs gradually suppress feature responses unrelated to background regions without clipping ROI regions between networks.
Figure 4 shows the proposed additive attention map. Attention values are calculated for each pixel , respectively. For the convenience of representation and distinction, and are denoted as and , respectively. The gating signal determines the focus area per pixel. In order to obtain higher accuracy, the additive attention  is used to obtain the attention coefficient. The additive formula is as follows:where and denote ReLU and sigmoid activation functions, respectively, is the weight, and and are the bias. Wang et al.  used attention based on vector splicing. Linear transformations are calculated using a 1 × 1 × 1 convolution of tensor channel direction. Grid resampling of attention coefficients is achieved using trilinear interpolation. The update of AG parameter is trained according to backpropagation instead of using sampling-based update method .
Finally, the output of AGs is the multiplication of feature map and attention coefficient by elements, as shown in formula (5).
Attention coefficients tend to obtain large values in target organ regions and relatively small values in background regions, which can improve the accuracy of image segmentation.
3. Experimental Results
The experiments are performed on three datasets: DRIVE, ISIC 2018, and public dataset used in LUNA and the Kaggle Data Science Bowl 2017. The following performance indicators are adopted in this paper: True positive (), true negative (), false positive (), and false negative (), including accuracy (), sensitivity (), specificity (), and F1-score ().
The accuracy rate is used to evaluate the accuracy of pixel classification and obtained by the following formula:
Sensitivity represents the proportion of samples that are predicted to be positive in the experimental results. It reflects the situation of positive samples. Sensitivity is calculated by the following formula:
The specificity is calculated by the following formula:
F1-score is used to measure the accuracy of binary classification model. It considers both the precision and recall of the classification model. It can be regarded as a harmonic average of model precision and recall. F1-score is calculated by the following formula:
In addition, receiver operating characteristics (ROC) curve and precision recall (PR) curve are used to compare the performance of each network more intuitively. The values of area under curve (AUC) of both ROC curve and PR curve of each network are calculated in this paper.
3.1. Skin Lesion Segmentation
The ISIC dataset is published by the International Skin Imaging Collaboration (ISIC), which contains 2594 dermoscopy images of common skin pigmentation lesions. All the images have been annotated by the recognized skin cancer specialists. These annotations include dermoscopic features, which are used to identify the type of skin lesions of the known global and local morphological elements in the image. During the experiment, 1815 images are used for training, 259 images are used for verification, and 520 images are used as testing sets. The size of each image in ISIC is 700 × 900. Firstly, the input image is preprocessed into 256 × 256. Training images include original images and ground truth images labeled by professional physicians. Figure 5 shows the segmentation results of ISIC. The first column is the original images, the second column is the ground truth images, and the third column shows the segmentation result of R2AU-Net. The first line of the following skin lesion segmentation figures shows that R2AU-Net can accurately segment the dark skin lesions and will not be affected by the hair around the lesions. For the less obvious light-colored lesions in the second line, R2AU-Net can also segment the lesions well. It can be found that R2AU-Net can accurately segment the image of skin lesions, which is almost identical to the ground truth image. Table 1 shows the comparison of segmentation result among R2AU-Net and other improved versions of U-Net through F1-score, sensitivity, specificity, accuracy, and AUC value. R2AU-Net performs well in various indicators. For the dichotomy experiment, ROC curve and PR curve can intuitively compare the performance of each classifier. The AUC values of the ROC curve and PR curve are shown in Figures 6 and 7, respectively. The ROC curve tends to the upper left and the PR curve tends to the upper right, which shows the great performance of the segmentation model.
3.2. Retinal Vascular Segmentation
Images of DRIVE dataset are obtained from the diabetic retinopathy screening program in the Netherlands. Screening groups included 400 subjects aged between 18 and 24 who were diabetic. 40 color retina images are randomly selected. The doctor can diagnose, screen, treat, and evaluate a variety of cardiovascular and ophthalmic diseases, such as diabetes, hypertension, arteriosclerosis, and choroidal neovascularization, through the blood vessel segmentation from retinopathy images and the signs of retina blood vessel morphological properties. In the experiment, 20 samples are used for training and 20 samples are used for testing. The size of the original image is 565 × 584. Obviously, the number of samples is not enough to train a deep neural network model. Therefore, this paper randomly divided the input 20 training images into 190000 patches for training. Among them, 171000 patches are used for training set, and 19000 patches are used for testing set. The data size of the input network is 64 × 64. The segmentation result of the input image is shown in Figure 8. The first image is the original color image, the second image is the ground truth mask, and the third image is the segmentation result of the R2AU-Net output; most of the blood vessels at the end can still be segmented. Table 2 shows the results of comparative experiments on the DRIVE dataset, including F1-score, sensitivity, specificity, accuracy, and AUC value. From the experimental results, the performance of R2AU-Net is better than the traditional methods and the original U-Net. Figures 9 and 10 show the AUC value of both ROC curve and PR curve of each network.
3.3. Lung Segmentation
Public datasets used in LUNA and the Kaggle Data Science Bowl 2017 are provided by the National Cancer Research Center of the United States. This dataset consists of 2D and 3D images. The original size of lung CT images is 512 × 512, the number of which is 267. 134 images are used for training, 54 images are used for verification, and 79 images are used for testing set. Figure 11 shows the segmentation results of R2AU-Net on the lung dataset. The first column is the input image, the second column is the ground truth mask, and the third column is the lung segmentation image of R2AU-Net. The third image of the first row shows that very small CT image of lung region is able to be segmented. The segmentation results of R2AU-Net are basically the same as the ground truth image. Table 3 shows the performance comparison of R2AU-Net and other improved versions of U-Net. Figures 12 and 13 show the AUC value of both ROC curve and PR curve of each network.
In this paper, R2AU-Net is proposed for medical image segmentation. The recurrent residual convolutional block is used to enhance the ability of capturing context information, and AGs are added in the skip connections. Attention gates use deep features of decoding path as gating signal to modify shallow features and suppress feature response of background area, so that the network can obtain more accurate segmentation results. Moreover, BN is used to accelerate the convergence speed and stability of the network in the upsampling process. The experimental results of three datasets show that R2AU-Net has good performance in medical image segmentation.
The following are the links to the datasets used in this article: skin lesion segmentation (https://challenge2018.isic-archive.com/), retinal vascular segmentation (https://drive.google.com/file/d/17wVfELqgwbp4Q02GD247jJyjq6lwB0l6/view), and lung nodule segmentation (https://www.kaggle.com/kmader/finding-lungs-in-ct-data/data).
Conflicts of Interest
The authors declare that they have no conflicts of interest.
D. C. Cirean, A. Giusti, L. M. Gambardella et al., “Deep neural networks segment neuronal membranes in electron microscopy images,” Advances in Neural Information Processing Systems, vol. 25, pp. 2852–2860, 2012.View at: Google Scholar
J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2015.View at: Google Scholar
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: convolutional networks for biomedical image segmentation,” in Proceedings of the International Conference on Medical Image Computing and Computer-Aassisted Iintervention, pp. 234–241, Munich, Germany, October 2015.View at: Publisher Site | Google Scholar
C. Payer, D. Štern, H. Bischof, and M. Urschler, “Multi-label whole heart segmentation using cnns and anatomical label configurations,” in Statistical Atlases and Computational Models of the Heart. ACDC and MMWHS Challenges, pp. 190–198, Springer, Berlin, Germany, 2018.View at: Publisher Site | Google Scholar
F. Liao, M. Liang, Z. Li et al., “Evaluate the malignancy of pulmonary nodules using the 3D deep leaky noisy-or network,” IEEE Transactions on Neural Networks & Learning Systems, vol. 30, no. 11, pp. 3484–3495, 2017.View at: Google Scholar
K. He, X. Zhang, S. Ren et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, Las Vegas, NV, USA, June 2016.View at: Google Scholar
S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” in Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456, Lille, France, July 2015.View at: Google Scholar
V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, USA, 2014.View at: Google Scholar