Medical image segmentation is a key topic in image processing and computer vision. Existing literature mainly focuses on single-organ segmentation. However, since maximizing the concentration of radiotherapy drugs in the target area with protecting the surrounding organs is essential for making effective radiotherapy plan, multiorgan segmentation has won more and more attention. An improved Mask R-CNN (region-based convolutional neural network) model is proposed for multiorgan segmentation to aid esophageal radiation treatment. Due to the fact that organ boundaries may be fuzzy and organ shapes are various, original Mask R-CNN works well on natural image segmentation while leaves something to be desired on the multiorgan segmentation task. Addressing it, the advantages of this method are threefold: (1) a ROI (region of interest) generation method is presented in the RPN (region proposal network) which is able to utilize multiscale semantic features. (2) A prebackground classification subnetwork is integrated to the original mask generation branch to improve the precision of multiorgan segmentation. (3) 4341 CT images of 44 patients are collected and annotated to evaluate the proposed method. Additionally, extensive experiments on the collected dataset demonstrate that the proposed method can segment the heart, right lung, left lung, planning target volume (PTV), and clinical target volume (CTV) accurately and efficiently. Specifically, less than 5% of the cases were missed detection or false detection on the test set, which shows a great potential for real clinical usage.

1. Introduction

Diagnostic imaging plays an important role in modern medicine. Computed tomography (CT), magnetic resonance imaging (MRI), and other imaging modalities provide important assistance for diagnosis and treatment planning. Take esophageal cancer as an example; esophageal cancer is a primary malignant tumor of the esophagus. At least 200,000 people suffer from esophageal cancer every year [1], and radiotherapy is one of the main treatments in China. However, treatment planning of radiotherapy is highly dependent on planning target volume (PTV) and accurate description of the organs at risk. The accuracy of organ countersegmentation determines the quality of dose planning optimization in radiotherapy and thus affects the success or failure of radiotherapy or the incidence of complications [2].

With the increasing scale and quantity of medical images, organ segmentation via manual delineation by the clinical experience of radiologists is inefficient [3]. And it is necessary to use computers for processing and analyzing the medical images automatically. With the development of computer vision technology, many different automatic image segmentation and delineation algorithms have been developed. These algorithms are called medical image segmentation or organ segmentation [4] in the literature.

Conventional medical image segmentation/organ segmentation algorithms can be roughly divided into eight categories [4]: (a) thresholding approaches, (b) region-growing approaches, (c) classifiers, (d) clustering approaches, (e) Markov random field models, (f) deformable models, (g) artificial neural networks, and (h) atlas-guided approaches. Although these methods have made some progress, the accuracy is not sufficient.

Benefit from the continuous progress of deep learning technology, medical image segmentation/organ segmentation is currently dominated by the CNN (convolutional neural network) [5]. Similar to the object detection method, CNN-based organ segmentation can also be divided into two types: (a) one-stage algorithm, which deems the organ segmentation as a one-stage pixel classification task. The typical structure is fully convolutional networks (FCNs) [6]; (b) two-stage algorithm, which decouples the organ segmentation into organ localization and instance segmentation stages. The typical structure is region CNN (R-CNN) [7]. The most well-known one-stage CNN architecture for organ segmentation is U-Net, published by Ronneberger et al. [8]. Most state-of-the-art organ segmentation methods are the invariants of U-Net [911]. Although they have achieved encouraging performance, two shortcomings exist. On the one hand, many literature studies focus on single-organ segmentation, while only few works are made effort to address the multiorgan segmentation problem [12, 13]. On the other hand, two-stage segmentation methods work well for multiobject segmentation on the natural image segmentation dataset [14] but worse than the one-stage algorithm on medical image segmentation [15]. Therefore, mining the potential of the two-stage multiorgan segmentation algorithm has great research value.

In this paper, to address the shortcomings mentioned above, we present an improved Mask R-CNN framework for multiorgan segmentation. Original Mask R-CNN [16] is presented to address the multi-instance segmentation problem on the natural image. Although the original Mask R-CNN has achieved state-of-the-art instance segmentation performance on general image datasets, the latest research [15] shows that it is able to accurately find bounding boxes for organs, while its performance on segmentation is worse than U-Net on the medical image segmentation dataset. We think a major reason for this is that the semantic representation obtained from the original Mask R-CNN framework is too rough for organ segmentation because organ boundaries may blur and organ shapes are various. To address it, we have made two improvements to the original Mask R-CNN: (a) a ROI (region of interest) generation method is presented in the RPN which is able to utilize multiscale semantic features; (b) a prebackground classification subnetwork is integrated to improve the precision of multiorgan segmentation. Moreover, CT images of 44 esophageal cancer patients are collected and annotated as benchmark to evaluate the proposed method.

To sum up, our contributions are as follows:(1)We applied the Mask R-CNN to esophageal cancer medical image processing successfully. Most existing methods focus on single-organ segmentation, while this paper devotes to address the multiorgan segmentation problem.(2)To provide a better multiorgan segmentation model, we propose two improvements compared with the original Mask R-CNN framework.(3)We conduct extensive experiments and analysis on the collected real multiorgan dataset and demonstrate the excellent performance of our proposed method on the multiorgan segmentation task.

The rest of this paper is organized as follows. Section 2 reviews and discusses the related works. Section 3 describes the proposed improved Mask R-CNN model in detail. Experimental results and comparisons are discussed in Section 4, and conclusions with the future work are described in Section 5.

Pham et al. [4] and Litjens et al. [5] reviewed the conventional and deep learning-based organ segmentation methods, respectively. In this section, we briefly review the previous methods which are most related to our work including the conventional medical image segmentation method, deep learning-based single-organ segmentation method, and deep learning-based multiorgan segmentation method.

2.1. Conventional Medical Image Segmentation Method

Conventional medical image segmentation method can be roughly divided into eight categories: (a) thresholding approaches [17]: thresholding approaches first attempt to determine an intensity value (threshold), then group all pixels with intensity greater than the threshold into one class, and all other pixels into another class. (b) Region-growing approaches [18]: region-growing approaches utilize intensity information and/or edges in the medical image to predefine criteria for extracting a region of the image that is connected. (c) Classifiers [19, 20]: classifier methods convert the medical image from the image space to the feature space first and then train classifiers on the feature space to distinguish which class of the pixel they belong to. (d) Clustering approaches [21]: commonly used clustering approaches for medical image segmentation are K-means, fuzzy c-means, and expectation-maximization. Compared with the classifiers, the clustering approaches are unsupervised approaches. (e) Markov random field models: Markov random field (MRF) is a statistical model which can be used within segmentation methods by modeling model spatial interactions between neighboring or nearby pixels. (f) Deformable models: deformable models use closed parametric curves or surfaces to delineate region boundaries. (g) Artificial neural networks (ANNs) [22]: the most widely applied use of the ANN in conventional medical image processing is as a classifier. (h) Atlas-guided approaches [23, 24]: the atlas is generated by compiling information on the anatomy that requires segmenting. This atlas is then used as a reference frame for segmenting new images. In addition, level set optimization is also utilized for multiorgan segmentation [25]. Though the methods mentioned above have achieved some progress, the accuracy of organ segmentation is not too high because all conventional methods depend on manual feature representation.

2.2. Deep Learning-Based Single-Organ Segmentation Method

Ronneberger et al. [8] first presented a novel CNN architecture (U-Net) and became the most popular structure in medical image analysis. The main novelty in U-Net is the combination of an equal amount of upsampling and downsampling layers. Inspired by U-Net, Zhou et al. [26] presented U-Net++, a more powerful architecture for medical image segmentation. Milletari et al. [27] proposed V-Net (a 3D variant of U-Net architecture) performing 3D image segmentation using 3D convolutional layers with an objective function directly based on the Dice coefficient. Drozdzal et al. [11] investigated the use of short ResNet-like skip connections in addition to the long skip connections in a regular U-Net. Besides CNN, Xie et al. [28], Stollenga et al. [29], Chen et al. [30], and Poudel et al. [31] utilized the recurrent neural network (RNN) for organ segmentation tasks. To combat spurious responses, few papers attempt to combine the CNN/RNN with graphical models like MRFs [32] and conditional random fields (CRFs) [33] to refine the segmentation output. Although these methods have achieved encouraging performance, they were presented to address the single-organ segmentation problem, which may not be suitable/optimal for multiorgan segmentation (It is difficult to segment multiple organs at the same time, which damages the clinical auxiliary effect.).

2.3. Deep Learning-Based Multiorgan Segmentation Method

The research on the deep learning-based multiorgan segmentation method is in its early phase. Tong et al. [34] introduced discriminative dictionary learning for abdominal multiorgan segmentation. Lay et al. [35] used context integration and discriminative models for rapid multiorgan segmentation. Roth et al. [36] and Chen et al. [37] adopted the 3D fully convolutional network. Recently, Dong et al. [38] presented a generative model (U-Net-GAN), and Wang et al. [39] proposed densely connected U-Net for multiorgan segmentation. Lei et al. [40] presented a review of deep learning in multiorgan segmentation. Different from these methods, the proposed method in this paper aims to improve the two-stage instance segmentation algorithm which is widely used in the natural image dataset, making it suitable for the multiorgan segmentation task.

3. Methods

In this section, we introduce the proposed method (which is named improved Mask R-CNN) for multiorgan segmentation. As shown in Figure 1, the proposed method is based on the existing well-known multi-instance segmentation method, Mask R-CNN. Compared with the original Mask R-CNN, we have made two improvements: (a) a ROI (region of interest) generation method is presented in the RPN which is able to utilize multiscale semantic features; (b) a prebackground classification subnetwork is integrated to improve the precision of multiorgan localization. The detailed proposed approach is presented in two sections: (a) the network structure and (b) loss function.

3.1. The Network Structure

The network of the proposed algorithm can be mainly divided into three modules. The first module is called feature extraction and ROI generation, which is mainly composed of ResNet50 + FPN + RPN. In this module, we generate multilayer feature maps first. Then, each point on the feature map is mapped into the original image to acquire the corresponding ROI.

The second module is named region of interest alignment, which pools the ROIs obtained from the first module to a fixed size and avoids quantization error. The third module is mask acquisition. In this module, the fixed-size ROIs obtained from the second module are sent to the organ region segmentation network for generating organ mask. And at the same time, they are also sent to the fully connected layer for organ-position rectangular bounding box regression and organ classification. The above three modules are detailed as follows.

3.1.1. Feature Extraction and ROI Generation

The purpose of this step is to extract the features of the input image and generate the ROI in the corresponding feature layer. First, a medical CT image containing multiple organs is input to the ResNet50 network. Res2, Res3, Res4, and Res5 are the feature output layers of the ResNet [15, 41], respectively. Then, feature pyramid network (FPN) [42] is adopted to fuse these multilayer features to obtain strong semantic information and improve the accuracy of organ detection. As shown in Figure 2, the specific approach is to conduct dimensionality reduction operation on the features above Res4 (that is, to add a layer of 1 ∗ 1 convolution layer) and upsampling operation on the features above P5 to make them have the same size. Then, addition operation (adding corresponding elements) is performed on the processed P5 and the processed Res4 to output the obtained results to P4, P2, P3, and so on. Then, the RPN network is used to predict in different output layers, P2, P3, P4, and P5, to obtain ROIs.

3.1.2. Region of Interest Alignment

This step aims to pool all ROIs remaining on the feature maps to a fixed size. Since the ROI position is usually obtained by the regression model, it is generally a floating-point number, while the pooled feature map requires a fixed size. In order to avoid quantization errors, the ROI align [15] (illustrated in Figure 3) layer is adopted. In the presented framework, we use the ROI align layer to traverse each ROI first, keeping the floating-point number boundary unquantized. Then, the ROI is divided into cells with the boundary of each cell not quantized. Then, the fixed four coordinate positions are calculated in each cell, the values of these four positions are calculated by bilinear interpolation, and the max-pooling operation is carried out finally. Through the above operations, the fixed size ROI can be obtained with no quantization error.

In the original Mask R-CNN segmentation algorithm, the ROI obtained by the RPN network is aligned to extract the ROI features. In this step, each ROI is aligned by a single-layer (single-scale) feature. In the presented method, as shown in Figure 4, we replace the single-layer features with multilayer features, that is to say, each ROI needs to do ROI alignment operation with multilayer features, and then the ROI features of different layers will be fused together so that each ROI feature will have multilayer features.

3.1.3. Mask Acquisition

The goal of this step is to get the multiorgan segmentation result. ROI of pooling to a fixed size was sent to the fully connected layer for organ classification (6 categories including background) and organ-position rectangular bounding box regression. Meanwhile, ROI of pooling to a fixed size was also sent to a mask generation branch (i.e., fully convolution neural network operation in each ROI). Organ area segmentation is a parallel branch to organ classification and organ-position rectangular bounding box regression. As shown in Figure 5(a), the branch consists of four consecutive convolution layers and a deconvolution layer (with 2 times of upsampling). The kernel size and channels of each convolution layer are 3 ∗ 3 and 25, respectively. A binary classification branch is added to distinguish foreground and background before the original mask branch (illustrated in Figure 5(b)). The new branch contains two 3 ∗ 3 convolution layers and a fully connected layer. The dimension of the output of the new branch is the same as the original branch via a reshape operation. The output mask of these two branches was fused to get the final multiorgan segmentation result.

3.2. Loss Function

In terms of loss function, a third loss function, which is used to generate mask, is added on the basis of Fast R-CNN [43] so that the total loss function of our improved Mask R-CNN framework is

Here, the classification and regression losses are defined as and , respectively:

P is a -dimensional vector representing the probability of a pixel belonging to the k class or background. For each ROI, , and represents the probability corresponding to class . represents the predicted translation scaling parameter of class u, refer to the translation with the same scale as the object proposal, and refer to the height and width of the logarithmic space relative to the object proposal. , and in equation (3) represent , and , respectively. Moreover, represents the corresponding parameter of the ground-truth bounding box.

Note that the smooth L1 loss is utilized in equation (3); the reasons are twofold: (a) compared with the widely used L2 loss, smooth L1 loss is robust for outlier points. (2) Many famous object detection frameworks use smooth L1 loss, e.g., Faster-RCNN and Mask R-CNN. We utilize the same bounding loss function which can guarantee the fairness of algorithm comparison. Of course, some box regression loss functions which have been proposed recently (e.g., GIoU, DIoU, and CIoU) are also compatible with the proposed framework.

in equation (1) is the mask loss of the newly added background segmentation branch (as described in Section 3.1.3). In our improved Mask R-CNN framework, the output dimension of each ROI is for the newly added mask branch, where represents the size of the mask and K represents categories, so a total of K-binary masks were generated in here. After the predicted mask was obtained, the value of the sigmoid function was calculated for each pixel of the mask, and the obtained result was taken as one of the inputs of (cross-entropy loss function). It should be noted that only positive sample ROI is used to calculate . The definition of the positive sample is the same as that of general object detection algorithms, and IOU greater than 0.5 is defined as the positive sample. In fact, is very similar to except that the former is calculated on the basis of pixels and the latter on the basis of images, so it is similar to in that although K masks are given here, only the one corresponding to the ground truth is valid in calculating the cross-entropy loss function. A mask contains multiple pixels, so here, is the average of the cross-entropy loss of each pixel:

Here, is the j-th pixel of the i-th generated mask.

4. Experiments

In this section, we conduct extensive experiments to evaluate the proposed improved Mask R-CNN multiorgan segmentation framework. We first introduce the collected and annotated dataset in Section 4.1 followed by the evaluation criteria in Section 4.2. Then, Section 4.3 describes the implementation details. Finally, we discuss the comparison with state-of-the-art methods in Section 4.4.

4.1. Dataset

The utilized multiorgan segmentation dataset consists of all the slice information of 44 esophageal cancer patients, with a total of 4341 CT images. Each image was labeled with five areas (heart, right lung, left lung, PTV, and CTV) by the doctor. We use 80% of these CT images as the training set, 5% as the validation set, and the remaining 15% as the test set.

4.2. Evaluation Criteria

There are many evaluation criteria which are proposed to evaluate the image segmentation results, e.g., region overlap and boundary similarity [44]. Here, we select Dice coefficient (DICE) [45] and Jaccard index (JAC) [46] as criteria to evaluate the overlap between the prediction and the ground-truth organ regions. Suppose that x and y are the organ regions of the prediction and the ground truth, respectively; JAC and DICE are calculated as follows:

4.3. Implementation Details

We implement our improved Mask R-CNN model based on the framework of PyTorch. The backbone is the adjusted ResNet50 which is detailed in Section 3.1.1. We use the stochastic gradient descent (SGD) optimizer with the learning rate set to 0.01 initially, and the batch size is set to 64. The maximum number of iterations is set to 100,000. When the number of iterations reached 50,000 and 80,000, the learning rate is reduced 10 times. All images are resized to . The weight decay is set to 0.0001, and the momentum is set to 0.9 for all convolution and fully connected layers. It should be noted that all parameters in the proposed model are trained from scratch.

4.4. Results and Discussion
4.4.1. Quantitative Evaluation with State-of-the-Art Methods

We compare our proposed methods against the current widely used multiorgan segmentation models (Linguraru et al. [47], He et al. [48], and Gauriau et al. [49]), and the comparison results are shown in Table 1. In general, we can observe that the proposed improved Mask R-CNN framework achieved the best performance. Moreover, Figure 6 shows the accuracy (JAC) and loss curves of the improved Mask R-CNN and original Mask R-CNN framework in the training stage. From Table 1 and Figure 6, we can conclude that the presented technique is able to improve the multiorgan segmentation performance of the original Mask R-CNN significantly and steadily.

4.4.2. Qualitative Evaluation

To illustrate the effectiveness of our method more visually, some multiorgan segmentation results are shown in Figure 7. The image we selected is distributed between 35 and 100 slices basically because in this range, each slice contains five organ regions that we need basically, and the information of each organ region is relatively rich. We found that the area of some organs from the 60th to 80th layers of patients is very small, which is difficult to be observed by the naked eye due to the perspective. However, our improved mask R-CNN algorithm can also achieve good results (as shown in Figure 4, especially the area indicated by the arrow in the figure may be difficult for doctors to annotate).

Although the proposed method can achieve encouraging performance, there are still some shortcomings. Examples of false detection and missed detection segmentation are shown in Figure 8. After analyzing all failure results, we find that that the missed detection was mainly concentrated in the slices from the 1st to the 35th layer of the patient, while the missed detection was mainly concentrated in the slices from the 110th to the 130th layer. By observing the constructed dataset, we find that the amount of data of the slice near the front and the slice near the back is relatively small, that is, the slice near the front layer contains relatively less target organ area, so the doctor’s label information in these parts is less. Therefore, we believe the major reason for these failure cases is due to the fact that training data are insufficient and unbalanced.

5. Conclusion

In this paper, we present the improved Mask R-CNN segmentation framework for the medical domain that is able to work well on the multiorgan segmentation task. The proposed improved Mask R-CNN framework builds around the original Mask R-CNN framework [15]. Compared with the original Mask R-CNN framework, there are two major improvements on the improved Mask R-CNN: (a) a ROI (region of interest) generation method is presented in the RPN (region proposal network) which is able to utilize multiscale semantic features; (b) a prebackground classification subnetwork is integrated to the original mask generation branch to improve the precision of multiorgan segmentation. Additionally, extensive experiments on the collected and annotated esophageal cancer dataset demonstrate the effectiveness of the proposed framework, i.e., the improved Mask R-CNN framework can segment the heart, right lung, left lung, PTV, and CTV accurately and simultaneously. Since it is time consuming and laborious to label medical images, we will investigate semi-supervised and weakly supervised multiorgan segmentation techniques in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.


This work was supported by the National Natural Science Foundation (NSF) of China (nos. 61702001 and 61902104), the Key Project of the Natural Science Foundation of Anhui University of Traditional Chinese Medicine (2019zrzd10), the Scientific Research Development Foundation of Hefei University (no. 19ZR15ZDA), the Talent Research Foundation of Hefei University (no. 18-19RC54) and Hefei University Annual Academy Research Development Fund Project (Natural Science) (No. 18ZR12ZDA).