Research Article  Open Access
Automated Segmentation of Colorectal Tumor in 3D MRI Using 3D Multiscale Densely Connected Convolutional Neural Network
Abstract
The main goal of this work is to automatically segment colorectal tumors in 3D T2weighted (T2w) MRI with reasonable accuracy. For such a purpose, a novel deep learningbased algorithm suited for volumetric colorectal tumor segmentation is proposed. The proposed CNN architecture, based on densely connected neural network, contains multiscale dense interconnectivity between layers of fine and coarse scales, thus leveraging multiscale contextual information in the network to get better flow of information throughout the network. Additionally, the 3D levelset algorithm was incorporated as a postprocessing task to refine contours of the network predicted segmentation. The method was assessed on T2weighted 3D MRI of 43 patients diagnosed with locally advanced colorectal tumor (cT3/T4). Cross validation was performed in 100 rounds by partitioning the dataset into 30 volumes for training and 13 for testing. Three performance metrics were computed to assess the similarity between predicted segmentation and the ground truth (i.e., manual segmentation by an expert radiologist/oncologist), including Dice similarity coefficient (DSC), recall rate (RR), and average surface distance (ASD). The above performance metrics were computed in terms of mean and standard deviation (mean ± standard deviation). The DSC, RR, and ASD were 0.8406 ± 0.0191, 0.8513 ± 0.0201, and 2.6407 ± 2.7975 before postprocessing, and these performance metrics became 0.8585 ± 0.0184, 0.8719 ± 0.0195, and 2.5401 ± 2.402 after postprocessing, respectively. We compared our proposed method to other existing volumetric medical image segmentation baseline methods (particularly 3D Unet and DenseVoxNet) in our segmentation tasks. The experimental results reveal that the proposed method has achieved better performance in colorectal tumor segmentation in volumetric MRI than the other baseline techniques.
1. Introduction
Colon and rectum are fundamental parts of the gastrointestinal (GI) or digestive system. The colon, which is also called the large intestine, starts from the small intestine and connects to the rectum. Its main function is to absorb minerals, nutrients, and water and remove waste from the body [1, 2]. According to recent cancer statistics, colorectal cancer is diagnosed as the second leading cause of cancer death in the United States [3].
Nowadays, magnetic resonance imaging (MRI) is the most preferable medical imaging modality in primary colorectal cancer diagnosis for radiotherapy treatment planning [2, 4, 5]. Usually, the oncologist or radiologist delineates colorectal tumor regions from volumetric MRI data manually. This manual delineation or segmentation is timeconsuming and laborious and presents inter and intraobserver variability. Therefore, there exists a need for efficient automatic colorectal tumor segmentation methods in clinical radiotherapy practices to segment the colorectal tumor from large volumetric data, as this may save time and reduce human interventions. In contrast to natural images, medical imaging is generally more chaotic, as the shape of the cancerous regions may vary from slice to slice, as shown in Figure 1. Hence, automatic segmentation of the colorectal tumor is a very challenging task, not only because its size may be very small but also because of its rather inconsistent behavior in terms of shape and intensity distribution.
(a)
(b)
(c)
Lately, automatic segmentation of the colorectal tumor from volumetric MRI data based on atlas [6] and supervoxel clustering [7] has been presented with some good performance. Newly, deep learningbased approaches have been widely employed with impressive results in medical image segmentation [8–16]: Trebeschi et al. [8] have presented a deep learningbased automatic segmentation method to localize and segment the rectal tumor in multiparametric MRI by incorporating a fusion between T2weighted (T2w) MRI and diffusionweighted imaging (DWI) MRI. Despite their method displaying good performance, it is unclear whether only T2w modality, which provides more anatomy information than DWI modality, could be useful for colorectal tumor segmentation. Secondly, they applied their implementation on 2D data, as it is very common in real data, but medical data, such as CT (Computed Tomography) and MRI, are in 3D volumetric form. The 2D convolutional neural network (CNN) algorithms segment the volumetric MRI or CT data in a slicebyslice sequence [9–11], where 2D kernels are used by aggregating axial, coronal, and sagittal planes in a onetoone association, individually. Although these 2D CNNbased methods demonstrated great improvement in segmentation accuracy [17], the inherent 2D nature of the kernels limits their application when using volumetric spatial information. Based on this consideration, 3D CNNbased algorithms [12–16] have been recently presented, where 3D kernels are used instead of 2D to extract spatial information across all three volumetric dimensions. For example, Çiçek et al. [12] proposed a 3D Unet volumetovolume segmentation network that is an extension of the 2D Unet [18]. 3D Unet used dual paths: an analysis path where features are abstracted, and a synthesis path or upsampling path where full resolution segmentation is produced. Additionally, 3D Unet established shortcut connections between early and later layers of the same resolution in both the analysis and synthesis paths. Chen et al. [13] presented a voxelwise residual network (VoxResNet) that is an extension of 2D deep residual learning [19] to 3D deep network. VoxResNet provides a skip connection to pass features from one layer to the next layer. Even if these 3D Unet and VoxResNet provide several skip connections to make training easy, the presence of these skip connections creates a short path from the early layers to the last one and this may end up transforming the net into a very simple configuration, with the unwanted additional burden of producing a very high number of parameters to be adjusted during training. Huang et al. [20] introduced a DenseNet that extends the concept of skip connections in [18, 19] by constructing direct connections from every layer to the corresponding previous layers to ensure maximum gradient flow between layers. In [20], DenseNet was proven as an accurate and efficient method for the natural image classification. Yu et al. [16] proposed the densely connected volumetric convolutional neural network (DenseVoxNet) for volumetric cardiac segmentation which is an extended 3D version of DenseNet [20]. DenseVoxNet utilizes two dense blocks followed by pooling layers. The first block learns highlevel feature maps, and the second block learns lowlevel feature maps; the latter is followed by a pooling layer that further reduces the resolution of the learned highlevel feature maps in the first block. Finally, the highresolution feature maps are restored by incorporating some deconvolution layers. In DenseVoxNet, early layers of the first block learn finescale features (i.e., highlevel features) based on small receptive field, while coarsescale features (i.e., lowlevel features) are learned by later layers of the second block with a larger receptive field. In short, finescale and coarsescale features are learned in early and later layers, respectively, and this may reduce the network ability to learn multiscale contextual information throughout network, thus leading to suboptimal performance [21].
In this study, a novel method to overcome the abovementioned problems in 3D volumetric segmentation is presented. We propose a 3D multiscale densely connected convolutional neural network (3D MSDenseNet), a volumetric network that is an extension of the recently proposed 2D multiscale dense networks (MSDNet) for the natural image classification [22]. In summary, we have employed 3D MSDenseNet for the segmentation of the colorectal tumor, with the following contributions:(1)A multiscale training scheme with parallel 3D densely interconnected convolutional layers for twodimensional depth and coarser scales is used where low and highlevel features are generated from each scale individually. A diagonal propagation layout is incorporated to couple the depth features with the coarser features from the first layer, thus maintaining local and global (multiscale) contextual information throughout the network to improve segmentation results efficiently.(2)The proposed network is based on volumetovolume learning and interference, which eradicates computation redundancy.(3)The method is validated on colorectal tumor segmentation in 3D MR images, and it has attained outperformed segmentation results in comparison with previous baseline methods. From the encouraging results obtained with MR images, the proposed method could be applied for further applications of medical imaging.
2. Methods
Figure 2 represents an overview of our proposed methodology. We have extended the characterization of the multiscale densely connected network to colorectal tumor segmentation with 3D volumetovolume learning fashion. The network is divided into two paths: depth path and scaled path. The depth path is similar to the dense network, which extracts the finescale features with high resolution. The scaled path is downsampled with a pooling layer of power 2. In this path, lowresolution features are learned. Furthermore, finescale features from depth are downsampled into coarse features via the diagonal path shown in Figure 2 and concatenated to the output of the convolution layer in the scaled path. By doing this, both local and global contextual information is incorporated in a dense network.
2.1. DenseNet: Densely Connected Convolutional Network
Generally, in feedforward CNN or ConvNet, the output of the l^{th} layer is represented as , which is obtained by mapping a nonlinear transformation from the output of the preceding layer such thatwhere is composed of a convolution or pooling operation followed by a nonlinear activation function such as the rectified linear unit (ReLU) or batch normalizationReLU (BNReLU). Recent works in computer vision have shown that a deeper network (i.e., with more layers) increases accuracy with better learning [20]. However, the performance of deeply modeled networks tends to decrease, and its training accuracy is saturated with the network depth increasing due to the vanishing/exploding gradient [20]. Later, Ronneberger et al. [18] solved this vanishing gradient problem in the deep network by incorporating skip connection, which propagates output features from layers of the same resolution in the contraction path to the output features from the layers in the expansion path. Nevertheless, this skip connection allows the gradient to flow directly from the lowresolution path to highresolution path, which makes training easy, but this generally produces an enormous feature channel in every layer and lead network to adjust a large number of parameters during training. To overcome this problem, Huang et al. [20] introduced a densely connected network (DenseNet). The DenseNet extends the concept of skip connections by constructing a direct connection from every layer to its corresponding previous layers, to ensure maximum gradient flow between layers. In DenseNet, feature maps produced by the preceding layer were concatenated as an input to the advanced layer, thus providing a direct connection from any layer to the subsequent layer such thatwhere represents the concatenation operation. In [20], DenseNet has emerged as an accurate and efficient method for the natural image classification. Yu et al. [16] proposed densely connected volumetric convolutional neural network (DenseVoxNet) for volumetric cardiac segmentation which is an extended 3D version of DenseNet [20].
2.2. Proposed Method (3D MSDenseNet)
In 3D MSDenseNet, we have two interconnected levels, depth level and scaled level, for simultaneous computation of high and lowlevel features, respectively. Let be an original input volume, and feature volume produced by layer l at scale s be represented as . Considering two scales in the network (i.e., s_{1} and s_{2}), we represent the depth level (horizontal path) and scaled level as s_{1} and s_{2} individually, as shown in Figure 2. The first layer is an inimitable layer where the feature map of the very first convolution layer is divided into respective scale s_{2} via pooling of stride of power 2. The highresolution feature maps () in the horizontal path (s_{1}) produced at subsequent layers () are densely connected [20]. However, output feature maps of subsequent layers in the vertical path (i.e., coarser scale, s_{2}) are results of concatenation of transformed features maps from previous layers in s_{2} and downsampled features maps from previous layers of s_{1}, propagated as the diagonal way, as shown in Figure 2. In this way, output features of coarser scale s_{2} at layer l in our proposed network can be expressed aswhere denotes the concatenation operator, represents those feature maps from finer scale s_{1} which are transformed by the pooling layer of stride of power 2 diagonally (as shown in Figure 2), and indicates those feature maps from coarser scale s_{2} transformed by regular convolution. Here, and have the same size of feature maps. In our network, the classifier only utilizes the feature maps from the coarser scale at layer l for the final prediction.
2.3. Contour Refinement with 3D LevelSet Algorithm
3D levelset based on the geodesic active contour method [23] is employed as a postprocessor to refine the final prediction of each network discussed above. 3D levelset adjusts the predicted tumor boundaries by incorporating prior information and a smoothing function. This 3D levelset method identifies a relationship between computation of geodesic distance curves and active contours. This relationship provides a precise detection of boundaries even in existence of huge gradient disparities and gaps. The levelset method based on the geodesic active contour is more elucidated with the mathematical derivations in [23, 24]. In order to simplify this algorithm, let be a levelset function which is initialized with the provided initial surface at t = 0. Here, is the probability map yielded by each method. This probability map, , is employed as the starting surface to initialize the 3D levelset. Thereafter, the evolution of the levelset function regulates the boundaries of the predicted tumor. In the geodesic active contour, the partial differential equation is incorporated to evolve the levelset function [23] such thatwhere , , and denote the convection function, expansion/contraction, and spatial modifier (i.e., smoothing) functions, respectively. In addition, α, β, and γ are the constant scalar quantities. The values of α, β, and γ bring the change in the above functions behavior. For example, negative values of β lead the initial surface to propagate in the outward direction with a given speed, while its positive value conveys the initial surface towards the inward direction. Evaluation of the levelset function is an iterative process; therefore, we have set the maximum number of iterations as 50 to stop the evolution process.
3. Experimental Setup
3.1. Experimental Datasets
The proposed method has been validated and compared on T2weighted 3D colorectal MR images. Data were collected from two different hospitals: namely, Department of Radiological Sciences, Oncology and Pathology, University La Sapienza, AOU Sant’Andrea, Via di Grottarossa 1035, 00189 Rome, Italy; and Department of Radiological Sciences, University of Pisa, Via Savi 10, 56126 Pisa, Italy. MR data were acquired in a sagittal view on a 3.0 Tesla scanner without a contrast agent. The overall dataset consists of 43 volumes T2weighted MRI, and each MRI volume consists of several slices, which are varied in number across subjects in the range 69∼122 and have dimension as 512 × 512 × (69∼122). The voxel spacing was varying from 0.46 × 0.46 × 0.5 to 0.6 × 0.6 × 1.2 mm/voxel across each subject. As the data have a slight slice gap, we did not incorporate any spatial resampling. The whole dataset was divided into training and testing sets for 100 repeated rounds of cross validation; i.e., 30 volumes were used for training and 13 for test until the combined results have given a numerically stable segmentation performance. The colorectal MR volumes were acquired in a sagittal view on a 3.0 Tesla scanner without a contrast agent. All MRI volumes went for preprocessing where they were normalized so that they have zero mean and unit variance. We cropped all the volumes with size of 195 × 114 × 61 mm. Furthermore, during training, the data were augmented with random rotations of 90°, 180°, and 270° in the sagittal plane to enlarge the training data. In addition, two medical experts using ITKsnap software [25, 26] manually segmented the colorectal tumor in all volumes. These manual delineations of tumors from each volume were then used as ground truth labels to train the network and validate it in the test phase.
3.2. Proposed Network Implementation
Our network architecture is composed of dual parallel paths, i.e., depth and scaled path, as illustrated in Figure 2, which achieves 3D endtoend training by adopting the nature of the fully convolutional network. The depth path consists of eight transformation layers, and the scaled path consists of nine transformation layers. In each path, every transformation layer is composed of a BN, a ReLU followed by 3 × 3 × 3 convolution (Conv), by following the similar fashion of DenseVoxNet. Furthermore, a 3D upsampling block has been utilized like DenseVoxNet. Like DenseVoxNet, the proposed network uses the dropout layer with a dropout rate of 0.2 after each Conv layer to increase the robustness of the network against overfitting. Our proposed method has approximately 0.7 million as total parameters, which is much fewer than DenseVoxNet [16] with 1.8 million and 3D Unet [12] with 19.0 million parameters. We have implemented our proposed method in the Caffe library [27]. Our implementation code is available online at the Internet link http://host.uniroma3.it/laboratori/sp4te/teaching/sp4bme/documents/codemsdn.zip.
3.3. Networks Training Procedures
All the networks—3D FCNNs [15], 3D Unet [12], and DenseVoxNet [16]—were originally implemented in Caffe library [27]. For the sake of comparison, we have applied a training procedure which is very similar to that utilized by 3D Unet and DenseVoxNet.
Firstly, we randomly initialized the weights with a Gaussian distribution with μ = 0 and σ = 0.01. The stochastic gradient descent (SGD) algorithm [28] has been used to realize the network optimization. We set the metaparameters for the SGD algorithm to update the weights as batch size = 4, weight decay = 0.0005, and momentum = 0.05. We set the initial learning rate at 0.05 and divided by 10 every 50 epochs. Similar learning rate policy in DenseVoxNet, i.e., “poly,” was adopted for all the methods. The “poly”learning rate policy changes the learning rate over each iteration by following a polynomial decay, where the learning rate is multiplied by the term [29], where the term power was set as 0.9 and 40000 maximum iterations. Moreover, to ease GPU memory, the training volumes were cropped randomly with subvolumes of 32 × 32 × 32 voxels as inputs to the network and the major voting strategy [30] was incorporated to obtain final segmentation results from the predictions of the overlapped subvolumes. Finally, the softmax with crossentropy loss was used to measure the loss between the predicted network output and the ground truth labels.
3.4. Performance Metrics
In this study, three evaluation metrics were used to validate and compare the proposed algorithm, namely, Dice similarity coefficient (DSC) [31], recall rate (RR), and average symmetric surface distance (ASD) [32]. These metrics are briefly explained as follows.
3.4.1. Dice Similarity Coefficient (DSC)
The DSC is a widely explored performance metric in medical image segmentation. It is also known as overlap index. It computes a general overlap similarity rate between the given ground truth label and the predicted segmentation output by a segmentation method. DSC is expressed aswhere S_{p} and S_{g} are the predicted segmentation output and the ground truth label, respectively. FP, TP, and FN indicate false positives, true positives, and false negatives, individually. DSC gives a score between 0 and 1, where 1 gives the best prediction and indicates that the predicted segmentation output is identical to the ground truth.
3.4.2. Recall Rate (RR)
RR is also referred as the truepositive rate (TPR) or sensitivity. We have utilized this term as the voxelwise recall rate to assess the recall performance of different algorithms. This performance metrics measure misclassified and correctly classified tumorrelated voxels. It is mathematically expressed as
It also gives a value between 0 and 1. Higher values indicate better predictions.
3.4.3. Average Symmetric Surface Distance (ASD)
ASD measures an average distance between the sets of boundary voxels of the predicted segmentation and the ground truth and is mathematically given aswhere and represent the k^{th} voxel from S_{p} and S_{g} sets, respectively. The function d denotes the pointtoset distance and is defined as , where is the Euclidean distance. Lower values of ASD indicate higher closeness between the two sets, hence a better segmentation, and vice versa.
4. Experimental Results
In this section, we have experimentally evaluated the efficacy of multiscale endtoend training scheme of our proposed method, where parallel 3D densely interconnected convolutional layers for twodimensional depth and coarser scales paths are incorporated. Since this study is focused on the segmentation of tumors by 3D networks, the use of 2D networks is out of the scope of this paper. Nevertheless, we tried 2D networks in preliminary trials with a short set of image data. The 2D network was able to correctly recognize the tumor but could not segment the whole tumor accurately, especially in the presence of small size tumors.
In this work, the proposed network has been assessed on 3D colorectal MRI data. For more comprehensive analysis and comparison of segmentation results, each dataset was divided into ground truth masks (i.e., manual segmentation done by medical experts) and training and validation subsets. Quantitative and qualitative evaluations and comparisons with baseline networks are stated for the segmentation of the colorectal tumor. First, we have analyzed and compared the learning process of each method, like described in Section 4.1. Secondly, we have assessed the efficiency of each algorithm qualitatively; Section 4.2 presents a comparison of qualitative results. Finally, in Section 4.3, we have quantitatively evaluated the segmentation results yielded by each algorithm, using evaluation metrics as described below in Section 3.4.
4.1. Learning Curves
The learning process of each method is illustrated in Figure 3, where loss versus training and loss versus validation are compared, individually, to some baseline methods. Figure 3 demonstrates that each method does not exhibit a serious overfitting as their validation loss consistently decreases along with decrement in training loss. Each method has adopted 3D fully convolutional architecture, where error back propagation is carried on pervoxelwise strategy instead of the patchbased training scheme [33]. In other words, each single voxel is independently utilized as a training sample, which dramatically enlarges the training datasets and thus reduces the overfitting risk. In contrast to this, the traditional patchbased training scheme [33] needs a dense prediction (i.e., many patches are required) for each voxel in the 3D volumetric data, and thus the computation of these redundant patches for every voxel makes the network computationally too complex and impractical for volumetric segmentation.
(a)
(b)
(c)
(d)
After comparing the learning curves of 3D FCNNs (Figure 3(a)), 3D Unet (Figure 3(b)), and DenseVoxNet (Figure 3(c)), the 3D Unet and DenseVoxNet converge much faster with the minimum error rate than the 3D FCNNs. This demonstrates that both the 3D Unet and DenseVoxNet successfully overcome gradients vanishing/exploding problems through the reuse of the features of early layers till the later layers. On the contrary, it is also shown that there is no significant difference between learning curves of the 3D Unet and DenseVoxNet, although the DenseVoxNet attains a steady drop of validation loss in the beginning. It further proves that the reuse of the features from successive layers to every subsequent layer by DenseVoxNet, which propagates the maximum gradients instead of the skipped connections employed by 3D Unet, is able to propagate output features from layers with the same resolution in the contraction path to the output features from the layers in the expansion path. Furthermore, Figure 3(d) shows that the proposed method, that incorporates the multiscale dense training scheme, has the best loss rate among all the examined methods. It reveals that the multiscale training scheme in our method optimizes and speeds up the network training procedure. Thus, the proposed method has the fastest convergence with the lowest loss rate than all.
4.2. Qualitative Results
In this section, we report the qualitative results to assess the effectiveness of each segmentation method of the colorectal tumors. Figure 4(a) gives a visual comparison of colorectal tumor segmentation results achieved from the examined methods. In Figure 4(a), from the left to right: the first two columns are the raw MRI input volume and its cropped volume, and the three following columns are related to the segmentation results produced by each method, where each column represents the predicted foreground probability, the initial colorectal segmentation results, and the refined segmentation results by the 3D level set. Moreover, the segmentation results produced by each method are outlined in red and overlapped with the true ground truth which is outlined in green. In Figure 4(b), we have overlapped the segmented 3D mask with the true ground truth 3D mask to visually evidence the falsenegative rate in the segmentation results. It can be observed that the proposed method (3D MSDenseNet) outperforms the other methods, with the lowest falsenegative rate, in respect to DenseVoxNet, 3D Unet, and 3D FCNNs. It is also noteworthy that the segmentation results obtained by each method significantly improves if a 3D level set is incorporated.
(a)
(b)
4.3. Quantitative Results
Table 1 presents the quantitative results of colorectal tumor segmentation produced by each method. The quantitative results are obtained by computing mean and standard deviation of each performance metric for all the 13 test volumes. We have initially compared the results obtained by each method without postprocessing by the 3D level set, considered here as baseline methods. Then, we present a comparison by incorporating the 3D level set as a postprocessor to refine the boundaries of the segmented results obtained by these baseline algorithms. In this way, we have got a total of eight settings, named as in the following: 3D FCNNs, 3D Unet, DenseVoxNet, 3D MSDenseNet, 3D FCNNs + 3D level set, 3D Unet + 3D level set, DenseVoxNet + 3D level set, and 3D MSDenseNet + 3D level set, respectively. Table 1 reveals that the 3D FCNNs have the lowest performance among all the metrics, followed by 3D Unet and DenseVoxNet, whereas the proposed method has maintained its performance by achieving the highest value of DSC and RR and the lowest value of ASD. When comparing the methods after postprocessing, every method has effectively improved their performance in the presence of the 3D level set: 3D FCNNs + 3D level set has improved DSC and RR as 16.44% and 15.23%, individually, and it reduced ASD to 3.0029 from 4.2613 mm. Similarly, 3D Unet + 3D level set and DenseVoxNet + 3D level set have attained improvements in DSC and RR as 5% and 5.97% and 4.99% and 4.29%, correspondingly. Also, they both have got a significant reduction in ASD as 3D Unet + 3D level set and DenseVoxNet + 3D level set reduce ASD to 2.8815 from 3.0173 and to 2.5249 from 2.7253, respectively. However, 3D MSDenseNet + 3D level set denotes a progress in DSC and RR as 2.13% and 2.42%, respectively, and it reduces ASD to 2.5401 from 2.6407. Nevertheless, the 3D MSDenseNet + 3D levelset method could not attain a significant improvement by utilizing the postprocessing step but still outperforms among all. Considering both qualitative and quantitative results, it can be observed that the addition of the 3D level set as a postprocessor improves the segmentation results of each method.

5. Discussion
In this work, we have tested the method 3D FCNNs + 3D level set [15], devised from mostly the same authors as this paper, together with two further prominent and widely explored volumetric segmentation algorithms, namely, 3D Unet [12] and 3D DenseVoxNet [16], for volumetric segmentation of the colorectal tumor from T2weighted abdominal MRI. Furthermore, we have extended their ability for the colorectal tumor segmentation task by the incorporating 3D level set in their original implementations. In order to improve the performance, we have proposed a novel algorithm based on 3D multiscale densely connected neural network (3D MSDenseNet). Many studies were carried out in the literature to develop techniques for medical image segmentation; they are mostly based on geometrical methods to address the hurdles and challenges for the segmentation of medical imaging, including statistical shape models, graph cuts, level set, and so on [34]. Recently, level setbased segmentation algorithms were commonly explored approaches for medical image segmentations. Generally, they utilize energy minimization approaches by incorporating different regularization terms (smoothing terms) and prior information (i.e., initial contour etc.) depending on the segmentation tasks. Level setbased segmentation algorithms take advantage of their ability to vary topological properties of segmentation function [35], so it becomes attractive. However, they always require an initial appropriate contour initialization to segment a desired object. This initial contour initialization requires an expert user intervention in the medical image segmentation. In addition, since medical images have disordered intensity distribution and show high variability (among imaging modalities, slices, etc.), a segmentation based on statistical models of intensity distribution is not successful. More precisely, level setbased approaches, given their simple appearance model [36], and lack of generalization ability and transferability are in some cases unable to learn alone the chaotic intensity distribution in medical images. Currently, CNNs deep learningbased approaches (i.e., CNNs) have been successfully explored in the medical imaging domain, specifically for classification, detection, and segmentation tasks. Usually, deep learningbased approaches learn a model by extracting features deeply from intricate structures and patterns from welldefined big training datasets where the trained model are used for prediction. In contrast to level setbased approaches, deep learningbased approaches can learn appearance models automatically from the big training data, which improves its transferability and generalization ability. However, deep learningbased approaches are not capable to provide an explicit way to incorporate a function to have the tendency of delivering regularization or smoothing terms like the levelset function has. Therefore, in order to take the advantages of both levelset and deep learning into account, we have incorporated 3D level set in each method that we used in our task.
Moreover, traditional CNNs are 2D in nature and were designed especially for 2D natural images, whereas medical images like MRI or CT are inherently in the 3D form. Generally, these 2D CNNs with 2D kernels have been used for medical image segmentation where volumetric segmentation was performed in a slicebyslice sequential order. Such 2D kernels are not able to completely make use of volumetric spatial information by sharing spatial information among the three planes. A 3D CNN architecture that utilizes 3D kernel which simultaneously share spatial information among three planes can offer a more effective solution.
Another challenge of 3D CNN involves controlling the hurdles in network optimization when the network goes deeper. Deeper networks are more prone to get risk of overfitting, due to vanishing of gradients in advance layers. This has been confirmed in this work. From the segmentation results produced by 3D FCNNs, we can see from Figure 4 that how the patterns/gradients have been lessened. In order to preserve the gradients in next layers when the network goes deeper, 3D Unet and DenseVoxNet reuse the features from early to next layers. In this way, 3D Unet overcomes the vanishing gradient problem in deep network by incorporating skip connection, which propagates output features from layers of the same resolution in the contraction path to the output features from the layers in the expansion path. Nevertheless, such a skip connection allows the gradient to flow directly from the lowresolution path to the highresolution one, which makes the training easy, but this generally produces a very high number of feature channels in every layer and leads to adjust a big number of parameters during training. To overcome this problem, the DenseVoxNet extends the concept of skip connections by constructing a direct connection from every layer to its corresponding previous layers, to ensure the maximum gradient flow between layers. In simple words, feature maps produced by the preceding layer are concatenated as an input to the advanced layer, thus providing a direct connection from any layer to the subsequent layer. Our results have proven that the direct connection strategy of DenseVoxNet provides better segmentation than the skip connection strategy of 3D Unet. However, DenseVoxNet has a deficit as the network individually learns highlevel and lowlevel features in early and later layers; this limits the network to learn multiscale contextual information throughout the network and may lead the network to a poor performance. The network we have proposed provides a multiscale dense training scheme where highresolution and lowresolution features are learned simultaneously, thus maintaining maximum gradients throughout the network. Our experimental analysis reveals that reusing features through multiscale dense connectivity produces an effective colorectal tumor segmentation. Nevertheless, although the proposed method has obtained better performance in colorectal tumor segmentation, the algorithm presented herein has higher variance in DSC and RR values, compared with the other methods, as shown in Table 1. It evidences that the proposed algorithm may not be able to compare contrast variations in a cancerous region and variations of slice gap along the zaxis among the datasets. A better normalization and superresolution method with more training samples might then be required to circumvent this problem.
6. Conclusion
In this research work, a novel 3D fully convolutional network architecture (3D MSDenseNet) is presented for accurate colorectal tumor segmentation in T2weighted MRI volumes. The proposed network provides a dense interconnectivity among the horizontal (depth) and vertical (scaled) layers. In this way, finer (i.e., highresolution features) and coarser (lowresolution features) features are coupled in a twodimensional array of horizontal and vertical layers, and thus, features of all resolutions are produced from the first layer on and maintained throughout the network. However, in other network (viz. traditional CNN, 3D Unet, or DenseVoxNet) coarse level features are generated with an increasing network depth. The experimental results show that the multiscale scheme of our approach has attained the best performance among all. Moreover, we have incorporated the 3D level set algorithm within each method, as a postprocessor that refines the segmented prediction. It has been also shown that adding a 3D level set increases the performance of all deep learningbased approaches. In addition, the proposed method, due to its simple network architecture, has a total number of parameters consisting of approximately 0.7 million, which is much fewer than DenseVoxNet with 1.8 million and 3D Unet with 19.0 million parameters. As a possible future direction, the proposed method could be further validate on other medical volumetric segmentation tasks.
Data Availability
The T2weighted MRI data used to support the findings of this study are restricted by the ethical board of Department of Radiological Sciences, University of Pisa, Via Savi 10, 56126 Pisa, Italy, and Department of Radiological Sciences, Oncology and Pathology, University La Sapienza, AOU Sant’Andrea, Via di Grottarossa 1035, 00189 Rome, Italy, in order to protect patient privacy. Data are available from Prof. Andrea Laghi (Department of Radiological Sciences, Oncology and Pathology, University La Sapienza, AOU Sant’Andrea, Via di Grottarossa 1035, 00189 Rome, Italy) and Prof. Emanuele Neri (Department of Radiological Sciences, University of Pisa, Via Savi 10, 56126 Pisa, Italy) for researchers who meet the criteria for access to confidential data.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
References
 Ashiya, “Notes on the structure and functions of large intestine of human body,” February 2013, http://www.preservearticles.com/201105216897/notesonthestructureandfunctionsoflargeintestineofhumanbody.html. View at: Google Scholar
 M. H. Soomro, G. Giunta, A. Laghi et al., “Haralick’s texture analysis applied to colorectal T2weighted MRI: a preliminary study of significance for cancer evolution,” in Proceedings of 13th IASTED International Conference on Biomedical Engineering (BioMed 2017), pp. 16–19, Innsbruck, Austria, February 2017. View at: Google Scholar
 R. L. Siegel, K. D. Miller, and A. Jemal, “Cancer statistics, 2017,” CA: A Cancer Journal for Clinicians, vol. 67, no. 1, pp. 7–30, 2017. View at: Publisher Site  Google Scholar
 H. Kaur, H. Choi, Y. N. You et al., “MR imaging for preoperative evaluation of primary rectal cancer: practical considerations,” RadioGraphics, vol. 32, no. 2, pp. 389–409, 2012. View at: Publisher Site  Google Scholar
 U. Tapan, M. Ozbayrak, and S. Tatli, “MRI in local staging of rectal cancer: an update,” Diagnostic and Interventional Radiology, vol. 20, no. 5, pp. 390–398, 2014. View at: Publisher Site  Google Scholar
 M. A. Gambacorta, C. Valentini, N. Dinapoli et al., “Clinical validation of atlasbased autosegmentation of pelvic volumes and normal tissue in rectal tumors using autosegmentation computed system,” Acta Oncologica, vol. 52, no. 8, pp. 1676–1681, 2013. View at: Publisher Site  Google Scholar
 B. Irving, A. Cifor, B. W. Papież et al., “Automated colorectal tumor segmentation in DCEMRI using supervoxel neighbourhood contrast characteristics,” in Proceedings of 17th International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI 2014), pp. 609–616, Springer, Boston, MA, USA, September 2014. View at: Publisher Site  Google Scholar
 S. Trebeschi, J. J. .M. van Griethuysen, D. M. J. Lambregts et al., “Deep learning for fullyautomated localization and segmentation of rectal cancer on multiparametric MR,” Scientific Reports, vol. 7, p. 5301, 2017. View at: Publisher Site  Google Scholar
 A. Prasoon, K. Petersen, C. Igel, F. Lauze, E. Dam, and M. Nielsen, “Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network,” in Proceedings of 16th International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI 2013), pp. 246–253, Springer, Nagoya, Japan, September 2013. View at: Publisher Site  Google Scholar
 M. Havaei, A. Davy, D. WardeFarley et al., “Brain tumor segmentation with deep neural networks,” Medical Image Analysis, vol. 35, pp. 18–31, 2017. View at: Publisher Site  Google Scholar
 H. R. Roth, L. Lu, A. Farag, A. Sohn, and R. M. Summers, “Spatial aggregation of holisticallynested networks for automated pancreas segmentation,” in Proceedings of 19th International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI 2016), pp. 451–459, Springer, Athens, Greece, October 2016. View at: Publisher Site  Google Scholar
 Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D Unet: learning dense volumetric segmentation from sparse annotation,” in Proceedings of 19th International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI 2016), pp. 424–432, Springer, Athens, Greece, October 2016. View at: Publisher Site  Google Scholar
 H. Chen, Q. Dou, L. Yu, J. Qin, and P. A. Heng, “VoxResNet: deep voxelwise residual networks for brain segmentation from 3D MR images,” NeuroImage, vol. 170, pp. 446–455, 2017. View at: Publisher Site  Google Scholar
 Q. Dou, L. Yu, H. Chen et al., “3D deeply supervised network for automated segmentation of volumetric medical images,” Medical Image Analysis, vol. 41, pp. 40–54, 2017. View at: Publisher Site  Google Scholar
 M. H. Soomro, G. De Cola, S. Conforto et al., “Automatic segmentation of colorectal cancer in 3D MRI by combining deep learning and 3D levelset algorithma preliminary study,” in Proceedings of IEEE 4th Middle East Conference on Biomedical Engineering (MECBME), pp. 198–203, Tunis, Tunisia, March 2018. View at: Google Scholar
 L. Yu, J. Z. Cheng, Q. Dou et al., “Automatic 3D cardiovascular MR segmentation with denselyconnected volumetric convnets,” in Proceedings of 20th International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI 2017), pp. 287–295, Quebec City, QC, Canada, September 2017. View at: Publisher Site  Google Scholar
 B. H. Menze, A. Jakab, S. Bauer et al., “The multimodal brain tumor image segmentation benchmark (BRATS),” IEEE Transactions on Medical Imaging, vol. 34, no. 10, pp. 1993–2024, 2015. View at: Publisher Site  Google Scholar
 O. Ronneberger, P. Fischer, and T. Brox, “Unet: convolutional networks for biomedical image segmentation,” in Proceedings of 18th International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI 2015), pp. 234–241, Springer, Munich, Germany, October 2015. View at: Publisher Site  Google Scholar
 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, Las Vegas, NV, USA, June 2016. View at: Google Scholar
 G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, July 2017. View at: Publisher Site  Google Scholar
 T. D. Bui, J. Shin, and T. Moon, “3D densely convolution networks for volumetric segmentation,” 2017, http://arxiv.org/abs/1709.03199. View at: Google Scholar
 G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Weinberger, “Multiscale dense networks for resource efficient image classification,” in Proceedings of International Conference on Learning Representations, Vancouver, BC, Canada, AprilMay 2018. View at: Google Scholar
 V. Caselles, R. Kimmel, and G. Sapiro, “Geodesic active contours,” International Journal of Computer Vision, vol. 22, no. 1, pp. 61–79, 1997. View at: Publisher Site  Google Scholar
 M. H. Soomro, G. Giunta, A. Laghi et al., “Segmenting MR images by levelset algorithms for perspective colorectal cancer diagnosis,” in Proceedings of Proceedings of the VI ECCOMAS Thematic Conference on Computational Vision and Medical Image Processing (VipIMAGE 2017), vol. 27, Springer, Porto, Portugal, October 2017. View at: Publisher Site  Google Scholar
 T. S. Yoo, M. J. Ackerman, W. E. Lorensen et al., “Engineering and algorithm design for an image processing API: a technical report on ITK—the insight toolkit,” Studies in Health Technology and Informatics, vol. 85, pp. 586–592, 2002. View at: Google Scholar
 P. A. Yushkevich, J. Piven, H. C. Hazlett et al., “Userguided 3D active contour segmentation of anatomical structures: significantly improved efficiency and reliability,” NeuroImage, vol. 31, no. 3, pp. 1116–1128, 2006. View at: Publisher Site  Google Scholar
 Y. Jia, E. Shelhamer, J. Donahue et al., “Caffe: convolutional architecture for fast feature embedding,” 2014, http://arxiv.org/abs/1408.5093. View at: Google Scholar
 L. Bottou, “Largescale machine learning with stochastic gradient descent,” in Proceedings of 19th International Conference on Computational Statistics (COMPSTAT’2010), pp. 177–186, Springer, Paris, France, August 2010. View at: Google Scholar
 W. Liu, A. Rabinovich, and A. C. Berg, “ParseNet: looking wider to see better,” 2015, http://arxiv.org/abs/1506.04579v2. View at: Google Scholar
 P. Kontschieder, S. R. Bulo, H. Bischof, and M. Pelillo, “Structured classlabels in random forests for semantic image labelling,” in Proceedings of 2011 International Conference on Computer Vision (ICCV), pp. 2190–2197, Barcelona, Spain, November 2011. View at: Google Scholar
 L. R. Dice, “Measures of the amount of ecologic association between species,” Ecology, vol. 26, no. 3, pp. 297–302, 1945. View at: Publisher Site  Google Scholar
 A. A. Taha and A. Hanbury, “Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool,” BMC Medical Imaging, vol. 15, p. 29, 2015. View at: Publisher Site  Google Scholar
 D. C. Ciresan, L. M. Gambardella, A. Giusti, and J. Schmidhuber, “Deep neural networks segment neuronal membranes in electron microscopy images,” in Proceedings of 25th International Conference on Neural Information Processing Systems (NIPS 2012), pp. 2852–2860, Lake Tahoe, NV, USA, December 2012. View at: Google Scholar
 A. Kronman and L. Joskowicz, “A geometric method for the detection and correction of segmentation leaks of anatomical structures in volumetric medical images,” International Journal of Computer Assisted Radiology and Surgery, vol. 11, no. 3, pp. 369–380, 2015. View at: Publisher Site  Google Scholar
 Y.T. Chen, “A novel approach to segmentation and measurement of medical image using level set methods,” Magnetic Resonance Imaging, vol. 39, pp. 175–193, 2017. View at: Publisher Site  Google Scholar
 D. Cremers, M. Rousson, and R. Deriche, “A review of statistical approaches to level set segmentation: integrating color, texture, motion and shape,” International Journal of Computer Vision, vol. 72, no. 2, pp. 195–215, 2007. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2019 Mumtaz Hussain Soomro et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.