In this paper, we proposed an improved 2D U-Net model integrated squeeze-and-excitation layer for prostate cancer segmentation. The proposed model combined a more complex 2D U-Net model and squeeze-and-excitation technique. The model consisted of an encoder stage and a decoder stage. The encoder stage aims to extract features of the input, which contains CONV blocks, SE layers, and max-pooling layers for improving the feature extraction capability of the model. The decoder aims to map the extracted features to the original image with CONV blocks, SE layers, and upsampling layers. The SE layer is implemented to learn more global and local features. Experiments on the public dataset PROMISE12 have demonstrated that the proposed model could achieve state-of-the-art segmentation performance compared with other traditional methods.

1. Introduction

Prostate cancer has become a high incidence cancer among men. Early medical detection and diagnosis of cancers could substantially improve the cure rate among patients. Currently, radiation therapy which uses medical ionizing radiation to kill cancer cells is a very common procedure to treat prostate cancers [1]. However, the worst disadvantage of the procedure is that the radiation may damage the cells of surrounding tissue when it kills prostate cancer. For the sake of raising the accuracy of radiation therapy and reducing the side effect in surrounding tissue such as bladder and rectum, more delicate prostate cancer diagnosis and more accurate prostate cancer localization methods are required.

At present, there are two main types of artificial and automatic to achieve prostate cancer segmentation on MRI (magnetic resonance imaging) [2]. The former, however, is gradually being displaced by the latter. Manual segment by radiologists is a time resuming work, and there are subjective differences among radiologists’ diagnoses. For example, a radiologist may get a segmentation image differently, and different radiologists may obtain to different results on the same image.

Automatic segmentation methods can help radiologists achieve prostate cancer segmentation result faster with higher accuracy. There are two main methods usually utilized: atlas-based methods and deformable model-based methods [3]. As for the atlas-based method, training images accompanied with their corresponding manual labels are mixed together; then, through nonrigid registration (NRR), a reference image named as an Atlas and labeled Atlas is formed [3]. The Atlas is a trained image which represents the prostate and its surrounding tissue while its corresponding labeled Atlas shows the probability of a voxel being a part of the prostate [2, 3]. In model-based methods, the model can use the atlas-based segmentation for its initialization and use the grey-level information of the image to be deformed to match the boundaries of the prostate [4]. Then, a distance metric is utilized, usually the Mahalanobis distance to match the contour of the feature model with the contour extracted from the case images [3]. Both methods can be time-consuming since they require a good initialization to display better effects on prostate cancer segmentation [2].

Currently, the deep learning-based methods have made a remarkable performance in medical image segmentation. There are some research studies based on deep learning methods that have obtained accurate results in the segmentation, which prove that a well-trained deep learning model can improve the accuracy and velocity in medical image segmentation [57]. Karimi et al. put forward a two-step segmentation method which contains two convolutional neural networks (CNNs), where the first CNN determines a prostate bounding box and the second CNN provides accurate delineation of the prostate boundary [5]. Guo et al. designed a deformable MR prostate segmentation method by integrating deep feature learning with sparse patch matching [6]. Cheng et al. presented a supervised learning framework which merges the atlas-based active appearance model (AAM) and support vector machines (SVM) to achieve a high segmentation result of the prostate boundary [7]. However, all the methods mentioned above have a common disadvantage in which it is difficult to achieve a pixelwise level segmentation with high accuracy.

Fully convolutional networks (FCN) proposed by Long et al., where the last fully connected layer of regular CNN is replaced with a convolution layer, can obtain the classification information of every pixel; therefore, it solves the problem of pixelwise level segmentation [8]. Roneneberger et al. made a further optimisation based on FCN and presented a symmetric structure called U-Net, which is a regular CNN with an upsampling operation, where deconvolutions are utilized to increase the size of feature maps [9]. At present, FCN or U-Net becomes the most popular backbone network in the medical image segmentation field. There are many new structures derived from the FCN or U-Net model after that time. For example, Zhou et al. modified the skip connection between encoder layers and decoder layers based on U-Net and then designed a new model called U-Net++ [10] and Milletari et al. put forward a variant model named as V-Net which can realize 3D segmentation [11]. However, these methods have a common disadvantage that the similar low-level features are extracted by the model repeatedly which results in unnecessary waste of computational resources.

In order to solve the problems above, in this paper, we proposed a more effective model, which utilizes the U-Net as the backbone of our network, and a squeeze-and-excitation layer is added to every convolution operation to select the emphasize the features which are contributed to the prostate cancer segmentation.

There are many research studies [5, 6, 1012] took the deep learning method the same with as to achieve prostate cancer segmentation on MRI because it comes to more remarkable performance in the field compared to the traditional method. The idea of making an optimisation based on U-net has attracted much attention in recent years; many related research studies have made good results. For examples, the U-Net++ was proposed by Zhou et al. which modifies the skip connection between the encoder and the decoder to achieve an optimisation [10], and the 3D U-Net called V-Net was put forward by Milletari et al. based on 2D U-Net [11].

The application of the SE layer took much inspiration from the channel attention utilized in a biattention adversarial network designed by Zhang et al. [12], which proves to have a positive effect on improving model performance.

3. Background

3.1. Structure

Our proposed model refers to the U-Net model and fully convolutional network (FCN), which divide the model into the encoder stage and the decoder stage (autoencoder). The overall structure of our model can be seen in Figure 1. The encoder (also called the contraction path) is used to capture the context in the image, and the decoder (also called the symmetric expanding path) is used to enable precise localization. U-Net and FCN are actually very similar and both of them are published in 2015; however, U-Net is a little bit later than FCN. However, there are still some differences between them. Compared with FCN, U-Net is completely symmetrical whose encoder stage and decoder stage are similar while FCN’s decoder stage structure is simpler which only uses one deconvolution operation and no more convolution structures such as U-Net. The second difference is about skip connection, FCN uses summation operation while U-Net uses concatenation operation.

3.2. The Activation Layer

An activation layer is always used after a convolution layer to choose if a particular neuron should be activated or not to be activated in U-Net. There are two most common activation functions used in U-Net. The first is rectified linear unit (ReLU) and the second is leaky rectified linear unit (Leaky ReLU). We are going to introduce these two functions in this section.

The ReLU formula is as follows:

For the Leaky ReLU,

Compared to the traditional activation function, such as logistic sigmoid, tanh, and other hyperbolic functions, the rectified linear function has the following advantages:(1)Imitation of biological principles: brain studies have shown that the message encoding of biological neurons is relatively scattered and sparse [13]. There are about 1–4% of neurons working in the brain at the same time. With linear rectification and regularization, we can know the detailed activities in the machine neural network. The logic function reaches 12 at input 0, which is already half full and stable which is not the same as the expectation of the scientist who think a simulated neural network is the same as the real biology [14].(2)More efficient gradient descent and backpropagation.(3)Simplify the calculation: ReLU function can prevent the influence of complicated function, for example, exponential functions, and reduce the total computing cost of the model.

3.3. Dropout Layer

Dropout is a popular way to prevent overfitting in neural network training. In the training process of deep learning network, dropout temporarily discards neural network units from the network with a certain probability, which causes each batch to train a different network model. Use the average to improve the generalization ability of the model. In addition to overfitting, dropout also alleviates the problem of long training time for large-scale neural networks.

3.4. Skip Connections

Skip connection is an operation that skips some of the layer of the network and then takes the output of the layer to feed to the next layers. In U-Net, skip connections were used to fight the vanishing gradient problem and learn pyramid level features [9]. The main idea of skip connections in U-Net is to have the pretrained features and reuse them in the later layer to improve the performance. The features are transferred from the encoder layer to the decoder layer by skip connections which are combined with concatenation instead of summation.

4. Proposed Methods

In this paper, we proposed an improved 2D U-Net model integrated squeeze-and-excitation layer which is used to segment prostate cancer automatically. We are going to introduce our proposed model and the main blocks.

4.1. Model Structure

We did some improvements to the traditional U-Net. Inspired by [8, 9], we added some squeeze-and-excitation (SE) layers, which will be introduced later, based on U-Net. Our model is divided into the encoder stage and the decoder stage; on the encoder stage, the model can effectively extract the input image feature by continuous convolution layer and pooling layer; on the decoder stage, the model will step by step map the extracted features to the original image by the continuous upsampling layer and output predicted mask eventually. Figure 2 is our proposed model, which is more complex than the traditional U-Net. In particular, we added a SE layer before each encoder’s pooling layer and after each decoder’s upsampling layer.


We use skip connection operation to concatenate two continuous convolution layer and activation layer and consist of a block and put them into a block which we named as CONV BLOCK. Figure 3 is its inner structure.

4.3. SE Layer

Inspired by [10], calculating the importance weights of each channel and then marking the more useful features, referring to Se-Net’s [15] practice, we implemented a method which can extract important features from channels and named it the SE layer; Figure 4 shows its detailed structure.

First of all, we assume feature , H, W, and C represent the height, width, and channel and number of features is F, respectively, and the function of is

is the feature of the channel. For feature F, we use a global average pooling layer (GAP) to generate a vector and named it whose function is

is the global averaged channel. After that, we use a ReLU activation layer and a sigmoid activation layer to achieve information aggregation aswhere refers to the ReLU function, and refer to the two fully connected layers, and r is a ratio parameter to reduce the dimensional complexity which is set to 4. The importance of each feature channel can be learned and named as .

We can extract important features by multiplying with , and it can be described as

The SE layer is a good way to enhance the ability to learn globally of the model, which is proved to be correct and valid in [15], by strengthening more important features. We use it in both the encoder stage and decoder stage; the detailed location is described in Section 3.1.

4.4. Evaluation Function

We choose Dice similarity coefficient (DSC) as our evaluation function according to [16]. Denote P the predicted mask and GT the ground truth:

In addition to this, we also choose accuracy (AC), Jaccard index (JA), and sensitivity (SE). TP, FP, TN, and FN represent true positive, false positive, true negative, and false negative, respectively. Their functions can be described as

5. Results

5.1. Dataset

The performance of the model is evaluated on a public dataset, PROMISE12 dataset, which includes 50 training sets and 30 continuous T2 weighted MR images in each set. We will resize the original image to 320 × 320 as the input of the model after loading the origin images.

5.2. Training

The designed model is based Tensorflow-Keras library. Our test set and training set all run on 6 GB NVIDIA GTX 1660TI GPU with Intel (R) Core (TM) i7-9750H CPU @ 2.60 GHz 16RAM. The initial learning rate is , and the epoch is 150. Before training, we use random flip, rotation, and cropping to augment our training sets to get better training results.

We use an Adam optimizer [17] with a learning rate as we mention above and a binary cross-entropy loss function [18], given bywhere is the prediction of the network on sample in a range between 0 and 1 and is the ground truth of sample in binary 0 or 1.

5.3. Results and Discussion

After the training of 150 epochs using five folds to pick each train set and test set, we can get the model loss and accuracy curves.

As can be seen in Figures 5 and 6, both the loss and accuracy curves perform well, and the effectiveness of the training was preliminarily proved. Two curves remain stable in dozens of epochs, which showed the model is not overfitted. And the gradual decline of the curve demonstrates good convergence of the model.

To show the effectiveness of our model, we implemented three traditional prostate segmentation methods [8, 9, 19]. The work in [8] is fully convolutional networks (FCN), [9] is traditional U-Net, and [19] is a multiatlas method. We will compare our model results to the other three model results mentioned above.

After examining the score in the whole dataset using five-fold cross validation, our model performed well compared to the other three models whose mean DSC is 0.87 and median DSC is 0.89. And the remaining three were also higher than the others.

The detailed five-fold cross-validation results can be seen in Figure 7.

As can be seen in Figure 7, our model performed well on five-fold cross validation. Most of its DSC scores are in the range of 0.70 to 0.95. On the first fold, the median DSC score is above 0.90 and the mean DSC score is a little lower in the range of 0.85 to 0.90. And the second, fourth, and fifth folds are almost like the first fold whose median DSC is around 0.9. And the mean DSC of all five folds is 0.87 which can be seen in Table 1.

6. Conclusion

In this paper, we develop an improved 2D U-Net model integrated Squeeze-and-excitation layer for prostate cancer segmentation. We divided two important components: SE layer and CONV BLOCK. With the SE layer, our model can learn more global and local features. In the CONV BLOCK, we combined feature maps and skip connection with a concatenation operation to bring further improvement in the model performance. In future work, different MRI modalities are going to be tried on our model to segment prostate cancer automatically.

Data Availability

The prostate MRI image dataset can be downloaded from the website (https://promise12.grand-challenge.org/Download/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

The authors contributed equally to this paper.


This study was funded by the Science and Technology Planning Project of Xiamen City (no. 3502Z20184036).