Since Late-Gadolinium Enhancement (LGE) of cardiac magnetic resonance (CMR) visualizes myocardial infarction, and the balanced-Steady State Free Precession (bSSFP) cine sequence can capture cardiac motions and present clear boundaries; multimodal CMR segmentation has played an important role in the assessment of myocardial viability and clinical diagnosis, while automatic and accurate CMR segmentation still remains challenging due to a very small amount of labeled LGE data and the relatively low contrasts of LGE. The main purpose of our work is to learn the real/fake bSSFP modality with ground truths to indirectly segment the LGE modality of cardiac MR by using a proposed cross-modality multicascade framework: cross-modality translation network and automatic segmentation network, respectively. In the segmentation stage, a novel multicascade pix2pix network is designed to segment the fake bSSFP sequence obtained from a cross-modality translation network. Moreover, we propose perceptual loss measuring features between ground truth and prediction, which are extracted from the pretrained vgg network in the segmentation stage. We evaluate the performance of the proposed method on the multimodal CMR dataset and verify its superiority over other state-of-the-art approaches under different network structures and different types of adversarial losses in terms of dice accuracy in testing. Therefore, the proposed network is promising for Indirect Cardiac LGE Segmentation in clinical applications.

1. Introduction

Multimodal CMR imaging is an essential tool in clinics for the screening and diagnosis of cardiac diseases. Different imaging modalities contain different sorts of useful information for cardiac disease screening task; the combination of different imaging modalities can overcome the limitations of an individual modality. The contrast agent for the LGE MR imaging is injected for 10-20 minutes; LGE images with distinctive locally brightness compared with the healthy tissues can enhance myocardial necrosis or scarring, which is a standard practice to evaluate cardiac structure, cardiac function, myocardial perfusion, and myocardial activity. Different from LGE images, the bSSFP can highlight the high signal area of the fluid but appear a uniform signal for other tissues; e.g., the large blood vessels and coronary arteries can be observed clearly in bSSFP because of more obvious contrast in the heart muscle and blood pool. T2-weighted MRI is effective in reducing false-positive results. Considering different MRI modalities is thus important for the acquisition of accurate cardiac information [1].

Segmentation of multimodal CMR images is a critical step in the process for the following diagnosis and surgical planning. However, it takes 20 minutes/case for an experienced doctor to manually segment the LGE images, it is extremely time-consuming to manually identify and delineate the corresponding structure in cardiac, and the result depends on the professional ability of doctors and varies from person to person. Therefore, the development of automatic and reliable LGE image segmentation algorithms is of high clinical values for patients suffering from myocardial infarction.

Tao and Der Geest proposed a method for segmenting the LGE images using myocardial morphological information [2]. Popescu et al. used a mask SLIC clustering method and Otsu threshold to segment LGE images [3]. In recent years, deep learning has achieved remarkable success in computer vision. More and more image processing methods are based on the CNN model [4, 5]. Chen et al. [6] proposed to use the domain adaption to fuse the features of unlabeled LGE images and then use the fused features to train the segmentation network. In addition, many approaches based on attention mechanisms [7, 8] and multiview methods [9] have been developed recently for segmenting medical images. Yang et al. combined multiview and attention mechanism to segment cardiac LGE images [10]. An automatic cardiac LGE segmentation algorithm based on the CNN is far more efficient and robust, and commonly more accurate than traditional methods [11, 12], so it is necessary to automatically segment the LGE images.

However, automatic LGE CMR segmentation is still arduous. Besides the great variations of the location and geometry of the heart region across different patients, Zhuang [1] pointed three major challenges related to the intensity distributions of the LGE CMR modality: (i) the intensity range of myocardium in LGE imaging leads to indistinguishable boundaries from its adjacent organs; (ii) the pathologies result in heterogeneous intensity of the myocardium, making the assumption of a simple distribution such as the single component Gaussian density invalid; and (iii) the preprocessing enhancement for the LGE CMR modality can be complex. So it is more difficult to segment directly LGE modality, especially in case of a small amount of labeled LGE data.

GAN was first proposed by Goodfellow et al. [13] for image synthesis, which uses a generator network and discriminator network, to pit one against the other (thus the “adversarial”) in order to generate fake synthetic instance that can pass for real data. Here, the generator generates a fake image by random noise, the discriminator judges whether the input data is true (data comes from real labels) or false (the data comes from the output of the generator). The aim of GANs is to learn the underlying distribution of training data in order to generate data that the discriminator cannot distinguish. At the same time, the game between the generator and the discriminator reaches the Nash equilibrium, i.e., the generated data distribution is equal to real data distribution . With the development of GANs [14], such models are widely used in image processing, including image and video generation [15], image segmentation [16], image synthesis [17], and image super resolution [18].

In this work, we propose a novel cross-modality multicascade framework for indirect LGE segmentation (CMMCSegNet), which is trained on multimodal cardiac MR data with a very small amount of LGE labels (for the LGE modality in Multisequence Cardiac MR Segmentation Challenge 2019 datasets [1], only five patients are labeled). The main contributions of this work are clarified as follows: (1)We develop a novel indirect LGE segmentation framework based on multimodal images; one of the primary components is to translate the LGE modality that needs to be segmented but only has very small amount of labeled data, into the bSSFP modality that is easy to be segmented by our proposed method(2)We propose a multicascade pix2pix network for image segmentation; that is, the generator is formed by cascading multiple subnetworks. In the segmentation network, we regard segmentation as the translation process from the original image to the segmentation target(3)We employ the perceptual loss that uses a pretrained VGG19 network to compare the feature differences between the labels and generation during the proposed multicascade pix2pix network training

The rest of this work is organized as follows. We first give some preliminaries in Section 2. We describe our CMMCSegNet in details in Section 3. We give experimental results in Section 4. Finally, we conclude this work in Section 5.

Tissue or organ segmentation plays an important role in the field of medical image processing. Medical image segmentation has been explored extensively; however, challenges in generality, robustness, and efficiency still remain. For brevity, we only focus below on the most closely related works.

2.1. Cascade Structure

A cascading network is to connect multiple subnetworks together to form a multilevel network. The cascading method has been effectively used in many vision applications like classification [19], image translation [20], detection [21], super resolution [22], and semantic segmentation [23]. For example, Cui et al. proposed a deep cascade network for image super resolution [22]. Cai and Vasconcelos proposed the use of cascade structure for object detection [21]. Zhao et al. proposed the recursive cascaded networks for medical image registration [24]. Armanious et al. proposed the use of cascaded generator network for image translation [20]. Havaei et al. proposed a new cascade architecture for brain tumor segmentation [23]. Li et al. [25] proposed to classify easy regions in a shallow network and train deeper networks to deal with hard regions. Lin et al. [26] proposed a top-down architecture with lateral connections to propagate deep semantic features to shallow layers.

Different from previous cascade networks, the multicascade pix2pix network proposed in this paper is a multiple U-net cascade structure for image segmentation, which allows an innovative way to supervise each generator individually for pix2pix GANs. To our best knowledge, this is an early and original attempt to adopt a cascade architecture in pix2pix GAN-based medical image segmentation. We will introduce more details in Section 3.

2.2. Multimodal Cardiac MR Image Segmentation

Recent literature suggests two main approaches to complete multimodal CMR image segmentation. One popular approach is about the GAN strategy based on cross-modality image translation that refers to the translation of images with modality into images with modality , which plays an increasingly important role in computer vision. Isola et al. [18] proposed the use of conditional GAN to implement a paired image-to-image translation. Ben-Cohen et al. used CT images to synthesize PET images based on the pix2pix network [27]. Cycle-GAN [28] was proposed for unpaired image-to-image translation. BiCycle-GAN [29] solved the translation process from single image to multicategory image. In addition, some GAN networks including DualGAN [30] and UNIT [31] were also proposed for unpaired image-to-image translation.

In CMR datasets [1], MR images of the different modalities are not strictly matched, so the classical unpaired image-to-image translation [32] can be applied to cross-modality CMR segmentation. Chen et al. [33] proposed to use UNIT to translate bSSFP images into LGE images and then train the segmentation network where the LGE images are provided by the translation architecture. Campello et al. also proposed to use Cycle-GAN to translate bSSFP images into LGE images but train the U-net network [34] for LGE segmentation. Tao et al. [35] proposed to integrate the translation network (Cycle-GAN) with the segmentation network to achieve LGE image segmentation.

Another promising approach is about the strategy on image registration. Roth et al. proposed to register LGE images with ground truths into LGE images without ground truths, after multiatlas label fusion by majority voting; they obtained a noisy LGE label and then trained a LGE segmentation network [36]. Liu et al. proposed a registration method for histogram matching to achieve augmentation of the LGE images [37].

3. Proposed Cross-Modality SegNet

The goal of this work is to achieve cardiac segmentation for LGE modality where a small amount of samples are labeled. Our CMMCSegNet (https://github.com/wangyu719/CmmcSegNet) framework is designed to facilitate indirect segmentation for the multimodal CMR images. The total framework is shown in Figure 1, including a training architecture and a testing architecture.

Our datasets are from Multisequence Cardiac MR Segmentation Challenge 2019 datasets (MS-CMRSeg 2019) [1]. In this work, we use LGE modality with 45 patients and bSSFP modality with 35 annotated patients (see Figure 2 for more details). Only five ground truth annotations are available in LGE modality of MS-CMRSeg 2019 datasets; hence, it is difficult to directly segment LGE modality using deep CNN-based methods. Figure 2 shows the differences between LGE and bSSFP images from the same patient. Furthermore, it is found that the bSSFP modality has a more obvious contrast than LGE modality, so we believe that the bSSFP is easier to be segmented. Besides, the bSSFP modality has a large number of images (35 patients) with ground truth annotations, so it is not difficult to train the bSSFP modality using the deep learning-based method.

3.1. Cross-Modality Image Translation

One of the primary components in training architecture is a cross-modality translation network, which can be trained end-to-end with unpaired modalities. Before segmenting the bSSFP images to achieve indirect segmentation of LGE images, we first present a Cycle-GAN architecture of translating LGE into bSSFP images.

Inspired by the knowledge distillation between unpaired image-to-image translation networks [32], we employ Cycle-GAN to achieve cross-modality image translation for CMR datasets. Let be two image domains that represent the LGE and bSSFP modalities, respectively. and are two generators of the cross-modality translation network such that and are inverse mappings of each other; that is, , for any unpaired images , . and are the discriminators of the cross-modality translation network, to distinguish that the input of discriminator is real or fake.

The Cycle-GAN architecture implementing cross-modality image translation for unpaired LGE/bSSFP datasets consists of two cycles: LGE cycle and bSSFP cycle. In the LGE cycle, the first generator is trained to transform LGE modality into fake bSSFP modality, the second generator is trained to transform the generated fake bSSFP modality back to the original LGE modality, and the discriminator discriminates between real and synthesized bSSFP modalities. In fact, enlightened by the activation-based attention transfer strategies, the discriminator is designed to extract the supervision information that modulates the learning of the generator . In the bSSFP cycle, real bSSFP was transformed to fake LGE by using the generator , the generator transforms the generated LGE to the original bSSFP, and the discriminator discriminates between real and fake bSSFP modality. Finally, the network framework is shown in Figure 1(b).

The overall training loss of our translation network is defined as where and are two adversarial losses defined by and the generation similarity is defined by and is the weight parameter for balancing the contributions of the generation loss and the two adversarial losses and .

3.2. Multicascade pix2pix Segmentation

Recently, the GAN-based framework is proposed to segment the retinal vessel [38]. We understand image segmentation as the translation from paired image to image (from an original image to a predicted segmentation results); hence, we propose a new image segmentation method using a multicascade technique and pix2pix structure, which we call a multicascade pix2pix network.

3.2.1. Multicascade Network

Our multicascade pix2pix segmentation network shown in Figure 1(c) is based on the GAN architecture, which consists of multiple generators () and a shared discriminator .

The generator : translates to , where the original input is real or fake bSSFP image, and the first generation is a prediction for the corresponding label. The other generators : furtherly improve the previous predicted probability to obtain more optimal prediction where and have the same size. In this work, is formed by the U-net [5] or ResNet [39] network for the purpose of more accurate segmentation. In experimental evaluation, we will compare the effects of different generator networks on the segmentation results. The purpose of this network is to obtain the final segmented result of the original input , which also is the result of the LGE segmentation. Therefore, the generated prediction obtained from the multicascade pix2pix segmentation network can be denoted as

The discriminator is a binary classifier based on pixels or patch-images which provides a network learning-based stopping criterion during generating. For the discriminator in our multicascade pix2pix segmentation network, we employ a convolutional Patch-GAN [18] to distinguish real or fake between the prediction and the ground truth , where is divided into patches with overlapping images, and each patch is discriminated with those of the ground truth , respectively; finally, a 2D probability map is obtained as the discriminator outputs.

To train an optimal segmentation network, the measures between and target label can be estimated and minimized to update discriminator that enforces to discriminate the generation and the ground truth. The segmentation network we propose is a conditional version of pix2pix GAN with the multicascade architecture, so the adversarial input in is mainly composed of there components, where the first component is the source image used as the condition and the others are the generation and the ground truth . At the same time, each generator is also optimized to generate domain-invariant representations that confuses the discriminator .

3.2.2. Loss Functions in Segmentation Stage

The dice score and Jaccard index are commonly used as metrics for the evaluation of image segmentation task. CNNs trained for image segmentation task are usually optimized by minimizing a weighted cross-entropy. In this work, we employ a specially designed loss function to simultaneously measure the generation similarity and the adversarial error, which contains three types of loss functions: adversarial loss , loss, and perceptual loss .

The original adversarial loss (Vanilla GAN loss) is given by the Kullback-Leibler (KL) divergence score as where is the given weight enforcing the trade-off between the cascade cross-entropy losses and is a condition input of each convolutional Patch-GAN in our multicascade pix2pix segmentation network. Recently, the most commonly used adversarial losses are WGAN-GP [40] and LSGAN [41]. In the next section, we will compare the performances of three different adversarial losses in our experiments.

loss is a weighted sum of the absolute distance between the calculated output data in the -th cascade block and the ground truth , which can make the segmentation results closer to the real results [18], and is defined by where are weight constants. Without loss of generality, we will take for all in our experiments.

Besides, we also employ the perceptual loss in our multicascade pix2pix segmentation network, which is composed of a pretrained VGG19 network and is first proposed in image super resolution application [42]. The perceptual loss focuses on feature maps between the output data and the ground truth [43]. It can hence be computed by where and represents the feature map of the -th feature channel of the -th feature layer (after activation) [42], is the number of feature channels in the -th feature layer and is the number of convolution layers, and and represent the size of the feature map in the VGG19 network. Here, is the error measure of the vgg/ResNet feature maps between the ground truth and prediction . The most widely used feature distances also contain the manhattan distance and the cosine similarity difined by where and are feature maps.

The total proposed segmentation model is trained by jointly minimizing the total loss for the three parts as follows: where the and are two given weight parameters.

4. Results and Discussion

The proposed CMMCSegNet framework is implemented using PyTorch. The experiments are conducted on a single GeForce RTX 2080Ti GPU with 11 GB RAM. To identify the model design, we performed several ablation experiments. They are described as follows.

4.1. Dataset and Experimental Setting

To demonstrate our CMMCSegNet framework, we use MS-CMRSeg 2019 datasets [1], which contain three different modalities: LGE with 45 patients but only 5 patients being labeled and bSSFP with 35 annotated patients and T2-weighted. The goal of CMR segmentation challenge is to achieve LGE image segmentation. Since there are fewer T2-weighted slices for each patient in the dataset (about 3-7 slices for each patient), we only use bSSFP modality and LGE modality in our experiments.

The cross-modality translation network is trained for epochs, and the model that performs best on the validation set was selected for translation from LGE to bSSFP in the proposed CMMCSegNet framework. The dataset training the segmentation network contains two parts, most of them are from real annotated bSSFP images (slices from the patients), and a small amount of fake bSSFP images are translated from the annotated LGE images (slices from about two patients) by the Cycle-GAN translation network.

We also train epochs for the segmentation network. The both models are trained using Adam optimization with a minibatch size of , a decayed learning rate with an initial value , the size of patch D in the discriminator based on Patch-GAN, and the weight hyperparameters , , , and .

4.2. Performance of Cross-Modality Translation

We first use Cycle-GAN to achieve translation between LGE and bSSFP modalities; we also employ three evaluation metrics, including Structural Similarity (SSIM), Peak Signal To Noise Ratio (PSNR), and Mutual Information (MI), to evaluate the performance of Cycle-GAN translation network, which is tested on the whole LGE and bSSFP images. Many randomly chosen results from the translated (fake) LGE or bSSFP modalities are shown in Figure 3. In Table 1, our translation model also leads to a comparable synthesis quality between LGE and bSSFP modalities for the whole datasets, where , , , and denote real LGE, fake LGE, real bSSFP, and fake bSSFP, respectively.

4.3. Comparisons for Different Choices of Adversarial Loss and Perceptual Loss

After the cross-modality translation, two fake bSSFP patients with annotated masks (obtained from the cross-modality translation of two LGE patients with the ground truth) and fully real labeled bSSFP patients (35 patients) are used to train our proposed segmentation network. Next, we did several different comparison experiments for segmentation evaluations of fake bSSFP without annotated data (obtained from the cross-modality translation).

Table 2 shows the dice score of cardiac LGE segmentation in using different adversarial losses (Vanilla GAN, LSGAN, and WGAN-GP) and different CMMCSegNet generator blocks (U-net and ResNet) and with/without perceptual loss ( or ). We can also see that the overall segmentation performance of the U-net generator is slightly better than that of the ResNet generator using 6 different losses in terms of the LV (left ventricle), MYO (myocardium), and RV (right ventricle). For the U-net generator, the model using LSGAN loss yields better diagnostic performance than those of both Vanilla GAN and WGAN-GP losses. Besides, the perceptual loss or perceptual loss added for kernel feature comparisons can guarantee that the network learn relevant high feature levels and content features, which will improve the segmentation results for Vanilla GAN and LSGAN. However, the dice score of LV and RV segmentation slightly decreases when WGAN-GP with the perceptual loss is used, while in the ResNet generation network, the models with the perceptual loss ( or ) achieve higher segmentation performance in all three terms and outperform those without the perceptual loss.

4.4. Comparisons of the Cascade Generators

High-level semantic features in each branch contain sufficient localization information of corresponding region. To make full use of the features, we propose the multicascade architecture to extract implicitly geometrical and textural information that guides the cardiac segmentation. In order to enhance the competitiveness of the proposed architecture, we evaluate the performances by running a pix2pix segmentation on the training dataset (real/fake bSSFP images with ground truths). Final results are achieved with an ensemble of 1-4 cascades using corresponding LSGAN’s adversarial loss and perceptual loss. Comparisons with different number of cascades are shown in Table 3; we can see that the number of cascades is increased from one to four and the dice values of some terms dropped slightly for the model with/without perceptual loss. The reason for this may be that the increase in the number of cascades may cause a lot of edge information to be lost in the original fake bSSFP images. As we can see from Figure 1, when the first segmentation network obtains the segmentation result of the input fake bSSFP images, if original fake bSSFP image is not used as a conditional input in the later , modifying the previous result , extracts fewer features comparing with the . To optimize the computational costs, starting from the second generator, we reduce the number of upsampling/downsampling layers in the middle part of the U-net generators from () to () for generators (), respectively. From Table 3, we observe that the proposed network with the simplified U-net versions can improve the segmentation results. Figure 4 shows the original LGE images, the translated bSSFP images, the corresponding ground truths, and the prediction results with varying the numbers of cascades.

4.5. Comparisons of the Weights and of Multicascade Blocks

The performance of the multicascade architecture may be directly limited by the loss weight parameter of each cascade generator . We compare the choice of the weights and , and represents the output of the -th generator . From Table 4, the model with LSGAN adversarial loss and vgg perceptual loss is optimized solely using loss weights and achieves the better results on the evaluation dice of . Due to the efficiency of the multicascade technique, the proposed segmentation network automatically improves image multilevel features that benefits the segmentation performance. Figure 5 shows the results of different generators in a multicascade pix2pix network with different weights; can further modify the details of making the output result closer to ground truth.

4.6. Comparison to Conventional Methods

Table 5 benchmarks the performance of the proposed framework against the direct and indirect LGE segmentation networks. First, we compare the performance of the four direct segmentation methods, including FCNs [4], U-net [5], U-net++ [44], and Attention U-net [45] networks by directly training a segmentation network from a small number of annotated LGE images. As reported in Table 5, although U-net performs better than others, it produces low dice value. Figure 6(a) visualizes the segmentation results by direct methods. We also compare the performance of the five indirect segmentation methods, including FCNs, U-net, U-net++, and Attention U-net networks and the proposed CMMCSegNet by indirectly training networks from a small number of annotated fake bSSFP images and fully real bSSFP annotated images. As shown in Table 5, the proposed technique provides the highest dice score of LV and MYO and the fair value in RV. This means that our proposed CMMCSegNet outperforms the other techniques. Figure 6(b) further illustrates a more detailed comparison between the proposed and other techniques; our proposed CMMCSegNet has obvious advantages that it is easier to learn the location information of the target area.

5. Conclusion

In this work, we proposed a CMMCSegNet framework based on multimodal cardiac MR images for indirect LGE segmentation. Firstly, we utilized Cycle-GAN to translate LGE modality into bSSFP modality and then segmented the translated (fake) bSSFP images to achieve indirect segmentation of LGE images. The advantage of this method is that only a small number of annotated LGE images can be required to achieve accurate segmentation of LGE by employing many annotated bSSFP images. This indirection also solved the problem of LGE images itself having a low contrast. Compared with the direct segmentation of LGE images, the indirect segmentation method has better segmentation performance.

For the multicascade pix2pix network, we regard the segmentation as a translation from image to ground truth; the purpose of multicascade architecture is to better improve the previous prediction through several generators. We also compared the use of different adversarial losses, the experimental results show LSGAN loss is better than the Vanilla GAN and WGAN-GP, and WGAN-GP loss is not significantly better than the Vanilla GAN loss. To improve the training effect of the model, the perceptual losses based on and measures are also used to optimize the features of each feature layer. In addition, we investigated the influence of the weights of the generation loss of multicascade structures, where the optimal weight coefficient is set to () for 3 cascade generation networks.

We also demonstrated the effectiveness of the proposed CMMCSegNet by comparing with FCNs, U-net, U-net++, and Attention U-net. In the future, we will consider the end-to-end segmentation method to segment the multimodal cardiac MR, combining the translation and segmentation together.

Data Availability

Dataset is obtained from Multisequence Cardiac MR Segmentation Challenge (MS-CMRSeg 2019; https://zmiclab.github.io/mscmrseg19/). This challenge is aimed at creating an open and fair competition for various research groups to test and validate their methods, particularly for the multisequence ventricle and myocardium segmentation. Also refer to publication [1].

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.


This work was supported by the National Natural Science Foundation of China (NSFC Project number 11771369) and also partly by grants from the Outstanding Young Scholars of Education Bureau of Hunan Province, PR China (number 17B257), and Natural Science Foundation of Hunan Province, PR China (numbers 2018JJ2375, 2017SK2014, and 2018XK2304).