Abstract

Aim. This study proposes a new artificial intelligence model based on cardiovascular computed tomography for more efficient and precise recognition of Tetralogy of Fallot (TOF). Methods. Our model is a structurally optimized stochastic pooling convolutional neural network (SOSPCNN), which combines stochastic pooling, structural optimization, and convolutional neural network. In addition, multiple-way data augmentation is used to overcome overfitting. Grad-CAM is employed to provide explainability to the proposed SOSPCNN model. Meanwhile, both desktop and web apps are developed based on this SOSPCNN model. Results. The results on ten runs of 10-fold crossvalidation show that our SOSPCNN model yields a sensitivity of , a specificity of , a precision of , an accuracy of , an F1 score of , an MCC of , an FMI of , and an AUC of 0.9587. Conclusion. The SOSPCNN method performed better than three state-of-the-art TOF recognition approaches.

1. Introduction

Tetralogy of Fallot (TOF) is a congenital defect that influences normal blood flow through the heart [1]. It is made up of 4 defects of the heart and its blood vessels [2]: (a) ventricular septal defect, (b) overriding aorta, (c) right ventricular outflow tract stenosis, and (d) right ventricular hypertrophy. Defects of TOF can cause oxygen in the blood that flows to the rest of the body to be reduced. Infants with TOF have a bluish-looking skin color [3] since their blood does not carry enough oxygen.

Traditional diagnosis of TOF is after a baby is born, often after the infant had an episode of cyanosis during crying or feeding. The most common test is an echocardiogram [4], an ultrasound of the heart that can show problems with the heart structure and how well the heart is working with this defect. Recently, computed tomography (CT) has shown its success in the differential diagnosis of TOF [5], since it can provide detailed images of many types of cardiovascular issue; besides, computed tomography (CT) can be performed even if the subject has an implanted medical device, unlike magnetic resonance imaging (MRI) [6].

Manual diagnosis on CT is lab-intensive, onerous, and needs expert skills. Besides, the manual results vary due to intraexpert and interexpert factors. Shan et al. (2021) [7] mention that “fully manual delineation that often takes hours” and the modern automatic diagnosis models based on artificial intelligence (AI) can only take seconds to minutes to get decisions, which now becomes a hot research field.

For example, Ye et al. (2011) [8] present a morphological classification (MC) method. The authors extract morphological features by registering cardiac MRI scans to a template. Later, deep learning (DL) rises as a new type of artificial intelligence (AI) technique and has shown its powerfulness in many academic and industrial fields. Within the field of DL, convolutional neural network (CNN) is one standard DL algorithm that is particularly suitable for handling images. Giannakidis et al. (2016) [9] presented a multiscale three-dimensional CNN (3DCNN) for segmentation of the right ventricle. Tandon et al. (2021) [10] present a ventricular contouring CNN (VCCNN) algorithm.

The difference between this study to previous studies is that we simplify the problem to a binary-coded classification problem [11]; that is, given an input cardiovascular CT image, the AI model should have the ability to give a binary output, i.e., predict whether the subject is TOF or healthy. This simplification makes the AI model focus on the prediction task itself and does not need to generate human-understandable outputs (such as segmentation, contouring, etc.) in the light of the expectation to make our AI model more accurate. Furthermore, we propose a new stochastic pooling CNN (SCCNN) that uses a new pooling technique—stochastic pooling to improve the prediction performance. All in all, our contributes are fourfold: (a)Stochastic pooling is employed to replace traditional max-pooling(b)Structural optimization is carried out to fix the optimal structure(c)Multiple-way DA is introduced to increase the diversity of training images(d)Experiments by ten runs of 10-fold crossvalidation show that our method is better than three state-of-the-art approaches

The rest of this paper is structured as follows: Section 2 describes the dataset. Section 3 contains the rationale of methodology, including the preprocessing, stochastic pooling, structural optimization, multiple-way data augmentation, the implementation, Grad-CAM, and evaluation measures. Section 4 presents the experimental results and discussions. Section 5 concludes this paper.

2. Dataset

This study is a retrospective research, of which ethical approval is exempted. The imaging protocol is described below: Philips Brilliance 256 row spiral CT machine, KV: 80, MAS: 138, Layer Thickness 0.8 mm, Lung Window (W: 1600 HU, L: -600 HU), Mediastinal Window (W: 750 HU, L: 90 HU), thin layer reconstruction according to the lesion display, layer thickness, and layer distance are both 0.8 mm mediastinal window images. Place the patient in a supine position, let the patient breathe deeply after holding in, and conventionally scan from the apex of the lung to the costal diaphragmatic angle. The resolutions of all images are 512 by 512 pixels. Data is available upon reasonable requests to corresponding authors.

We selected ten children with Tetralogy of Fallot who were admitted to Nanjing Children’s Hospital from March 2017 to March 2020. We then used a systematic random sampling method to select ten normal children from healthy medical examiners within the same period of time. The Tetralogy of Fallot (TOF) observation group included three males and seven females, aged 4-22 months, with an average age of () months. Normal children in the control group included six males and four females, aged from 3 months to 24 months, with an average age of months. Inclusion criteria for children with confirmed Tetralogy of Fallot are as follows: (1)CT suggests Tetralogy of Fallot(2)Surgery confirmed that the anatomical deformity of the heart is Tetralogy of Fallot

3. Methodology

3.1. Preprocessing

Table 1 lists the abbreviation list for the ease of reading. A five-step preprocessing was carried out on all the images to select the important slices, save storage, enhance contrast, remove unnecessary image regions, and reduce the image resolution.

First, four slices were chosen by radiologists using a slice-level selection method. For TOF patients, the slices showing the largest size and number of lesions were selected. For healthy control subjects, any level of the image can be selected. Now, we have in total 40 TOF images and 40 HC images.

Second, all the images are converted to grayscale images and stored in tiff format [12] using the compression lossless method. Third, histogram stretching (HS) was employed to enhance image contrast. Suppose the th input and output of HS is and . HS can be formulated as where and stand for the minimum and maximum grayscale values in the input .

Fourth, cropping was done in order to eliminate the check-up bed at the bottom, the subject’s two arms at bilateral sides, the rulers at the bottom and right side, and information (hospital, scanning protocol, subject’s information, image head information, and labeling) at four corners.

Lastly, downscaling was performed to reduce each image to the size of . Figure 1 displays the diagram of our preprocessing procedure. Figures 2(a) and 2(b) shows two preprocessed examples of TOF and HC, respectively.

3.2. Stochastic Pooling

Pooling is an essential operation in standard convolutional neural networks (CNNs) [13]. Two types of pooling exist. One is max pooling (MP), and the other is average pooling (AP). The objective of pooling is to down-sample an input image or feature map (FM), reducing their dimensionality (width or height) and allowing for some assumption about the features to be made in each block.

Suppose we have an input image or FM, which can be split into blocks, where every block has the extent of . Currently, let us fix on the block at th row and th column as shown as the red rectangle in Figure 3. where , means the pixel value at coordinate .

The strided convolution (SC) goes over the input activation map with the strides that equals the size of the block . The output of SC is

The -norm pooling (L2P), average pooling (AP) [14], and max pooling (MP) [15] produce the -norm, average, and maximum values within the block , respectively. Their formula can be written as below:

Nevertheless, the AP outputs the average, downscaling the greatest value, where the important features may lie. In contrast, MP stores the greatest value but deteriorates the overfitting obstacle. In order to solve the above concerns, stochastic pooling (SP) [15] is introduced to provide a resolution to the drawbacks of AP and MP. SP is a four-step process.

Step 1. It produces the probability map (PM) for each pixel in the block . where stands for the PM value at pixel .

Step 2. It creates a random location vector (RLV) that takes the discrete probability distribution (DPD) as where represents the probability. Shortly speaking, , or , namely, the distribution of RLV has the DPD as .

Step 3. A sample location vector is drawn from the RLV , and we have

Step 4. SP outputs the pixel at the location , namely,

Input: block .
Step 1: produce the PM for each pixel . See Equation (7).
Step 2: create a RLV . See Equation (8).
Step 3: draw a sample location vector from the RLV . See Equation (9).
Step 4: output pixel at location . See Equation (10)
Output matrix .

Figure 4 shows a realistic example of four different pooling methods. Algorithm 1 presents the pseudocode of SP. Take the block (The red rectangle in Figure 4) as an example, L2P generates the output as 6.98. AP and MP present 5.99 and 9.9, respectively. Meanwhile, SP first generates PM matrix and a sample location vector is drawn as . Therefore, the output of SP is .

3.3. Structural Optimization

How to obtain the best network structure [16]? We try to design nine different configurations in this study. Their hyperparameters of structures are listed in Table 2. Two hyperparameters are considered in this study: (i) the number of Conv layers and (ii) the number of fully connected layers (FCLs). Those two types of layers are common layers in standard CNN, so we will not introduce them due to the page limit.

In the following experiment, we will observe that configuration V, a five-layer customized neural network, gives the best performance. Here, we briefly give its detailed structure in Table 3. The input is of size . The first Conv layer (Conv_1) is associated with the batch normalization (BN) layer and rectified linear unit (ReLU) activation. The parameters of Conv_1 are 32 kernels with sizes of and stride of 2. Afterward, the first SP (SP_1) reduce the FM from to .

After three Conv layers and three SP layers, the size of FM is . It is then flattened to a vector of 8192 neurons. With two FCLs of 100 and 2 hidden neurons, the neural network finally outputs whether TOF or HC. All in all, our model is termed structurally optimized stochastic pooling convolutional neural network (SOSPCNN). The FM plot is portrayed in Figure 5.

3.4. Multiple-Way Data Augmentation

The relatively small dataset (40+40 = 80 images) may bring the overfitting problem. To avoid overfitting, data augmentation (DA) [17] is a powerful tool because it can generate synthetic images on the training set [18]. Zhu (2021) [19] presented an 18-way DA method and proved this 18-way DA works better than the traditional DA approach. Its diagram is shown in Figure 6. The difference of DA and MDA is that (i) MDA uses a combination of different DA methods on training set; (ii) MDA is modular design. The users are easy to add or remove particular DA methods from a MDA.

Suppose we have the raw training image , where represents the image index. First, different DA methods displayed in Figure 6 are applied to . Let be each DA operation, we get augmented datasets on raw image as

Let stands for the size of generated new images for each DA method, thus,

Second, horizontal mirrored image is produced by where means horizontal mirror function.

Third, all different DA methods are carried out on the mirrored image and produce new datasets as

Fourth, the raw image , the mirrored image , all -way results of raw image , and all -way DA results of horizontal mirrored image are combined. The final generated dataset from is defined as : where stands for the concatenation function. Let augmentation factor be , which means the number of images in , we obtain

Algorithm 2 recaps the pseudocode of this 18-way DA. We set to achieve an 18-way DA. We also set , thus , indicating each raw training image will generate 542 images, which include the raw image itself.

Input: import raw preprocessed th training image .
geometric or photometric or noise-injection DA transforms are utilized on .
Step 1: we obtain . See Equation (12)
 Each enhanced dataset contains new images. See Equation (13).
Step 2: a horizontal mirror image is produced as . See Equation (14).
Step 3: -way data augmentation methods are carried out on ,
 We obtain . See Equation (15).
Step 4: , , , and are merged via . See Equation (16).
Output A new dataset is produced. The number of images is . See Equation (17).
3.5. Implementation and Grad-CAM

-fold crossvalidation [20] is employed. The whole dataset is divided into folds (see Figure 7). At th trial, , the th fold is picked up as the test, and the rest folds: are chosen as training set [21]. In this study, we set , namely, a 10-fold cross validation. Furthermore, we run the 10-fold crossvalidation 10 times, i.e., -fold crossvalidation.

Gradient-weighted class activation mapping (Grad-CAM) [22] is employed to explain how our model makes its decision in classification. Grad-CAM utilizes the gradient of the classification score with respect to the convolutional features determined by the network to understand which parts of the image are most important for classification. Grad-CAM is a generalization of the class activation mapping (CAM) method [23] to a broader range of CNN models since the original CAM relies on a fully convolutional neural network structure. The output of SP_3 (see Table 3) is used as the feature layer for Grad-CAM.

Mathematically, suppose our classification network is with output , standing for the score for class . We would like to compute the Grad-CAM map for a layer with feature maps , where stands for the indexes of pixels. We can obtain the neural importance weight as where stands for the total number of pixels in the feature map. The Grad-CAM is a weighted combination of the feature maps with a ReLU as

The Grad-CAM map is then upsampled to the size of input data.

3.6. Measures

The confusion matrix of 10 runs of 10-fold crossvalidation is supposed to be

Note for a perfect classification. The meaning of , , , , , and are itemized in Table 4.

Nine measures are used: sensitivity, specificity, precision, accuracy, F1 score, Matthews correlation coefficient (MCC), Fowlkes–Mallows index (FMI), receiver operating characteristic (ROC), and area under the curve (AUC).

The first four measures are defined as

F1, MCC [24], and FMI [25] are defined as

The above measures are calculated in the mean and standard deviation (MSD) format. Furthermore, ROC is a curve to measure a binary classifier with varying discrimination thresholds [26]. The ROC curve is created by plotting the sensitivity against 1-specificity. The AUC is calculated based on the ROC curve [27].

4. Experimental Results

4.1. Statistical Analysis

The result of the SOSPCNN model using configuration V is itemized in Table 5. The model arrives at a performance with a sensitivity of , a specificity of , a precision of , an accuracy of , an F1 score of , an MCC of , and an FMI of .

Figure 8 shows the confusion matrix of -fold crossvalidation, where we can see the , , , and , indicating 31 TOF are wrongly classified as HC while 29 HC are misclassified to TOF. Hence, the sensitivity is , and specificity is .

4.2. Configuration Comparison

We compare nine configurations (see Table 2). The validation is the same as previous experiment. Due to the page limit, the detailed statistical analysis is not shown. The ROC and AUC values are displayed in Figure 9. The AUC values of nine networks with different configurations are: 0.9502, 0.9511, 0.9504, 0.9532, 0.9587, 0.9577, 0.9360, 0.9419, and 0.9389 (as shown in Figure 10). We can observe from Figure 10 that the best network is with configuration V, whose structure is shown in Table 3.

4.3. Effect of Multiple-Way Data Augmentation

Figure 11 shows the multiple-way DA results if we take Figure 2(a) as the raw training examples. Due to the page limit, the multiple-way DA results on the horizontally mirrored image are not displayed. As we can see from Figure 11, multiple-way DA increases the diversity of the training images.

If we remove the multiple-way data augmentation from our model, the performances are decreased, as shown in Table 6, where MSD stands for mean and standard deviation. Comparing Table 5 with Table 6, we can observe multiple-way DA is efficient in improving the classification performance. The reason is that it helps our model resist overfitting by enhancing the diversity of the training set.

4.4. Explainability

Figure 12 shows the manual delineation and the heat map of Figure 2(a) via Grad-CAM described in Section 3.5. The manual delineation showed the radiologist make decisions on “TOF” diagnosis based on all the areas of the abnormal heart, while the heat map shows the proposed SOSPCNN model also puts more focus on the heart region other than the surrounding tissues and background areas.

4.5. Comparison with State-of-the-Art Approaches

We compare the proposed SOSPCNN model with three other approaches: MC [8], 3DCNN [9], and VCCNN [10]. The results are shown in Table 7. Note that some comparison methods are not suitable for our dataset, so we modify them to adapt to our dataset.

The error bar comparison is drawn in Figure 13, which clearly shows that the proposed SOSPCNN outperforms all three comparative approaches. The reason is three folds: (i) we use stochastic pooling to replace traditional max-pooling; (ii) we use structural optimization to determine the optimal structure of our SOSPCNN model; (iii) multiple-way DA is included to increase the diversity of training images. In the future, more advanced techniques [2830] will be tested and integrated into our model.

4.6. Desktop and Web Apps

MATLAB app designer is used to create a professional application for both desktop and web. The input to this web app is any cardiovascular CT image, and the aforementioned SOSPCNN model is included in our app. Figure 14(a) displays the graphical user interface (GUI) of the standalone desktop app. The users can upload their custom images, and the software can show the results by turning the knob into the correct texts: TOF, HC, or none.

Figure 14(b) shows the GUI of the web app that is accessed through a “Google Chrome” web browser. The web app is based on a client-server modeled structure [31], i.e., the user is provided services through an off-site server hosted by a third-party cloud service, Microsoft Azure in our study. Our developed online web app can assist hospital clinicians in making decisions remotely and effectively.

5. Conclusion

This paper proposes a web app for TOF recognition. Our proposed model is termed structurally optimized stochastic pooling convolutional neural network (SOSPCNN) with explainable property achieved by Grad-CAM. The results by ten runs of 10-fold crossvalidation show that this SOSPCNN model yields a sensitivity of , a specificity of , a precision of , an accuracy of , an F1 score of , an MCC of , an FMI of , and an AUC of 0.9587. Further, we develop both desktop and web apps to realize this SOSPCNN model.

The shortcomings of our method are as follows: (i) our model is trained on a small dataset; (ii) our model does not go through strict medical verification; (iii) our model only considers TOF and HC.

Therefore, we shall attempt to solve the above three weak points in the future. We shall try to collect more TOF and HC cardiovascular CT images. We shall invite clinicians to use our web app and return feedbacks so that we can continue to improve our model. We shall try to collect data of other heart diseases, so make our model can identify more types of diseases.

Data Availability

Data is available upon reasonable requests to corresponding authors.

Conflicts of Interest

The authors declare that they have no conflicts of interest to report regarding the present study.

Authors’ Contributions

Shui-Hua Wang and Kaihong Wu contributed equally to this work.

Acknowledgments

This paper is partially supported by the Royal Society International Exchanges Cost Share Award, UK (RP202G0230); Medical Research Council Confidence in Concept Award, UK (MC_PC_17171); Hope Foundation for Cancer Research, UK (RM60G0680); British Heart Foundation Accelerator Award, UK; Sino-UK Industrial Fund, UK (RP202G0289); and Global Challenges Research Fund (GCRF), UK (P202PF11).