Abstract

Currently, nearly two million patients die of gastrointestinal diseases worldwide. Video endoscopy is one of the latest technologies in the medical imaging field for the diagnosis of gastrointestinal diseases, such as stomach ulcers, bleeding, and polyps. Medical video endoscopy generates many images, so doctors need considerable time to follow up all the images. This creates a challenge for manual diagnosis and has encouraged investigations into computer-aided techniques to diagnose all the generated images in a short period and with high accuracy. The novelty of the proposed methodology lies in developing a system for diagnosis of gastrointestinal diseases. This paper introduces three networks, GoogleNet, ResNet-50, and AlexNet, which are based on deep learning and evaluates them for their potential in diagnosing a dataset of lower gastrointestinal diseases. All images are enhanced, and the noise is removed before they are inputted into the deep learning networks. The Kvasir dataset contains 5,000 images divided equally into five types of lower gastrointestinal diseases (dyed-lifted polyps, normal cecum, normal pylorus, polyps, and ulcerative colitis). In the classification stage, pretrained convolutional neural network (CNN) models are tuned by transferring learning to perform new tasks. The softmax activation function receives the deep feature vector and classifies the input images into five classes. All CNN models achieved superior results. AlexNet achieved an accuracy of 97%, sensitivity of 96.8%, specificity of 99.20%, and AUC of 99.98%.

1. Introduction

Cancer is the most common cause of death in the world, and gastrointestinal cancer is the most frequently occurring type. The World Health Organization (WHO) estimates that 1.8 million people die annually from gastrointestinal diseases [1], and gastrointestinal cancer is the fourth cause of death in the world. Gastrointestinal cancer grows from gastrointestinal polyps, which are abnormal tissue growths on the mucosa of the stomach and colon. The polyps grow slowly, and symptoms only appear when they are large [2]. However, polyps can be prevented and cured if detected at an early stage [3].

Video endoscopy plays an important role in increasing the early diagnosis of polyps in the gastrointestinal tract and reducing the number of mortalities [4]. Endoscopy can determine the severity of ulcerative colitis by detecting mucosal patterns that include spatial differences in mucosal colour and texture (degree of roughness on the mucosal surface in gastrointestinal) [5]. Hundreds of images can be extracted from a gastrointestinal video, but disease appears in only a few images, and no medical person can devote the amount of time needed to monitor all the images. Consequently, the accuracy of the diagnosis depends mainly on the experience of the doctor; experts are able to diagnose polyps in up to 27% of cases [6]. Therefore, during the examination, polyps may remain undetected and lead to future malignancies.

The deficiencies of the radiologist and other human factors can lead to a false diagnosis, so a computer-aided automated method would be valuable for diagnosing polyps with high accuracy and at the early stages of cancer. Artificial intelligence techniques have shown massive potential in various medical fields for helping humans to visualize disease that cannot be discovered with the naked eye [79]. For example, artificial intelligence techniques can extract complex microimaged structures from endoscopy images and identify key features. Clinically, artificial intelligence techniques can distinguish between neoplastic and nonneoplastic tissues. Techniques are also available for extracting texture features to evaluate the risk of gastric cancers [10, 11]. Colonoscopy images have been used to classify colitis by extracting texture features [12, 13]. However, the challenges in extracting the features of gastrointestinal images limit the diagnostic accuracy [14].

Machine learning techniques have been used to extract colour, texture, and edge features from endoscopic images that depend on trial and error for disease diagnosis [15, 16]. Convolutional neural networks (CNNs) have begun to solve these feature engineering limitations, and the use of CNN in supervised learning has greatly improved medical image diagnostics [17]. CNN has proved its tremendous ability to extract features by transferring engineering to the learning process [18]. Deep learning algorithms have shown an outperformance of medical image diagnostics over the performance of experts [19]. Hence, computer-aided diagnostics using deep learning techniques for endoscopic images have the potential to achieve diagnostic accuracy that is better than that obtained by trained specialists [20].

Karkanis et al. [21] presented a technique for extracting colour features based on wavelet decomposition for diagnosing colon polyps. Many studies have applied machine learning techniques to diagnose gastrointestinal images, with features extracted through approaches that include polyp-based local binary, grey-level co-occurrence matrices (GLCMs), wavelets, context-based features, and edge shape and valley information [2224]. The system proposed by Tajbakhsh et al. [25] has achieved better performance than other methods. Challenges remain for extracting hand-crafted features, such as light reflection, camera angle, and structural polyps. CNN techniques are powerful extractors of deep features, and CNN has achieved promising results in recent years in diagnosing medical images. Zhang et al. [26] introduced a polyp detection system based on a Single Shot MultiBox Detector (SSD), where they reused missing features from the max pooling layers and added them to feature maps to increase the detection accuracy and classification. Godkhindi and Gowda [27] presented a CNN system for detecting polyps from CT colonography images. Their algorithm segments the colon from the CT image and isolates it from the rest of the organs. It then diagnoses a colon polyp by extracting the shape features. Ozawa et al. [28] presented a system for detecting colorectal polyps by SSD and reported promising results in diagnosing these polyps. Pozdeev et al. [29] presented a two-stage polyp segmentation and automated classification system. The first stage uses the global features of endoscopic images to classify the presence or absence of a tumor, while the second stage includes segmentation by CNN. Wan et al. [30] have introduced biomedical optical spectroscopy techniques to detect gastrointestinal cancers in early stages. This type of spectroscopy has the potential to provide structural and chemical information and has many advantages, including noninvasiveness, a reagent-free protocol, and a nondestructive procedure. Ribeiro et al. [31] have proposed the use of CNNs for diagnostics of the colon mucosa to uncover colon polyps for early-stage colon cancer classification.

CNN extracts features by exploiting the input pixels to handle distortions caused by different light conditions. Min et al. [32] developed a computer-aided system to diagnose linked colour imaging by extracting colour features from lesions. The system classified images as adenomatous polyps and nonadenomatous polyps, and the system achieved satisfactory results. Song et al. [33] also developed a computer-aided system for diagnosing colorectal polyp histology by CNN techniques; their network classified the polyps into three types: serrated polyp, deep submucosal cancer, and benign adenoma mucosal or superficial submucosal cancer.

The main contribution of the present paper is the provision of a computer-aided detection method for lower gastrointestinal diseases with modified criteria for extracting deep shape, colour, and texture features and adapting them to a learning transfer technique for fine-tuned and contoured transfers. Extensive experiments were conducted to select pretrained models to diagnose lower gastrointestinal diseases. New models were developed for transferring features extracted from a nonmedical deep learning dataset and adapting them to the new dataset. The remainder of the paper is organised as follows. Section 2 describes the background and motivations. Section 3 discusses the materials and methods used to diagnose a dataset. Section 4 presents the analysis and results of the study that have been achieved and compares the proposed systems with those from previous studies. Finally, the conclusions are presented in Section 5.

2. Background and Motivations

This section provides the fundamentals of gastrointestinal diseases and an overview of deep learning for diagnosing medical images.

2.1. Overview and Status of Gastrointestinal Diseases

Computer-aided early detection of disease is an important research field that can improve healthcare systems and medical practice around the world. The Kvasir dataset, which contains gastrointestinal images, is classified into three clinically important findings, three significant anatomical landmarks, and two categories of endoscopic polyp removal. The gastrointestinal tract is affected by many diseases, with 2.8 million new cases and 1.8 million deaths caused annually by oesophageal and stomach cancers. The gold standard for gastrointestinal examination is endoscopy. The upper gastrointestinal examination, involving the stomach, oesophagus, and upper part of the small intestine is done by gastroscopy, while the colon and rectum are examined by a colonoscopy. Both of these examinations are done as real-time videos with high resolution. Endoscopy equipment is expensive and requires extensive experience and training. Endoscopic detection and removal of lesions in their early stages, followed by appropriate treatment, are important for preventing colorectal cancer. Doctors vary in their abilities to detect colorectal cancer, and this may affect colorectal cancer diagnosis if a doctor’s ability to evaluate the images is limited. An accurate diagnosis of the type of disease is also important for treatment and follow-up. Therefore, automatic diagnostics would be very welcome. Automatic diagnosis of pathological findings could contribute to the evaluation and identification of gastrointestinal cancers, thereby improving the efficiency and use of medical resources.

2.2. Deep Learning

CNNs are computational systems designed for the purpose of pattern recognition. CNN has entered into a number of fields, including healthcare [34], and has an important role in the diagnostics of images obtained in early disease stages. Image recognition and diagnostic accuracy are two tasks in which CNN excels compared to human experts. CNN has three types of layers: convolutional layers, pooling layers, and fully connected layers [35]. CNN has more ability than traditional networks and other networks such as RNN to deal with images because it uses a combination of technologies with these layers [36, 37]. The basic idea underlying CNN is the use of two-dimensional images and application of two-dimensional filters, in addition to a learning transfer technique where models are trained using the best pretrained models and the last three layers are replaced to learn the weights of the problem to be solved. CNN features are extracted from the dataset that they are trained on, so experts do not need to manually extract the features [38]. CNN’s strength comes from its ability to learn the representative features in its training dataset. Convolutional layers work like the human brain does in feedback, where each layer acts as feedback to the next layer and the process continues until precise features are obtained.

2.2.1. GoogleNet

GoogleNet, developed by Google researchers, is a model of CNNs, sometimes called Inception V1; it consists of 22 layers (27 layers including the pooling layers). The GoogleNet architecture was the winner in the classification challenge for images at the ILSVRC 2014. GoogleNet is used in many fields, including computer vision tasks, as well as in medical image classification. Figure 1 illustrates the architecture of the GoogleNet used to classify 5,000 images into five diseases from the lower digestive system. The GoogleNet architecture consists of 27 layers, including layers that do not contain parameters. They are divided into the input layer, in which the images are inputted into an RGB system with a size of 224 × 224 pixels. The first convolutional layer contains two 7 × 7 filters; these are among the largest filters compared to the other layers. This layer reduces the size of the input image, followed by the max pool layer with 3 × 3 filters, a convolutional layer with a 3 × 3 filter, and then the max pool layer with 3 × 3 filters. The output is inputted into a two-layer block inception module, followed by a max pool layer with 3 × 3 filters and then another four-layer block for an inception module. This is followed by a max pool layer with 3 × 3 filters, a two-layer block inception module, and then by a max pool layer with 3 × 3 filters. The average pooling layer has a size of 7 × 7 pixels. Stride is used to determine the amount of filter shift on the input image. The Dropout technique uses this to prevent overfitting. In our work, Dropout was set at 40%, which means that the neurons are stopped by 40% in each iteration, and different parameters are used in each iteration. The fully connected layer received 9216 features and produced 4,096 features. The softmax layer produces five classes: dyed-lifted polyps, normal cecum, normal pylorus, polyps, and ulcerative colitis.

2.2.2. ResNet-50

ResNet-50 is a residual CNN model consisting of 177 layers. ResNet-50 was the winner in the image classification challenge in 2015. ResNet-50 is the backbone of many computer vision tasks. Figure 2 illustrates the architecture of the ResNet-50 used to classify 5,000 images divided into five diseases from the lower digestive system. The ResNet-50 architecture consists of 16 blocks that contain 177 layers divided into the input layer that inputs RGB images with a size of 224 × 224 pixels and 49 convolutional layers, which use different types of filters [39]. The convolutional layer extracts deep features from the input images and stores them in deep feature vector maps, with one pooling layer for both average and max. These two layers reduce the feature vector map dimensions. Batch normalisation then helps the network to choose the learning rate correctly. The Rectified Linear Activation function (ReLU) that follows the convolutional layers only passes positive outputs and converts the negative values to zero. The fully connected layer receives 9216 features and produces 4096 features, and the second connected layer produces 1000 features. The softmax layer produces the five classes: dyed-lifted polyps, normal cecum, normal pylorus, polyps, and ulcerative colitis.

2.2.3. AlexNet

AlexNet is a model of CNNs and consists of 25 layers. AlexNet was the winner in the ImageNet classification competition in 2012, with a top-5 error rate of 15.3% [40]. Figure 3 illustrates the architecture of the AlexNet used to classify 5,000 images divided into five diseases from the lower digestive system. The architecture of AlexNet consists of 25 layers divided into the input layer that inputs RGB images at 227 × 227 pixels size and five convolutional layers that use different types of filters. The convolutional layer extracts deep features from the input images. Three layers are used in max pooling; these layers reduce the dimensions of the feature vector maps. Two layers of cross-channel normalisation work on reparameterisation of the vector weights and choose the appropriate learning rate. Seven layers of the ReLU follow the convolutional layers. ReLU only outputs the positive values, while converting the negative values to zero. Three fully connected layers operate in series. The first connected layer receives 9216 features and produces 4096 features, the second connected layer produces 4096 features, and the third fully connected layer produces 1000 neurons (features). A softmax layer produces the five classes: dyed-lifted polyps, normal cecum, normal pylorus, polyps, and ulcerative colitis.

3. Materials and Methods

The computer-aided automatic detection of gastrointestinal diseases is an important research field. In this section, we describe the GoogleNet, ResNet-50, and AlexNet models of the CNN for early and accurate diagnosis of lower gastrointestinal disease. The general structure of the gastrointestinal detection system used in this work is shown in Figure 4. Preprocessing improves images and removes noise and artifacts, while the image augmentation technology improves training process. The convolutional layers extract the deepest and most important features from each image. The fully connected layers diagnose and classify the gastrointestinal images.

3.1. Dataset

The dataset was collected from the Vestre Viken Health Trust (VV) in Norway from the gastroenterology department at the Baerum Hospital using endoscopic equipment. All images were described by experts from VV and the Cancer Registry of Norway (CRN). The CRN is the national body at the Oslo University Hospital that is in charge of screening and early detection of cancer to prevent spread. The Kvasir dataset consists of interpreted images by experts, including classes containing endoscopic procedures in the gastrointestinal tract and anatomical landmarks. The dataset contains hundreds of images that are sufficient for use in deep learning and transfer learning. The dataset is in RGB colour space and consists of images in resolution from 720 × 576 up to 1920 × 1072 pixels. In our work, the dataset contains 5,000 images equally divided into five diseases: dyed-lifted polyps, normal cecum, normal pylorus, polyps, and ulcerative colitis. Figure 5 shows samples from the Kvasir dataset. The data are available in this link: https://datasets.simula.no/kvasir/#download.

3.2. Preprocessing and Augmentation Techniques

Noise and artefacts arise from light reflections, photographic angles, and the mucous membranes surrounding the internal organs and reduce the performance of CNN due to the increased complexities of feature extraction. Therefore, optimisation processes have been of interest to researchers to improve image quality. In this paper, gastrointestinal images were preprocessed before being inputted into the CNN models. First, the image was scaled for colour constancy, and the image sizes were changed to 244 × 244 pixels for both GoogleNet and ResNet-50 models and to 227 × 227 pixels for the AlexNet model. The mean for the three RGB channels was then calculated for the gastrointestinal images. Finally, the enhancement process was conducted through the average filter, which calculates the average for each pixel with its neighbours and replaces it; this process continues for all pixels of the image [41, 42].

CNN techniques depend mainly on the volume of data. A larger set of training data generates more promising results for the model. Because of the lack of medical images, data augmentation techniques improve CNN models for accurate classification [37, 43]. The data augmentation technology also works to balance the dataset when the number of images differs between classes. In this paper, images of the training data were augmented through the operations of flipping, zooming, shifting, and ±rotation [44].

3.3. Convolutional Layers

The gastrointestinal dataset contains many features, such as shape, texture, and colour. The manual extraction of features requires substantial experience, especially when extracting images from a video, where many images do not include the disease and the disease features appear in a few images that may be missed by the radiologist and the specialists. CNN algorithms work by extracting representative features of each disease through convolutional layers. GoogleNet contains many convolutional layers and nine inception layers, ResNet-50 has 49 convolutional layers, and AlexNet has five convolutional layers. These layers apply a set of filters and adjust the weights during the training phase to address the deep features and pass them on to the next layer. The average and max pooling layers also reduce the size of the feature maps and represent a group of pixels either by means of the average or the max value between the groups of pixels. The convolutional layers extract representative features of each image for a total of 9216 features per image and represent them in feature maps to feed them to the classification layers.

3.4. Normalisation of the Image

Normalisation is one of the techniques used in training deep neural networks to normalise images for the purpose of accelerating training processes. Normalisation aids in the appropriate choice of the learning rate through gradient descent converging. Without the normalisation process, the learning rate is more difficult and takes longer. In our work, the image normalisation process was done by subtracting the mean of the complete training set for each pixel. A variance of the dataset was calculated and divided by every pixel, resulting in data centring and making the variance of each feature equal to one.

3.5. Dropout Technology

CNNs produce millions of parameters, leading to overfitting. For this reason, CNN applies a Dropout technique to reduce overfitting. In our work, the Dropout technique was applied to GoogleNet, ResNet-50, and AlexNet at 40%, which means that 40% of the neurons are stopped in each iteration. Therefore, the networks use different parameters for each iteration. However, the Dropout technique doubles the training time.

3.6. Transfer Learning

Transfer learning is one of the most important steps in CNN [45]. In our study, transfer learning and fine-tuning were applied to GoogleNet, ResNet-50, and AlexNet networks pretrained to the ImageNet dataset [46]. Transfer learning is based on training a dataset to solve a specific problem and then transferring that learning to solving another related dataset problem [47, 48]. Transfer learning works by choosing the pretrained model and the size of the problem and using what has been learned to transfer the generalisation to another task. Transfer learning also avoids overfitting. In this work, transfer learning was applied to GoogleNet, ResNet-50, and AlexNet, where the weights were fine-tuned. The GoogleNet, ResNet-50, and AlexNet models were trained on the ImageNet dataset, and the learning was then transferred to the gastrointestinal dataset. The last three layers of the patterns were deleted and replaced with a fully connected layer. The first connected layer received 9,216 neurons and outputted 4,096 neurons, while the second connected layer received 4,096 neurons and outputted those. The softmax layer produced the five classes: dye-lifted polyps, normal cecum, normal pylorus, polyps, and ulcerative colitis.

3.7. Optimizers (Adam)

Optimizers are used to change and tune parameters of neural networks such as weights, biases, and learning rate to reduce loss. Optimizer methods are considered to improve the deep learning classifier, which helps to speed up the performance of models. Adaptive Moment Estimation (Adam) is one of the best deep learning optimizers. Adam is a compilation of both RMSProp and momentum [49]. The adaptive learning rate for each parameter is calculated by Adam. It keeps average past gradient like momentum , and it keeps squared gradients by storing past decaying average . The following equation describes how Adam works to tune parameters, learning rate, etc.where refers to the first moment in the gradient, refers to the second moment in the gradient, and indicate the decay rate.

4. Experimental Results

Weights and parameters were adjusted for GoogleNet, ResNet-50, and AlexNet CNNs in the training phase to evaluate the dataset of gastrointestinal diseases. Table 1 shows the training options for the three networks and the execution time in the MATLAB environment. The resources are Core i5 Gen 6 with 4G NVIDA GPU. In this paper, three experiments were conducted to evaluate a gastroenterology dataset containing 5,000 images divided equally among five diseases. The same dataset was applied in all three experiments. The dataset was divided into 80% for training and 20% for selection and validation.

Figure 6 shows the confusion matrix and AUC obtained from GoogleNet, ResNet-50, and AlexNet. The confusion matrix reviews all test images that are correctly classified (true negative (TN) and true positive (TP)) and incorrectly classified (false positive (FP) and false negative (FN)). The AUC also shows the ratio of TP vs. FP. Table 2 and Figure 6 show the evaluation of the dataset for three CNN models. Accuracy, sensitivity, specificity, and AUC are calculated according to equations (3)–(6). All networks showed promising results, as indicated in Table 2. The systems achieved an accuracy of 96.7%, 95%, and 97%, a sensitivity of 96.60%, 94.80%, and 96.80%, a specificity of 99%, 98.80%, and 99.20%, a validation accuracy of 96.70%, 97%, and 94.88%, and an AUC of 99.99%, 99.69%, and 99.98%, for GoogleNet, ResNet-50, and AlexNet, respectively. Note that the results in all experiments were approximately equal.

TP represents the number of positive samples correctly classified. TN represents the number of negative samples correctly classified. FP represents the number of benign samples classified as malignant. FN represents the number of malignant samples classified as benign.

Table 3 and Figure 7 show the classification performance of GoogleNet, ResNet-50, and AlexNet at the level of each disease. ResNet-50 and AlexNet achieved 99% accuracy in classifying dyed-lifted polyps, while ResNet-50 and AlexNet achieved 95% accuracy in classifying normal cecum disease. GoogleNet achieved the best classification (100%) for normal pylorus disease. Polyps were classified at 98% by ResNet-50, while GoogleNet achieved the best performance for classifying ulcerative colitis (96%).

The proposed CNNs were evaluated through several measures previously examined in the literature, as shown in Table 4. All relevant literature reported accuracy between 70.40% and 90.20%, while our proposed system reached an accuracy of 97%. Related previous studies achieved sensitivity between 70.40% and 95.16%, while our proposed system achieved a sensitivity of 96.80%. The specificity in previous studies ranged between 70.90% and 93%, while our proposed system reached a specificity of 99%. The proposed system outperformed all previous studies with regard to AUC, as our proposed system reached an AUC of 99.99%. The comparison of the proposed system against the existing models is presented in Figure 7.

5. Conclusion

This work provides a robust framework for classifying the gastrointestinal tract diseases in the Kvasir dataset. Deep learning techniques can reduce the probability of developing malignant diseases by aiding in early detection, while also reducing the unnecessary removal of benign tumors. Video endoscopy is the most widely used diagnostic method for diagnosing gastrointestinal polyps, but many human factors lead to improper diagnosis of gastrointestinal diseases. This paper presents three deep learning models, GoogleNet, ResNet-50, and AlexNet, that can direct the doctor’s focus to the most important regions that may have been missed. The dataset was divided into 80% for training and 20% for testing and validation. The images were optimised to remove noise and artifacts. The data augmentation technology works by multiplying the images during the training phase for high accuracy. Convolutional layers extract features of shape, colour, and texture. Overall, 9216 features were extracted and passed into the fully connected layers that produced 1000 neurons. The softmax layer produces five classes, classifying each image into one of the five types of gastrointestinal diseases. All three models achieved equally promising results. Advanced deep learning algorithm will be applied in future.

Data Availability

The data are available in the following link: https://datasets.simula.no/kvasir/#download.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors extend their appreciation to the Deanship of Scientific Research at King Faisal University for funding this research work through the project number NA00095.