Research Article | Open Access
Sen Wang, Yuxiang Xing, Li Zhang, Hewei Gao, Hao Zhang, "Deep Convolutional Neural Network for Ulcer Recognition in Wireless Capsule Endoscopy: Experimental Feasibility and Optimization", Computational and Mathematical Methods in Medicine, vol. 2019, Article ID 7546215, 14 pages, 2019. https://doi.org/10.1155/2019/7546215
Deep Convolutional Neural Network for Ulcer Recognition in Wireless Capsule Endoscopy: Experimental Feasibility and Optimization
Wireless capsule endoscopy (WCE) has developed rapidly over the last several years and now enables physicians to examine the gastrointestinal tract without surgical operation. However, a large number of images must be analyzed to obtain a diagnosis. Deep convolutional neural networks (CNNs) have demonstrated impressive performance in different computer vision tasks. Thus, in this work, we aim to explore the feasibility of deep learning for ulcer recognition and optimize a CNN-based ulcer recognition architecture for WCE images. By analyzing the ulcer recognition task and characteristics of classic deep learning networks, we propose a HAnet architecture that uses ResNet-34 as the base network and fuses hyper features from the shallow layer with deep features in deeper layers to provide final diagnostic decisions. 1,416 independent WCE videos are collected for this study. The overall test accuracy of our HAnet is 92.05%, and its sensitivity and specificity are 91.64% and 92.42%, respectively. According to our comparisons of F1, F2, and ROC-AUC, the proposed method performs better than several off-the-shelf CNN models, including VGG, DenseNet, and Inception-ResNet-v2, and classical machine learning methods with handcrafted features for WCE image classification. Overall, this study demonstrates that recognizing ulcers in WCE images via the deep CNN method is feasible and could help reduce the tedious image reading work of physicians. Moreover, our HAnet architecture tailored for this problem gives a fine choice for the design of network structure.
Gastrointestinal (GI) diseases pose great threats to human health. Gastric cancer, for example, ranks fourth among the most common type of cancers globally and is the second most common cause of death from cancer worldwide . Conventional gastroscopy can provide accurate localization of lesions and is one of the most popular diagnostic modalities for gastric diseases. However, conventional gastroscopy is painful and invasive and cannot effectively detect lesions in the small intestine.
The emergence of wireless capsule endoscopy (WCE) has revolutionized the task of imaging GI issues; this technology offers a noninvasive alternative to the conventional method and allows exploration of the GI tract with direct visualization. WCE has been proven to have great value in evaluating focal lesions, such as those related to GI bleeding and ulcers, in the digestive tract .
WCE was first induced in 2000 by Given Imaging and approved for use by the U.S. Food and Drug Administration in 2001 . In the examination phase, a capsule is swallowed by a patient and propelled by peristalsis or magnetic fields to travel along the GI tract [3, 4]. While travelling, the WCE takes colored pictures of the GI tract for hours at a frame rate of 2–4 photographs per second  and transmits the same to a data-recording device. The recorded images are viewed by physicians to arrive at a diagnosis. Figure 1 illustrates a wireless capsule.
Examination of WCE images is a time-consuming and tedious endeavor for doctors because a single scan for a patient may include up to tens of thousands of images of the GI tract. Experienced physicians may spend hours reviewing each case. Furthermore, abnormal frames may occupy only a tiny portion of all of the images obtained . Thus, physicians may miss the actual issue due to fatigue or oversight.
Several features have motivated researchers to turn to computer-aided systems, including improved ulcer detection, polyp recognition, and bleeding area segmentation [3, 5–15], to reduce the burden on physicians and guarantee diagnostic precision.
Ulcers are one of the most common lesions in the GI tract; an estimated 1 out of every 10 persons is believed to suffer from ulcers . An ulcer is defined as an area of tissues destroyed by gastric juice and showing a discontinuity or break in a bodily membrane [9, 11]. The color and texture of the ulcerated area are different from those of a normal GI tract. Some representative ulcer frames in WCE videos are demonstrated in Figure 2. Ulcer recognition requires classification of each image in a WCE video as ulcerated or not, similar to the classification work in computer vision tasks.
Deep learning methods based on the convolutional neural network (CNN) have seen several breakthroughs in classification tasks in recent years. Considering the difficulty in mathematically describing the great variation in the shapes and features of abnormal regions in WCE images and the fact that deep learning is powerful in extracting information from data, we propose the application of deep learning methods to ulcer recognition using a large WCE dataset of big volume to provide adequate diversity. In this paper, we carefully analyze the problem of ulcer frame classification and propose a deep learning framework based on a multiscale feature concatenated CNN, hereinafter referred to as HAnet, to assist in the WCE video examination task of physicians. Our network is verified to be effective on a large dataset containing WCE videos of 1,416 patients.
Our main contributions can be summarized in terms of the following three aspects: (1) The proposed architecture adopts state-of-the-art CNN models to efficiently extract features for ulcer recognition. It incorporates a special design that fuses hyper features from shallow layers and deep features from deep layers to improve the recognition of ulcers at vastly distributed scales. (2) To the best of our knowledge, this work is the first experimental study to include a large dataset consisting of over 1,400 WCE videos from ulcer patients to explore the feasibility of deep CNN for ulcer diagnosis. Some representative datasets presented in published works are listed in Table 1. The 92.05% accuracy and 0.9726 ROC-AUC of our proposed model demonstrate its great potential for practical clinic applications. (3) An extensive comparison with different state-of-the-art CNN network structures is provided to evaluate the most promising network for ulcer recognition.
2. Materials and Methods
2.1. Abnormality Recognition in WCE Videos
Prior related methods for abnormality recognition in WCE videos can be roughly divided into two classes: conventional machine learning techniques with handcrafted features and deep learning methods.
Conventional machine learning techniques are usually based on manually selected handcrafted features followed by application of some classifier. Features commonly employed in conventional techniques include color and textural features.
Lesion areas are usually of a different color from the surrounding normal areas; for example, bleeding areas may present as red and ulcerated areas may present as yellow or white. Fu et al.  proposed a rapid bleeding detection method that extracts color feature in the RGB color space. Besides the RGB color space, other color spaces, like the HSI/HSV  and YCbCr , are also commonly used to extract features.
Texture is another type of feature commonly used for pattern recognition. Texture features include local binary patterns (LBP) and filter-based features . An LBP descriptor is based on a simple binary coding scheme that compares each pixel with its neighbors . The LBP descriptor, as well as its extended versions, such as uniform LBP  and monogenic LBP , has been adopted in various WCE recognition tasks. Filter-based features, such as Gabor filters and wavelet transforms, are widely used in WCE image recognition tasks for their ability to describe images in multistage space. In addition, different textural features can be combined for better recognition performance. As demonstrated in , the combination of wavelet transformation and uniform LBP can achieve automatic polyp detection with good accuracy.
CNN-based deep learning methods are known to show impressive performance. The error rate in computer vision challenges (e.g., ImageNet, COCO) has decreased rapidly with the emergence of various deep CNN architectures, such as AlexNet, VGGNet, GoogLeNet, and ResNet [21–28].
Many researchers have realized that handcrafted features merely encode partial information in WCE images  and that deep learning methods are capable of extracting powerful feature representations that can be used in WCE lesion recognition and depth estimation [6, 7, 15, 18, 30–34].
A framework for hookworm detection was proposed in ; this framework consists of an edge extraction network and a hookworm classification network. Inception modules are used to capture multiscale features to capture spatial correlations. The robustness and effectiveness of this method were verified in a dataset containing 440,000 WCE images of 11 patients. Yuan and Meng  proposed an autoencoder-based neural network model that introduces an image manifold constraint to a traditional sparse autoencoder to recognize polyps in WCE images. Manifold constraint can effectively enforce images within the same category to share similar features and keep images in different categories far way, i.e., it can preserve large intervariances and small intravariance among images. The proposed method was evaluated using 3,000 normal WCE images and 1,000 WCE images with polyps extracted from 35 patient videos. Utilizing temporal information of WCE videos with 3D convolution has also been explored for poly detection . Deep learning methods are also adopted for ulcer diagnosis. There are also some investigations of ulcer recognition with deep learning methods [18, 35]. Off-the-shelf CNN models are trained and evaluated in these studies. Experimental results and comparisons in these studies clearly demonstrate the superiority of deep learning methods over conventional machine learning techniques.
From ulcer size analysis of our dataset, we find that most of the ulcers occupy only a tiny area in the whole image. Deep CNNs can inherently compute feature hierarchies layer by layer. Hyper features from shallow layers have high resolution but lack representation capacity; by contrast, deep features from deep layers are semantically strong but have poor resolution [36–38]. These features motivate us to propose a framework that fuses hyper and deep features to achieve ulcer recognition at vastly different scales. We will give detailed description of ulcer size analysis and the proposed method in Sections 2.2 and 2.3.
2.2. Ulcer Dataset
Our dataset is collected using a WCE system provided by Ankon Technologies Co., Ltd. (Wuhan, Shanghai, China). The WCE system consists of an endoscopic capsule, a guidance magnet robot, a data recorder, and a computer workstation with software for real-time viewing and controlling. The capsule is 28 mm × 12 mm in size and contains a permanent magnet in its dome. Images are recorded and transferred at a speed of 2 frames/s. The resolution of the WCE image is 480 × 480 pixels.
The dataset used in this work to evaluate the performance of the proposed framework contains 1,416 WCE videos from 1,416 patients (males 73%, female 27%), i.e., one video per patient. The WCE videos are collected from more than 30 hospitals and 100 medical examination centers through the Ankon WCE system. Each video is independently annotated by at least two gastroenterologists. If the difference between annotation bounding boxes of the same ulcer is larger than 10%, an expert gastroenterologist will review the annotation and provide a final decision. The age distribution of patients is illustrated in Figure 3. The entire dataset consists of 1,157 ulcer videos and 259 normal videos. In total, 24,839 frames are annotated as ulcers by gastroenterologists. To balance the volume of each class, 24,225 normal frames are randomly extracted from normal videos for this study to match the 24,839 representative ulcer frames. A mask of diameter 420 pixels was used to crop the center area of each image in preprocessing. This preprocessing did not change the image size.
We plot the distribution of ulcer size in our dataset in Figure 4. The vertical and horizontal axes denote the number of images and the ratio of the ulcerated area to the whole image size, respectively. Despite the inspiring success of CNNs in ImageNet competition, ulcer recognition presents some challenge to the ImageNet classification task because lesions normally occupy only a small area of WCE images and the structures of lesions are rather subtle. In Figure 4, about 25% of the ulcers occupy less than 1% of the area of the whole image and more than 80% of the ulcers found occupy less than 5% of the area of the image. Hence, a specific design of a suitable network is proposed to account for the small ulcer problem and achieve good sensitivity.
2.3. HAnet-Based Ulcer Recognition Network with Fused Hyper and Deep Features
In this section, we introduce our design and the proposed architecture of our ulcer recognition network.
Inspired by the design concept of previous works that deal with object recognition in vastly distributed scales [36–38], we propose an ulcer recognition network with a hyperconnection architecture (HAnet). The overall pipeline of this network is illustrated in Figure 5. Fundamentally, HAnet fuses hyper and deep features. Here, we use ResNet-34 as the base feature-extraction network because, according to our experiments (demonstrated in Section 3), it provides the best results. Global average pooling (GAP)  is used to generate features for each layer. GAP takes an average of each feature layer, so that it reduces tensors with dimensions h × × d to 1 × 1 × d. Hyper features can be extracted from multiple intermediate layers (layers 2 and 3 in this case) of the base network; they are concatenated with the features of last feature-extraction layer (layer 4 in this case) to make the final decision.
Our WCE system outputs color images with a resolution of 480 × 480 pixels. Experiments by the computer vision community [36, 40] have shown that high-resolution input images are helpful to the performance of CNN networks. To fully utilize the output images from the WCE system, we modify the base network to receive input images with a size of 480 × 480 × 3 without cropping or rescaling.
2.4. Loss of Weighted Cross Entropy
Cross-entropy (CE) loss is a common choice for classification tasks. For binary classification , CE is defined aswhere denotes the ground-truth label of the sample and is the estimated probability of a sample belonging to the class with label 1. Mathematically, the minimization process of CE is to enlarge the probabilities of samples with label = 1 and suppress the probabilities of samples with label = 0.
To deal with possible imbalance between classes, a weighting factor can be applied to different classes, which can be called weighted cross-entropy (wCE) loss .where denotes the weighting factor to balance the loss of different classes. Considering the overall small and variational size distribution of ulcers, as well as possible imbalance in the large dataset, we set wCE as our loss function.
2.5. Evaluation Criteria
To evaluate the performance of classification, accuracy (AC), sensitivity (SE), and specificity (SP) are exploited as metrics .
Here, N is the total number of test images and TP, FP, TN, and FN are the number of correctly classified images containing ulcers, the number of normal images falsely classified as ulcer frames, the number of correctly classified images without ulcers, and the number of images with ulcers falsely classified as normal images, respectively.
AC gives an overall assessment of the performance of the model, SE denotes the model’s ability to detect ulcer images, and SP denotes its ability to distinguish normal images. Ideally, we expect both high SE and SP, although some trade-offs between these metrics exist. Considering that further manual inspection by the doctor of ulcer images detected by computer-aided systems is compulsory, SE should be as high as possible with no negative impact on overall AC.
We use a 5-fold cross-validation strategy at the case level to evaluate the performances of different architectures; this strategy splits the total number of cases evenly into five subsets. Here, one subset is used for testing, and the four other subsets are used for training and validation. Figure 6 illustrates the cross-validation operation. In the present study, the volumes of train, validation, and test are about 70%, 10%, and 20%, respectively. Normal or ulcer frames are then extracted from each case to form the training/validation/testing dataset. We perform case-level splitting because adjacent frames in the same case are likely to share similar details. We do not conduct frame-level cross-validation splitting to avoid overfitting. The validation dataset is used to select the best model in each training process, i.e., the model with the best validation accuracy during the training iteration is saved as the final model.
In this section, the implementation process of the proposed method is introduced, and its performance is evaluated by comparison with several other related methods, including state-of-the-art CNN methods and some representative WCE recognition methods based on conventional machine learning techniques.
3.1. Network Architectures and Training Configurations
The proposed HAnet connects hyper features to the final feature vector with the aim of enhancing the recognition of ulcers of different sizes. The HAnet models are distinguished by their architecture and training settings, which include three architectures and three training configurations in total. We illustrate these different architectures and configurations in Table 2.
Three different architectures can be obtained when hyper features from layers 2 and 3 are used for decision in combination with features from layer 4 of our ResNet backbone: in Figure 7(a), hyper(l2), which fuses the hyper features from layer 2 with the deep features of layer 4, in Figure 7(b), hyper(l3), which fuses features from layers 3 and 4, and in Figure 7(c), hyper(l23), which fuses features from layers 2, 3, and 4 to form the third HAnet. Figure 7 provides comprehensive diagrams of these different HAnet architectures.
Each HAnet can be trained with three configurations. Figure 8 illustrates these configurations.
The whole HAnet is trained using pretrained ResNet weights from ImageNet from initialization (denoted as ImageNet in Table 2). The total training process lasts for 40 epochs, and the batch size is fixed to 16 samples. The learning rate was initialized to be 10−3 and decayed by a factor of 10 at each period of 20 epochs. The parameter of momentum is set to 0.9. Experimental results show that 40 epochs are adequate for training to converge. Weighted cross-entropy loss is used as the optimization criterion. The best model is selected based on validation results.
ResNet(480) is first fine-tuned on our dataset using pretrained ResNet weights from ImageNet for initialization. The training settings are identical to those in (1). Convergence is achieved during training, and the best model is selected based on validation results. We then train the whole HAnet using the fine-tuned ResNet (480) models for initialization and update all weights in HAnet (denoted as all-update in Table 2). Training lasts for 40 epochs. The learning rate is set to 10−4, momentum is set to 0.9, and the best model is selected based on validation results.
The weights of the fine-tuned ResNet(480) model are used, and only the last fully connected (FC) layer is updated in HAnet (denoted as FC-only in Table 2). The best model is selected based on validation results. Training lasts for 10 epochs, the learning rate is set to 10−4, and momentum is set to 0.9.
For example, the first HAnet in Table 2, hyper(l2) FC-only, refers to the architecture fusing the features from layer 2 and the final layer 4; it uses ResNet (480) weights as the feature extractor and only the final FC layer is updated during HAnet training.
To achieve better generalizability, data augmentation was applied online in the training procedure as suggested in . The images are randomly rotated between 0° and 90° and flipped with 50% possibility. Our network is implemented using PyTorch.
3.2. Refinement of the Weighting Factor for Weighted Cross-Entropy
To demonstrate the impact of different weighting factors, i.e., in equation (2), we examine the cross-validation results of model recognition accuracy with different weighting factors. The AC, SE, and SP curves are shown in Figure 9.
AC varies with changes in weighting factor. In general, SE improves while SP is degraded as the weighting factor increases. Detailed AC, SE, and SP values are listed in Table 3. ResNet-18(480) refers to experiments on a ResNet-18 network with a WCE full-resolution image input of 480 × 480 × 3. A possible explanation for the observed effect of the weighting factor is that the ulcer dataset contains many consecutive frames of the same ulcer, and these frames may share marked similarities. Thus, while the frame numbers of the ulcer and normal dataset are comparable, the information contained by each dataset remains unbalanced. The weighting factor corrects or compensates for this imbalance.
In the following experiments, 4.0 is used as the weighting factor as it outperforms other choices and simultaneously achieves good balance between SE and SP.
3.3. Selection of Hyper Architectures
We tested 10 models in total, as listed in Table 4, including a ResNet-18(480) model and nine HAnet models based on ResNet-18(480). The resolution of input images is 480 × 480 × 3 for all models.
According to the results in Table 4, the FC-only and all-update hyper models consistently outperform the ResNet-18(480) model in terms of the AC criterion, which demonstrates the effectiveness of HAnet architectures. Moreover, FC-only models generally perform better than all-update models, thus implying that ResNet-18(480) extracts features well and that further updates may corrupt these features.
The hyper ImageNet models, including hyper(l2) ImageNet, hyper(l3) ImageNet, and hyper(l23) ImageNet, seem to give weak performance. Hyper ImageNet models and the other hyper models share the same architectures. The difference between these types of models is that the hyper ImageNet models are trained with the pretrained ImageNet ResNet-18 weights while the other models use ResNet-18(480) weights that have been fine-tuned on the WCE dataset. This finding reveals that a straightforward base net such as ResNet-18(480) shows great power in extracting features. The complicated connections of HAnet may prohibit the network from reaching good convergence points.
To fully utilize the advantages of hyper architectures, we recommend a two-stage training process: (1) Train a ResNet-18(480) model based on the ImageNet-pretrained weights and then (2) use the fine-tuned ResNet-18(480) model as a backbone feature extractor to train the hyper models. We denote the best model in all hyper architectures as HAnet-18(480), i.e., a hyper(l23) FC-only model.
Additionally, former exploration is based on ResNet-18, and results indicate that a hyper(l23) FC-only architecture based on the ResNet backbone feature extractor fine-tuned by WCE images may be expected to improve the recognition capability of lesions in WCE videos. To optimize our network, we examine the performance of various ResNet series members to determine an appropriate backbone. The corresponding results are listed in Table 5; ResNet-34(480) has better performance than ResNet-18(480) and ResNet-50(480). Thus, we take ResNet-34(480) as our backbone to train HAnet-34(480). The training settings are described in Section 3.1. Figure 10 gives the overall progression of HAnet-34(480).
3.4. Comparison with Other Methods
To evaluate the performance of HAnet, we compared the proposed method with several other methods, including several off-the-shelf CNN models [26–28] and two representative handcrafted-feature based methods for WCE recognition [3, 42]. The off-the-shelf CNN models, including VGG , DenseNet , and Inception-ResNet-v2 , are trained to converge with the same settings as ResNet-34(480), and the best model is selected based on the validation results. For handcrafted-feature based methods, grid searches to optimize hyper parameters are carried out.
We performed repeated 2 × 5-fold cross-validation to provide sufficient measurements for statistical tests. Table 6 compares the detailed results of HAnet-34(480) with those of other methods. On average, HAnet-34(480) performs better in terms of AC, SE, and SP than the other methods. Figure 11(a) gives the location of each model, considering its inference time and accuracy. Figure 11(b) is the statistical results of paired T-Test.
Among the models tested, HAnet-34(480) yields the best performance with good efficiency and accuracy. Additionally, the statistical test results demonstrate the improvement of our HAnet-34 is statistically significant. Number in each grid cell denotes the value of the two models in the corresponding row and column. We can see that the improvement of HAnet-34(480) is statistically significant at the 0.01 level compared with other methods.
Table 7 gives more evaluation results based on several criteria, including precision (PRE), recall (RECALL), F1 and F2 scores , and ROC-AUC . HAnet-34 outperforms all other models based on these criteria.
In this section, the recognition capability of the proposed method for small lesions is demonstrated and discussed. Recognition results are also visualized via the class activation map (CAM) method , which indicates the localization potential of CNN networks for clinical diagnosis.
4.1. Enhanced Recognition of Small Lesions by HAnet
To analyze the recognition capacity of the proposed model, the sensitivities of ulcers of different sizes are studied, and the results of ResNet-34(480) and the best hyper model, HAnet-34(480), are listed in Table 8.
Based on the results of each row in Table 8, most of the errors noticeably occur in the small size range for both models. In general, the larger the ulcer, the easier its recognition. In the vertical comparison, the ulcer recognition of HAnet-34(480) outperforms that of ResNet-34(480) at all size ranges including small lesions.
4.2. Visualization of Recognition Results
To better understand our network, we use a CAM  generated from GAP to visualize the behavior of HAnet by highlighting the relatively important parts of an image and providing object location information. CAM is the weighted linear sum of the activation map in the last convolutional layer. The image regions most relevant to a particular category can be simply obtained by upsampling the CAM. Using CAM, we can verify what indeed has been learned by the network. Six cases of representative results are displayed in Figure 12. For each pair of images, the left image shows the original frame, while the right image shows the CAM result.
These results displayed in Figure 12 demonstrate the potential use of HAnet for locating ulcers and easing the work of clinical physicians.
In this work, we proposed a CNN architecture for ulcer detection that uses a state-of-the-art CNN architecture (ResNet-34) as the feature extractor and fuses hyper and deep features to enhance the recognition of ulcers of various sizes. A large ulcer dataset containing WCE videos from 1,416 patients was used for this study. The proposed network was extensively evaluated and compared with other methods using overall AC, SE, SP, F1, F2, and ROC-AUC as metrics.
Experimental results demonstrate that the proposed architecture outperforms off-the-shelf CNN architectures, especially for the recognition of small ulcers. Visualization with CAM further demonstrates the potential of the proposed architecture to locate a suspicious area accurately in a WCE image. Taken together, the results suggest a potential method for the automatic diagnosis of ulcers from WCE videos.
Additionally, we conducted experiments to investigate the effect of number of cases. We used split 0 datasets in the cross-validation experiment, 990 cases for training, 142 cases for validation, and 283 cases for testing. We constructed different training datasets from the 990 cases while fixed the validation and test dataset. Firstly, we did experiments on using different number of cases for training. We randomly selected 659 cases, 423 cases, and 283 cases from 990 cases. Then, we did another experiment using similar number of frames as last experiment that distributed in all 990 cases. Results demonstrate that when similar number of frames are used for training, test accuracies using training datasets with more cases are better. This should be attributed to richer diversity introduced by more cases. We may recommend to use as many cases as possible to train the model.
While the performance of HAnet is very encouraging, improving its SE and SP further is necessary. For example, the fusion strategy in the proposed architecture involves concatenation of features from shallow layers after GAP. Semantic information in hyper features may not be as strong as that in deep features, i.e., false-activated neural units due to the relative limited receptive field in the shallow layers may add unnecessary noise to the concatenated feature vector when GAP is utilized. An attention mechanism  that can focus on the suspicious area may help address this issue. Temporal information from adjacent frames could also be used to provide external guidance during recognition of the current frame. In future work, we will explore more techniques in network design to improve ulcer recognition.
The architectures of members of the ResNet series, including ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, are illustrated in Table 9. These members share a similar pipeline consisting of seven components (i.e., conv1, maxpool, layer 1, layer 2, layer 3, layer 4, and avgpool + fc). Each layer in layers 1–4 has different subcomponents denoted as , where represents a specific basic module consisting of 2-3 convolutional layers and n is the number of times the basic module is repeated. Each row in describes the convolution layer setting, e.g., means a convolution layer with a kernel size and channel of and 64, respectively.
The WCE datasets used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Hao Zhang is currently an employee of Ankon Technologies Co., Ltd. (Wuhan, Shanghai, China). He helped in providing WCE data and organizing annotation. The authors would like to thank Ankon Technologies Co., Ltd. (Wuhan, Shanghai, China) for providing the WCE data (ankoninc.com.cn). They would also like to thank the participating engineers of Ankon Technologies Co., Ltd., for their thoughtful support and cooperation. This work was supported by a Research Project from Ankon Technologies Co. Ltd. (Wuhan, Shanghai, China), the National Key Scientific Instrument and Equipment Development Project under Grant no. 2013YQ160439, and the Zhangjiang National Innovation Demonstration Zone Special Development Fund under Grant no. ZJ2017-ZD-001.
- L. Shen, Y.-S. Shan, H.-M. Hu et al., “Management of gastric cancer in Asia: resource-stratified guidelines,” The Lancet Oncology, vol. 14, no. 12, pp. e535–e547, 2013.
- Z. Liao, X. Hou, E.-Q. Lin-Hu et al., “Accuracy of magnetically controlled capsule endoscopy, compared with conventional gastroscopy, in detection of gastric diseases,” Clinical Gastroenterology and Hepatology, vol. 14, no. 9, pp. 1266–1273.e1, 2016.
- Y. Yuan, B. Li, and M. Q.-H. Meng, “Bleeding frame and region detection in the wireless capsule endoscopy video,” IEEE Journal of Biomedical and Health Informatics, vol. 20, no. 2, pp. 624–630, 2015.
- N. Shamsudhin, V. I. Zverev, H. Keller et al., “Magnetically guided capsule endoscopy,” Medical Physics, vol. 44, no. 8, pp. e91–e111, 2017.
- B. Li and M. Q.-H. Meng, “Computer aided detection of bleeding regions for capsule endoscopy images,” IEEE Transactions on Biomedical Engineering, vol. 56, no. 4, pp. 1032–1039, 2009.
- J.-Y. He, X. Wu, Y.-G. Jiang, Q. Peng, and R. Jain, “Hookworm detection in wireless capsule endoscopy images with deep learning,” IEEE Transactions on Image Processing, vol. 27, no. 5, pp. 2379–2392, 2018.
- Y. Yuan and M. Q.-H. Meng, “Deep learning for polyp recognition in wireless capsule endoscopy images,” Medical Physics, vol. 44, no. 4, pp. 1379–1389, 2017.
- B. Li and M. Q.-H. Meng, “Automatic polyp detection for wireless capsule endoscopy images,” Expert Systems with Applications, vol. 39, no. 12, pp. 10952–10958, 2012.
- A. Karargyris and N. Bourbakis, “Detection of small bowel polyps and ulcers in wireless capsule endoscopy videos,” IEEE Transactions on Biomedical Engineering, vol. 58, no. 10, pp. 2777–2786, 2011.
- L. Yu, P. C. Yuen, and J. Lai, “Ulcer detection in wireless capsule endoscopy images,” in Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012), pp. 45–48, IEEE, Tsukuba, Japan, November 2012.
- J.-Y. Yeh, T.-H. Wu, and W.-J. Tsai, “Bleeding and ulcer detection using wireless capsule endoscopy images,” Journal of Software Engineering and Applications, vol. 7, no. 5, pp. 422–432, 2014.
- Y. Fu, W. Zhang, M. Mandal, and M. Q.-H. Meng, “Computer-aided bleeding detection in WCE video,” IEEE Journal of Biomedical and Health Informatics, vol. 18, no. 2, pp. 636–642, 2014.
- V. Charisis, L. Hadjileontiadis, and G. Sergiadis, “Enhanced ulcer recognition from capsule endoscopic images using texture analysis,” in New Advances in the Basic and Clinical Gastroenterology, IntechOpen, London, UK, 2012.
- A. K. Kundu and S. A. Fattah, “An asymmetric indexed image based technique for automatic ulcer detection in wireless capsule endoscopy images,” in Proceedings of the 2017 IEEE Region 10 Humanitarian Technology Conference (R10-HTC), pp. 734–737, IEEE, Dhaka, Bangladesh, December 2017.
- E. Ribeiro, A. Uhl, W. Georg, and M. Häfner, “Exploring deep learning and transfer learning for colonic polyp classification,” Computational and Mathematical Methods in Medicine, vol. 2016, Article ID 6584725, 16 pages, 2016.
- B. Li and M. Q.-H. Meng, “Computer-based detection of bleeding and ulcer in wireless capsule endoscopy images by chromaticity moments,” Computers in Biology and Medicine, vol. 39, no. 2, pp. 141–147, 2009.
- B. Li, L. Qi, M. Q.-H. Meng, and Y. Fan, “Using ensemble classifier for small bowel ulcer detection in wireless capsule endoscopy images,” in Proceedings of the 2009 IEEE International Conference on Robotics & Biomimetics, Guilin, China, December 2009.
- T. Aoki, A. Yamada, K. Aoyama et al., “Automatic detection of erosions and ulcerations in wireless capsule endoscopy images based on a deep convolutional neural network,” Gastrointestinal Endoscopy, vol. 89, no. 2, pp. 357–363.e2, 2019.
- B. Li and M. Q.-H. Meng, “Texture analysis for ulcer detection in capsule endoscopy images,” Image and Vision Computing, vol. 27, no. 9, pp. 1336–1342, 2009.
- Y. Yuan and M. Q.-H. Meng, “A novel feature for polyp detection in wireless capsule endoscopy images,” in Proceedings of the 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5010–5015, IEEE, Chicago, IL, USA, September 2014.
- A. Krizhevsky, I. Sutskever, and G. Hinton, “ImageNet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, vol. 25, no. 2, Curran Associates, Inc., Red Hook, NY, USA, 2012.
- M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proceedings of the European Conference on Computer Vision, pp. 818–833, Springer, Cham, Switzerland, September 2014.
- C. Szegedy, W. Liu, Y. Jia et al., “Going deeper with convolutions,” in Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Boston, MA, USA, June 2015.
- C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826, Las Vegas, NV, USA, June 2016.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778, Las Vegas, NV, USA, June 2016.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708, Honolulu, HI, USA, July 2017.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-ResNet and the impact of residual connections on learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, February 2017.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2014, https://arxiv.org/abs/1409.1556.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
- M. Turan, E. P. Ornek, N. Ibrahimli et al., “Unsupervised odometry and depth learning for endoscopic capsule robots,” 2018, https://arxiv.org/abs/1803.01047.
- M. Ye, E. Johns, A. Handa, L. Zhang, P. Pratt, and G.-Z. Yang, “Self-supervised Siamese learning on stereo image pairs for depth estimation in robotic surgery,” 2017, https://arxiv.org/abs/1705.08260.
- M. Faisal and N. J. Durr, “Deep learning and conditional random fields-based depth estimation and topographical reconstruction from conventional endoscopy,” Medical Image Analysis, vol. 48, pp. 230–243, 2018.
- L. Yu, H. Chen, Q. Dou, J. Qin, and P. A. Heng, “Integrating online and offline three-dimensional deep learning for automated polyp detection in colonoscopy videos,” IEEE Journal of Biomedical and Health Informatics, vol. 21, no. 1, pp. 65–75, 2017.
- A. A. Shvets, V. I. Iglovikov, A. Rakhlin, and A. A. Kalinin, “Angiodysplasia detection and localization using deep convolutional neural networks,” in Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 612–617, IEEE, Orlando, FL, USA, December 2018.
- S. Fan, L. Xu, Y. Fan, K. Wei, and L. Li, “Computer-aided detection of small intestinal ulcer and erosion in wireless capsule endoscopy images,” Physics in Medicine & Biology, vol. 63, no. 16, p. 165001, 2018.
- W. Liu, D. Anguelov, D. Erhan et al., “SSD: single shot multibox detector,” in Proceedings of the European Conference on Computer Vision, pp. 21–37, Springer, Cham, Switzerland, October 2016.
- J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7263–7271, Honolulu, HI, USA, July 2017.
- T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125, Honolulu, HI, USA, July 2017.
- M. Lin, Q. Chen, and S. Yan, “Network in network,” 2013, https://arxiv.org/abs/1312.4400.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988, Venice, Italy, October 2017.
- H. R. Roth, H. Oda, Y. Hayashi et al., “Hierarchical 3D fully convolutional networks for multi-organ segmentation,” 2017, https://arxiv.org/abs/1704.06382.
- X. Liu, J. Bai, G. Liao, Z. Luo, and C. Wang, “Detection of protruding lesion in wireless capsule endoscopy videos of small intestine,” in Proceedings of the Medical Imaging 2018: Computer-Aided Diagnosis, Houston, TX, USA, February 2018.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929, Las Vegas, NV, USA, June 2016.
- F. Wang, M. Jiang, C. Qian et al., “Residual attention network for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164, Honolulu, HI, USA, July 2017.
Copyright © 2019 Sen Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.