Diabetic retinopathy (DR) is a complication of long-standing diabetes, which is hard to detect in its early stage because it only shows a few symptoms. Nowadays, the diagnosis of DR usually requires taking digital fundus images, as well as images using optical coherence tomography (OCT). Since OCT equipment is very expensive, it will benefit both the patients and the ophthalmologists if an accurate diagnosis can be made, based solely on reading digital fundus images. In the paper, we present a novel algorithm based on deep convolutional neural network (DCNN). Unlike the traditional DCNN approach, we replace the commonly used max-pooling layers with fractional max-pooling. Two of these DCNNs with a different number of layers are trained to derive more discriminative features for classification. After combining features from metadata of the image and DCNNs, we train a support vector machine (SVM) classifier to learn the underlying boundary of distributions of each class. For the experiments, we used the publicly available DR detection database provided by Kaggle. We used 34,124 training images and 1,000 validation images to build our model and tested with 53,572 testing images. The proposed DR classifier classifies the stages of DR into five categories, labeled with an integer ranging between zero and four. The experimental results show that the proposed method can achieve a recognition rate up to 86.17%, which is higher than previously reported in the literature. In addition to designing a machine learning algorithm, we also develop an app called “Deep Retina.” Equipped with a handheld ophthalmoscope, the average person can take fundus images by themselves and obtain an immediate result, calculated by our algorithm. It is beneficial for home care, remote medical care, and self-examination.

1. Introduction

The global cost of treating adult diabetes and its induced chronic complications is USD 850 billion in 2017. Diabetic retinopathy (DR) is one of the most common and serious complications of diabetes mellitus and is a leading cause of low vision and blindness in working-age adults [1, 2]. The International Diabetes Foundation (IDF) estimated that the global population with diabetes in 2017 was 451 million and over one-third of the population had DR [3], representing a tremendous population at risk of visual impairment or blindness. By 2045, the worldwide prevalence of diabetes is expected to increase to 693 million people [3]. In addition, almost half (49.7%) of all people living with diabetes remain undiagnosed for years because of silent symptoms [3]. However, long-term high blood sugar levels ultimately destroy blood vessels and nerves, leading to complications, such as cardiovascular disease and blindness. Detection and treatment of DR in the early stage will prevent its development or progression.

The diagnosis and severity of DR are based on retinal examination. Clinically, the classification of DR can be divided into two categories: (1) nonproliferative diabetic retinopathy (NPDR) with exudation and ischemia in different severity but without retinal neovascularization, and (2) proliferative diabetic retinopathy (PDR), which is characterized by neovascularization with or without its complications of traditional retinal detachment and the initial appearance of vitreous hemorrhage. Microvascular diseases of NPDR include microaneurysms, retinal dot and blot hemorrhages, lipid exudates, venous beading change, and intraretinal microvascular abnormalities (IRMA). Based on the degree and extent of these lesions, NPDR can be divided into three levels: mild NPDR presents with microaneurysms or few retinal hemorrhages; moderate NPDR shows more severe microaneurysms, hemorrhage or soft exudate, but not reaching the level of severe NPDR, which is associated with marked retinal hemorrhage in 4 quadrants, venous beading in at least 2 quadrants and IRMA in at least 1 quadrant. Table 1 summarizes the DR category with its manifestation.

Manual grading by ophthalmologists has been the mainstay of DR screening in the past decades. However, due to the expanding population with diabetes and the recent advances in technology, automated detection of DR offers the potential to provide an efficient and cost-effective approach to screening. Current commercialized automated retinal image analysis systems (ARIAs), such as iGradingM, Retmarker, and EyeArt, focus on differentiating diseased/no disease, or detection of referable DR [5, 6]. Nonetheless, ARIAs are currently not sufficiently sophisticated to classify different levels of DR, which means that identifying the subtle change between levels is still a challenging task for the technique of medical image analysis. Figure 1 shows example fundus images for each lesion.

In addition to the accuracy of medical image processing, the mobility and portability of medical examination equipment are of equal importance. Currently, the acquisition of digital fundus images requires the cooperating patient to sit in front of the fundus camera in the room, with ambient lighting minimized or turned off. The patient needs to look forward at the camera at a fixed light and use infrared fundus imaging to focus on the area of interest. Many nonmydriatic cameras have software that automatically detects the posterior pole of the eye and takes a picture when it is focused behind the eye. The RGB image sensor still requires a flash to capture images in the visible light spectrum. However, the digital fundus imagers most popularly used in the clinics are bulky and expensive, as shown in Figure 2, which limit its capability for large-scale screening.

One of the major goals for this study, besides increasing the classification accuracy using artificial intelligence, is to come up with a new system framework for DR screening. The new framework combines the advantages of mobile computing, cloud computing, big data, and artificial intelligence. The components of the proposed framework can be described as following:(i)Mobile block: The fundus image acquisition is achieved using a hand-held fundus imager, coupled with a self-developed iPhone APP. The imager is small and light-weighted. It can be carried inside a backpack. The deployment of such devices is extremely convenient. The portable nature due to its small form factors and light-weight can benefit the medical service for remote rural areas.(ii)Cloud block: The proposed system does not sacrifice its computational performance for its portability. Thanks to the architecture of cloud computing, the core of the computational resources is moved to the cloud and can be scaled up flexibly as the request increases. We developed highly efficient deep learning algorithm which runs on cloud server and is able to respond to the diagnosis request within 10 seconds.(iii)Big data block: The cloud-based architecture also helps to collect big data. As more and more end devices (hand-held fundus camera) are being used, the number of fundus images that passed into the cloud will increase accordingly. By storing all of these fundus image data, we are able to make good use of such big dataset, such as machine learning model retraining, new feature exploration, or cross-domain data-mining for different types of ophalmological diseases.

In summary, in this paper, we propose a new system framework for DR screening, based on artificial intelligence, mobile computing, cloud computing, and big data analytics. Figure 3 shows an illustration of the proposed system. Such system is a new paradigm for telemedical service and will benefit rural areas where the medical resources are insufficient.

In the following sections, we will gradually unveil our ideas and show the experimental results. In Section 2, we performed related literature review for important algorithms for the foundation of DR classification, which is retinal vessel segmentation. In Section 3, we performed related literature review for DR detection. In Section 4, we illustrated the proposed deep learning and machine learning algorithms in full details. We show the experimental results in Section 5 and discuss some important findings in Section 6. Finally, a conclusion is given in Section 7.

2. Literature Review of Retina Vessels Segmentation

In the process of identifying DR, it is pivotal to locate the retinal vessels. If the vessels position can be correctly known, we can determine whether the patient is suffering from DR based on information about the precise location and thickness of the vessels. However, vessel tracking is a complex process because of the many other substances besides vessels in fundus images. Numerous vessel segmentation methods have been proposed, which can be broadly divided into five categories: vascular tracking, matched filtering, morphological processing, deformation models, and machine learning.

2.1. Methods of Vascular Tracking

Methods of vascular tracking are based on the continuous structure of vessels, by starting at an initial point and following the vessels until no further vessels are found. The critical factor in this procedure is the setting of the initial point, as this will affect the accuracy of vessel segmentation. Currently, setting the initial point can be done either artificially or automatically.

The earliest adaptive vascular tracking method was proposed by Liu and Sun [7] in 1993, which extracts the vasculature from X-ray angiograms. First, given an initial point and direction within a vessel, the authors apply an “extrapolation-update” scheme that involves estimating local vessel trajectories. Once a vessel fragment has been tracked, it is removed from the image. This procedure is repeated until the vascular tree has been extracted. The drawbacks of this strategy are that due to the algorithm used, the user must set the vessel starting points and that the approach does not seem adaptable to three-dimensional extraction. In 1999, an automatic vascular tracking method was developed by Can et al. [8]. This strategy mainly collects pixel wide vascular local minimum points (usually in the middle of a vessel) to perform tracking. Vlachos and Dermatas [9] suggested a multiscale line tracking method with morphological postprocessing. Yin et al. [10] proposed a retinal vascular tree extraction, based on iterative tracking and Bayesian method.

The advantage of vascular tracking is that it can provide local information about characteristics, such as the diameter/width and direction of vessels. However, the vascular tracking performance can be easily affected by crossing or branching of vessels, which reduces the identification efficiency.

2.2. Methods of Matched Filtering

Matched filtering methods employ multiple matched filters for extraction, so designing proper filters is essential to detect vessels. Since the gray-scale distribution of fundus vessels is in keeping with Gaussian, an intuitive method exists that uses the maximum response of images after filtering to find vessel points. As the diameter/width of vessels is diverse, a multiscale Gaussian filter method is often used for vessel tracking.

In 1989, Chaudhuri et al. [11] pioneered the application of Gaussian filters in vessel tracking, by using some vascular characteristics, such as the fact that vessels are darker than the background, the width of the vessels ranges from 2 to 10 pixels, and the vessels grow from the optic disc into a radial shape. Therefore, Chaudhuri et al. [11] designed two-dimensional Gaussian filters that can detect vessels in 12 different directions. However, this method needs large computation, and some of the dark lesions are similar to the characteristic of vessels, causing tracking errors. Hoover et al. [12] described an improved method that considers local and regional characteristics of vessels to separate blood vessels in retinal images and iteratively determine whether the current point is a vessel point.

After such improvement, a large number of studies of reformed filters have been developed. Jiang and Mojon [13] promoted a generalized threshold method based on a multithreshold detection. Zhang et al. [14] improved the matching filtering method by applying a local vessel cross section analysis, using local bilateral thresholding. Li et al. [15] suggested a multiscale production of the matched filter, to enhance the extraction of tiny vessels.

2.3. Methods of Morphological Processing

Morphological processing facilitates the segmentation and identification of target objects by analyzing and processing structural elements in a binary image. Thus, linear and circular elements of blood vessels can be selected, isolating the desired structure instead of the background image. In addition, morphological processing can also smooth and fill the image contour with the advantage of antinoise. However, this method overrelies on structural elements and does not make good use of characteristics of vessels.

According to vessel characteristics, Zana and Klein [16] introduced a mathematical morphology-based algorithm that allows separating the vessels from all possible undesirable patterns. Building on this approach, Ayala et al. [17] proposed using different average fuzzy sets. In Miri and Mahloojifar [18], fundus images were analyzed by the use of curvelet transform and morphological reconstruction of multistructural elements to enhance the boundaries and determine the vascular ridge. Karthika and Marimuthu [19] combined curvelet transform and morphological reconstruction of multistructural elements, with strongly connected component analysis (SCCA) to segment and identify vessels.

2.4. Methods of Deformation Models

First introduced by Kass et al. [20] in 1988, the key benefit of deformation models is the ability to produce smooth parametric curves or surfaces. Two categories of deformation models are identified: parametric deformation and geometric deformation. Parametric deformation models are also called active contour or snake models (set of points each with an associated energy). Through the external and internal forces acting on the snake, the snake model can change its shape and smoothness toward the desired structure. In 2007, Espona et al. [21] used a parametric deformation model method on fundus images and promoted an improved method with morphological segmentation. With the assistance of morphological vessel segmentation, the snake model expands to the contour of the obtained vessels until the local energy function is minimal. Another deformation model method called ribbon of twins (ROT), which combines ribbon snakes and double snakes, was proposed by Al-Diri and Hunter [22]. Each twin consists of two snakes, one inside and one outside the vessel edges. The double snake model then attempts to integrate the pairs of twins on the vessel borders into a single ribbon and calculate the vessel width.

There are several shortcomings in parametric deformation models. For instance, the segmentation results depend on the initial contour, and difficulties arise when extending from low to high dimensions and in segmenting complex objects. Geometric deformation can well solve the problems caused by parametric deformable models. Geometric deformable models are based on deformation curve evolution theory and have no strict requirement on the position of the initial contour, which increases the robustness of the method and allows it to be extended to high dimensions. Zhang et al. [23] proposed an automatic vessel segmentation method, which uses nonlinear orthogonal projection to capture the characteristics of retinal blood vessels and obtained an adaptive local thresholding algorithm for blood vessel segmentation. Zhao et al. [24] suggested a retinal vessel segmentation method that employs a region-based active contour model with a level set implementation and a region growing model.

2.5. Methods of Deformation Model

Machine learning is an algorithm that teaches computers to learn to achieve goals automatically, by building generative or discriminative models from accumulated datasets. Machine learning can be divided into supervised learning and unsupervised learning. The supervised learning methods learn to achieve goals based on ground-truth, which means that during the training stage, the training data used to train the model come with a “label” that can be used by the machine learning algorithm to differentiate the data. Applying such paradigm in the problem of DR, it means that when using supervised learning, one needs to mark all of the pixels belonging to vessels in advance, whereas the unsupervised learning method does not need to mark them beforehand.

For supervised learning, Cesar and Jelinek [30] and Leandro et al. [31] proposed a supervised classification with two-dimensional Gabor wavelet. Each pixel has a feature vector that consists of the gray-scale feature and responses of distinct sizes of two-dimensional Gabor wavelet. Ricci and Perfetti [25] proposed a segmentation method for retinal vessels based online manipulation and support vector classification. Since the features are extracted by two orthogonal vertical lines, it reduces the features and training samples in supervised learning. A supervised method using neural network was proposed by Marin et al. [26], which has one input layer, three hidden layers, and one output layer. Each pixel in the image is represented by a seven-dimensional feature vector to train the network. Shanmugam and Banu [27] used an extreme learning machine (ELM) to detect retinal vessels by creating a seven-dimensional feature vector based on gray-scale features and invariant moments and using ELM to segment vessels. In 2015, Wang et al. [28] raised a new hierarchical retinal vascular segmentation, including three steps: preprocessing, hierarchical feature extraction, and integration classification. It involves using simple linear iterative clustering (SLIC) to perform super-pixel segmentation and randomly selecting a pixel to represent the entire super-pixel, as a more easy and efficient means of extracting features.

For unsupervised learning, in 1998, Tolias and Panas [32] created an automatic and unsupervised segmentation method based on blurred fundus images, which used fuzzy C-means (FCM) to find initial candidate points. Xie and Nie [33] proposed a segmentation method based on a genetic algorithm and FCM. Salazar-Gonzalez et al. [29] used methods of vector flow to segment retinal vessels.

Table 2 is a summary about the performance comparison between different existing methods.

3. Literature Review of Diabetic Retinopathy Detection

Although extracting vessels before detecting DR with machine learning can achieve high accuracy, it is time-consuming to create the marked ground-truth for retinal vessels. Another paradigm is to train the computer to automatically learn how to distinguish levels of DR by reading retinal images directly, without performing vessel segmentation. In 2000, Ege et al. [34] proposed an automatic analysis of DR by different statistical classifiers, including Bayesian, Mahalanobis, and k-nearest neighbor. Silberman et al. [35] introduced an automatic detection system for DR and reported an equal error rate of 87%. Karegowda et al. [36] tried to detect exudates in retinal images using back-propagation neural networks (BPN). Their features were decided by two methods: decision trees and genetic algorithms with correlation-based feature selection (GA-CFS). In their experiment, the best BPN performance showed 98.45% accuracy. Kavitha and Duraiswamy [37] did some research on automatic detection of hard and soft exudates in fundus images, using color histogram thresholding to classify exudates. Their experiments showed 99.07% accuracy, 89% sensitivity, and 99% specificity. In 2014, de la Calleja et al. [38] used local binary patterns (LBP) to extract local features and artificial neural networks, random forest (RF), and support vector machines (SVM) for detection. In using a dataset containing 71 images, their best result achieved 97.46% accuracy with RF.

4. Material and Methods

We propose an automatic DR detection algorithm, based on DCNN, fractional max-pooling [39], SVM [40], and teaching-learning-based optimization (TLBO) [41]. Specifically, we train two DCNN networks with fractional max-pooling, combining their prediction results using SVM and optimizing the SVM parameters with TLBO. The reason for training two distinct networks is that different network architectures may have their unique advantages in feature space representation. By training two DCNNs and combining their features, the prediction accuracy can be further enhanced. Another important factor impacting the recognition rate is the parameter of classifiers. We propose to optimize the SVM parameters using TLBO. We illustrate the image preprocessing methods in Section 4.1 and present the fractional max-pooling, SVM, and TLBO, in Sections 4.2, 4.3, and 4.4, respectively.

4.1. Preprocessing

Given the vessels in the original fundus images are mostly not very clear, and the size of each fundus image may differ, it is essential to preprocess images so that they have the same size and the visibility of the vessels is improved. There are three steps in preprocessing. The first is to rescale images to the same size. Since the fundus images are circular, we rescale the input images so that the diameter of the fundus images becomes 540 pixels. After rescaling, the local average color value is subtracted from the rescaled images, and another transformation is performed so that the local average is mapped to 50% gray-scale in order to remove the color divergence caused by different ophthalmoscopes. Last but not least, because boundary effects may occur in some images, we remove the periphery by clipping 10% from the border of the images. Figure 4 shows the original fundus image and the image after preprocessing.

4.2. Fractional Max-Pooling

Pooling is a procedure that turns the input matrix into a smaller output matrix. The purpose is to divide the input matrix into multiple pooling regions ():

The pooling results are computed according to pooling type:

In equation (2), “Oper” refers to a particular mathematical operation. For example, if max-pooling is used, the operation will be to take the maximum of the input region. For average pooling, the average of the input region is taken. For such a network that requires tremendous learning, it is preferable to use as many hidden layers as possible. In this work, the pooling layer used in our networks is fractional max-pooling instead of general max-pooling.

Fractional max-pooling is a pooling scheme that makes the size of the output matrix equivalent to fractional times that of the input matrix after pooling, i.e., . To describe the general pooling regions, let and be two increasing integer sequences starting with one and ending with or . These two sequences are used in pooling steps, as described in Figures 5 and 6.

The constants, and in equation (3), stand for the overlapping length and the width of the pooling window, respectively. Figure 5 is a simple example of overlapping pooling. Figure 6 illustrates different pooling region types.

After fractional max-pooling, the pooling window size is still integers, but the global pooling size will change. Namely, fractional max-pooling does not directly change the pooling window into a fractional scale. Instead, it uses windows of variable size to achieve fractional pooling. The generation of and sequences can be random or pseudorandom. Pseudorandom sequences generate more stable pooling regions than random sequences and can also achieve higher accuracy [39].

4.3. Support Vector Machine (SVM)

SVM is a supervised learning method used for classification and regression analysis. SVM can find the hyperplane or decision boundary defined by the solution vector , which not only separates the training vectors but also works well with unseen test data. To improve its generalization ability, SVM selects decision boundaries based on maximizing margins between classes.

Figure 7 illustrates the idea. Suppose there are points in a binary dataset:where is the data label, which can be 1 or −1, indicating the class to which belongs. We need to find the optimized hyperplane, such that the distance between the hyperplane and its nearest point is maximized. A hyperplane can be written as equation (5) based on :where is the normal vector of the hyperplane, and the value of decides the margin of hyperplane from the training data point along the normal vector .

For , whose value is 1, the data must satisfy , and for , whose value is −1, has to be satisfied. Combining these two conditions, we get

The goal is to maximize according to the constrain of equation (6) in order to derive the optimized decision hyperplane for classification.

Sometimes, the training data might not be able to be perfectly separated using linear boundaries. Therefore, in the SVM formulation, we need to introduce the error metric and the cost parameter , as shown in equation (7). The goal now becomes to minimize

Subject to

The performance of SVM is influenced by two main parameters, the first one is , which is a tunable parameter in equation (7). The other one is , which is used in the radial basis function (RBF) kernel to map data into a higher dimensional space before training and classification. The RBF kernel can be defined aswhere denotes the width of the Gaussian envelope in a high-dimensional feature space.

4.4. Teaching-Learning-Based Optimization (TLBO)

TLBO, an evolution-based optimization algorithm, was proposed by Rao et al. [41], in 2011. The concept of TLBO is inspired by the evolution of the learning process when a group or a class of learners learn a target task. There are two ways of learning in groups or classes: (1) learning from the guidance of the teacher and (2) learning from other learners. The procedure of TLBO can be divided into two phases, as described below in Sections 4.4.1 and 4.4.2.

4.4.1. Teacher Phase

In the whole population, the teacher () can be considered as the best solution. Namely, learners learn from the teacher in the teacher phase. In this phase, the teacher strives for enhancing the results of other individuals () by increasing the mean result of the classroom. This can be described as adjusting to approximate . In order to maintain a stochastic nature during the optimization process, two randomly generated parameters, and , are applied in each iteration for the solution as

In equation (10), is a randomly selected number in the range of 0 and 1. Moreover, and are the new and existing solutions at iteration , respectively. in equation (11) is a teaching factor which can be either 1 or 2.

4.4.2. Learner Phase

The learners gain their knowledge by interacting with each other. Therefore, an individual learns new information if other individuals have more knowledge than him or her. In this phase, the student interacts randomly with another student () in order to enhance his or her knowledge. Equation (12) shows that if is better than (i.e., for minimization problems), is moved toward . Otherwise, it is moved away from .

If the new solution to the problem is better than the old ones, the new solution will be recorded as the best solution. After updating the status of each learner, a new iteration begins. A stop criterion, based on the iteration number or the difference of the cost function, can be set to stop the iteration properly. The flowchart of TLBO is shown in Figure 8.

5. Result

Our fundus image data is from the database provided by one of the Kaggle contests; entitled “Identify signs of diabetic retinopathy in eye images” [42]. In this database, there are about 90,000 images. We separate 1000 images from the training dataset to be the validation dataset. The detailed information of each dataset is shown in Table 3, and our two network architectures are shown in Figure 9.

Our proposed method uses two DCNNs with fractional max-pooling layers. For every input fundus image, the two DCNN will output a vector of size 1 × 5, representing the probability distribution of the prediction for each lesion (category). The probability distribution, together with other values, forms a feature with dimensionality 24. The 24 features are described as follows:(i)DCNN probabilities of each lesion, respectively (5 features)(ii)Averages of R, G, and B channel values within 50% ∗ 50% center cropped image (3 features)(iii)Widths and heights of 50% ∗ 50% center cropped image (2 features)(iv)Overall standard deviation of the original image and 50% ∗ 50% centered cropped image, Laplacian-filtered image (2 features)(v)In total, there are 12 features for one fundus image. We then append another 12 features from the fundus image of the other eye of the same subject. Therefore, the overall length of the feature vector is 24 for one fundus image. The 24 feature vectors of dimensionality are used as input vectors of SVM

The 24-dimensional vector is used to train a multiclass SVM (five classes), whose parameters are optimized using the TLBO method. We implemented the method described in [39] and used it as the baseline. The baseline system uses similar features with a scheme of ensemble classifier (RF).

We used the validation set data to optimize the parameter set in SVM using TLBO. The upper and lower bounds of the parameter are set within [0, 100]. We ran 50 iterations with 50 students.

Our final accuracy for five-class classification task of DR is 86.17% and the accuracy for the binary class classification task is 91.05%. Labels for five-class classification are normal, NPDR level 1, NPDR level 2, NPDR level 3, and PDR while labels for binary class classification are normal and abnormal. For binary classification, its sensitivity is 0.8930 while the specificity is 0.9089. Except counting accuracy, we also do a T-test for our binary class classification. The T-test is also called the Student’s t-test. It is a statistical hypothesis test, in which the test statistic follows a Student’s t-distribution. Usually, the t-test is used to compare whether there is a significant difference between two groups of data and assists in judging the data divergence. In doing a paired samples t-test with results from binary class classification and random judgment, its outcome is 1 for the hypothesis test result, zero for the value and [0.3934, 0.4033] for the confidence interval, under null hypothesis at the 5% significant level.

The hypothesis test result is an index that tells whether two data come from the same distribution or not. If the data come from the same distribution, the value of the hypothesis test result will be close to 0. On the contrary, if the data resources are distinct, the result will be close to 1, which means there is a differentiation between the data. The value is the probability of accepting the assumption that there is a difference between two data may be wrong. The smaller the value, the more reason that there is a disparity between data.

Also, we designed an app called “Deep Retina,” providing personal examination, remote medical care, and early screening. Figure 10 shows our app interfaces. After choosing a fundus image that the user wants to check, it will send the image to our server and use our designed machine learning method. It takes about 10 s (depends on network speed) to get the result, which will be presented as the probability of each lesion. With a handheld device, individuals can do the initial examination at the district office or even at home. More importantly, it can benefit some remote areas that lack medical resources.

6. Discussion

6.1. Accuracy Improvement

Table 4 shows the accuracy comparison when using different classifiers and parameter optimization methods on each dataset. Using the default parameters with SVM (without optimization), accuracies in both validation and test sets are higher than that of the RF [39]. If we optimize the parameters using the default parameter searching method provided in the LIBSVM software package, though it achieves very high accuracy in the five-fold cross validation experiment, the validation and test accuracies are even lower than the default one. From this result, we believe that overfitting arises when optimizing parameters in SVM.

Table 5 shows the confusion matrix of the classification results from the two DCNN networks (before performing SVM classification). Network 1 is the architecture shown on the left side of Figure 9, and network 2 is the one on the right side. From Table 5, it shows that the lesion classifications of 0 and 2 are better than the other categories. For lesion 1, most of the prediction results are incorrect. Also, for lesions 3 and 4, the majority of the results are misclassifications that are classified into lesion 2.

Table 6 shows the confusion matrix of the classification results using the full procedure of the proposed method (using SVM with TLBO). Table 7 displays the difference between Tables 5 and 6, which serves as a performance comparison between the two methods (using DCNN only and DCNN + SVM + TLBO). From the table, every class, except class 0 and overall accuracy, is increased in network 1. For network 2, each accuracy, except class 3, is increased. The decline in accuracy of class 3 is mainly caused by misclassification of class 2. Table 8 shows the confusion matrix of the classification results using the baseline method as reported in [39], for comparison purposes.

6.2. Deep Learning vs Traditional Classification Methods

Many traditional classification methods try to solve the problem of DR detection by (1) using image processing to capture symptoms in fundus images and then (2) building a classifier to make decisions based on the detected symptoms (1). The shortcoming of image processing methods is that the manifestations of the symptoms are random across different images; therefore, it is extremely time-consuming and requires intense efforts to label the locations of the symptoms. Abiding by the new philosophy that comes with the emergence of the deep learning technology, our proposed method is trying to learn how to make decisions directly from the image data itself. Different than the former approaches, our images only need to be labeled with lesion number instead of labeling symptom locations. Consequently, it saves considerable time during the database preprocessing stage. On top of the classification results by the two DCNN networks, we use SVM optimized by TLBO to generate an improved outcome, and we achieve 86.17% accuracy. Our result is better than the first-place winner in the Kaggle competition. It shows that our research result is the state-of-the-art.

6.3. Limitation

In our current datasets, the number of images of lesions 3 and 4 is not sufficient to train a network, which is a limitation of the proposed method. Therefore, one of our future works is to develop deeper collaborative relations with hospitals and clinics to acquire more data of lesions 3 and 4. With more data, we believe the classification accuracy will be further increased. In addition, from our result, we found that it is hard to differentiate the images between lesions 0 and 1. Therefore, when we collect new data, it is desirable to collect more images belonging to lesions 0 and 1. Also, we can attempt to use a different network architecture for this problem.

7. Conclusion

It is feasible to train a deep learning model for automatic diagnosis of DR, as long as we have enough data for statistical model training. Furthermore, the database preparation stage only needs a categorical label for each training image. It does not require detailed annotation for retinal vessel tracking in every image. Hence, it is time-efficient compared to the traditional machine learning-based method for automatic diagnosis of DR. The final accuracy can achieve 86.17% and 91.05% for five-class and binary class classifications, respectively.

The sensitivity and specificity of binary classification are 0.8930 and 0.9089, respectively, which is a satisfactory result. Furthermore, we developed an automatic inspection app that can be used in both personal examination and remote medical care. With more image data collected, we expect the accuracy can be even more enhanced, further improving our system.

Data Availability

All experiment data come from a Kaggle contest called “Identify signs of diabetic retinopathy in eye images” (website: https://www.kaggle.com/c/diabetic-retinopathy-detection).

Conflicts of Interest

The authors declare that they have no conflicts of interest.


The authors would like to thank Mr. Chia-Ming Hu, for his help in preparing computational instruments and performing parts of the experiments in this paper.