Abstract

Diabetes is one of the alarming issues in today’s era. It is a chronic disease that may cause many health-related problems. It is a group of syndrome that results in too much sugar in the blood. Diabetes’s chronic hyperglycemia has been linked to long-term damage, organ breakdown, and organ failure, notably in the eyes, kidneys, nerves, heart, and veins. Machine learning has quickly advanced, and it is now used in many facets of medical health. The goal of this research is to create a model with the highest level of accuracy that can predict a patient’s chance of developing diabetes. This paper proposes a novel architecture for predicting diabetes patients using the K-means clustering technique and support vector machine (SVM). The features extracted from K-means are then classified using an SVM classifier. A publicly available dataset, namely, the Pima Indians Diabetes Database, is tested using this approach. Accuracy of 98.7% is noted on the used dataset. On this dataset, the combined method performs better than the conventional SVM-based classification. This paper also compared the accuracy, precision, recall, and F1-score of the different machine learning techniques for classifying diabetes patients.

1. Introduction

Various forms of diabetes exist. In type 1, pancreatic insulin stops producing hormones. This hormone helps digest carbohydrates, fats, and proteins. In type 2 diabetes, cells associated with the digestive system cannot process insulin. Over time, the production of insulin in the body stops. Because of this, various internal organs start to become useless, which causes death. Type 3 diabetes is associated with pregnancy, in which a woman's blood sugar level increases [14]. A person cannot do anything special in advance for type 1 rescue. In type 2, nutritious food, regular exercise, and weight control are the best measures in prevention, and they can prevent 90% of diabetes [5, 6]. In 2014, in the clinic examination of 4.21 crore citizens, diabetes was found in 31 lakhs, i.e., about 7.75% of people. Awareness about diabetes is still low. 40% of affected citizens are unable to take proper care of themselves. They are not able to control their sugar level either. Half of them have never got an eye test done, even though they are at entire risk of retinopathy.

1.1. Motivation

According to the WHO, 7.30 million adults in India are in the grip of diabetes. 10.7 to 14.2% of the population is diabetic in cities, while 3 to 6.8% in villages. According to the central government, a quarter of India’s 4.25 crore diabetes patients are in India. As per the National Family Health Survey-4, the number of diabetes patients in the country doubled in the year 2007 compared to 2007.3.5% of women and 3.5% of men aged 35 to 49 have diabetes. Diabetes is present in 9.7% of Indians over 80, while the same is present in 13.1% of people aged 60 to 69 and 13.2 in 70 to 79. After aging, the condition has reduced because the patient is not survived. Among those who are left and are living for 60 years or even beyond, diabetes is about 36% or say one-third compared to people in the age group of 60 to 79. Heart disease, stroke, kidney disease, and blindness are also becoming a significant threat to health due to the increasing incidence of diabetes in the new generation [79].

1.2. Main Contributions

The contributions of this manuscript are as follows:(i)This paper proposed a novel architecture to diagnose diabetes based on some parameters early.(ii)This paper used K-means clustering combined with an SVM algorithm to classify the data.(iii)For experiment purposes, the Pima Indians Diabetes Database is used. In the Pima Indians Diabetes Database, there are 668 female patients’ data. 80% of these data are used to train machine and 20% to test the machine on the proposed architecture.

Sections break apart the remaining text of the paper: Section 2 describes the related work and compares the work done by all the researchers on the Pima Indians Diabetes Database using different techniques. Section 3 outlines the suggested approach. This section briefly describes our architecture and uses algorithms to improve the accuracy of predicting diabetes. Section 4 contains the experimental evaluation and results. This section briefly describes the used dataset with all the measured parameters. Section 5 of the paper puts it to a conclusion.

Diabetes prediction using the Pima Indians Diabetes Database is a topic of interest among researchers during the last few decades. This section highlighted some of the methods used by the research to predict diabetes using the Pima Indians Diabetes Database and the accuracy achieved.

AlJarullah [10] has used the decision tree algorithm to predict type II diabetes. The data preprocessing part and the second diabetes prediction are completed in the first phase using the decision tree algorithm. The maximum accuracy achieved in this paper is 78.17%. Anand et al. [11] used higher-order neural network (HONN) combined with principal component analysis (PCA) to predict type II diabetes. In this paper, the authors used PCA to handle the missing data and also to scale the data in the same range of values. The maximum accuracy achieved in this paper is 89.47%. Banerjee et al. [12] used neural network, an evolutionary algorithm-based approach for predicting diabetes. This paper also compares the neural network model with other models. The maximum accuracy achieved in this paper is 93.5%. Barale and Shirke [13] used the K-means clustering algorithm combined with an artificial neural network (ANN) and K-means combined with logistic regression classifiers to predict diabetes. The maximum accuracy achieved in this paper is 98%. In order to uncover hidden patterns in the dataset, the K-means clustering technique is applied.

Chikh et al. [14] used the modified artificial immune recognition system (AIRS). In this, they used the fuzzy K-nearest neighbor algorithm to diagnose diabetes. The maximum accuracy achieved in this paper is 89.1%. Choubey and Paul[15] used the GA combined with multilayer perceptron neural network method for diagnosing diabetes. In the first phase, the genetic algorithm (GA) is used for feature selection, and in the second, diabetes classification is completed using multilayer perceptron neural network. The maximum accuracy achieved in this paper is 79.13%. Christobel and Sivaprakasam [16] used class-wise K-nearest neighbor (CkNN) to classify the diabetes dataset. In the first phase, data preprocessing is done, and the mean value is substituted in place of missing values. In the second phase of diabetes, classification is completed using modified KNN. The maximum accuracy achieved in this paper is 78.16%.

Deperlioğlu and Utku[17] used a multilayer feedforward NN structure trained by the Bayesian regularization algorithm and the mean square error function. The maximum accuracy achieved in this paper is 95.5%. In this study, the ANN was trained ten times. Gandhi and Prajapathi [18] used F-score, K-means clustering, Z-score normalization, and SVM. In the first phase, data preprocessing is done using F-score and K-means. In the second phase, diabetes classification is completed using SVM. The maximum accuracy achieved in this paper is 98%. Ganji and Abadeh [19] used ant colony optimization (ACO) to predict diabetes. The maximum accuracy achieved in this paper is 84.24%—using an ant colony-based classification method, a set of fuzzy rules for diabetic illness diagnosis may be extracted. Hayashi and Yukita [20] used Re-RX with J48 graft, combined with sampling selection techniques for predicting diabetes. As a “white-box” model, the recursive-rule extraction (Re-RX) method delivers extremely accurate categorization. The maximum accuracy achieved in this paper is 83.83%. Huang and Lu[21] used information gain (IG) along with DNN for the prediction of diabetes. The maximum accuracy achieved in this paper is 90.16%. Iyer et al. [22] used the J48 decision tree algorithm and naïve Bayes algorithm for the classification dataset and achieved an accuracy of 76.9% and 79.5%, respectively. Kahramanli and Allahverdhi[23] used an ANN and fuzzy NN to classify diabetes datasets. The maximum accuracy achieved in this paper is 86.8%.

Karatsiolis and Schizas[24] used a SVM with an RBF kernel and a SVM with a polynomial kernel to classify the diabetes dataset. First, the dataset was divided into two subsets. Then, one of the subsets, SVM with an RBF kernel, is applied, and on the other subset, SVM with a polynomial kernel is used. The maximum accuracy achieved in this paper is 82.2% and 81%, respectively. Karegowda et al. [25] used K-means clustering along with GA and CFS for the classification of the diabetes dataset. Classification accuracy of 96.68% is achieved in three phases. The K-means clustering algorithm is applied in the first phase to identify and eliminate incorrectly classified instances. In the second phase, a GA and correlation-based feature selection (CFS) are applied to extract relevant features. Finally, in the third phase, classification is done using K-nearest neighbor (KNN) algorithm. Karegowda et al. [26] used K-means clustering combined with decision tree C4.5 to classify the diabetes dataset. In the first phase, K-means clustering is used to eliminate incorrect instances. In the second phase, the decision tree algorithm C4.5 is used to classify the data. The maximum accuracy achieved in this paper is 93.33%. Karegowda et al. [27] used a GA and back propagation network (BPN) to classify the data. The maximum accuracy achieved in this paper is 77.7%. Kayaer and Yildirim [28] used the GRNN to classify the data and achieved an accuracy of 80.21%.

Kumar Das et al. [29] used random forest and gradient boosting classifiers to classify diabetes datasets and achieved an accuracy of 90%. Initially, data preprocessing is done, and then the classifier is applied to classify the data. Senthil Kumar et al. [30] used covering-based rough set classification for the dataset classification. This is a pattern-based approach. Maximum accuracy of 79.34% is achieved using this procedure. Kumari and Chitra[31] used SVM with RBF kernel to classify the data and achieved an accuracy of 75.5%. Nirmala Devi et al. [32] used amalgam KNN to classify the data and achieved an accuracy of 97.4%. This amalgam of KNN consists of K-means with KNN. K-means algorithm is used to identify missing values. Missing values are replaced by the mean and median in this algorithm. Patil et al. [33] used K-means clustering combined with decision tree C4.5 and achieved an accuracy of 92.38% to classify the dataset. Polat [34] used fuzzy C-means combined with SVM and KNN and weighting methods (FCMAW) and achieved an accuracy of 91.41 and 84.38, respectively. Polat et al. [35] used GDA and least square support vector and achieved an accuracy of 82.05% to classify the data. Rado et al. [36] used random forest combined with recursive feature elimination, and the accuracy achieved was 73%. Raghavendraet al. [37] used a neural network model with a backward elimination feature selection method and made the accuracy of 84.52% to classify the dataset. Rajni and Amandeep [38] achieved a classification accuracy of 72.9% by using the RB-Bayes algorithm. In this, the mean is used to handle the missing values. Ramana and Boddu [39] used the naïve Bayes classification algorithm, and the accuracy achieved was 76.34%. Balajiet al. [40] used a deep NN restricted Boltzmann machine, and 80.9% accuracy was achieved.

Vaishali et al. [41] used Goldberg’s GA combined with a multi-objective evolutionary fuzzy classifier to classify the type 2 diabetes dataset. In the first stage, essential features are extracted using Goldberg’s GA. In the second stage, the multi-objective evolutionary fuzzy classifier is applied, and an accuracy of 83.04% is achieved using this method. Vosoulipour et al. [42] used NN and ANFIS structures and achieved an accuracy of 81.3%. Wong and Lease [43] used Cartesian genetic programming and achieved an accuracy of 80.5%. Wu et al. [44] used an improved K-means algorithm and the logistic regression algorithm for the dataset classification and achieved an accuracy of 95.42%. Zolfaghari [45] used SVM combined with NN and achieved an accuracy of 88.04%, and Bano and Khan[46] used K-NN and achieved an accuracy of 82.29% to classify the dataset.

An automated model for diagnosing diabetes was reported by Lakhvaniet al. [47] utilizing a three-layered artificial neural network (ANN). For neuron activation, the authors employed a logistic activation function, and they trained the model using the quasi-Newton approach. Through the use of the Pima Indian Diabetes Dataset, Patil and Ingle[48] offered a comparative analysis of different ML classification algorithms with diabetes prediction. For statistical modeling and accuracy verification, authors employed KNN, LR, which is based on the regression problem, naive Bayes probabilistic classifier, SVM with both linear and nonlinear kernel, and decision tree with RF classifier. 80.20 percent accuracy is the highest possible. LDA and GA were employed for feature selection by Alharan et al. [49] to increase the classification accuracy for diabetes. The approach has a maximum accuracy of 90.89 percent. Sivaranjani et al. [50] presented a model for diabetes categorization using SVM and RF techniques. PCA is also used to reduce the number of dimensions, with maximum accuracy rates of 83 and 81.4 percent, respectively.

In all the techniques used by the researchers, the main challenge is to improve the accuracy of the system for early diagnosis of diabetes. To overcome this problem, this paper suggested a fusion technique in two phases. In the first phase, data preprocessing is done using K-means, and in the second phase, diabetes classification is completed using SVM to achieve the maximum accuracy. Techniques used by different researchers and achieved accuracy are summarized in Table 1.

3. Proposed Methodology

This section describes the proposed Pima diabetes patient classification model using K-means clustering and SVM. Figure 1 presents an overview of the suggested model. The proposed model first created the clusters using the K-means clustering and then used the SVM for the classification.

3.1. K-Means Clustering Algorithm

K-Means algorithm is used to cluster the dataset into different classes. K-Means works for multi-dimensional data. For two-dimensional data, the example is shown in Figure 2.

The following steps are used in the K-means clustering algorithm [51]:(1)Choose the K number of clusters.(2)Choose at random k points. These k points will be the centroids of the k clusters. It is not necessarily that these k points are from dataset. Any k points can be selected.(3)Assign each data point to the nearest centroid, and the resulting k cluster will be formed. The Euclidian distance is used to calculate distance.(4)Determine and set each cluster’s new centroid.(5)Change the centroid that corresponds to each data point. If there was a reassignment, proceed to step 4, otherwise, end.

The number of clusters (in step 1) is computed using the elbow method. For the used dataset, the number of clusters is 5.

3.2. SVM Classification Algorithm

SVM was developed nationally in the 1960s and later found in the 1990s. SVM is very popular in machine learning because SVM is a robust algorithm. SVM is very different from other machine learning algorithms. SVM is about finding the best decision boundary that helps to separate the dataset into different classes. SVM separates the types through the maximum margin boundary between support vectors. For the best boundary, the sum of the distances of the boundary line from support vectors should be maximum. This boundary line is known as the maximum margin hyperplane or maximum margin classifier. The support vector classification (SVC) class is used to implement SVM. In SVC, there are many parameters. In this model, we used the linear kernel for linear classification [52].

4. Experimental Evaluation and Results

The specifics of the dataset utilized in this investigation are presented in this section. Results are calculated using various categorization algorithms and suggested methods. The details are as follows.

4.1. Dataset Description

This paper used a publicly available dataset, namely, Pima Indians Diabetes Dataset [53]. This dataset contains the data of a total of 668 female patients with eight independent parameters, namely, pregnancies, glucose, blood pressure (BP), skin thickness (ST), insulin, BMI, diabetes pedigree function and age, and one dependent parameter, outcome [53]. The first five records of the dataset are presented in Table 2.

All the parameters of the used dataset are as follows:(i)Pregnancies: This parameter represents the number of times pregnant.(ii)Glucose: During an oral glucose tolerance test, plasma glucose concentration exceeded 2 hours.(iii)Blood Pressure: Diastolic heart rate (mm Hg).(iv)Skin Thickness: This parameter represents the triceps skinfold thickness (mm).(v)Insulin: This parameter expresses the 2-hour serum insulin (mu U/ml).(vi)BMI: This parameter describes the body mass index .(vii)Diabetes Pedigree Function (DPF): DPF is a function that scores the likelihood of diabetes based on family history.(viii)Age: This parameter represents the age of the person (in years).(ix)Outcome: This parameter represents the class variable. 0 means nondiabetic, and 1 means diabetic.

4.2. Performance Measure

All the approaches are compared using accuracy, precision, recall, and F1-score for performance measures. Accuracy, precision, recall, and F1-score are computed using (1)–(4) [54]. The confusion matrix is plotted for calculating all the performance measure parameters. The generated confusion matrix of the proposed method is shown in Figure 3.

The used performance measurement parameters are for confusion matrix (Table 3) with two classes (binary classification).(i)Accuracy: The proportion of correct classification (true positive and true negative) from the overall number of cases.(ii)Precision: The percentage of instances that were correctly classified as positive (true positive) when they were expected to be positive.(iii)Recall: The percentage of correctly classifying positive instances as positive (true positive).(iv)F1-score: The balance between precision and recall is shown by the F1-score.

4.3. Experimental Results

The discussion presented in Table 4 shows the result of the proposed approach to the Pima Indians Diabetes Dataset. The accuracy of 98.7% is recorded using the proposed method, whereas the accuracy of 82.46% is recorded using only the SVM classification algorithm. An improvement of 19.69% is achieved on the Pima Indians Diabetes Dataset.

The comparison of the various classification methods, namely, decision tree, random forest, kernel SVM, naive Bayes, KNN, logistic regression, SVM, and the proposed approach based with respect to accuracy, precision, recall, and F1-score, is shown in Figures 47.

The accuracy, precision, recall, and F1-score of the proposed method are 98.7%, 98.6%, 96.8%, and 97.5%, respectively.

5. Conclusion and Future Scope

This study suggested a brand-new architecture for diabetes patient categorization using K-means clustering and SVM. The clusters of the database are designed using a K-means clustering method. The predictions are then computed based on the created clusters considered as features for categorization using SVM. The Pima Indians Diabetes Database is used to verify the approach's resilience against a publicly accessible dataset. The Pima Indians Diabetes Database has 668 female patients’ data. 80% of these data are used to train machine and 20% to test the machine on the proposed architecture, with a maximum accuracy of 98.7%. By obtaining more reliable characteristics from the database, the classification rates may rise in the future. Additionally, combining techniques like decision fusion of several classifiers might help the classification process.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.