The technical improvements in healthcare sector today have given rise to many new inventions in the field of artificial intelligence. Patterns for disease identification are carried out, and the onset of prediction of many diseases is detected. Diseases include diabetes mellitus disease, fatal heart diseases, and symptomatic cancer. There are many algorithms that have played a critical role in the prediction of diseases. This paper proposes an ML based approach for diabetes mellitus disease prediction. For diabetes prediction, many ML algorithms are compared and used in the proposed work, and finally the three ML classifiers providing the highest accuracy are determined: RF, GBM, and LGBM. The accuracy of prediction is obtained using two types of datasets. They are Pima Indians dataset and a curated dataset. The ML classifiers LGBM, GB, and RF are used to build a predictive model, and the accuracy of each classifier is noted and compared. In addition to the generalized prediction mechanism, the data augmentation technique is also used, and the final accuracy of prediction is obtained for the classifiers LGBM, GB, and RF. A comparative study and demonstration between augmentation and non-augmentation are also discussed for the two datasets used in order to further improve the performance accuracy for predicting diabetes disease.

1. Introduction

Diabetes mellitus (DM) is a multifactorial progress chronic metabolic disorder. It is a persistent ailment triggered by excessive sugar levels. In the year 2019, 463 million people were affected by diabetes mellitus. It is estimated that in the year 2030, 578 million people are likely to be affected by DM, and in the year 2045, the number of people affected will rise to 700 million as per a study conducted by researchers [1]. DM is also considered as an autoimmune disease; hence, one primary reason cannot be identified as a cause [2]. There could be many reasons, some of them are age, family history, insulin production, body mass index, stress, pregnancy, etc. Some of these attributes can be taken as attributes in the dataset, and a study can be conducted for predicting the disease based on them. These attributes play an important role by affecting a person’s health. However, by using individual attributes and combination of these attributes [3], diabetes disease can be predicted using the many advanced technologies in machine learning that have been developed so far. By predicting the disease, a person can avoid developing further complications that the disease can lead to in the future [4].

1.1. Diabetes Is Majorly Classified into 4 Types
1.1.1. Type 1 Diabetes

It is also called juvenile diabetes. It is dependent on insulin. The immune system damages the insulin release cells, removing insulin production in the body. It occurs mostly in adolescence. The treatment aims at blood sugar level, insulin therapy, diet, and exercise. Some of the complications for type 1 include damage in the kidneys, problems in pregnancy and skin, blood flow becoming weak, problems in skin [5].

1.1.2. Type 2 Diabetes

The human body’s production of insulin is insufficient or it resists production of insulin. It is milder when compared to type 1 and can be treated using insulin therapy, medication, diet, and exercise. The complications it can cause include kidney, nerves, and eye problems; heart disease; and stroke. The diagnosis involves tests such as A1C test, FPG (fasting plasma glucose) test, and RPG (random plasma glucose) test. Other tests include oral glucose tolerance test (OGTT), glucose challenge test, fasting blood sugar (FBG) test, and random blood sugar (RBG) test [5].

1.1.3. Gestational Diabetes

It is diagnosed for the initial few days of pregnancy. Some of the complications include fluctuating blood pressure, difficulty in breathing, birth weight complexity, early birth, and diabetes in future. The tests for diabetes are glucose challenge test and glucose tolerance test. The treatment for diabetes includes diet, injections for insulin balance, exercise, and monitoring of blood glucose. The gestational diabetes is one type where the chances for the disease occurrence after birth are reduced or null in most cases. However, if not taken care of more, it may occur in future in the person earlier affected [6].

1.1.4. Prediabetes

It is also called impaired glucose tolerance. It is a case in which blood sugar is high compared to type 2 diabetes. The risk factors are type 2 diabetes and heart diseases like stroke. The glucose level is in the range of 100 to 125 mg/dL. The HbA1c range is 5.7% to 6.4%. Some of the conditions that occur during prediabetes are high BP, low HDL levels, high blood sugar levels, large waist size, and high triglycerides. By combining the three, a term known as metabolic syndrome is created [6].

2.1. Literature Review

Many researchers have identified and developed work based on DM disease using ML algorithms and classifiers. Some of the work and techniques used are discussed below.

Kopitar et al. [7] have utilized the concept of artificial intelligence for improving the disease prediction accuracy. It consists of three steps: preprocessing, feature selection, and feature classification. Methods such as HSA (harmony search algorithm), GA (genetic algorithm), and PSO (particle swarm optimization) along with k-means clustering for selection of feature are adopted. KNN is used for classification procedures. The accuracy prediction evaluation metrics used include sensitivity, specificity, recall, and precision. An accuracy of 91.65% is obtained.

Deng et al. [8] have constructed a predictive model for diabetes prognosis using DT classifier and SMOTE. The system proposed contains 2 stages: In stage 1, data imbalance is removed using SMOTE. In stage 2, diagnosis of diabetes is made using DT classifier. The dataset is collected from a diagnostic lab in Kashmir Valley containing 734 entries. An accuracy of 94.7% is obtained.

Espino et al. [9] used the concepts of transfer learning and data augmentation to overcome challenges caused by datasets such as imbalanced datasets and small training datasets. Systematic examination of 3 neural network architectures, transfer learning strategies, and augmentation techniques and different loss functions including mixup and generative models are used. The same network architecture for type 1 diabetes using OhioT1DM public dataset is developed. The dataset is collected from Beth Israel Deaconess Medical Center and approved by the International Review Board. An accuracy of 95% is obtained.

Bavkar et al. [10] have designed a pipeline based model using deep learning (DL) techniques to predict diabetes. It includes data augmentation using a variable auto encoder (VAE), feature augmentation using sparse encoder (SAE), and a convolution neural network for classification. The dataset used is Pima dataset taken from UCI Repository. The accuracy obtained is 92.31% by using CNN classifier and training it along with SAE for feature augmentation in comparison with a well-balanced dataset.

Branimir et al. [11] proposed a system to overcome the following two challenges: heterogeneity regarding previous techniques and lack of transparency in features. It used the PRISMA methodology. 18 varieties of model comparison along with algorithms based on trees were performed. The authors concluded that KNN and SVM are majorly used for prediction.

Jyotismita et al. [12] have used the detection and analysis mechanism of diabetes disease using 6 facets, namely, dataset, processing methods, feature extraction, ML identification, and classification and diagnosis of DM, overcoming the flaws of classification. A comparison between various supervised, unsupervised, and clustering techniques is made. Different datasets possess different challenges and significant work that needs to be done to improve the efficiency of detection of various diabetic diseases.

Zhang et al. [13] have used the concept of data augmentation for overcoming the problem of insufficient data. The authors have proposed to use GMM to augment more data when the dataset is numeric. The dataset consists of 1157 samples. 5 regressors are used. They are linear regressor, decision tree regressor, random forest regressor, K-neighbors regressor, ridge regressor.

Safial et al. [14] have proposed a strategy for diabetes diagnosis through DL network by using 5-fold and 10-fold cross validation for training. The dataset used is Pima Indians dataset. The prediction accuracy is obtained as 98.35% when 10-fold cross validation is used.

Nur et al. [15] have focused mainly on data preprocessing which includes the following processes: removing the missing values, balancing the data process, and performing feature importance and data augmentation process. RF and LR are used for algorithm classification. The result obtained is 20% higher for precision and 24% higher for recall when compared to data without preprocessing.

3. Algorithms for Each Classifier

Invalue: Data used for n-dimension, X1 ∈ R1n1 consisting of threshold and samples with variance
Outvalue: k-dimensional data that is reduced, Y1 ϵ R1k1
(1)Given X1 ϵ R1n1 and obtain the mean,
where ∈ R1n1
(2)Covariance matrix, n1 × n1,
(3)Decomposition of eigenvalue: given as P1DP−1, where P1 ϵ R1n1 is the eigenvector matrix and denotes the diagonal eigenvalues
(4)The eigenvectors are then sorted in a descending order to select first k1 eigenvectors that is given as
 variance ≥ Tvariance
(5)The data X1 is given into a k-dimension by , where Y1 ϵ R1k1.
Invalue: n-dimensional data (original), X1 ϵ R1n1
Outvalue: k-dimensional data (reduced), Y1 ϵ R1k1
(1)The non-quadratic function is set and considered as nonlinear function and G1 is assumed as negentropy.
(2)Given W1 of W1 × H1 = X1, where W1, H1, and X1, during mixing, are the source ratios
Consisting of multiple components, where the output is mixed separately.
(3)Obtain PCA on X1 by X1 = PCA(X1)
(4)while W changes do
(5)W1 = mean (X1 × G1(W1 · X1)) – mean (G0 (W1T1 · X1)),
(6)W1 = orthogonalize (W1)
(7)Execute Y1 = W1 · X1, where Y1 ϵ R1k1.
Invalue: n-dimensional data (original), X1ϵR1n and expected outvalue, Y1T ∈ R1
Outvalue: k-dimensional data (reduced), Y1ϵRk1
(1)for i ≤ n do
(3)Sort the correlation in a descending order and choose first k1 features for Y1ϵR1k1.
Invalue: Value that is n-dimensional, X1ϵR1n1 and outvalue (target), Y1ϵR1
Outvalue: The pp, P1 ∈ [0, 1] of test data (unseen), x,
, C1 = 2 (diabetes present (C1) or not (C2))
(1)The geometric distances are calculated,
D1h1 for k1 query points, where X1i1 = current instance, x1i1 = query instance, q1 = order
(2)Establish set S1 with k1 points (closest)
(3)Estimate the pp, P1 for each class
f 1(x1) is the function to class assignment.
pp means posterior probability.
Input: data (n-dimensional), X1 ϵ R1n1 and outvalue (target), Y1 ϵ R1
Output: The pp, P1 ϵ [0, 1] of test data (unseen), x,
, C1 = 2. (diabetes in (C11) or not (C12))
(1)Divide θ = (j1, tm1) into (θ) and (θ) subsets; θ contains feature, j1, threshold, tm1
(2)Calculate the kth node using an impurity(i) function (H1),
(3)Reduce the impurity(i) by selecting the right parameters, θ = argmin θ G1(Q′1, θ)
(4)Repeat the processes for subsets
(θ) and (θ) until depth reaches  < min samples or  = 1.
Invalue: data (n-dimensional), X1ϵR1n1 with N samples and outvalue (target),
Outvalue: The pp, P1ϵ [0,1] of unseen test data, x1, where
, C1 = 2 (diabetes in (C1) or not (C2))
(1)Initiate the sample weight, D1(i1) =  , i1 = 1, 2,...,N.
(2)for t1 ≤ T1 (n_Classifiers) do
(3)Weak_learner_training by using distribution .
(4)Select a hypothesis (weak), : R1n1R1 with low weight error,
 _x000F_t1 =  [ () 6 = Y1 ]
(5)Choose and update where i1 = 1,...,N and z1t1 is the normalization factor.
(6)Output pp: .
Invalue: data (n-dimensional), X1 ϵ R1n1, outvalue (target), Y1 ϵ R1
Outvalue: The pp, P1 ϵ [0, 1] of test data (unseen), x1, where
C1 = 2 (diabetes in (C1) or not (C2))
(1)for b1 = 1 to N (n_Bagging) do
(2)Design a sample (bootstrap) ( from given X1 ϵ, Y1 ϵ R1
(3)Design an RF tree using and by recursively repeating.
(4)The pp P1NRF (x1) where is the prediction of the kth RF.
Invalue: data (n-dimensional), X1ϵR1n1 and outvalue (target), Y1ϵR1
Outvalue: The pp, P1ϵ [0, 1] of test data (unseen), x, where
, C1 = 2 (diabetes present (C1) or not (C2))
(1)Assign the probabilities (prior) for each class,
and , where N determines the number of samples
(2)The output pp of class for the given predictor (attributes)
P1(X1|Ci1) is the predictor (likelihood) for a given class and P(X1) is the pp (prior).
Invalue: data (n-dimensional), X1ϵR1n1 and outvalue (target), Y1ϵR1
Outvalue: The pp, P1ϵ [0, 1] of test data (unseen), x1, where
, C1 = 2 (diabetes in (C1) or not (C2))
(1)Initiate the model with fixed value:
L1(Y1,F1(x1)) is the loss functions and N denotes the number of samples
(2)for m = 1 to M (n_Iterations) do
(3)Calculate pseudo-residuals,
where i1 = 1, 2,...,N
(4)Assign a base tree, h1m1 using set (training) (X1i1,r1im) for i1 = 1, 2,...,N
(5)Multiplier γ1m1 is calculated by
(6)Update the model by
(7)F1 m(x1) is the desired pp, P1ϵ [0, 1] .

4. Implementation

Machine learning is such a field in which many classification and regression problems can be solved. ML based data-driven approaches have been used along with individual recorded data of each person/patient for implementation. The first step in implementation procedure is to select and identify the dataset. The selected dataset is then input into the predictive model in order to obtain the accuracy percentage. Predictive models are developed using different input parameters of the dataset combinations for diabetes accounting based on the correlation between the attributes used [16]. To obtain the outcome, a correlation matrix can be generated as shown in Figures 1 and2. The attributes producing maximum relevance are chosen and used for further processes. Following the correlation matrix, the data preprocessing is done in order to clean the data, avoid and ignore missing data, correct the incomplete data, etc. The data preprocessing is followed by feature extraction and feature selection [17]. The data are divided into training data (TrD) and testing data (TtD). The model is then built using the machine learning algorithms, and the prediction performance is assessed. The major reason for predicting the disease is to maintain a healthcare regime, create awareness, and prevent a person from further getting affected by other diseases that can be caused by diabetes mellitus.

The various processes are explained below.

4.1. Dataset

The first step in the prediction process is to identify the dataset. The features and attributes in the dataset play an important role in prediction. The datasets used are Pima Indians dataset [18] and a curated dataset [19]. The dataset consists of attributes such as age, blood pressure, insulin, glucose, pregnancy, diabetes pedigree function, skin thickness, outcome, and BMI. These attributes are considered for disease prediction by comparing them with the different classifiers.

In this study, two types of datasets are used as mentioned above. The first dataset is Pima Indians dataset taken from UCI Repository. It consists of 9 attributes: age (A), glucose (G), insulin (I), blood pressure (BP), diabetic pedigree function (DPF), BMI, pregnancy (P), skin thickness (ST), and outcome (O). The second dataset is taken from a survey conducted independently among known sources containing the same 9 attributes mentioned above. This dataset is called the DMS (Diabetes Mellitus Survey) dataset. The accuracy percentage is calculated using the various classifiers suitable for implementation.

The total number of entries in Pima Indians dataset is 768 with 500 negative instances and 268 positive instances. The number of entries in DMS dataset is 1110, the number of positive instances is 372, and the number of negative instances is 738.

The attributes used in the dataset are described in Figure 1, and the correlation of these attributes is taken into account to find out the attributes that play an important role in the disease prediction.

Figure 2 shows the attributes, and the highest relevance to the disease is taken and measured based on correlation matrix as shown in Figure 3 [32].

4.2. Correlation Matrix

Following the dataset selection procedure, a correlation matrix is presented in a diagrammatic form or table to display the relation between various variables or attributes used. The correlation can also be determined using the value of information gain calculated between the attributes. The information gain is used to measure and calculate the information received in bits about the class prediction that needs to be performed [2022]. It majorly depends on feature distribution and corresponding class distribution. To calculate the information gain, the following formula can be used [23]:

with entropywhere is the set of training examples, is the vector of ith variable in the set, is the fraction of examples [19] of the ith variable having value , and p1+(S1) and p1-(S1) are the probability of training sample in the set S1 of the positive and negative class [24].

The value between two attributes in the correlation matrix as shown in Figures 3 and4 denotes the correlation graph that determines how they are dependent on one another.

The information gain calculated for the attributes is shown in Table 1.

Table 1 indicates that glucose has the highest relevance of 47%. The next highest relevance is given by BMI (29%) and pregnancy (22%). Therefore, the attributes which have a threshold value of above 20% are considered the prime attributes. Glucose, BMI, and pregnancy are taken into consideration as the prime attributes as the threshold values of these attributes are more than 20%.

4.3. Data Preprocessing

The data preprocessing is one major step during implementation. It is the process of transforming the data before including it in the algorithm that is going to be used. It is a process of converting raw data into clean and processed data [25]. The purpose of data preprocessing is to obtain a more clarified result in a specific format. It can also be used to modify the data in such a way that more than one type of algorithm can be used during processing and implementation. The steps in data preprocessing include the following [26].

4.3.1. Data Cleaning

The main use of data cleaning is to ignore the missing value and compute the value in a filled format. Binning of noisy data along with regression and clustering is done along with removing outlier values that cannot be grouped into clusters [27].

4.3.2. Data Integration

It is used to combine and merge multiple sources into single storage unit known as the data warehouse. It can include detection and data value conflict resolution, object matching, schema integration, and removing redundant attributes [28].

4.3.3. Data Transformation

It involves the process of grouping and consolidating the quality of data into alternate forms by structuring and formatting the data and changing values by processes such as aggregation, attribute selection, normalization, and generalization [29].

4.3.4. Data Reduction

The process is used to reduce the representation of data. Some of the strategies are numerosity reduction, attribute subset selection, discretization, data cube aggregation, data compression, and dimensionality reduction [30].

The outcome of the processes explained above is shown in Figure 5.

Figure 5 demonstrates the difference in the values before and after preprocessing. The values used in the dataset are cleaned, integrated, transformed, and reduced for further procedure.

4.4. Exploratory Data Analysis

Exploratory data analysis is the critical process of performing initial investigations on data to discover patterns and spot anomalies in order to test the hypothesis of summary statistics and graphical representations [31].

It is a method of analyzing the different kinds of data that are being used in the model. Specific statistical functions and techniques you can perform with EDA tools include clustering, univariate visualization, bivariate visualization, and multivariate visualization [32].

Libraries such as NumPy, Seaborn, and Matlab are some of the data visualization methods used in the model.

The histogram shown in Figure 6 is a graphing tool. It is used to consolidate discrete or continuous data values used to measure the interval scale regularly [33].

4.5. Algorithms in Machine Learning

There are various algorithms in machine learning that can be used for prediction concepts, including random forest, support vector machine, light gradient boosting machine, gradient boosting machine, decision tree, XGBoost, and logistic regression. In order to choose the highest accuracy producing algorithm, a model is built for the dataset taken, and each algorithm is applied to it. The classifier producing the highest value is then selected based on the results [34].

The model built during simulation for the proposed system generated the results shown in Figure 7.

From Figure 7, it can be seen that the algorithms LGBM, decision tree, and XGB produced the highest results comparatively.

The results obtained when the process was run again are shown in Figure 8.

From Figure 8, it can be seen that the algorithms LGBM, XGB, and decision tree still produced the highest results comparatively even when the process was executed again.

From the above simulation, the LGBM algorithm can be taken into consideration for actual model processing and execution as it produces the highest accuracy percentage.

4.6. Theoretical Description of Concepts Used

Many ML methods and techniques can be tested and used along with classifiers for diabetes disease prediction. However, for the datasets used, the best suited classifiers are gradient boosting classifiers (GBM, LGBM, and XGB) and decision tree based on the simulation mechanism used earlier. However, other classifiers such as random forest, naive Bayes, and support vector machine are also considered for final accuracy percentage analysis [35].

The theoretical concepts of the machine learning classifiers used are explained below.

4.6.1. Gradient Boosting

A gradient boosting classifier is a combination of many weak learners formed into a predictive model typically in the form of decision trees. The number of trees is based on the number of values in the dataset used. It is mainly used when the bias error in the model needs to be decreased. A gradient-descent technique is chosen to obtain values of the coefficients [35].

In order to obtain the value of the coefficient, the loss function used needs to be calculated. It is calculated using , where is the actually calculated value and is the finally predicted value by the model. So is replaced with which represents the actual target [36]. It is mathematically given as

4.6.2. Light Gradient Boosting Machine (LGBM)

The LGBM has high performance and is considered as an advanced version of “gradient boosting framework” based on decision tree algorithm. It is majorly used for ranking and classification. It splits the tree leaf-wise with best fit. It can be calculated using many data improvement techniques and can be given by evaluating the variance after diving the values [26]. It can be given by the following equation:

The value determines the way in which the decision tree algorithm can be used to split the data and implement the values. The equation represents the number of trees that can be used in the model depending on the number of instances used in the dataset. When compared with GB, LGBM is comparatively faster and the parameters used are different, which can further increase or decrease the efficiency [37].

4.6.3. XGB

XGBoost is used for supervised regression models. It is used to infer the details about the validity of the objective function and base learners. The concept of ensemble learning is used to combine the results into a single prediction by involving training and combining individual models. XGBoost is a type of the ensemble learning methods. The objective function of XGBoost is given as follows:where denotes prediction from the jth tree. The MSE (mean squared error) is given as follows [38]:

4.6.4. Decision Tree

The decision tree entropy is generated as follows: A node is taken, and class labels are identified. The value of j ranges from 1 to . It is given mathematically as follows:

LGBM uses two concepts, namely, GBDT (gradient boosting decision tree) and GOSS (gradient based one-sided sampling). It is used to separate the tree in a leaf-wise manner that provides best fit whereas other boosting algorithms are used to divide it depth-wise. The accuracy results are better when compared with the other existing boosting algorithms [39].

4.6.5. Random Forest (RF)

The RF consolidates the outputs or outcomes of a number of decision trees together in order to obtain a single result. The DTs considered are taken as a base row sampling technique as well as column sampling technique. The number of base learners is improved depending on the inputs, and the variance is reduced to increase the accuracy. It is taken into account as one of the important bagging methodologies [40].

4.6.6. Naive Bayes (NB)

NB is dependent on classification methods that divide the data using the conditional probability values. Naive Bayes is an algorithm that is used for detecting the behavior of the different patients involved. It is a combination of classification logistic regression for classifying the patients into different groups. It is an algorithm that works swiftly for all the classification problems. It is good for predictions involving real time, multiple classes, recommendation system, text based classification, and sentiment analysis. It is easy to implement for large datasets [2].

The Bayesian formula for calculating naive Bayes algorithm is as follows:where P(A|B) is the posterior probability, P(B|A) is the likelihood probability, P(A) is the class prior probability, and P(B) is the predictor prior probability [41].

4.6.7. Support Vector Machine

Support vector machine belongs to the concept of supervised learning algorithm that can be used for problems of regression and classification. It is used to generate the decision boundary or best line for dividing the n-dimensional space into many classes that are different from one another in order to place the data point in the right category for future purposes. The hyperplane is known as the best decision boundary. To create the hyperplane, SVM selects vectors and extreme points. This introduces the concept of support vectors that further gives rise to the algorithm called support vector machine [42].

SVM uses the Lagrangian formulation mentioned below for classifying the testing samples:where the class label of support vector is given as , are test tuple, and and are numeric parameters [43].

4.7. Data Augmentation

In some cases, there are small datasets. In order to overcome the problem of small datasets or data imbalance, the concept of data augmentation can be applied [44]. Data augmentation in data analysis is a technique that can be utilized for increasing the quantity of data by slightly modifying copies of the already existing data or creating synthetic data from the already existing data [45]. It is useful for enhancing the performance and outcomes of ML models by forming new and different examples to train the datasets. The flow of the process is given in Figure 9.

For text data, techniques such as sampling, tokenization of documents or texts, shuffling sentences, rejoining statements, work replacement, and syntax tree manipulation can be carried out [46]. Some libraries used for data augmentation are Augmentor, Albumentations, Imgaug, and AutoAugment (Deep Augment).

These libraries have to be used along with frameworks for implementation. Some of the libraries already have a predefined or preexisting synergy with specific framework. For example, Albumentations uses PyTorch [47].

4.7.1. Technique

The technique used is oversampling. The concept of oversampling involves randomly duplicating the values in the dataset. The examples are chosen from the minority class by replacing and adding them to the existing training dataset. This process is repeated until the data in the minority class and majority class are equal [48].

4.8. The Algorithm Used for Oversampling
Inputs: Class_0: Minority Class, Class_1: Majority Class.
Parameters: used to improve minority class Class_0.
(1)Class_1_over: To get sample of Class_0 and to store in Class_1_over.
(2)Test_over: To concatenate Class_0 and Class_1_over.

Figures 10 and 11 show the count of the dataset before augmentation and after augmentation. In Figure 10, there is a mismatch in the count of the dataset, whereas in Figure 11, the data count has been balanced.

The metrics used for accuracy prediction include [49] F1-score, precision, recall, sensitivity, and specificity. They can be calculated as follows:

F1-Score: It is a metric used to calculate accuracy. It is used in classification models. It is calculated mathematically as follows:

Recall: It is mainly used to identify relevant data among a lot of available data. It is calculated mathematically as follows:

Precision: It determines the quality of the positive prediction made by the model. It is calculated mathematically as follows:

Specificity: It determines the proportion of actual negatives which are true negatives in the model. It is calculated mathematically as follows:

Sensitivity: It determines the value of in proportion to the added values of and . It is calculated mathematically as follows [50]:

5. Architecture

The architecture consists of the working flow of the procedure as shown in Figure 12. Initially the data is chosen from the various databases available, and finally the dataset is chosen. The data chosen is preprocessed, and feature selection and extraction are carried out. The data is further preprocessed using EDA until all of the defects are rectified. The dataset is then clean and suitable for training and testing procedures. The dataset is divided into training data, testing data, and validation data. The various classifiers are then compared, and the best suited classifier for the dataset is then chosen and applied. The prediction model is then built based on the classifiers taken, i.e., GB, LGBM, and RF.

After obtaining the prediction accuracy, data augmentation is further applied to the dataset, and this improves the already existing accuracy percentage. This is finally obtained as the best prediction accuracy percentage of the model. In addition to the above methods, the voting classifier is also used to predict the best possible outcome for the disease prediction among the classifiers LGBM, GB, and RF.

Voting Classifier: It is a ML model that can be used to train on numerous models and predict output depending on the highest probability of the class chosen as the output. The voting classifiers are divided into softvoting and hardvoting.

Hard Voting: Based on the higher number of votes Nc(yt), the prediction of class label happens via majority voting of each classifier. Hard voting is mainly used to predict class labels. It is given mathematically as follows:

Soft Voting: The probability vectors for each predicted classifier are summed up, and the average is obtained. The highest value is taken as the winning class. Soft voting is mainly used to predict class membership probabilities. It is given mathematically as follows:

6. Results and Discussion

The prediction for diabetes mellitus is done by the model built using the dataset Pima initially, and then the highest accuracy producing algorithms are chosen and further incorporated in the DMS dataset used.

The accuracy values obtained for the various classifiers are given in Table 2. The abbreviations for the classifiers are as follows: LR (logistic regression), XGB (extreme gradient boosting), GB (gradient boosting), DT (decision tree), ET (extra trees), RF (random forest), and LGBM (light gradient boosting machine).

Figure 13 indicates the bar graph of the accuracy percentage obtained while using classifiers LR, XGB, GBM, DT, ET, RF, and LGBM.

Table 2 shows that LGBM and RF produce the highest accuracy. However, from Figures 6 and 7, XGB also produces high accuracy. Therefore another predictive mechanism for the classifiers RF, LGBM, and GBM is conducted for the datasets Pima and DMS.

Diabetes mellitus is predicted using the predictive model built, and the accuracy for the 2 datasets used varies with each machine learning algorithm used. The result obtained after performing data augmentation for both datasets is given in Table 3.

From Table 3, it can be seen that the LGBM algorithm produces the highest accuracy (89.5%) for the Pima dataset and the same accuracy is also obtained when the RF classifier is used without data preprocessing. When the data is preprocessed, the accuracy obtained for Pima dataset is highest (92.5%) when LGBM algorithm is used. For the DMS dataset, the accuracy obtained is highest (95.27%) when RF algorithm is used without preprocessing, and the accuracy obtained after preprocessing is highest (98.99%) for LGBM.

Figure 14 demonstrates the bar graph of the highest accuracy percentage obtained while using classifiers LGBM, RF, and GBM.

7. Conclusion and Future Scope

From the above study, it can be concluded that the LGBM algorithm provides the highest accuracy when compared with RF and GB classifiers. Therefore, the LGBM algorithm is well suited for the Pima dataset and the DMS dataset used in the study.

The LGBM algorithm differs from RF and GB in the following ways: The parameters used in LGBM are different from those in GB and RF. The parameter tuning varies with each algorithm, and the model is built based on the classifier used. Therefore, in this paper, a predictive model is built using LGBM algorithm, and the accuracy is obtained as shown in Table 3 for the datasets used.

The diabetes mellitus disease prediction can further be improved by enhancing the dataset using other advanced methodologies like transformer based learning. The attributes used can also be employed in different combinations for identification. The classifiers used can be fine-tuned more to predict the disease with higher accuracy, and the probability of occurrence of the disease can be calculated. This will further improve the accuracy percentage and deliver a more profound model to predict diabetes mellitus disease among affected people.

Data Availability

The data used to support the findings of the study are available at https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.

Conflicts of Interest

The authors declare that they have no conflicts of interest.