Efficient Automated Disease Diagnosis Using Machine Learning Models
Recently, many researchers have designed various automated diagnosis models using various supervised learning models. An early diagnosis of disease may control the death rate due to these diseases. In this paper, an efficient automated disease diagnosis model is designed using the machine learning models. In this paper, we have selected three critical diseases such as coronavirus, heart disease, and diabetes. In the proposed model, the data are entered into an android app, the analysis is then performed in a real-time database using a pretrained machine learning model which was trained on the same dataset and deployed in firebase, and finally, the disease detection result is shown in the android app. Logistic regression is used to carry out computation for prediction. Early detection can help in identifying the risk of coronavirus, heart disease, and diabetes. Comparative analysis indicates that the proposed model can help doctors to give timely medications for treatment.
Machine learning is used in various areas like education and healthcare. With the advancement of technology, the better computing power and availability of datasets on open-source repositories have further increased the use of machine learning. Machine learning is used in healthcare in vast areas. The healthcare sector produces large amounts of data in terms of images, patient data, and so on that helps to identify patterns and make predictions. Machine learning is used in healthcare to solve various problems [1–3]. Heart disease is based on the individual, and the extent of heart disease can vary from person to person . Thus, making a machine learning model, training it on the dataset, and entering individual patient details can help in prediction. The prediction result will be according to the data entered and hence will be specific to that individual. Type-2 diabetes is a disease that can be prevented by control of weight, lifestyle, and so on . Coronavirus is a disease that has no clearly defined treatment. The coronavirus 2019 (COVID-19) originated from China. There are different treatments  that are going on for it but there are no clearly defined steps for treatment.
Artificial intelligence (AI) aims to mimic human cognitive functions. It is bringing a paradigm shift to healthcare, powered by the increasing availability of healthcare data and rapid progress of analytics techniques . Recently, many models have been developed for automated diagnosis of various diseases such as cancer, COVID-19, and diabetes . Recently, many researchers have started using machine learning models for real-time diagnosis of disease by developing mobile apps . Even some mobile apps have been developed which can predict the risk of certain disease and recommend the diagnosis to the given person based upon the respective health conditions . However, efficient early stage diagnosis is still defined as an ill-posed problem . Recently, many researchers have started using deep-learning models to obtain significantly better performance as compared with the machine learning models [6, 7].
In this study, the machine learning models are applied to the coronavirus, heart disease, and diabetes dataset to predict the risk of these diseases in an individual. An end-to-end process is used where people must enter their details in the mobile application and submit the data. The real-time processing takes place, and the risk is predicted within a few seconds. The mobile application that is used as a real-time database on the cloud is the firebase database. The trained parameters of the model are stored in the database, and prediction is done in real-time. Further, the user is also notified of the accuracy of the model. Apart from this, the news article from trusted sources is also shared in the app in real-time. The source of the news is also mentioned in the app.
The main contributions are as follows: (i)An efficient automated disease diagnosis model is designed using the machine learning models.(ii)Three critical diseases are selected such as coronavirus, heart disease, and diabetes.(iii)In the proposed model, the data are entered into an android app, the analysis is then performed in a real-time database using a pretrained machine learning model which was trained on the same dataset and deployed in firebase, and finally, the disease detection result is shown in the android app.(iv)Logistic regression is used to carry out computation for prediction.
The remaining paper is summarized as follows. Section 2 presents related work. Section 3 presents the research methodology. Section 4 discusses the experimental results and analysis. Section 5 concludes the paper.
2. Related Work
Machine learning is used widely in today’s world because of increasing computation power and the availability of large datasets on open-source tools. Quality of transmission (QoT) can give some insights into the model. An attempt has been made to monitor the QoT to determine the physical condition of the model . In another study, an attempt has been made to use ML in intrusion detection systems [8–10]. Song et al.  proposed a modified optimization method for the extreme learning machine. In another study, a supervised deep-learning technique is used to diagnose faults in the induction machine system . ML is also used in demodulation methods  for visible light communication systems. In wireless networking, resource management is another problem where ML was applied to obtain optimal performance .
In another study, an algorithm is proposed to achieve local updates and global updates  which is critical for the learning process. ML is also used to solve wireless network problems. Chen et al.  represented how artificial neural networks can be used to solve various problems in the wireless network. Nawaz et al.  gave a detailed study of how different models are used in 5G technology. In another study , detailed research is presented on how neuromorphic photonics systems are used in solving ML-based problems. An attempt has been made to detect falls and daily activity using artificial neural networks . An attempt has been even made for the diagnosis of ML models . ML is also used to detect malware in android software . Tang et al.  provided insights for the use of ML in vehicular 6G networks. ML is also used to determine flight delay . In another attempt, ML is used to determine the protein dynamics . In another study, the use of AI in wireless communication  is done and the new research area developed is called edge learning.
The use of machine learning models in healthcare has increased. The ability of machine learning models to bring out the meaning from data and to prediction is used for early prediction of diseases. Machine learning is used in heart disease problems to bring out solutions to complex problems. For instance, some data mining techniques are applied to heart disease data  to determine patterns and help in the prediction of heart disease. In another study , a hybrid of machine learning models is proposed for the diagnosis of heart disease. Khourdifi and Bahaj  used different machine learning models for heart disease prediction and applied various optimizations which include particle swarm optimization (PSO) combined with ant colony optimization (ACO). In a study,  ensemble technique is used for the prediction of heart disease. Their research showed that the ensemble technique increased the accuracy of weak classifiers. Several studies have tried to link heart disease with coronavirus to determine if there is any relation among them [30–35]. There are several attempts made to determine heart disease and prevent it before it causes serious harm [36–40].
The machine learning models are also used for coronavirus disease to solve problems in the medical domain using data. For instance, machine learning techniques and mathematical models are used  to determine the number of infected people and the probable time when coronavirus will be over in China. In a study conducted by Bullock , various applications, tools, and datasets are explored which bring in the artificial intelligence used against coronavirus. The role of artificial intelligence and machine learning to fight coronavirus is necessary as it will help in the early prediction of coronavirus . In the coronavirus situation, it is necessary to characterize the propagation of information on social media . The patients who have hypertension and diabetes  or have old age  are at higher risk of coronavirus . An attempt has been made to determine the relation between coronavirus and diabetes [48–51].
Diabetes has been in society for a very long time. Diabetes is further dependent on an individual’s body, diet, and way of living . Machine learning models are used in diabetes problems to bring out solutions and to enable early prediction using machine learning models. For instance, Alghamdi et al.  developed an ensemble-based model for predicting incident diabetes. The database used in the research has 32,555 patients. In another study, prediabetes is predicted using machine learning models on the Korean population . Sneha and Gangil  analyzed the dataset for selecting the optimal features for the early prediction of diabetes. Zou et al.  used a dataset of Luzhou, China, to predict and diagnose diabetes. An attempt has been made to identify diabetes in developing countries . Diabetes is more in older people than in younger generations, so an attempt has been made to list the clinical procedure for handling diabetes in older adults . Table 1 presents the comparative analysis among the existing techniques.
The objective of the methodology is to predict the risk of having coronavirus, heart disease, and diabetes in an individual based on answering a few questions using machine learning models in an end-to-end process. The research is carried on a system with the following system configurations and software: Python 3 and Java 10.0.2 software are used and implemented using Jupyter Notebook 5.5.0 and Android Studio 3.1.0, respectively, on Intel(R) Core(TM) i3-2310M CPU @2.10 GHz with 8 GB RAM.
A block diagram of the basic steps adopted for each machine learning model is shown in Figure 1. First, data cleaning is performed to convert the raw data into a usable form. After data cleaning, data analysis is done to determine the importance of features. In data analysis, the features are identified, and the data are converted into a form on which machine learning models can be applied. These steps are used for each of the model predictions: (a) COVID-19, (b) heart disease, and (c) diabetes.
The data of COVID-19 collected from [18, 58] include many features which are represented in Table 2. The dataset is raw and cannot be used directly. The total data points of the dataset are 13174. However, most of the data points have missing values. For instance, 11825 age value, 12681 symptoms value, 12416 travel history location, and so on are missing. The relevant features selected include age, sex, symptoms, country, and travel_history_location. These values are essential to carry out predictions. However, due to missing values, many data points are dropped. After removing the null values, the dataset consists of 260 rows. A view of the dataset after cleaning is shown in Table 3.
After data cleaning, data analysis is performed. After cleaning the dataset, there are 260 rows. The country columns include following countries: “China,” “France,” “Japan,” “Malaysia,” “Nepal,” “Singapore,” “South Korea,” “Thailand,” “United States,” “Cambodia,” “Vietnam,” “Philippines,” “Italy,” “Lebanon,” “Spain,” “Lithuania.” These countries were segregated into two separate groups: the first one with countries having more than 10,000 cases and the second one with less than 10,000 cases. The data points in the first group are marked by 1 in the country column, and data points in the second group are marked by 0 for the country column. Data points that had no travel history were marked by 0, and the rest were marked by 1. For the sex column, the male was marked by 0 and the female was marked by 1. The complete dataset of COVID-19 consists of only the details of patients tested positive. To apply machine learning models, the dataset must have negative cases also; otherwise, the model will predict all cases positively based on learning. For this reason, 80 new rows were added with a negative result. Here, 0 entry in the output column corresponds to a negative result. The age and sex of these 80 data points were the same as the first 80 data points of the dataset. The columns-symptoms, country, and travel_history_location were made to cover all possible 8 cases. 10 rows correspond to each case. After completion of the dataset, a heat map was drawn to determine the impact of each feature on the prediction of output.
In Figure 2, analyzing the heatmap, it can be seen that symptoms and travel history location have a huge positive impact on having the COVID-19.
3.2. Heart Disease
The heart disease dataset [19, 59] has features shown in Table 4. The dataset has 70,000 data points. Out of the features listed in the table, the features used include “age,” “gender,” “height,” “weight,” “cholesterol,” “gluc,” “smoke,” “alco,” “ap_hi,” and “ap_lo.” There were some outliers. The value of systolic blood pressure above 200 and the value of diastolic pressure above 150 are referred to as outliers here. A snapshot of the dataset is shown in Figure 3.
For analysis of features, a heat map was drawn. According to the heat map (see Figure 3), the most important features in determining heart disease include systolic and diastolic blood pressure, cholesterol, and age (see Table 5).
The dataset collected from [20, 60] has the features as shown in Table 6. The dataset has 768 data points. Out of all the listed features in Table 3, the features used include “pregnancies,” “blood pressure,” “BMI,” and “age.” The aim of the research is not only to build a theoretical model of prediction by using artificial intelligence but to make it practically possible to use the models in the real-time application without many restrictions. The features including skin thickness and diabetes pedigree function are not possible for a normal person to determine at home. For this reason, only those features are taken which are possible for predicting. For instance, diabetes pedigree function is a complex function  calculated by using various factors including parents, siblings, half aunt, and half-uncle. A view of the dataset is shown in Table 7.
A heat map was drawn to determine the importance of features. According to the heat map, pregnancies, glucose, BMI, and age have the highest impact (greater than 0.2) on predicting diabetes (see Figure 4). Out of this, glucose is not considered for making the model useful for practical use (see Table 7).
After cleaning and analyzing all the datasets, machine learning models were applied. The logistic regression model is used for all the datasets. To make the prediction, the coefficients and intercept of all the three logistic regression models are stored in a firebase real-time database. Since coefficients and intercept are in the firebase real-time database, any updation in the coefficients can be reflected in real-time in the application. This allows us further to tweak the parameters as the dataset grows and training improves. To make it useful for practical use, an android application is developed. Using the android application, the prediction can be made in real-time by answering a few questions. To further make the application useful, all the latest news and trends of COVID-19, diabetes, and heart disease are shown in the app that can be updated in real-time.
4. Results and Discussion
Figure 5 shows an example of prediction in the android app. The screenshots used in Figure 5 are taken from the android app in production, which is named disease prediction using artificial intelligence (DPAI). The user can choose from various options to predict diseases or look at news/trends of diseases from the main navigation menu as shown in Step 1. The user must enter some details which are features fed to the model for prediction. For prediction, the value of various coefficients and intercept is fetched from the firebase database in real-time, and calculation is performed to give the output. Along with the output, the accuracy of the model is also displayed to users for more transparency.
In this paper, we have used 65% dataset for training process, 10% for cross-validation, and 25% for testing purpose. Figures 6–8 show the accuracy analysis of the proposed and the existing machine learning models by considering diabetes, heart disease, and COVID-19 binary datasets. These figures clearly show that the proposed model outperforms the existing models by 1.2746%. Figures 9–11 demonstrate the F-measure analysis of the proposed and the existing machine learning models by considering diabetes, heart disease, and COVID-19 binary datasets. These figures clearly show that the proposed model outperforms the existing models by 1.3926%.
From the comparative analysis, it is found that among the existing models, the proposed model outperforms the competitive models in terms of various performance measures. Also, the proposed model provides consistently good performance with lesser degree of uncertainty, especially compared with LR, J48, KNN, ANN, RF, GB, and ANFIS.
This study provides insights into using the machine learning models to predict the risk of COVID-19, heart disease, and diabetes in an individual based on answering a few questions related to various factors like travel history, age, gender, and blood pressure. Logistic regression is used for prediction. Extensive experimental results reveal that the proposed model outperforms the competitive machine learning models in terms of accuracy and F-measure by 1.4765% and 1.2782, respectively, for COVID-19 dataset. The proposed model outperforms the competitive machine learning models in terms of accuracy and F-measure by 1.8274% and 1.7264, respectively, for diabetes dataset. Also, the proposed model outperforms the competitive machine learning models in terms of accuracy and F-measure by 1.7362% and 1.3821, respectively, for heart disease dataset.
The findings in this research can be helpful in the early screening of potential COVID-19, diabetes, and heart disease patients. It can be helpful in the sense that the first screening can be performed at the comfort of home. If a high risk of disease is predicted in a patient, then it can be followed by clinical trials for confirmation. In the near future, one may apply the proposed model to some other applications such as handwritten recognition , image filtering , cancer classification , and medical image segmentation  and additionally may use various meta-heuristic techniques  to tune the initial parameters of the proposed machine learning models.
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
M. Z. Ali, M. N. S. K. Shabbir, X. Liang, Y. Zhang, and T. Hu, “Machine learning-based fault diagnosis for single- and multi-faults in induction motors using measured stator currents and vibration signals,” IEEE Transactions on Industry Applications, vol. 55, no. 3, pp. 2378–2391, 2019.View at: Publisher Site | Google Scholar
T. F. Lima, H. Peng, A. N. Tait et al., “Machine learning with neuromorphic photonics,” Journal of Lightwave Technology, vol. 37, pp. 1515–1534, 2019.View at: Google Scholar
F. Noé, G. D. Fabritiis, and C. Clementi, “Machine learning for protein folding and dynamics,” Current Opinion in Structural Biology, vol. 60, pp. 77–84, 2019.View at: Google Scholar
J. Patel and U. Tejal, “Heart disease prediction using machine learning and data mining technique,” 2016.View at: Google Scholar
Y. Khourdifi and M. Bahaj, “Heart disease prediction and classification using machine learning algorithms optimized by particle swarm optimization and ant colony optimization,” International Journal of Intelligent Engineering and Systems, vol. 12, no. 1, pp. 242–252, 2019.View at: Publisher Site | Google Scholar
Y. Kokubo, M. Watanabe, A. Higashiyama, and K. Honda-Kohmo, “Small-dense low-density lipoprotein cholesterol: a subclinical marker for the primary prevention of coronary heart disease,” Journal of Atherosclerosis and Thrombosis, vol. 27, no. 7, pp. 641–643, 2020.View at: Publisher Site | Google Scholar
J. Suls, J. N. Mogavero, L. Falzon, L. S. Pescatello, E. A. Hennessy, and K. W. Davidson, “Health behaviour change in cardiovascular disease prevention and management: meta-review of behaviour change techniques to affect self-regulation,” Health Psychology Review, vol. 14, no. 1, pp. 43–65, 2019.View at: Publisher Site | Google Scholar
C. F. Team, “Severe outcomes among patients with coronavirus disease 2019 (COVID-19) - United States,” MMWR. Morbidity and Mortality Weekly Report, vol. 69, no. 12, pp. 343–346, 2020.View at: Google Scholar
J. Smith Walter, “Using the ADAP learning algorithm to forecast the onset of diabetes mellitus,” Johns Hopkins Apl Technical Digest, vol. 10, 1988.View at: Google Scholar
B. Gupta, M. Tiwari, and S. Singh Lamba, “Visibility improvement and mass segmentation of mammogram images using quantile separated histogram equalisation with local contrast enhancement,” CAAI Transactions on Intelligence Technology, vol. 4, no. 2, pp. 73–79, 2019.View at: Publisher Site | Google Scholar