Abstract

The coronavirus disease 2019 (Covid-19) pandemic has affected most countries of the world. The detection of Covid-19 positive cases is an important step to fight the pandemic and save human lives. The polymerase chain reaction test is the most used method to detect Covid-19 positive cases. Various molecular methods and serological methods have also been explored to detect Covid-19 positive cases. Machine learning algorithms have been applied to various kinds of datasets to predict Covid-19 positive cases. The machine learning algorithms were applied on a Covid-19 dataset based on commonly taken laboratory tests to predict Covid-19 positive cases. These types of datasets are easy to collect. The paper investigates the application of decision tree ensembles which are accurate and robust to the selection of parameters. As there is an imbalance between the number of positive cases and the number of negative cases, decision tree ensembles developed for imbalanced datasets are applied. F-measure, precision, recall, area under the precision-recall curve, and area under the receiver operating characteristic curve are used to compare different decision tree ensembles. Different performance measures suggest that decision tree ensembles developed for imbalanced datasets perform better. Results also suggest that including age as a variable can improve the performance of various ensembles of decision trees.

1. Introduction

SAR-CoV-2 is a new strain of coronavirus causing Covid-19, which was first identified in Wuhan city, Hubei province of China (WHO) [1, 2]. It was declared a pandemic by the World Health Organization (WHO) [1, 2]. On 24 November 2020, there have been more than 59.2 confirmed cases of Covid-19 and more than 1.3 million deaths [3], reported to WHO.

Covid-19 is a respiratory infection and in severe cases may have Covid-19 pneumonia [4]. The reverse transcript polymerase chain reaction (RT-PCR) is considered as the gold standard for Covid-19 diagnosis [5]. It requires at least 24 hours to produce a result. Clinicians use chest imaging tests to diagnose Covid-19 disease when awaiting RT-PCR test results.

Machine learning algorithms are very effective for prediction [6]. Chest X-ray images and computed tomography (CT) scans have been used to train machine learning models [79]. Trained models are then used to predict the Covid-19 positive cases. Using Point-of-care ultrasound, machine learning methods can accurately predict which Covid-19 patients are at a greater risk of death on initial lung scans [10]. Machine learning methods can also help distinguish Covid-19 infection from the community-acquired pulmonary infection [11].

Various studies have been carried out to investigate whether Covid-19 positive cases can be predicted using commonly taken laboratory tests such as hematocrit, hemoglobin, platelets, red blood cells, lymphocytes, and leukocytes [1214].

Some of these datasets are imbalanced datasets [12, 14]; in other words, the number of positive cases is very small as compared to the number of negative cases. There are various machine learning methods which have been specifically developed for these datasets [15, 16]. The performance measures for these datasets should be carefully selected as, in some datasets such as medical datasets, the penalty for false negative is more than the penalty for false positive [17].

An ensemble is a combination of many classifiers [18]. An ensemble performs better than an individual classifier if member classifiers are accurate and diverse. Classifier ensembles have been used in many applications [19, 20]. A decision tree is a popular machine learning algorithm [6]. Random forest [21], eXtreme Gradient Boosting (XGBoost) [22], Bagged decision trees [23], and so forth are examples of decision tree ensembles. Ensembles of decision trees are accurate and quite robust to the selection of parameters. Decision tree ensembles have also been developed for imbalanced datasets [24, 25]. Various Covid-19 datasets based on laboratory tests [1214] are imbalanced datasets. However, they have not been studied using machine learning algorithms developed for imbalanced datasets.

The paper will investigate the application of decision tree ensembles for Covid-19 dataset that is based on commonly taken laboratory tests. The dataset is imbalanced as the ratio of positive class to negative class data points is 1 to 6.5. Therefore, decision tree ensembles that are specifically developed for imbalanced datasets are also applied. Undersampling and oversampling techniques [15, 16] are used to address the data imbalance problem. As the data is imbalanced, classification accuracy is not an appropriate performance measure to compare different classifiers. F-measure, precision, recall, area under the precision-recall curve, and area under the receiver operating characteristic curve [17] performance measures are employed to compare the performances of different decision tree ensembles.

The paper is organized in the following way. Section 2 discusses the research works that apply machine learning algorithms on various kinds of datasets to predict the Covid-19 positive cases. Information on the Covid-19 dataset, decision tree ensembles, and performance measures used in the experiments is presented in Section 3. Section 4 has experiments and discussion. The paper ends with a conclusion and future work section.

Machine learning algorithms have been applied on various types of datasets such as X-ray images, CT scans, and point-of-care ultrasound to predict the Covid-19 positive cases [7, 10]. However, in this section, we concentrate on the research works related to the application of machine learning algorithms to predict Covid-19 positive cases using only routinely collected laboratory findings of the patients [1214].

Batista et al. [13] collected data from 235 adult patients from the Hospital Israelita Albert Einstein in São Paulo, Brazil, from 17 to 30 March 2020. 15 variables which include age and gender other than 13 laboratory tests such as hemoglobin, platelets, and red blood cells are used to create a prediction model. Five machine learning algorithms (neural networks, random forests, XGBoost, logistic regression, and support vector machines (SVM)) were tested on this dataset. The best predictive performance was obtained by the SVM algorithm. Schwab et al. [14] used anonymised data from a cohort of 5644 patients seen at the Hospital Israelita Albert Einstein in São Paulo, Brazil, in the early months of 2020. In the dataset, the rate of positive patients was around 10%. They used 97 routine clinical, laboratory, and demographic measurements as the features. They applied the same five different machine learning algorithms as applied by Batista et al. In their experiments. XGBoost performed the best. They also did the experiments for predicting hospitalisation and ICU admission. Random forests performed best for predicting hospital admissions for Covid-19 positive patients, whereas SVM performed best for predicting ICU admission for Covid-19 positive patients. Alakus and Turkoglu [12] modified the data used by Alakus and Turkoglu [12]. The new dataset has 18 laboratory findings of 600 patients. In the dataset, the rate of positive patients was around 13%. They carried out a comparative study of deep learning approaches. Six different model types—Artificial Neural Network (ANN), Convolutional Neural Networks (CNN), Long-Short Term Memory (LSTM), Recurrent Neural Networks (RNN), CNNLSTM, and CNNRNN—were applied to this dataset. LSTM performed best for the 10-fold cross-validation approach. The datasets used by Schwab et al. [14] and Alakus and Turkoglu [12] are imbalanced. Therefore, it will be interesting to apply algorithms that have been developed for imbalanced datasets.

3. Data and Classifiers Used in the Experiments

The section will discuss the Covid-19 dataset, decision tree ensembles, and performance measures which are used in our experiments.

3.1. Data

111 laboratory findings of 5644 patients seen at the Hospital Israelita Albert Einstein in São Paulo, Brazil, were collected to detect Covid-19 in the early months of 2020 [14]. Alakus and Turkoglu [12] selected the 18 lab findings that play the most important role in Covid-19. They removed those data points that had missing values for these 18 lab findings. The final data has 600 data points: 520 data points are negative and 80 data points are Covid-19 patients. The ratio of negative data (majority class) to positive data (minority class) is 6.5 to 1 which makes the data imbalanced data. This dataset is used in the experiments.

3.2. Decision Tree Ensembles

A decision tree has been a very successful classifier that has been applied in many domains [6]. Decision trees are built using a recursive partition process in which data points are split at each node by using the selected split criterion. A path from the root node to a leaf is a rule which is used for the prediction. An ensemble of classifiers consists of many classifiers [18]. The final decision is the combination of all the member classifiers. An ensemble generally performs better than individual members if individual members are accurate and diverse. Decision tree ensembles are quite robust to the selection of parameters and perform well. In the experiments, many decision tree-based ensembles are used. As the data is imbalanced data, decision tree-based ensembles that are developed for imbalanced datasets are also used. We will discuss these ensembles and their implementation.

3.2.1. C4.5 Decision Tree

Various split criteria have been proposed for decision trees. C4.5 uses the information gain ratio split criterion which reduces a bias towards multivalued attributes [26]. For all the ensembles, we use C4.5 or its version as the base classifier.

3.2.2. Decision Tree Ensembles

Different kinds of ensembles methods have been proposed. Some of them are general methods such as bagging [23] and AdaBoost [27] which can be used with any classifier, whereas some of them are specific to a decision tree such as random forests [21]. XGBoost [22] is a scalable gradient tree boosting ensemble method, which has produced excellent results in many domains.

3.2.3. Decision Tree Ensembles for Imbalanced Datasets

Many approaches have been proposed to handle imbalanced datasets. Undersampling of the majority class and oversampling of the minority class are two important approaches to reduce the imbalance of the datasets [15, 16]. Random undersampling (RUS) [15, 16] selects some data points from the majority class and combines them with the minority class to reduce the imbalance. Synthetic Minority Oversampling Technique (SMOTE) [28] generates synthetic minority data points which are combined with the original dataset to reduce the imbalance. SMOTEBoost [29] combines SMOTE algorithm and the boosting procedure, whereas SMOTEBagging [30] is a combination of SMOTE and Bagging algorithm. RUSBoost [31] combines data undersampling and boosting. RUSBagging [32] is the combination of random undersampling and bagging. Balanced random forests [25] use undersampling of the majority class to create balanced data for each tree of random forests.

Different packages such as Weka [33], imblearn [32], ebmc [34, 35], and XGBoost [36] are used in the experiments. Table 1 shows the classifiers and the related packages used in the experiments.

The default values of parameters were used for all the classifiers. However, the sizes of ensembles were fixed to 50. A 10-fold cross-validation procedure was used in the experiments. The average results are presented in different tables (Tables 28).

3.3. Performance Measures

There are different measures to compute the performance of the classifiers. Accuracy is one of the most commonly used performance measures. However, for imbalanced datasets, accuracy measure is not very useful [17]. Accuracy, precision, recall, and F1-measure are used in the experiments [17]. We discuss these measures in detail in the appendix. The area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) are also used to compare the performance of decision tree ensembles [17].(i)AUROC: The ROC is the curve between TPR and FPR at different decision threshold settings. The area under the curve is used as a performance measure. AUROC values range from 0 to 1. The baseline (random classifier) is 0.5 [37].(ii)AUPRC: The precision-recall curve is the curve between precision and recall at different decision threshold settings. The area under the curve is used as a performance measure. AUPRC takes the values from 0 to 1. The baseline is not constant for AUPRC. It depends on the ratio of positive and negative samples. It is equal to positive/(positive + negative) [37]. For the dataset, which was used in the experiment, it is equal to 80/600 = 0.13. It has been demonstrated [37] that, for imbalanced datasets, AUPRC is a better performance measure as compared to AUROC.

4. Results and Discussion

Different experiments were carried out to study the performance of different decision tree ensembles. This section will present the results of those experiments.

4.1. The Comparative Study of Various Decision Tree Ensembles

Different types of decision tree ensembles are compared using various performance measures. Results are presented in Table 2. For two (accuracy and precision) out of six performance measures, the standard classifiers performed best, whereas for the other four performance measures (F1-measure, recall, AU-ROC, and AUPRC) decision tree ensembles for imbalanced datasets perform best. Random forests performed best for two performance measures (accuracy and precision). Balanced random forest (RUS) performed best for three performance measures: recall, F1-measures, and AUPRC. RUSBagging performed best for AUROC. AUROC and AUPRC are widely used performance measures for imbalanced datasets. RUSBagging performed best for AUROC with the value 0.881, whereas balanced random forest (RUS) gave the best AUPRC with the value 0.561. The study demonstrates that decision tree ensembles for imbalanced datasets perform better for this Covid-19 dataset.

4.2. Standard Decision Tree Ensembles with Different Sampling Techniques

Sampling is an approach to overcome the imbalance of datasets. We use standard decision tree ensembles with two sampling approaches: SMOTE and RUS. As AUROC and AUPRC are mostly used performance measures for imbalanced datasets, only these performance measures were used. Results with SMOTE oversampling sampling method are presented in Table 3. We carried out experiments with different ratios of minority class and majority class data points. The results on the original data points were also presented. Results suggest that the best AUROC was 0.872 by random forest with the original dataset, whereas the best AUPRC was 0.648 by bagging with the original dataset. For bagging and random forest, the results improved with oversampling. Except for a single decision tree, SMOTE oversampling method had a negative effect on the performance of other ensembles. The presence of noisy minority points may be the reason for the poor performance of SMOTE oversampling method [38].

RUS was done on the majority class to change the ratio of minority class and majority class. The best results (AUROC and AUPRC) were obtained with XGBoost. The best AUROC was 0.873 with the data with a ratio of 0.5 whereas the best AUPRC was 0.554. XGBoost and AdaBoost performed best for the dataset with a ratio of 0.5 whereas random forest performed best for the original data. The results suggest that the sampling methods did not have similar effects on all the ensembles. However, the best results among all the classifiers were obtained with the XGBoost for the data with a ratio of 0.5 created with RUS.

4.3. Effects of the Ensemble Size

An ensemble is a combination of classifiers. The number of classifiers in an ensemble is the size of the ensemble. An experiment is carried out to study the effect of ensemble sizes on different classifier ensembles. For the study, those ensembles were selected which performed better with default values. AUROC and AUPRC performance measures are used for the study. Results for AUROC and AUPRC are presented in Tables 5 and 6, respectively. Results suggest that generally the performance slightly improves or remains constant with size. The results are consistent with the theory of classifier ensembles that most of the performance improvement comes with the first few classifiers in ensembles, adding more classifiers may not be very useful [39].

4.4. Effects of the Age Variable

The data has 18 lab findings. The data also has the age of each patient. The previous experiments [12] used 18 lab tests for the Covid-19 prediction. They did not use age in the experiments. The age of a patient plays a very important part in the severity of Covid-19. Therefore, it is important to understand the effect of age on the Covid-19 prediction. An experiment is carried out with a dataset with 19 attributes (18 lab findings + age variable) and the results are compared with that of 18 attributes. Experiment settings were the same as discussed for the dataset with 18 attributes. Two sets of experiments were done, one with the ensemble size equal to 50 and the other with ensemble size equal to. Results are presented in Tables 7 and 8. For the AUROC performance measure with an ensemble size of 50, seven out of nine ensembles performed better for the dataset with 19 attributes. Similarly, with an ensemble size of 100, six out of nine ensembles performed better with the dataset with 19 attributes. For the AUPRC performance measure, eight out of nine ensembles performed better for the dataset with 19 attributes, with ensemble sizes of 50 and 100. The results suggest that including age variable in the dataset can improve the prediction performance.

Following are the findings of the experiments:(I)Covid-19 dataset based on laboratory test [12] is an imbalanced dataset. However, it has not been studied using machine learning algorithms developed for imbalanced datasets. We studied them using decision tree ensembles for imbalanced datasets. The study demonstrates that decision tree ensembles for imbalanced datasets perform better for this Covid-19 dataset.(II)Experiments with different sampling techniques suggest that that the sampling methods did not have similar effects on all the general decision tree ensembles. However, the best results among all the classifiers were obtained with the XGBoost for the data with a ratio of 0.5 created with RUS.III.The previous experiments [12] did not use the AUPRC performance measure, which is a better performance measure as compared to AUROC [37]. The results could be misleading with inappropriate performance measures. As in Table 2, the performance of balanced random forest (RUS) is best for AUPRC, whereas RUSBagging is best for AUROC. The classifier based on the AUPRC performance measure is used.(IV)The dataset has 19 attributes (18 lab tests + age). The previous experiments [12] used 18 lab tests for the Covid-19 prediction. We also studied the effect of the age attribute on the Covid-19 prediction. It is found that including age attribute with 18 lab tests can improve the Covid-19 prediction accuracy.

5. Conclusion and Future Work

The prediction of Covid-19 positive cases is an important step for managing Covid-19 positive patients. Machine learning algorithms can be useful for this classification task. Various kinds of decision tree ensembles are applied on a Covid-19 dataset. The dataset has commonly taken laboratory tests. Therefore, these kinds of datasets are easy to collect. The dataset has a class imbalance problem. The results demonstrate that decision tree ensembles developed for imbalanced datasets perform better than standard decision tree ensembles. This suggests that the selection of classification methods should be based on the properties of data. If the data is imbalanced, classifiers developed for imbalanced datasets should be used. Similarly, the appropriate performance measures should be used for a given classification problem. Otherwise, the results could be misleading. The results also suggest that combining the age variable with the other laboratory tests can improve the prediction performance. In the future, we will compare the performance of decision tree ensembles with the other type of classifiers such as SVM and deep learning classifiers. We will further study the combination of laboratory tests with X-ray data for the prediction of Covid-19 positive cases. The prediction of the severity of Covid-19 by using datasets with laboratory tests will also be investigated.

Appendix

We discuss various performance measures in detail. We define terms which are used to define these measures for a binary dataset. The dataset has two classes: positive and negative. In our experiments, the minority class is taken as a positive class.

True positive (TP): correctly predicted positive class data points.

False positive (FP): negative class data points predicted as a positive class. True negative (TN): correctly predicted negative class data points.

False negative (FN): negative class data points predicted as a positive class. : total positive class data points.

N: total negative class data points.

True positive rate (TPR) = TP/.

False positive rate (FPR) = FP/N.

Performance measures are defined using the following equations:

Data Availability

The data are publicly available data.

Conflicts of Interest

The authors declare no conflicts of interest.