Abstract

Cervical cancer is frequently a deadly disease, common in females. However, early diagnosis of cervical cancer can reduce the mortality rate and other associated complications. Cervical cancer risk factors can aid the early diagnosis. For better diagnosis accuracy, we proposed a study for early diagnosis of cervical cancer using reduced risk feature set and three ensemble-based classification techniques, i.e., extreme Gradient Boosting (XGBoost), AdaBoost, and Random Forest (RF) along with Firefly algorithm for optimization. Synthetic Minority Oversampling Technique (SMOTE) data sampling technique was used to alleviate the data imbalance problem. Cervical cancer Risk Factors data set, containing 32 risks factor and four targets (Hinselmann, Schiller, Cytology, and Biopsy), is used in the study. The four targets are the widely used diagnosis test for cervical cancer. The effectiveness of the proposed study is evaluated in terms of accuracy, sensitivity, specificity, positive predictive accuracy (PPA), and negative predictive accuracy (NPA). Moreover, Firefly features selection technique was used to achieve better results with the reduced number of features. Experimental results reveal the significance of the proposed model and achieved the highest outcome for Hinselmann test when compared with other three diagnostic tests. Furthermore, the reduction in the number of features has enhanced the outcomes. Additionally, the performance of the proposed models is noticeable in terms of accuracy when compared with other benchmark studies for cervical cancer diagnosis using reduced risk factors data set.

1. Introduction

Cervical cancer is one of the commonly occurring types of cancer in females and mostly develops during their midlives (35 years–44 years) [1]. This type of cancer can be fatal as it does not show clear symptoms in its early stages. Symptoms usually appear in late stages, where it could have spread to other organs like bones, liver, lymph nodes, and lungs. One of the early signs of cervical cancer is when the tube that carries urine from the kidney is blocked. Other late symptoms that can appear are vaginal bleeding, pelvic pain, weight loss, and leg pain [2].

The risk factors that lead to the development of cervical cancer are hormones containing medicines, birth control pills, smoking, and the number of pregnancies. However, it is believed that human papilloma virus (HPV) is the major factor in developing cervical cancer [2]. HPV is a common sexually transmitted infection; it is usually harmless, but sometimes it may lead to cancer [3]. HPV infection becomes at a higher risk of getting cervical cancer. Furthermore, the probability of getting cervical cancer increases if one possesses more than one risk factor. As the cancer does not show signs in its early stages, regular checkups are required especially for those who have the risk factors. In the developing countries, lack of medical equipment and the cost of conducting checkups could also be a burden. With the advent and advancement of machine learning, it has become possible to find robust solutions for early diagnosis of cancer cases using data-driven approaches.

Various studies have contributed to the field of cervical cancer diagnosis using several classification techniques by using different types of data such as clinical-based, image, and genetic-based data. In our study, we used clinical cervical risk factor data. Two similar studies were conducted by Wu and Zhou [4] and Abdoh et al. [5]; they performed the comparative analysis of two feature selection techniques, namely, recursive feature elimination (RFE) and Principal Component Analysis (PCA). The first study used Support Vector Machine (SVM), and the other study used Random Forest (RF). Both studies used the same number of features. Although the data suffered from imbalance, an oversampling was applied to the data in [4] and SMOTE was used in [5]. Both studies identified two risk factors to be removed such as time since the first and last diagnosis of STDs (sexually transmitted diseases), due to a lot of missing entries. Furthermore, the study [4] discovered that less computational cost was an advantage given by both SVM-PCA and SVM-RFE, whereas high computational cost is a limitation to the SVM model. Moreover, STDs, intrauterine device (IUD), hormonal contraceptives, and first sexual intercourse were identified as the highly relevant features [5]. Overall, the outcome of both the studies showed that using 30 features produced highest results. Furthermore, it was found that the SMOTE-RF model performed well for all targets.

Similarly, Lu et al. [6] and Karim and Neehal [7] used ensemble models to estimate the risk of cervical cancer. Both studies performed data cleaning mechanism to replace missing values. The former study used an ensemble classifier with voting strategy using a combination of a private and public data set. The private data set contains 472 records taken from Chinese hospital. The public data set was obtained from the UCI repository; 14 features were used. The private data set was collected using questionnaire. The results revealed that voting ensemble classifier produced better results when compared to Linear Regression, Decision Tree (DT), Multilayer Perceptron (MLP), SVM, and K-NN classifiers. On the other hand, Karim and Neehal study used DT, MLP, and SVM using sequential Minimal Optimization (SMO) and K nearest neighbor (KNN) techniques. Experiments showed that SMO has a better performance in terms of accuracy, precision, recall, and F-measure. Similarly, Ul-Islam et al. [8] used DT, RF, Logistic Model Tree, and ANN for cervical cancer detection. Apriori algorithm was used to identify features that strongly relate to cancer. The study found that age, number of sexual partners, hormonal contraceptives, number of pregnancies, and first sexual intercourse are significant risk factors. Results indicated that RF produced best outcome when compared to the other models.

Al-Wesabi et al. [9] conducted a comparison between different machine learning classifiers such as Gaussian Naïve Bayes (GNB), KNN, DT, LR, and SVM. The outcome of the classifiers was not satisfactory due to the data imbalance. To resolve this problem, undersampling, oversampling, and SMOTETomek were applied. Oversampling had the best result among all three methods. Moreover, a Sequential Feature Selector was applied with both forward and backward versions. Both the Sequential Forward Feature Selector (SFS) and Sequential Backward Feature Selector (SBS) enhanced the performance of the prediction with an accuracy of 95%. After selecting the common features between DT and KNN, the accuracy exceeded 97% for the DT. The results revealed that age, first sexual intercourse, number of pregnancies, smoking, hormonal contraceptives, and STDs: genital herpes were the main predictive features.

Similarly, several studies have been made using deep learning and transfer learning for cervical cancer diagnosis. Fernandes et al. [10] and Adem et al. [11] used deep learning and showed significant outcome in terms of diagnosis accuracy. The study [10] used a loss function that provides a supervised optimization of dimensionality reduction and classification models. The study indicated that it can be useful in examining records of patients if the Biopsy and perhaps other testing results are absent and are capable of classifying successfully whether they have cervical cancer or not. On the other hand, the researchers in [11] used a deep neural network model with softmax function to classify the data sets. The performance of the softmax function with stacked autoencoder was compared with the other machine learning methods (DT, KNN, SVM, Feed Forward NN, and Rotation Forest models). It was found that the softmax function with a stacked autoencoder model produced better outcome classification rate of 97.8%.

Similarly, Fernandes et al. [12] applied transfer learning with partial observability for cancer screenings. The limitation of the study was that several patients were resisting answering some questions for privacy concerns. Challenges were also faced in defining quality as there are multiple readings and it started relying on human preference. Therefore, as an alternative of an ordinal scale, a simple binary scheme was used. Nevertheless, the model performance was considerable.

Conclusively, the finding made after the above-mentioned literature is that the data set found at UCI repository had several missing values; therefore, previous studies have removed at least 2 features. Missing values were due to patient’s concerns regarding their privacy. After removing 2 features due to huge missing value, SVM-PCA seemed to provide satisfactory performance. However, SMO and SMOTE-RF were amongst the best performing models. Another approach to deal with the imbalance in UCI cervical risk factor data set was using oversampling. Deep learning proved to be effective, especially where the Biopsy and possibly other screening results are absent. Age, first sexual intercourse, number of pregnancies, smoking, hormonal contraceptives, IUD, STDs, STDs: genital warts, or HPV infections were identified as the top key features. The significant outcomes made by the machine learning classifiers motivate the need for further investigation and enhancement of the outcomes for the prediction of cervical cancer.

In this study, three ensemble-based classifiers extreme Gradient Boosting, Ada Boost, and RF are used to classify cervical cancer. Cervical Cancer Risk factor data set from UCI machine learning repository was collected at “Hospital Universitario de Caracas” in Caracas, Venezuela [13]. In addition to the importance of correctly classifying cancerous and noncancerous cases, it is also essential to identify key risk factors that contribute to developing cancer. Nature-inspired Firefly feature selection and optimization algorithm was applied. Furthermore, the Synthetic Minority Oversampling Technique (SMOTE) is used to balance the classes of the data as it suffers greatly from imbalanced problem.

The paper is organized as follows: Section 2 presents material and methods. Section 3 contains experimental setup and results. The comparison of the proposed model with the existing studies using the same dataset is discussed in Section 4. Finally, Section 5 contains the conclusion.

2. Material and Method

2.1. Dataset Description

The cervical cancer risk factors data set used in the study was collected at “Hospital Universitario de Caracas” in Caracas, Venezuela and is available on the UCI Machine Learning repository [13]. It consists of 858 records, with some missing values, as several patients did not answer some of the questions due to privacy concerns. The data set contains 32 risk factors and 4 targets, i.e., the diagnosis tests used for cervical cancer. It contains different categories of feature set such as habits, demographic information, history, and Genomic medical records. Features such as age, Dx: Cancer, Dx: CIN, Dx: HPV, and Dx features contains no missing values. Dx: CIN is a change in the walls of cervix and is commonly due to HPV infection; sometimes, it may lead to cancer if it is not treated properly. However, Dx: cancer variable is represented if the patient has other types of cancer or not. Sometimes, a patient may have more than one type of cancer. In the data set, some of the patients do not have cervical cancer, but they had the Dx: cancer value true. Therefore, it is not used as a target variable.

Table 1 presents a brief description of each feature with the type. Cervical cancer diagnosis usually requires several tests; this data contains the widely used diagnosis tests as the target. Hinselmann, Schiller, Cytology, and Biopsy are four widely used diagnosis tests for cervical cancer. Hinselmann or Colposcopy is a test that examines the inside of the vagina and cervix using a tool that magnifies the tissues to detect any anomalies [3]. Schiller is a test in which a chemical substance called iodine is applied to the cervix, where it stains healthy cells into brown color and leaves the abnormal cells uncolored, while cytology is a test that examines body cells from uterine cervix for any cancerous cells or other diseases. And Biopsy refers to the test where a small part of cervical tissue is examined under a microscope. Most Biopsy tests can make significant diagnosis.

2.2. Dataset Preprocessing

The data set suffers from a huge number of missing values; 24 features out of the 32 contained missing values. Initially, the features with the huge percentage of missing values were removed. STDs: Time since first diagnosis and STDs: Time since last diagnosis features were removed since they have 787 missing values (see Table 2), which is more than half of the data. However, the data imputation was performed for the features with fewer numbers of missing values. The most frequent value technique was used to impute the remaining missing values. Additionally, the data set also suffers from huge class imbalance. The data set target labels were imbalanced with 35 for the Hinselmann, 74 for Schiller, 44 for Cytology, and 55 Biopsy out of the 858 records as shown in Figure 1. SMOTE was used to deal with class imbalance. SMOTE works by oversampling the minority class by generating new synthetic data for minority instances based on nearest neighbors using the Euclidean Distance between data points [14]. Figure 1 shows the number of records per class labels in the data set.

2.3. Firefly Feature Selection

Dimensionality reduction is one of the effective ways to select the features that improve the performance of the supervised learning model. In the study, we adopted nature-inspired algorithm Firefly for selecting the features that better formulate the problem. Firefly was proposed by Yang [15] and was initially proposed for the optimization. Metaheuristic Firefly algorithm is inspired by fireflies’ and flash lightening capability of a fly. It is a population-based optimization algorithm to find the optimal value or parameter for a target function. In this technique, each fly is pulled out by the glow intensity of the nearby flies. If the intensity of the gleam is extremely low at some point, then the attraction will be declining. Firefly used three rules; that is, (a) all the flies should be of the same gender; (b) the criteria of attractiveness depend upon the intensity of the glow; (c) target function will generate the gleam of the firefly. The flies with less glow will move towards the flies with brighter glow. The brightness can be adjusted using objective function. The same idea is implemented in the algorithm to search the optimal features that can better fit the training model. Firefly is more computationally economical and produced better outcome in feature selection when compared with other metaheuristic techniques like genetic algorithms and particle swarm optimization [16]. The time complexity of firefly is [17]. It uses the light intensity to select the features. Highly relevant features are represented as the features with high intensity light.

For feature selection, initially, some fireflies will be generated, and each fly will randomly assign the weights to all features. In our study, we generated 50 number of flies (n = 50). The dimension of the data set is 30. Furthermore, the lower bound was set to −50, while the upper bound is equal to 50. The maximum generations were 500. Additionally, α (alpha) was initially set to 0.5 and in every subsequent iteration, we used the (1) and (2) to update α (alpha) value.

However, the gamma was set to 1. The number of features selected using Firefly for Hinselmann was 15, for Schiller 13 features, for Cytology 11 features, and 11 features for Biopsy, respectively.

2.4. Ensemble-Based Classification Methods

Three ensemble-based classification techniques such as Random Forest, Extreme Gradient Boosting, and Ada Boost were used to train the model. The description of these techniques is discussed in the section below.

2.5. Random Forest

Random Forest (RF) was first proposed by Breiman in 2001 [18]. Random forest is an ensemble model that uses decision tree as individual model and bagging as ensemble method. It improves the performance of decision tree by adding many trees to reduce the overfitting in the decision tree. RF can be used for both classification and regression. RF generates a random forest that contains decision trees and gets a prediction from each one of them and then selects the best solution with the maximum votes [19].

When training a tree, it is important to measure how much each feature decreases the impurity, as the decrease in the impurity indicates the significance of the feature. The tree classification result depends on the impurity measure used. For classification, the measures for impurity are either Gini impurity or information gain and for regression, and the measure for impurity is variance. Training decision tree consists of iteratively splitting the data. Gini impurity decides the best split of the data using the formula.where p (i) is the probability of selecting a datapoint with class; i.e., Information gain (IG) is also another measure to decide the best split of the data depending on the gain of each feature. The formula that calculates the information gain is given in the following equation:

2.6. Extreme Gradient Boosting

eXtreme Gradient Boosting (XGBoost) is a tree-based ensemble technique [20]. XGBoost can be used for classification, regression, and ranking problems. XG boosting is a type of gradient boosting. Gradient Boosting (GB) is a boosting ensemble technique that makes predicators sequentially instead of individually. GB is a method that produces a strong classifier by combining weak classifiers [21]. The goal of the GB is building an iterative model that optimizes a loss function. It pinpoints the failings of weak learners by using gradients in the loss function [21]:where e denotes the error term. The loss function measures how good is the model at fitting the underlying data. The loss function depends on the optimization goal, for regression is a measure of the error between the true and predicated values, whereas, for classification, it measures the how good is a model at classifying cases correctly [21]. This technique takes less time and less iterations, since predictors are learning from the past mistakes of the other predictors. The GB works by teaching a model C to predict values of the form

By minimizing a loss function, e.g., MSE:where i iterates over a training set of size n of true values of the target variable yyʹ = estimated values of C (x) y = true values & n = number of instances in y.

Considering a GB model with M phases and m as a single phase being (), to improve some deficient model Fm, a new estimator hm (x) is added. Therefore,

Estimator h will be fitted to , which is the difference between the true value and the predicated value, i.e., the residual. Thus, we attempt to adjust the errors of the previous model (Fm) [22].

XGBoost is better than Ada boost in terms of speed and performance. It is highly scalable and runs 10 times faster as compared to the other traditional single machine learning algorithms. XGBoost handles the sparse data and implements several optimization and regularization techniques. Moreover, it also uses the concept of parallel and distributed computing.

2.7. AdaBoost

Adaptive Boosting (AdaBoost) is a meta-learner originally proposed for the binary classification proposed by Freund and Schapire [23]. It is an ensemble technique to build a meta classifier by combining several weak classifiers using progressive learning.

AdaBoost uses the concept of boosting data sampling technique; adaptive sampling was used to assign high weights to the misclassified events. The misclassified samples will be selected in the next iteration to better train the model, and the final prediction was made using weighted voting. AdaBoost has reduced error rate, has a better effect on the prediction as compared to bagging [24], and uses decision tree stumps. Initially, all the samples in the data set have equal weights. Let x be the number of samples in the data set, and let y be the target. The target is a binary class represented by 0 and 1. The first decision tree stump will use some records from the data set, and predictions will be performed. After the initial prediction, the weights to the sample will be updated. More weights will be assigned to the data samples that were misclassified. The samples with the high weights will be selected in the next iteration. The process will be continued, unless the error rate is completely reduced, or a certain target level is achieved.

AdaBoost contains two main steps, combination and step forward using sequential iterative approach. All the instances in the training set have equal weights in the first iteration. However, in subsequent iterations, the weights are changed based on the error rates. The instances with error have increased weights. For the binary class classification problem containing T training samples is represented in the following equation:

Let C be the linear combination of weak classifiers. The combination of the classifiers is represented aswhere N is the number of weak classifiers, represents the weights, and C (x) represents weak classifiers. In every next iteration, the classifier is trained based on the performance of the classifier in previous iteration.where represents the classifier in t iteration. is the performance of the classifier at t 1 iteration.

The weights can be calculated using the following equation: represents the error rate of the weak classifier.

2.8. Optimization Strategy

This section discusses optimization strategy to find the best hyperparameters combination that produces the highest targeted outcomes. Firefly optimization algorithm was used for parameter tunning. The details of Firefly are discussed in Section 2.3. Table 3 presents the hyperparameter values of Random Forest for all the four targets, For RF “gini” index criterion was used. Table 4 represents the parameters used for XGBoost. Gbtree booster was used with the random state of 42 and the learning rate of 0.05. Similarly, Table 5 presents the optimal feature vales for AdaBoost. Furthermore, Figures 24 represent the Grid Search optimization graph for Random Forest, Extreme Gradient Boosting, and AdaBoost classifier.

3. Experimental Setup and Results

The model was implemented in Python language 3.8.0 release using Jupyter Notebook environment. Ski-learn library was used for the classifiers along with other needed built-in tools, while separate library (xgboost 1.2.0) was used for XGBoost ensemble. There is K-fold cross validation with K = 10 for partitioning the data into training and testing. Five evaluation measures such as accuracy, sensitivity (recall), specificity (precision), positive predictive accuracy (PPA), and negative predictive accuracy (NPA) were used. Sensitivity and specificity are focused more during the study due to the application of the proposed model. Accuracy denotes the percentage of correctly classified cases, sensitivity measures the percentage of positives cases that were classified as positives, and specificity refers to the percentage of negative cases that were classified as negatives. Moreover, the criteria for the selection of the performance evaluation measures depend upon the measures used in the benchmark studies. Two sets of experiments were conducted for each target using selected features by using Firefly feature selection algorithm and 30 features for four targets. The SMOTE technique was applied to generate synthetic data. The results of model are presented in section below.

3.1. Hinselmann

Table 6 presents the accuracy, sensitivity, specificity, PPA, and NPA for the RF, AdaBoost, and XGBoost models, respectively, using SMOTE for Hinselmann test target class. The number of selected features for Hinselmann was 15. XGBoost outperformed the other classifiers for both feature sets. However, the performance of XGBoost with selected feature is better when compared with 30 features. The model produces an accuracy of 98.83, sensitivity of 97.5, specificity of 99.2, PPA of 99.17, and NPA of 97.63, respectively.

3.2. Schiller

Table 7 presents the outcomes for the Schiller test. Like Hinselmann target, XGBoost with selected features outperformed that of Schiller, respectively. However, the outcomes achieved by the model for Schiller are lower when compared with Hinselmann target class. The performance of RF and XGBoost is similar with selected feature for Schiller with a minor difference. The number of features selected by Firefly for Schiller was 13.

3.3. Cytology

Table 8 presents the outcome of all the classifiers for the cytology diagnosis tests. Like Hinselmann and Schiller diagnostic test, XGBoost outperformed Cytology test as well with selected features. For specificity and accuracy, similar outcomes were achieved using 30 and selected features. Similarly, the performance of RF is similar in both 30 and selected features. The number of features selected by Firefly feature selector for Cytology was 11.

3.4. Biopsy

Similarly, performance was not drastically different, yet using all the features resulted in a higher accuracy than when using SMOTE with selected features for Biopsy as shown in Table 9. XGB obtained the highest accuracy of 97.1 with all features. However, for other measures, the performance of the XGBoost is better with the selected features. Similar performance was achieved for all measures when classified using RF for both feature sets 30 and selected, respectively. The number of selected features used for Biopsy target class was 11.

Overall, after comparing all the four-diagnostic tests, Hinselmann test achieved the better outcome and can be used for the diagnosis of cervical cancer as shown in Table 10. As per the outcome achieved in the proposed study, Hinselmann diagnosis test has better performance when compared from other cervical cancer diagnosis tests like Schiller, Biopsy, and Cytology, respectively. Similar findings have been made in Abdoh et al. [5] and Wu and Zhou [4] study.

4. Comparison with Existing Studies

The study used three ensemble techniques AdaBoost, extreme Gradient Boosting, and Random Forest. Furthermore, the proposed study is the pioneer in using bioinspired algorithm for feature selection and optimization for cervical cancer diagnosis. To explore the significance of our proposed study, the outcome of the study was compared with the benchmark studies. The criteria for the benchmark studies selection were based on data set used for the diagnosis of cervical cancer. Table 11 contains the comparison of the proposed technique with the benchmark studies in the literature. The best outcomes in the benchmark studies were achieved using 30 features. However, some of the outcomes in the previous studies were achieved with the reduced features. The number in the brackets next to some of the outcomes represents the number of features.

Therefore, based on Table 11, the proposed study outperforms the two studies in the benchmark interms of accuracy with reduced risk factors. However, the achieved sensitivity and NPA are less than those of Wu and Zhou [4] but higher than those of Abdoh et al. [5]. The number of features in Wu et al. study is 30, while the proposed study used reduced risk factors. The specificity and PPA of the proposed study are higher than those of the benchmark studies except for the Schiller diagnosis test.

In nutshell, the main contributions of the current study are applying bioinspired algorithm for feature selection and for model optimization for cervical cancer risk factors. The proposed model enhanced the outcomes when compared with the previous studies related with cervical cancer risk factors data set. Despite the above-mentioned advantages, the study suffers from some limitations: the data set suffers from huge imbalance, and augmented data was generated using SMOTE. Moreover, the current study was based on open-source data set, and further testing is required to use other real and open-source data sets.

To alleviate the above-mentioned limitations, there is a need for validating the model on real data set from the hospital.

5. Conclusion

This study presents an investigation of several ensemble techniques such as Random Forest, AdaBoost, and Extreme Gradient Boosting for diagnosing cervical cancer. The data set was obtained from the UCI machine learning repository containing 858 records, 32 features, and 4 target variables. The target variables are the diagnosis test used for cervical cancer. Experiments were conducted for each target class separately. Data preprocessing includes imputing missing values and class balancing using SMOTE. Moreover, bioinspired firefly algorithm was used to optimize the models, and to identify the key features. To compare the performance of the models, the experiments were conducted with 30 features and the selected features using SMOTED data. Extreme Gradient Boosting outperformed the other two models for all four target variables. For future work, the model will be validated on multiple data sets. Also, other models that can handle outliers and unbalanced data differently should be investigated.

Data Availability

The study used open-source data set available at https://archive.ics.uci.edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.