Abstract

Breast cancer (BC) disease is the most common and rapidly spreading disease across the globe. This disease can be prevented if identified early, and this eventually reduces the death rate. Machine learning (ML) is the most frequently utilized technology in research. Cancer patients can benefit from early detection and diagnosis. Using machine learning approaches, this research proposes an improved way of detecting breast cancer. To deal with the problem of imbalanced data in the class and noise, the Synthetic Minority Oversampling Technique (SMOTE) has been used. There are two steps in the suggested task. In the first phase, SMOTE is utilized to decrease the influence of imbalance data issues, and subsequently, in the next phase, data is classified using the Naive Bayes classifier, decision trees classifier, Random Forest, and their ensembles. According to the experimental analysis, the XGBoost-Random Forest ensemble classifier outperforms with 98.20% accuracy in the early detection of breast cancer.

1. Introduction

One of the most as well as dangerous diseases existing on the planet is breast cancer (BC). There are two types of BC: invasive and noninvasive. The first type invasive cancer is malignant, and it spreads to other organs. The second type noninvasive cancer is precancerous and does not spread beyond the native organ. In the end, it progresses to invasive BC. The glands along with the milk ducts that convey the milk are the parts of the body where breast cancer can be found. Breast cancer frequently spreads to other organs, turning them aggressive. BC can be categorized into four categories: the first type of cancer is a prestage breast cancer called carcinoma in situ. The second type of BC is the most common, accounting for 70-80 percent of all diagnoses. Inflammatory BC is another type of BC that develops fast and strongly. Inflammatory BC cells enter the skin and lymph veins of the breast. One more type is metastatic BC that spreads to other regions of the body.

Disease diagnosis is a difficult and time-consuming task in medicine. A great amount of medical diagnostic data can be found in many diagnostic institutions, hospitals, research organizations, and websites. To automate and speed up disease diagnosis, however, categorizing them is not absolutely necessary. As per the American Cancer Society [1], BC affects more women than any other malignancy. According to estimates, 252,710 women in the USA were identified with invasive BC in 2017 and 63,410 women were diagnosed with in situ BC.

So avoiding BC is quite difficult; however, if identified early, proper diagnosis and treatment can be provided to cure the disease. This also reduces the treatment expenses as well. However, since symptoms of cancer might be uncommon at times, prior detection can be challenging. Mammograms and also self-breast examinations are essential for detecting and identifying any prior anomalies before the malignancy progresses [2].

BC outcomes are classified using a number of methods. This disease can be classified and predicted using a variety of approaches. The XGBoost ensemble method developed in this paper can be used to classify breast tumors. For the proposed XGBoost ensemble model, we used Naive Bayes (NB) and decision tree (DT) as base learners. In addition, the accomplishments of the suggested models are being evaluated using the Kaggle Wisconsin BC dataset and the UCI ML Repository. The aim of the research work is to increase prediction accuracy by detecting and categorizing malignant and benign individuals.

This section contains information about related research that has already been completed. The model that was recommended in this work [3] uses a hybrid method employing machine learning. It used feature selection approach called MRMR with four different classifiers to figure out the optimal results for this method. SVM, Naive Bays, End Meta, and Function Tree were the four classifiers utilized by the author, and they were all compared. It was discovered that SVM was an effective classifier. RFE and SVM are together combined in the SVM classifier technique [4]. RNNs are a type of neural network (NN) [57] that has a large number of layers in the sequential dimension and has been widely used in the modeling of time sequence. RNNs, unlike regular NNs, may analyze data objects where actually, the activation at each step is dependent on the previous step. CNN relies on “discrete convolution” since it makes use of spatial data [8] among picture pixels. As a result, it is assumed that the image is grayscale.

In this work [9], one more hybrid model based on ML was proposed. The authors claimed through experimental results that SVM was a good classifier with higher accuracy than others. They compared SVM with KNN and ANN as well DT algorithms. It was applied to the blood and picture datasets. As a result, the authors [10] suggested a machine learning model but with a different classifier. Extreme Learning Machine, SVM, KNN, and ANN were the classifiers employed by the author. To get better results, the classifier has to be tweaked a little. Extreme Learning Machine, on the other hand, produced better results.

In this study [11], different ML techniques were compared. WEKA was used to perform the comparison, and the dataset used was the Wisconsin BC dataset. According to their findings, SVM produced improved performance matrices. Deep learning (DL) methods were originated after ML to overcome the difficulty of ML. The paradigm of a DL-based CNN was proposed in this work [12]. CNN employed a variety of models, and after comparison, they concluded that Inception V3 provided better accuracy than others.

In healthcare, machine learning and its associated approaches have been identified as crucial in improving patient outcomes and wellbeing. An accuracy of 96.4 percent was found using logistic regression [13]. SVM and KNN were employed to classify breast cancer in this study [14], and the accuracy was 96.85 percent. RF [15] was used, and the accuracy was 92.2 percent. To figure out the optimal classifier in the BC dataset [16], researchers compared the performance of NB, SVM-RBF kernel, DT, basic CART classifiers, and RBF neural networks. AdaBoost was used, and it performed 97.5 percent better than Random Forest. Ensemble methods were used in this study [17] to achieve 96.25 percent accuracy, compared to 96.2 percent accuracy in earlier studies [18] using the back propagation strategy. The results showed 96.84% accuracy using the Wisconsin dataset for BC. As classification algorithms, they used SVM, KNN, RF, NB, and ANN.

On the acquired BC dataset, we used XGBoost ensemble learning employing NB and DT algorithms as base learners, and a significant boost in accuracy and recall was found. Machine learning models use a variety of approaches to improve the performance of classic models by integrating multiple models. By generating numerous models, the goal is to introduce ensemble learning and comprehend basic techniques. Compared to the individual classifiers, the ensemble learning method provides prominent accurate results. Our methodology employs the ensemble method, which employs in predicting good accuracy findings that correct issues or any restrictions according to the research study.

3. Methodology

Medical treatment as well as the accuracy of the diagnosis has a significant impact on the likelihood of survival and cancer recurrence. In this experiment, arbitrary extracted data was employed, with a 70 : 30 split between training as well as testing data. Training sets were used to provide the training to the model, and its effectiveness was assessed using test data. The dataset has 143 instances and contains 10 variables or attributes whose values will indicate whether or not a person is likely to get breast cancer. The output variable, also known as the target variable, is a binary variable that can be either malignant or benign. The dataset taken from Kaggle consists of these independent variables, sample code number, clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses, and one dependent or output variable. However, the first feature sample code number is not considered for processing as it does not have any significance. Figure 1 represents the different stages of the procedure.

The authors [19] describe A-SMOTE, an advanced strategy to deal with the data imbalance problem. The steps are described below.

Step 1. A-SMOTE method is utilized to generate a synthetic object by using the following equation: where is majority class samples, minority class sample number, and newly generated synthetic instance number.

The synthetic objects obtained by SMOTE can be accepted or rejected and depend upon 2 criteria. For instance, consider a set of synthetic instances which are new as and th feature value of ,

Let be minority sample collection and majority sample collection. Distance is calculated between and , i.e., and between and , i.e., . Using equations (2) and (3), the distance is computed as shown.

As per equations (2) and (3), we compute arrays and using equations (4) and (5).

Then, minimum value will be chosen among and the minimum value out of . If is a lesser than , the new samples will be accepted else, rejected.

Step 2. The steps to remove the noise are listed below.

For example, if is a latest synthetic minority acquired by Step 1, then, we will compute the distance among with every original minority defined using equation (6). where is sample rapprochement that includes all minority and is obtained as follows using the following equation:

Step 3. Calculate the distance among and every original majority , , described using equation (9). where is sample rapprochement that includes all majority and is obtained using the following equation:

3.1. Decision Tree (DT) Classifier

A basic diagram for categorizing samples is a DT. In DT, the data is constantly divided based on a parameter [20]. The DTs are a group of supervised classification algorithms that are well-known. They perform well on classification tasks, the decisional process are easy to understand, and the algorithm for creating (training) them is quick and simple. It is one of the most well-known modeling strategies because it was one of the first elite regression analysis methods individuals learned when learning predictive modeling.

3.2. Alternating Decision Tree (AltDT)

An AltDT consists of a sequence of decision nodes. An AltDT categorizes an instance by summarizing all prediction nodes traversed and pursuing all paths for which all decision nodes are true [21]. Both the root and leaves of AltDT are always prediction nodes. An AltDT classifies an instance by traversing all paths where all decision nodes are true and adding any prediction nodes traversed.

3.3. Reduced Error Pruning Tree (RedEPT)

RedEPT is a quick DT learning algorithm that constructs a DT based on the information obtained or by minimizing variance. This algorithm’s basic pruning method is REP with back overfitting [22]. It politely arranges numerical attribute values once, and in fractional instances, it handles missing values with an embedded function by C4.5 which is an extension of Quinlan’s earlier ID3 algorithm based on wrapper feature selection. Training, validation, and test sets are used until additional trimming is damaged, which is an effective strategy if a substantial amount of data is available.

3.4. Random Forest (RT) Classifier

Random Forest is a ML technique that is part of the supervised ML model. The RF classifier is made up of numerous DTs representing various subjects. It takes the average of each tree’s subset to improve predictive accuracy. RF, rather than depending on a single decision tree, uses the majority prediction of voting from every tree and then predicts the result [23]. Every node in the decision tree answers a query about the situation.

For a candidate (nominal) split attribute denoted possible levels as Gini Index for this feature is computed using the following equation:

3.5. Naïve Bayes (NB) Classifier

The Bayes’ theorem is a straightforward formula for estimating conditional probabilities. The formula is given as follows: where are events, probability of given is true, probability of given is true, probability of , and probability of .

3.6. XGBoost

XGBoost is a high-scalability DT ensemble based on gradient boosting. XGBoost, like gradient boosting, minimizes a loss function to produce an additive expansion of the objective function. Because XGBoost only uses DTs as base classifiers, the complexity of the trees is controlled using a variant of the loss function as shown in equations (14) and (15). where is the number of leaves on the tree and denotes the leaf output scores. This loss function can be incorporated into decision trees’ split criterion, resulting in a prepruning approach. Trees with higher values are simpler. determines how much loss reduction gain is required to split an internal node. Shrinkage is an additional regularization parameter in XGBoost that reduces the additive expansion step size. Finally, other tactics such as tree depth can be used to limit the complexity of the trees. The models are trained faster and need less storage space as a side effect of reducing tree complexity.

4. Performance Evaluation

Several ML methods, such as Naive Bayes, AltDT, RedEPT, and RF, are used as independent classifiers on the dataset. The implementation was done using Python language. Their performance is compared using numerous metrics, which are detailed in the next section.

Different performance metrics have been used to evaluate the suggested model. “Precision” is the percentage of accurately classified events among those which have been classified as correctly positive [24]. Precision indicates what proportion of the total positive anticipated is genuinely positive. The precision is computed using the following equation:

Recall indicates what proportion of the total positives is projected to be positive. The proportion of TPs to the sum of TPs and FNs is known as recall [25]. True positive rate and true recall are the same things. Out of all feasible positive predictions, recall is one metric that quantifies how many right predictions that are positive were made. The recall is computed using the following equation:

In order for a good classifier to be one, both accuracy and recall must be one, which means the number of FPs and FNs must be zero. So, a statistic is needed that takes both precision and recall into account. F1 is calculated using the following equation:

The accuracy is computed using the following equation [26]:

Table 1 depicts that RF is the optimal model since it consumes only 2.26 seconds for model building (TTMB—Time for Model Building); however, AltDT has consumed 60.38 seconds for model building.

AltDT provides the best accuracy of 95.6%. RF provides 94.5% accuracy, RedEPT provides 89.23%, and prediction of NB classifier is the least with 88.50% accuracy. The accuracy prediction of various classifiers is shown in Figure 2.

To calculate the error rates in the predicted value, let represent a set of data that has the form , ,…, , where represents -dimensional tuples of test with respective values of for a given response , as well as represents count of tuples in .

The mean absolute error (MAE) is computed using the following equation:

The errors are squared before being averaged in RMSE. This basically means that RMSE gives larger mistakes a higher weight. This suggests that RMSE is far more beneficial when substantial errors exist and have a significant impact on the model’s performance. This characteristic is important in many mathematical calculations since it avoids taking the absolute value of the error. In this metric as well, the lower the value, the better the model’s performance. RMSE is calculated using the following equation:

The relative absolute error (RAE) is used for the evaluation of a prediction model’s performance. RAE is computed using the following equation:

The RRSE is one of the measures to compute to know how good the ML model fits the data. The model does not match the data well if there is a substantial discrepancy between the values. It is calculated using the following equation:

Table 2 depicts the different error rates’ comparison of single classifiers. The error rates of RF are lesser compared to those of other classifiers.

Figure 3 shows the rates of errors of different individual classifiers. MAE and RMSE rates of NB, AltDT, RF, and RedEPT are, respectively, 0.60, 0.38, 0.25, and 0.25 and 0.75, 0.38, 0.36, and 0.40.

Table 3 depicts that XGBoost-RF can be recommended, since it takes as few as 10.34 seconds for model building. However, XGBoost-RedEPT is the least recommended model since it takes 60.25 seconds for model building.

XGBoost-RF provides the best accuracy of 98.20%. XGBoost-AltDT provides 96.50% accuracy, XGBoost-RedEPT provides 82.25%, and the prediction of XGBoost-NB ensemble classifier is the least with 81.55% accuracy. The accuracy prediction of different classifiers is shown in Figure 4.

Table 4 depicts the different error rates’ comparison of XGBoost classifiers. The error rates of XGBoost-RF are lesser compared to those of all other ensemble classifiers.

Figure 5 shows the error rates of different ensemble classifiers. XGBoost-RF ensemble classifier is the best one as it provides 0.12 error rates for MAE and 0.27 for RMSE, respectively. XGBoost-NB has the highest error rate of 0.44 and 0.56 for MAE as well as RMSE, respectively.

5. Conclusion

This paper proposes the XGBoost ensemble technique for breast cancer prediction based on known feature patterns. It can be compared to traditional data mining methods in terms of disease diagnosis. During the feature extraction process, ensemble classification techniques replace the traditional techniques of retrieving useful information. SMOTE technique has been employed to deal with the problem of data imbalance. According to the experimental results, the time taken to create the model for XGBoost ensemble classifier is only 10.34 seconds for XGBoost-RF, which is the best, and XGBoost-RedEPT takes the worst time of 60.25 seconds. The results show that the XGBoost-RF classifier shows an error rate of 0.12 for MAE and a 0.27 for RMSE. The results show that XGBoost-RF outperforms other ensemble classifiers, with 98.20% accuracy.

Data Availability

The Irvine ML Repository data used to support the findings of this study are available at https://archive.ics.uci.edu/ml/datasets.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.