Abstract

Breast cancer is one of the most dangerous cancers, accounting for a large number of fatalities each year. It is the leading cause of mortality among women globally. It is getting a lot of interest in the scientific community because of its possible life-threatening danger. As a consequence, many machine learning methods (MLMs) have been modified to provide the best results for early diagnosis of this malignancy. Machine learning methods (MLMs) offer several beneficial implications in breast cancer, including early prognosis, detection, and diagnosis. Compared to traditional statistical analysis, machine learning methods (MLMs) have the capacity to improve the analysis of various health data, such as unstructured, complicated, and noisy data. With the demanding prevalence of breast cancer and the arrival of “data reformation,” it is thus imperative to mention the ethical consequences of machine learning (ML) on society and cancer care. It offers conclusively strong tools, smart methods, and efficient algorithms that can help in the prognosis of breast cancer. The focus of this review is on supervised techniques such as classification and regression that may be implemented and used for breast cancer data analysis. Some supervised learning methods like Naive Bayes, AdaBoost, and support vector machine are presented in this work in the early identification of breast cancer. These algorithms have been analyzed for their accuracy and efficiency using various assessment metrics and methods.

1. Introduction

Breast cancer is among the most significant threats to women worldwide. Among females, this is the second biggest cause of death [1]. Breast cancer is the second largest cause of death, according to the World Health Organization (WHO) [2]. Breast cancer claimed the lives of 70,218 Indian women in 2012. It is a stepwise procedure offering a range of cell types, but prevention remains a significant challenge for it, all over the world [2]. Huge methodological revolutions have occurred in this sector throughout the years; however, there are a variety of challenges.

In timely identification, several image processing methodologies have been applied to minimize the probability of death per year. As a result, a variety of machine learning techniques are applied to boost precision in the detection of breast cancer. In recent decades, various machine learning techniques have been developed and they are divided into 3 stages: (a) preprocessing; (b) feature extraction; and (c) classification.

Preprocessing mammography pictures improves visibility of outlying regions and brightness, making data interpretation easier. Breast cancer survivors in certain industrialized nations have an overall 5-year survival rate of more than 80% due to early identification and care. Breast cancer stem cells have shown the genesis and methods of tumor-resistant strains, and various breast cancer genes have been found. People now have a plethora of pharmaceutical options for breast cancer prevention and treatment, and biological prevention has since been created to provide even greater advantages in terms of living standards [35].

2. Breast Cancer

Breast cancer has been defined as abnormal tissue development in the breast that results in malignant tumors. Non-invasive breast cancer refers to malignant cells that remain in the breast. Invasive malignancies cannot heal quickly and might spread to other regions of the body. These are deadly malignancies that can sprawl to other organs of the human body which results in metastatic cancer [69].

2.1. Categories of Breast Cancer

There are 2 categories of breast cancer: intrusive and non-intrusive. Ductal carcinoma in situ is a non-intrusive variety, whereas the rest are intrusive [1013].

In Figure 1, various types of breast cancer are given, and they are defined in Table 1.

2.2. Breast Cancer Diagnosis

Multiple investigations are required for the diagnosis of breast cancer, such as a clinical examination, which is probing of the breast to discover cancer indicators, indicating the initial stage. Following the clinical examination, the next stage is medical imaging, which allows for the identification of tumor masses as well as the collection of data from the clinical examination. Many kinds of test include mammography, ultrasonography, and magnetic resonance imaging (MRI) [1417]. One of these strategies is chosen based on the patient’s circumstances as shown in Figure 2.

Machine learning is now in such high demand that it is being offered as a service. Machine learning is a high-barrier area that frequently necessitates specialist knowledge. A set of abilities and knowledge is required to design an efficient machine learning model, which includes the steps of preprocessing, feature selection, and classification processes. So, in Figure 2, some steps of breast cancer diagnosis are depicted.

3. Materials and Methods

In machine learning [18], there are various types of automatic models that are proposed to build from previous records in this type of learning. The machine learning model assesses and may generate future predictions based on that training model [19].

To begin this investigation, the Wisconsin Breast Cancer Dataset was obtained from the UCI Machine Learning Repository. The datasets are classified into two types: malignant and benign. The points are derived from needle sputum imaging data from a breast tumor and describe the nucleus of the present picture. The WDBC database has been utilized to detect 212 cancer and 357 benign cases among 569 Wisconsin hospital patients [2022].

The following subcategories will outline the process flow:(i)Dataset definition(ii)Dataset assessment(iii)Training and testing

3.1. Dataset Definition

The Wisconsin Breast Cancer Dataset (WBCD) is shown in this work with separation of benign and malignant status in Figure 3. This dataset contains 357 benign and 212 malignant breast cancer patients, respectively.

All the breast cancer attributes are listed in Figure 3 and Table 2 with real-value characteristics.

3.2. Dataset Assessment

The entire dataset has been examined for the data patterns. The mean radius feature of the dataset is counter-plotted and found that expected cancer-free patients have a mean radius of 1, whereas probable cancer-bearing patients have a mean radius greater than 1.

3.3. Training and Testing

The dataset numbers are examined based on their properties before being used in the following stage. The dataset was then divided into two halves at random: training and testing, 75 and 25, respectively.

Patients are confronted with a variety of interventions on a regular basis, including ultrasound, biopsy, and mammography, depending on the severity of their breast cancer signs. The most common of these operations is biopsy, which entails the removal of a piece of tissue or cell for investigation. A fine-needle aspiration (FNA) operation is used to remove a tissue sample from a breast and transmit it to a pathology laboratory for analysis under microscopy. Numerical qualities can be ascertained through an investigation of tissues at a microscopic level. The evidence received from fine-needle aspiration is evaluated in conjunction with other imaging analyses to find the possibility of the patient having a cancerous lump.

Six classification models are supposed to implement including ANN, logistic regression, AdaBoost, Naive Bayes, support vector machine, and K-nearest neighbor. The findings obtained are then assessed by the algorithms in order to find an efficient model. The workflow for the given dataset is illustrated in Figure 4.

4. Machine Learning Algorithms for Breast Cancer Prediction

4.1. K-Nearest Neighbor (KNN)

KNN [23] (K-nearest neighbor) is a supervised method that categorizes data depending on how its neighbors are categorized. KNN maintains track of all existing instances and classifies new ones using a matching score. The KNN method looks through a database for data that are similar to what are already available. These newly found data are known as the current data’s nearest neighbors. With the test station in this article, KNN is used to choose generally relevant adjacent stations. Assume the road network has a M station [24, 25].

In this algorithm, a random sample size dataset has been taken with various data fields. It is the mth station’s historic traffic flow data, which differ from the substation. The Euclidean distance is used to determine the degree of correlation between the substation and other locations.

4.2. Support Vector Machine (SVM)

Vapnik was the first one to discover the support vector machine in 1979 [26]. Vapnik suggested it for classification and regression problems again in 1995 [27]. Support vectors with multilayer perceptron and radial-basis function networks can be utilized for pattern classification. The SVM is a cutting-edge technique that incorporates maximum classification algorithms into statistical learning [28]. Linear and non-linear data are categorized using SVM techniques. It involves converting training dimensional space using non-linear projection. The classification task is carried out by SVM, which maximizes the margin and classifies both classes while reducing classification errors. Although the SVM is used to solve a range of problems, such as regression, data categorization is the most popular one. Figure 2 depicts the main concept. The goal is to identify a hyperplane that separates them by the greatest possible margin with the categorization of positive and negative data points.

4.3. Artificial Neural Network (ANN)

This algorithm is capable of performing a variety of tasks like prediction, curve fitting, and regression. It formulates the output using a transfer function [29, 30]. Each intake is multiplied by a weight, which acts as a link between the input, the neuron, and its layers. To acquire the result, the neuron uses a transfer function in the last stage. ANN approaches offer the advantage of having no prior understanding of mathematical computations across attributes, although they do demand less processing work and can handle multivariable problems [31].

4.4. Logistic Regression (LR)

The only machine learning method that is not a black-box approach is logistic regression. In general, black-box techniques are difficult to comprehend, but logistic regression demonstrates how they work. Logistic regressions are classified into three types: binary, multinomial, and ordinal [32]. Logistic regression originated as a classification technique rather than a linear regression. Scikit-learn solves multiclass classification issues with a more appropriate logistic regression implementation technique [3335].

4.5. Naive Bayes

A probabilistic learner is called a Naive Bayes classifier. It is most commonly utilized in clinical conditions. Some of the significant benefits of this classifier are that it just requires a minimal scale of training dataset for computation. According to the Naive Bayes method, the occurrence of a particular feature in a class has no bearing on the inclusion or exclusion of any other feature [36]. Only the semantic content of the messages is classified by the Naive Bayes classifier. Operational filters take into account details like the existence of suspicious header, which provides extra properties to the message interpretation [37, 38].

Consider a set of experiments in which each statement is eventually represented as a vector VY1, VY2, …, VYn, where Y1, Y2, …, Yn are given properties. Each characteristic contains information about a single message token. In the simple case, all the attributes are from Boolean [39]:

Bayes’ theorem represents the likelihood that a message with vector (Y) = Y1, Y2, …, Yn belongs to the following category c:

4.6. AdaBoost

AdaBoost is a sort of classification boosting technique. It focuses on data points, and this technique combines the prognosis efficiency of numerous weak classifiers into a single strong classifier [4044]. This method focuses on classifier “training” and “weighting.” It applies iteration to the provided method after assigning a weight value to every occurrence in the training dataset. Iteration raises the number of incorrect instances while lowering the number of correct ones. This iteration is carried out in order to improve the weak classifiers’ accuracy [4547]. The AdaBoost algorithm reduces the exponential loss as follows.

Step 1. Let us check an example as follows:

Step 2. Assign the weight as , for n = 1, …, N.
Train the weak learner.

Step 3. Examine weak hypothesis with its inaccuracy:

Step 4. Find

Step 5. Update means distribution.

Step 6. Final hypothesis will be

5. Measurement and Experimental Result

5.1. Confusion Matrix for Different ML Algorithms

In the given breast cancer dataset, various machine learning methods [48, 49] are used in the given breast cancer dataset. The performance of a classification technique is shown and summarized using a confusion matrix [50]. The anticipated and actual classification are shown in a confusion matrix of dimension n × n connected with a classifier, where n is the number of possible classes. A confusion matrix is used in this work [51] with the following interpretations for the entries.

A confusion matrix is frequently used to show the performance of a classifier based on the four values listed as TP, FP, TN, and FN [5254]. The prediction accuracy and classification error [55] can be obtained from confusion matrix.

5.2. Results

The confusion matrix is used to assess method performance in classification. The results are then compared in the confusion matrix. In this matrix, 4 terms are used for evaluating performance, i.e., TP (true positive), FP (false positive), TN (true negative), and FN (false negative) [5659].

Accuracy [60, 61]denotes correct figure of malignant patients whether it is positive or negative.

Precision is a metric used to determine how many of the positive forecasts are correct (true positives) [62, 63]. The formula is as follows.

Among all the positive examples in the data, recall means the positive cases, which the classifier correctly predicted [64]. Sensitivity is a term that is occasionally used to describe it [65]. The formula is as follows.

The F1 score [66] is a statistic that incorporates precision and recall. The harmonic mean of the two is a term used frequently [67]. The harmonic mean is a method of determining “average” of integers that is supposed to be superior than the arithmetic mean for ratios (such as precision and recall) [68]. In this situation, the F1 score [69] formula is as follows.

Six supervised learning algorithms were examined for their prognosis and detection capacities in the diagnosis of breast cancer [70, 71]. In this dataset, a maximum of 569 instances were utilized [72]. There are eight input parameters and one output parameter for each item. Figures 59 and Table 3 show the anticipated outcomes for the six suggested supervised learning algorithms (KNN, SVM, Naive Bayes, ANN, logistic regression, and AdaBoost) based on the confusion matrix. Figure shows how the various approaches compare and depict all of the expected findings [7375]. The anticipated results clearly show that, as compared to other algorithms, the bagging algorithm predicts breast cancer with the highest degree of accuracy. [56, 76].

The support vector machine (SVM) gives the highest accuracy on the test dataset, which is 97.9%, as seen in this comparison table and graph. SVM has a higher accuracy percentage than other supervised learning algorithms. Support vector machine is the optimal algorithm for prediction, according to this research.

6. Conclusion

Breast cancer patients with a good prognosis and early discovery have a better chance of being cured. Finally, in the current study, several of the most prominent machine learning methods were utilized to classify “breast cancer” based on ten criteria. Some of the machine learning methods which are applied on this dataset are KNN, SVM, Naive Bayes, ANN, logistic regression, and AdaBoost with respective accuracy of 92.10%, 97.90%, 97.26%, 97.36%, 96.49%, and 97.54%. The defined methods have been applied on the prognosis of breast cancer. As a result, support vector machine (SVM) has been the most proficient algorithm among all supervised learning methods.

In future, some models can be implemented for supervised learning methods with some other techniques like genetic algorithm, ACO (ant colony optimization), PSO (particle swarm optimization), and so on so that early detection can be performed with efficient accuracy.

Data Availability

The data used to support the findings of this study are available from the author Vandana Rawat upon request ([email protected]).

Conflicts of Interest

The authors declare that they have no conflicts of interest.