Abstract

Cancer is a deadly disease that occurs due to rapid and uncontrolled cell growth. In this article, a machine learning (ML) algorithm is proposed to diagnose different cancer diseases from big data. The algorithm comprises a two-stage hybrid feature selection. In the first stage, an overall ranker is initiated to combine the results of three filter-based feature evaluation methods, namely, chi-squared, F-statistic, and mutual information (MI). The features are then ordered according to this combination. In the second stage, the modified wrapper-based sequential forward selection is utilized to discover the optimal feature subset, using ML models such as support vector machine (SVM), decision tree (DT), random forest (RF), and K-nearest neighbor (KNN) classifiers. To examine the proposed algorithm, many tests have been carried out on four cancerous microarray datasets, employing in the process 10-fold cross-validation and hyperparameter tuning. The performance of the algorithm is evaluated by calculating the diagnostic accuracy. The results indicate that for the leukemia dataset, both SVM and KNN models register the highest accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model achieves the highest accuracy at 100% using only 6 features. For the small round blue cell tumor (SRBCT) dataset, the SVM model also achieves the highest accuracy at 100% using only 8 features. For the lung cancer dataset, the SVM model also achieves the highest accuracy at 99.57% using 19 features. By comparing with other algorithms, the results obtained from the proposed algorithm are superior in terms of the number of selected features and diagnostic accuracy.

1. Introduction

DNA microarray is a modern biological research technology for gene expression analysis. It has the ability to measure the expression levels of thousands of genes, during important biological operations [1]. Therefore, this technology has become an important tool, used by researchers for identifying the genes that cause cancer. In addition, it has enabled researchers to diagnose different gene-related cancer diseases [2]. As a result, numerous applications of DNA microarray technology have been implemented, which have led to the presence of a huge amount of genomic microarray data [3].

The microarray data have some specific characteristics. That is, there are a high dimensionality and a small number of samples. As such, the analysis of microarray data is considered a difficult task [4]. Since microarray data include many dimensions, causing it to be big data, dimensionality reduction (DR) is an essential preprocessing step during the classification process. The presence of many dimensions causes three main problems in the implementation of the classification task. These problems are the delay in the learning process, the increase in computational cost, and the decrease in classification accuracy [5].

DR techniques can be classified into two main approaches: feature extraction and feature selection. The feature extraction approach aims to construct the features into a new feature space with lower dimensionality. Actually, the newly constructed features are usually combinations of the original ones. Examples of feature extraction techniques include linear discriminant analysis (LDA), principal component analysis (PCA), and canonical correlation analysis (CCA). On the other hand, the feature selection approach uses the original dataset to select an optimal subset of informative features by eliminating the redundant and irrelevant features [6]. Generally, feature selection methods are categorized into four groups: filter, wrapper, embedded, and hybrid methods.

In filter methods, the most relevant features are selected through the data itself; i.e., the features are evaluated according to the intrinsic and statistical properties of the data, without using any machine learning (ML) algorithm to guide the search of relevant features [7]. Hence, these methods are distinguished by their low computational cost and scalability. Examples include information gain (IG), correlation-based feature selection (CFS), Fisher score, ReliefF, chi-squared, mutual information (MI), and minimum redundancy maximum relevance (mRMR) [8]. In wrapper methods, different feature subsets are evaluated according to the performance of a specific ML model so that the best subset is identified [9]. Although wrapper methods are more accurate than filter methods, they are more complex and slower. The most common examples of wrapper methods are forward feature selection, backward feature elimination, and recursive feature elimination, which are explained further next.(i)Forward Feature Elimination. It is an iterative approach; in the beginning, there is a null model, and then, the model is fitted with each individual feature one at a time; accordingly, the feature with the highest classification accuracy is determined. Thereafter, a model is fitted with two features by trying combinations of the earlier selected feature with all other remaining features, and then, the combination of features that achieves the maximum classification accuracy is determined. This process is repeated until a subset of features outperforms all other determined subsets in terms of classification accuracy [10].(ii)Backward Feature Elimination. In this approach, all features are initially added to the model, and in each iteration, the least significant feature is removed based on some evaluation criteria. This process continues until no progress is detected by eliminating the features [11].(iii)Recursive Feature Elimination. It is an optimization algorithm and aims to find the finest feature subset. Unlike previous approaches, this approach continually produces a new model [12].

In embedded methods, ML models are used with their own built-in feature selection methods [13]. Examples of embedded methods are L1 (LASSO) regularization and decision tree (DT) [14]. In hybrid methods, the advantages of the filter and the wrapper methods are merged. The hybrid methods first use one or more filter-based methods, and then, the wrapper method is used to select the optimal feature subset [15]. In some cases, hybrid methods give better results than stand-alone ones [16]. In this article, a modified feature selection technique, which is defined as a wrapper-based sequential forward selection technique, is proposed.

In recent years, each of the ingredients of the proposed system has been the topic of much research work. As far as ML models are concerned, numerous studies have focused on employing them for cancer diagnosis. In [17], the authors present a review of 48 articles on the role of ML in disease prediction, concluding that the support vector machine (SVM) classifier is applied most frequently, followed by naive Bayes (NB). Regarding accuracy, they see that the random forest (RF) model is the best. This view of RF is shared by the authors of [18] who test five ML models, namely, SVM, DT, RF, NB, and gradient boosting (GB), to classify the samples into cancerous and noncancerous, and they report that RF achieves the best performance. The same view is also shared by the authors of [19], who use ten models for classifying cancer patients, and they report that RF with Wilcoxon signed rank-sum (WCSRS) test gives more accurate predictions than LDA, quadratic discriminant analysis (QDA), NB, Gaussian process classification (GPC), SVM, artificial neural network (ANN), logistic regression (LR), DT, and AdaBoost (AB). Another view is shared by the authors of [20], who report that SVM provides better classification based on their experiments with SVM and NB. In [21], the authors compare the performance of three ML models, namely, nearest neighbors (NN), SVM, and NB for the prediction of cancer among other diseases. They report that NN model outperforms the other two models. In [22], the authors evaluate the performance of ML models for the purpose of biomarker prediction and report that DT yields higher performance than LDA and NB. In [23], the authors use a deep learning (DL)-based multimodel ensemble method, based on five ML models: NN, SVM, DT, RF, and GB for cancer prediction. They show that the ensemble technique achieves better results than individual base models. In [24], the authors present three ML models, namely, SVM, ANN, and DT, to classify five tumor types. They report that both SVM and ANN can be used efficiently for this classification task. DT can also be used in this classification but is not efficient as well as others.

Some more relevant studies in the context of disease diagnosis using ML are in order. In [25], the authors propose an ensemble learning framework to solve positive-unlabeled learning problems in predicting miRNA-disease associations. The framework consists of a semi-supervised K-means method and a sub-aging method, combined with an effective random vector functional link network as a prediction model. In [26], the authors develop a hybrid learning framework to forecast multistep-ahead meningitis cases. The proposed framework combines signal decomposition, a weighted integrated strategy. In [27], an ML pipeline is suggested for the accurate prediction of heart disease. It includes preprocessing and entropy-based feature engineering. Performance analysis is carried out on LR, DT, RF, NB, KNN, SVM, AB, and XGBoost. In [28], the authors utilize an ensemble ML technique in hybrid integrations to predict dengue disease getting high accuracy. In [29], ML approaches such as Bayesian regression neural network, cubist regression, KNN, quantile random forest, and support vector regression are used stand-alone and coupled with variational mode decomposition for predicting COVID-19 cases.

To overcome the dimensionality problem, a set of useful feature selection methods have been proposed to analyze gene profiling for selecting the highly distinguished genes, which are called biomarkers. In [30], the authors propose a gene selection programming (GSP) method for selecting relevant genes to effectively classify cancer. SVM with a linear kernel is used as a classifier of the GSP. The proposed method is tested on ten microarray datasets. The experiments demonstrate that GSP is the most effective for removing irrelevant and redundant genes from microarray datasets. In addition, the authors demonstrate that the subset of genes selected by GSP achieves the highest classification accuracy, with the lowest processing time. In [31], the authors present a two-stage gene selection method, called mRMR-COA-HS. In the first stage, the number of genes is reduced by mRMR. In the second stage, a combination of cuckoo optimization algorithm (COA) and harmony search (HS) with the SVM classifier is used. This method is performed on four microarray datasets. The authors report that the mRMR-COA-HS method is significantly superior to other methods. In [32], the authors propose a feature selection algorithm based on relevance, redundancy, and complementarity (FS-RRC). To illustrate the performance of FS-RRC, FS-RRC is compared with eleven effective feature selection methods on fifteen public biological datasets and two synthetic datasets. The experimental results demonstrate the superiority of FS-RRC. In [33], the authors develop a novel hybrid wrapper approach called BTLBOGSA for gene selection. This approach is based on integrating the characteristics of teaching learning-based algorithm (TLBO) and gravitational search algorithm (GSA). The proposed method employs an NB classifier as a fitness function to select the extremely important genes that can help accurately to classify cancer. The effectiveness of this method is tested on ten biological datasets. Experimental results show that this method clearly outperforms other available filter and wrapper methods.

In [34], the authors propose a customized similarity measure using a fuzzy rough quick reduct algorithm for feature selection, and this method is evaluated using leukemia, lung, and ovarian cancer gene expression datasets on RF classifier. The authors conclude that the proposed method shows promising results compared with other methods. In [35], the authors present a two-stage gene selection method, called MI-GA. In the first stage, MI-based gene selection is used. In the second stage, genetic algorithm (GA)-based gene selection is used. The efficiency of the proposed method is verified using the SVM classifier, which uses five variations, and each variation uses different kernel functions. This method is performed on colon, lung, and ovarian cancer datasets. The results show that the proposed MI-GA gene selection method gives better results than the existing methods and produces maximum classification accuracy. In [36], the authors introduce a distributed feature selection (DFS) strategy using symmetric uncertainty (SU), CFS, and multilayer perceptron (MLP) through distribution across multiple clusters. Well-known classifiers are applied to the selected features. These classifiers include RIDOR, SVM, NN, and simple cart (SC). The experimental implementation of this strategy accomplishes about 57% success rate and 18% competitive rate compared with traditional methods when applied to seven high-dimensional microarray datasets and one lower-dimension dataset. In [37], the authors use MapReduce (MR)-based approach to present a novel distributed method. The presented algorithm consists of MR-based Fisher score (mrFScore), MR-based ReliefF (mrReliefF), and MR-based probabilistic neural network (mrPNN) using the weighted chaotic grey wolf optimization technique (WCGWO). The authors report that the performance of WCGWO-mrPNN outperforms the other methods, when tested on seven well-known datasets that have high-dimensional microarray classification.

In [38], the Jaya optimization algorithm is exploited to introduce a novel feature selection approach called FSJaya. To evaluate the FSJaya approach efficiency, four classifiers, namely, NB, KNN, LDA, and rep tree (RT), are used on several datasets with different dimensions. The authors show that the proposed approach is efficiently able to remove the redundant features and clearly outperforms feature selection by implementing a genetic algorithm (FSGA), feature selection by applying differential evolutionary (FSDE) approaches, and feature selection by using a particle swarm optimization algorithm (FSPSO). In [39], the authors propose the G-Forest algorithm, which is tested on two datasets of two types of cancers, leukemia and diffuse large B-cell lymphoma (DLBCL). The results report that G-Forest enhances accuracy up to 14% and reduces costs up to 56% on average compared with other methods. In [40], an optimization algorithm called the elephant search algorithm (ESA) is suggested to select the best gene expressions. Firefly search (FFS) is also employed to find out the efficiency of this method in the feature selection process. In addition, a stochastic gradient descent-based deep neural network as DL with softmax activation function is used on the reduced features to improve the classification. The experiments are performed on ten common cancer microarray datasets, which are obtained from the UCI machine learning repository. The authors state that the proposed method is as important as the best method presented in the literature.

In [41], the authors present a hybrid algorithm called SARA, which is implemented by simulated annealing (SA) and Rao algorithm (RA) for selecting the optimal subset of genes and classifying cancer. The presented method consists of two stages. The first stage uses mRMR for feature preselection. While the second stage uses SARA as a wrapper method. Furthermore, the log sigmoidal function is introduced as an encoding scheme to convert the continuous version of simulated annealing-Rao algorithm (SARA) into a discrete optimization algorithm. The proposed method is implemented on three binary-class and four multi-class datasets. The authors report that this method selects the highly discriminating genes with high classification accuracy. Particularly, for small round blue cell tumor (SRBCT) dataset, it achieves high classification accuracy at 99.81% using only five informative genes. In [42], the authors propose the cuckoo search method guided by the memory-based mechanism to store the most informative features that are determined by the best solutions. The proposed algorithm is compared with the original algorithm using twelve microarray datasets. The experimental results indicate that the proposed algorithm outperforms the original and contemporary algorithms. In [43], the authors provide a feature selection method based on the artificial electric field algorithm (AEFA), called FSAEFA. The presented method is evaluated and compared with some other feature selection methods, namely, FSDE, FAGA, and FSPSO. This method is tested on ten datasets. The authors report that the proposed method is superior to other methods.

Based on the mentioned studies, it can be seen that there is no agreement on which ML model is best for predicting cancer. Obviously, this depends on several factors, such as the training dataset, applied methodology, selected features, and model parameters. The above studies also tell that no single feature selection approach is best in all circumstances. Thus, one has to experiment with the prediction situation at hand and that is what will be done in this article. In particular, extensive experiments will be conducted to determine which ML model achieves the best accuracy in predicting cancer, using the fewest number possible of features. Therefore, a brief look at each model used in this article is in order.

The SVM model is used for both classification and regression problems [44]. SVM creates a decision boundary (hyperplane) in an N-dimensional space (being N the number of features) to separate data from different classes. The main goal is to maximize the distance between this hyperplane and the data examples that are closest to it (support vectors) [45]. SVM is frequently applied in bioinformatics and medical analysis, especially for gene classification [46]. The DT model is used to create a training path to predict classes by deduction of the learning decision rules from the training dataset. It presents a simple visualization of results [47]. The RF model is categorized as an ensemble ML model, as it consists of a combination of DT models. Each DT is created by a random vector sampled independently from the input vectors, casting at the end a vote for the most likely class the input vector belongs to [48]. The KNN model is the simplest supervised ML model. It is utilized for both classification and regression predictive problems. It depends on the value of K or the number of predefined nearest neighbors. To classify the test object, the distance between neighboring objects is measured, and then, the majority class among K neighbors is assigned to the test object [49].

In this article, a new two-stage hybrid feature selection algorithm is proposed. In the first stage, a robust overall ranker is constructed to combine the results of three different filter methods, namely, chi-squared, -statistic, and MI as a preprocessing stage to improve the feature selection procedure. In the second stage, the feature selection procedure is implemented using a modified wrapper-based sequential forward selection technique to select the most predictive and informative genes that can help accurately classify cancer. SVM, DT, RF, and NN classifiers are utilized in the selection of the optimal feature subset. Extensive experiments are conducted on four different cancerous microarray datasets, namely, leukemia, ovarian cancer, SRBCT, and lung cancer to demonstrate the effectiveness and efficiency of the proposed method. The proposed system outperforms state-of-the-art systems in terms of the number of selected genes and classification accuracy.

The rest of the article is structured as follows. Section 2 describes the proposed cancer prediction system. Section 3 details the experimental conditions, results obtained, and comparisons with other state-of-the-art methods. Finally, Section 4 presents the conclusions and future work.

2. Materials and Methods

This section presents an explanation for the conceptual structure of the proposed cancer prediction system. As shown in Figure 1, the system is composed of two successive phases: the data preprocessing phase and the phase of the feature selection and classification. In the data preprocessing phase, the feature values are normalized and the features are ranked according to their importance to make them suitable for the feature selection procedure. In the feature selection and classification phase, the models are trained and tested to identify the fewest number of features that achieve the highest accuracy. Moreover, the features that reduce the performance of ML model are excluded.

2.1. Data Preprocessing Phase

The data preprocessing phase is essential for cleaning the data and making it suitable for building the ML model, and this will increase the accuracy and efficiency of the model. The data preprocessing phase includes the following two processes.

2.1.1. Data Normalization

Each feature value of a column is normalized using a min-max technique. Consequently, each feature value is scaled according to the following equation to a value :where min () and max () are the minimum and maximum values of the feature column .

2.1.2. Feature Ranking

The main goal of this step is to order the features according to their importance. So, filter-based feature evaluation methods are employed to evaluate the significance of each feature. In particular, three filter methods are applied: chi-squared, -statistic, and MI [50].

(i) Filter Methods. Chi-Squared (): this statistic examines the dependence between two random variables, in our case a feature and the target (decision) variable. To calculate the chi-squared statistic, the first step is to create from the dataset a contingency table, having rows, where is the number of distinct values of the feature, and columns, where is the number of distinct classes of the target. At each entry , in the table, we place both the observed frequency and expected frequency for feature value and class . The observed frequency is the number of times value appears with class in the dataset. The expected frequency is the fraction of times value appears as a value for the feature, multiplied by the number of cases of class . Now, the chi-squared statistic can be computed as follows [51]:

A zero chi-squared value means that the two variables are entirely independent.

-statistic: An -statistic or -test is a family of statistical tests that calculates the ratio between variances. A larger value means the feature is more discriminative. For a dataset of two classes, positive and negative, the -statistic of the th feature can be calculated using [52]the following equation:where is the total number of cases, is the number of positive cases, is the number of negative cases, is the average of the values of the th feature, is the average of the values of the th feature for the positive cases, is the average of the values of the th feature for the negative cases, is the value of the th feature of th positive case, and is the value of the th feature of th negative case. We can see in the above equation that the numerator measures how far the feature average for each class is from the feature average for the dataset as a whole, whereas the denominator is the variances of both classes. Clearly, the fraction will get bigger as the numerator gets bigger and the denominator gets smaller.

Mutual Information (MI): The mutual information, , is calculated between two random variables, and , and represents the information they share, or more specifically the reduction in uncertainty for one given a known value of the other. The MI between discrete random variables, and , with values over spaces and , respectively, can be calculated as [53]follows:where is the joint probability distribution of and , and and are the marginal probability distributions of and , respectively. If the log is taken to the base 2, the units are bits. A zero MI means that the variables are completely unrelated, which is because if and are independent, then so that , whose log is 0.

(ii) Overall Ranking Algorithm. According to the proposed work, the feature ranking process is performed based on gathering the separated results of the mentioned filters together. The complete feature ranking process is shown in Figure 2, which is carried out through five detailed steps as follows.(1)A feature score table (FST) of rows and 4 columns is constructed, where is the number of dataset features. The first column is assigned to feature names and the next three columns are assigned to their evaluation values by the three filters: chi-squared, -statistic, and MI.(2)A rank table (RT) with rows and 4 columns is created. The first column for the feature names is assigned. Each value in the next three columns of the RT is deduced from its corresponding value in the FST as follows:The score value of each feature in the FST is replaced by a corresponding rank value in RT.The value 1 represents the highest rank and is assigned to the feature with the highest score in each of the filter columns in the FST.The rank value is increased by 1 for the feature score, which is directly below the previous score in each of the filter columns in the FST.The previous step is repeated until reaching the lowest rank with value .(3)In the RT, the outliers (extreme) of the rank values are detected as follows:In the row of each feature, the highest rank value of the three filters is examined.If the highest rank value is less than or equal to twice the sum of the other two, then all rank values will remain the same.Otherwise, if one of the rank values is greater than twice the sum of the other two, then it means that there is an outlier and it needs moderation.The required moderation is performed by replacing that outlier value with twice the sum of the other rank values.For example, if the row of some feature in the RT is , then the 8 will be considered an outlier, since . Thus, the row will be modified to .(4)An overall rank table (ORT) with 5 columns is constructed, and the first column is filled with feature names. Next, the following procedures are performed:The next three columns are filled with the rank values of the 3 filters after moderation.For each feature, the overall rank (OR) value is deduced by summing the three rank values of the feature’s row to be a single value in the fifth column.(5)Ascendingly, the ORT is sorted using the OR values of the fifth column as a key. The features will be ordered from the most important, at the top, to the least important at the bottom.

Algorithm 1 illustrates the pseudo-code of the overall ranking algorithm.

Input: feature set //m is the number of features
Output: ranked features .
(1)Create a Feature Score Table (FST) of rows and columns, having the evaluation scores provided by the filter-based methods (Chi-squared, -statistic and MI) to each feature
(2)Create from FST a Rank Table (RT), replacing each score in FST by its rank among other scores
(3)Moderate the outliers in RT as follows. If one row entry is larger than twice the sum of the other two, replace it by twice the sum of the other. Whereas if it is less than or equal to twice the sum of the other two, keep it the same.
(4)Create an overall rank table (ORT) from RT, appending an Overall Rank (OR) column
(5)Add up the entries of each row and place the sum in the OR column
(6)Sort the ORT ascendingly, using OR column as a key
(7)
2.2. Feature Selection and Classification Phase

In this section, the two processes of feature selection and classification are explained.

2.2.1. Feature Selection

Feature selection is a very crucial step because the inclusion of inconsequential and redundant features negatively affects the model performance significantly. By selecting relevant features from the raw dataset, the learning model is improved in many ways: (i) avoiding learning from noise and overfitting, (ii) improving accuracy, and (iii) reducing training time. In addition, working with more informative features contributes to early diagnosis. As mentioned in Section 1, there are four types of feature selection methods, namely, filter-based, wrapper-based, embedded-based, and hybrid-based methods.

In this article, a modified wrapper-based sequential forward selection technique is presented. In this model, the selection technique starts by adding the highest overall rank feature to an empty subset and then it measures the model's performance. Next, a set of successive iterations are performed. In each iteration, only one feature is added to the subset and performance is measured. If the newly added feature improves the performance of the model, it will remain within the subset. Otherwise, the added feature will be removed. Likewise, the remaining features are added and evaluated one by one to the features kept in the subset. In the last iteration, the features that are kept in the subset are the features that optimize the classification accuracy.

2.2.2. Classification

The classification technique is applied to categorize data into a set of classes using supervised ML techniques. There are a variety of classification techniques for classifying microarray datasets. Based on the recent literature on cancer prediction (as summarized in Section 1), the present work implements four prediction models, namely, SVM, DT, RF, and NN.

To optimize and refine the performance of the proposed models, the hyperparameter tuning technique is implemented to pass various parameters into the model using the grid search method that takes a set of possible values for each hyperparameter, evaluates the performance for each combination of them, and in the end selects the combination, which achieves the best performance.

The -fold cross-validation approach is also utilized to get the best performance for the models. In the present work, is used, so the dataset is split into 10-fold of approximately the same size. Then, ninefolds are utilized for training and only onefold for the testing. This process is repeated until each of the 10-fold has been used as a testing set to ensure that each case in the dataset has been classified by the model. For each fold, the performance of the model is calculated, and eventually, the average performance is obtained from the 10-fold.

In addition, the accuracy is used as a vital metric for evaluating the performance of ML models. The accuracy is deduced as follows:where (true positive) is the number of cases belonging to the class and correctly labeled as such, (false positive) is the number of cases belonging to the class but incorrectly labeled as not, (true negative) is the number of cases not belonging to the class and correctly labeled as such, and (false negative) is the number of cases belonging to the class but incorrectly labeled as not.

(i) Feature Selection and Classification Algorithm. After ordering the features from the most significant to the least based on OR values in the ORT, the feature selection procedure is performed. The complete feature selection and classification process are shown in Figure 3, which is carried out through the following steps.(1)The most important feature that is in the first row of the ORT to an empty feature subset is added.(2)10-fold cross-validation is used for the feature subset and tune hyperparameters using the grid search technique.(3)An ML model is built using this subset of features.(4)The accuracy of the ML model is calculated, and it is called as the previous accuracy.(5)The next feature that is in the next row of the ORT to the feature subset is appended.(6)10-fold cross-validation is used for the feature subset and tune hyperparameters using the grid search technique.(7)An ML model is built using this subset of features.(8)The accuracy of the ML model is calculated, and it is called as the current accuracy.(9)The current accuracy is compared with the previous accuracy as follows:If the current accuracy is less than or equal to the previous accuracy, then the last added feature is excluded from the feature subset.Otherwise, if the current accuracy is greater than the previous accuracy, then the previous accuracy is made equal to the current accuracy.(10)The steps starting from step 5 are repeated until reaching the end of the ORT.(11)The optimum feature subset and its accuracy (the previous accuracy) are returned.

Algorithm 2 illustrates the pseudo-code of the feature selection and the classification process.

Input: (Set of classifiers), (ranked features).
Output: (Selected feature subset), (best accuracy).
(1)For each do
(2) Set
(3) Use 10-fold cross-validation and tune hyperparameters
(4) Build an ML model using the feature subset
(5) Calculate the accuracy of the model
(6)For to do
(7)  Append the feature to
(8)  Use 10-fold cross-validation and tune hyperparameters
(9)  Build an ML model using the feature subset
(10)  Calculate the accuracy of the model
(11)  Ifthen
(12)   Exclude the feature from
(13)Else
(14)   
(15)  End if
(16)End for
(17)Return, Acc0
(18)End for

3. Results and Discussion

The proposed system was tested by performing extensive experiments on four publicly available microarray datasets [54] shown in Table 1. The system, based on Apache Spark, was written in Python. Some API libraries that are integrated with Spark were used such as Spark’s MLlib to implement the feature selection and classification algorithm. Python libraries were used to implement the feature ranking algorithm. The proposed system was implemented on a Spark cluster, which consists of one master node and two slave nodes. Every node was deployed with the same physical environment, i.e., Intel (R) Core (TM) i7-4510 U CPU @ 2.00 GHz, 2.60 GHz, and 8 GB memory.

It should be noted that Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. It can work with structured data such as CSV files and unstructured data such as JSON files [55]. Spark provides high-level APIs in Scala, Java, Python, and R for libraries such as MLlib (Machine Learning Library) for ML, Spark Streaming for stream processing, GraphX for graph analysis, and Spark SQL for structured data processing [56]. MLlib implements ML prediction models, hyperparameter tuning, and cross-validation. It is divided into two main packages: spark.mllib and spark.ml. spark.mllib is built on top of RDDs, and spark.ml is built on top of DataFrames. Both packages come with a variety of common ML tasks such as featurization, transformations, model training, model evaluation, and optimization. In the present work, we use the spark.ml package because it provides the pipeline API for building, debugging, and tuning ML pipelines, whereas spark.mllib includes packages for linear algebra, statistics, and other basic utilities for ML. DataFrames can automatically distinguish between numerical and categorical features and can also automatically optimize both storage and computation [57].

The methods outlined in Section 2 were followed to build the model. So, first, the filter-based feature evaluation methods were used to order the features according to their importance, and then, the ML models were trained and tested. It ends up selecting the model with the highest performing, which used the fewest number of features obtained through the modified wrapper-based sequential forward selection technique. The results of feature ranking, feature selection, and classification process are described in this section, in addition to presenting a comparison of the performance of the proposed method in terms of the number of selected features and classification accuracy with twelve other methods.

3.1. Feature Ranking Results

Tables 2 and 3 display the metric scores obtained for only twenty features of two microarray datasets: leukemia and ovarian cancer, respectively. Scores were obtained by the three metrics, chi-squared, -statistic, and MI, applying equations (2), (3), and (4), respectively. It can be observed that the same feature is ranked differently by each metric. For example, for the leukemia dataset as shown in Table 2, the chi-squared statistic method sees “M27891_at” as the most important feature, and this view is not shared by the -statistic and MI. Actually, -statistic sees “X95735_at” as the most important feature, but MI sees “M23197_at” as the most important. For the ovarian dataset, as shown in Table 3, both chi-squared statistic method and -statistic see “MZ245.24466” as the most important feature, while MI sees “MZ244.95245” as the most important feature, likewise for both SRBCT and lung cancer datasets. For this variation, an approach described in Section 2 will be used to find an overall rank for each feature based on the collective view of the three metrics.

For the feature ranking process, after creating a feature score table (FST) for each dataset, a rank table (RT) is created for each of them as shown in Tables 4 and 5. These tables show the rank of only twenty features of leukemia and ovarian cancer datasets, respectively. Here, each metric value is replaced by its rank among its peers. For leukemia and ovarian cancer datasets, the results of the overall rank of the top twenty features—after moderating the outliers—are shown in Tables 6 and 7, respectively. In these tables, a moderated outlier is set in bold, likewise for both SRBCT and lung cancer datasets.

3.2. Feature Selection and Classification Results

Four ML models were explored for cancer prediction, namely, SVM, DT, RF, and NN. These models in particular are chosen based on reviewing the recent research on cancer prediction as summarized in Section 1. For each dataset, to evaluate the performance of the four candidate ML models, some experiments were carried out, one using all features and the others using the features ranked by their overall rank (the proposed approach) to determine which feature subset achieves the best accuracy. 10-fold cross-validation is used to evaluate each of the four ML models. This means that—of all cases of the dataset—90% were used for training and 10% for testing. From the test results, for each model and for each fold, the accuracy metric was calculated using equation (5). The results of accuracy metric were then averaged for all the 10-fold. The average of the accuracy was then taken to give a single number for each model indicating its performance. The performance of the models using full features and the features selected by the proposed wrapper method is presented in the following subsections.

3.2.1. Performance of the Models Using All Features

In this experiment, for each dataset, the performance of the four ML models when trained and tested on all features was measured.

As can be seen in Table 8, for the leukemia dataset, both RF and SVM models achieve the best average accuracy at 98.57%. For the ovarian cancer dataset, the SVM model outperforms the other models by achieving the highest average accuracy at 100%. For the SRBCT dataset, both RF and SVM models register the best average accuracy at 100%. While for the lung cancer dataset, the RF model achieves the best average accuracy at 99.57%.

3.2.2. Performance of the Models Using the Features Selected by the Proposed Wrapper-Based Sequential Forward Selection Method

In this experiment, for each dataset, the performance for the four ML models was measured when they were trained and tested on subsets of features that were selected according to the proposed wrapper method.

For each dataset, the best subset of features that achieves the best accuracy is shown in Table 9.

As can be seen in Table 8, for the leukemia dataset, both SVM and NN models register the best average accuracy at 100% using only 5 features. For the ovarian cancer dataset, the SVM model outperforms the other models by achieving the best average accuracy at 100% using only 6 features. For the SRBCT dataset, the SVM model also outperforms the other models by achieving the best average accuracy at 100% using only 8 features. While for the lung cancer dataset, the SVM model achieves the highest average accuracy at 99.57% using 19 features.

3.3. Comparison with Other Algorithms

Table 10 reports the comparative results of the four microarray datasets introduced above. In particular, the results of the proposed algorithm are compared with those of twelve algorithms in the literature. From the comparison, it can be easily realized that the proposed algorithm is promising in terms of classification accuracy and number of selected features for all used datasets. In particular, an accuracy of at least 99.57% is obtained throughout.

4. Conclusions

This article presents a robust machine learning (ML)-based algorithm to diagnose different cancer diseases using microarray datasets. The algorithm can effectively eliminate irrelevant and redundant genes. The output of the algorithm has high stability and classification accuracy. When the results of the algorithm are compared with those of similar algorithms, the proposed algorithm showed clear superiority. In particular, it selected a smaller number of genes and yielded a higher level of accuracy. Furthermore, the time and storage cost of the algorithm are very appealing, making it optimal for big data.

An interesting future extension would be to adapt and verify the proposed algorithm on more realistic and benchmark microarray datasets of bigger sizes. Also, implementation using Hadoop/MapReduce platforms could be explored. In particular, to make the algorithm faster and more efficient when dealing with high-dimensional data, we intend to develop a parallel version to be run on cluster/cloud computing facilities.

Data Availability

The microarray datasets used to support the findings of this study can be accessed at https://csse.szu.edu.cn/staff/zhuzx/Datasets.html.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.