Research Article | Open Access
mRMR-ABC: A Hybrid Gene Selection Algorithm for Cancer Classification Using Microarray Gene Expression Profiling
An artificial bee colony (ABC) is a relatively recent swarm intelligence optimization approach. In this paper, we propose the first attempt at applying ABC algorithm in analyzing a microarray gene expression profile. In addition, we propose an innovative feature selection algorithm, minimum redundancy maximum relevance (mRMR), and combine it with an ABC algorithm, mRMR-ABC, to select informative genes from microarray profile. The new approach is based on a support vector machine (SVM) algorithm to measure the classification accuracy for selected genes. We evaluate the performance of the proposed mRMR-ABC algorithm by conducting extensive experiments on six binary and multiclass gene expression microarray datasets. Furthermore, we compare our proposed mRMR-ABC algorithm with previously known techniques. We reimplemented two of these techniques for the sake of a fair comparison using the same parameters. These two techniques are mRMR when combined with a genetic algorithm (mRMR-GA) and mRMR when combined with a particle swarm optimization algorithm (mRMR-PSO). The experimental results prove that the proposed mRMR-ABC algorithm achieves accurate classification performance using small number of predictive genes when tested using both datasets and compared to previously suggested methods. This shows that mRMR-ABC is a promising approach for solving gene selection and cancer classification problems.
Microarray or gene expression profiling is applied to compare and determine the gene expression level and pattern for different cell types or tissue samples in a single experiment. Thus, we can determine the more informative genes that are responsible for causing a specific disease or cancer [1, 2]. Gene expression measures the activity degree for gene in a given tissue of the body. Individual genes can be active (switch on) or inactive (switch off) depending on the needs and circumstances of the body cells under particular condition . Therefore, any abnormalities of gene expression level may cause the death of cells, or uncontrolled growth, as in cancer [3, 4].
However, microarray dataset suffers from the curse of dimensionality, the limited number of samples, and the irrelevant and noise genes, all of which make the classification task for a given sample more challenging [1, 5, 6]. Gene selection process aims to select the minimum number of relative and meaningful genes that are more predictive in classification process. This maximizes the classifier’s ability to classify samples accurately. Lin et al. in  proved that the feature selection is effective and comprehensive and can significantly improve the overall classification performance. The optimal feature selection problem is considered as NP-hard problem . Therefore, it is better to use heuristic approaches such as bioinspired evolutionary algorithms in order to solve this problem.
The artificial bee colony algorithm that is innovated in 2005 by Karaboga  is one of the bioinspired evolutionary techniques, which has been employed to identify an optimal solution in different optimization problems. ABC is inspired by the bees behaviour when looking for a good food (honey) source. In the current literature, there are many comparative studies for the performance of ABC algorithms with other bioinspired evolutionary algorithms [10–12], and the experimental results proved that the ABC algorithm is competitive. In addition, ABC has been used to solve many numerical optimization problems because it is considered to be simple technique and easy to implement and has few number of parameters. Therefore, in this paper, we propose the application of the ABC algorithm to select the predictive and informative genes from microarray gene expression profile.
Bioinspired evolutionary techniques are more applicable and accurate than the wrapper gene selection method  because they have the ability for searching and fining the optimal or near-optimal solutions on high-dimensional solution spaces. Furthermore, they allow searching the solution space by considering more than one attribute at the same time . But, as other evolutionary approaches, the ABC has some challenging issues, especially in computational efficiency, when it is applied to complex and high-dimensional data such as microarray datasets. Therefore, to improve the performance of the ABC algorithm in high-dimensional datasets, we propose adding a feature selection algorithm, minimum redundancy maximum relevance (mRMR), as a preprocessing stage. We combine it with the ABC algorithm, mRMR-ABC, in order to select informative genes from cancer microarray profiles. This hybrid gene selection offers a good balance between filters and wrapper gene selection methods, being more computationally effective, as in filter methods, and model feature dependencies as in wrapper methods .
In this paper, we measure the efficiency of gene selection techniques using a support vector machine (SVM) as a classifier. An SVM displayed substantial benefits when compared to other classification approaches . It is difficult to find a linear classifier to separate different classes in the dataset. An SVM solves this difficulty by mapping and converting the input space into a high-dimensional space; after that it finds a linear classification model to classify the input data with a maximum margin hyperplane. Comparing with other machine learning classification methods, SVM is more effective in high-dimensional space .
In the literature, there are several algorithms for gene selection and cancer classification that use a microarray. However, to our knowledge, this is the first attempt at applying ABC-based algorithm as a gene selection method for cancer classification problems using a microarray gene expression profile. The proposed algorithm is tested using six binary and multiclass gene expression microarray datasets and is also compared with original ABC, mRMR when combined with a genetic algorithm (mRMR-GA), and mRMR with a particle swarm optimization (mRMR-PSO) algorithm. In addition, we compared it with other related algorithms that have been published recently. The experimental results show improvements in both the number of selected informative genes and cancer classification accuracy.
The rest of this paper is organized as follows. Section 2 provides a brief description of the different approaches used in our work, namely, ABC, mRMR, and SVM algorithms. We also propose both algorithms when applied to gene selection in microarray data profile. The proposed hybrid mRMR-ABC algorithm is explained in Section 3. Section 4 outlines the experimental setup and provides results. Finally, Section 5 concludes our paper.
In this section, first we briefly introduce both artificial bee colony (ABC) and minimum redundancy maximum relevance (mRMR) filter algorithms and show how we can apply each of them for gene selection in microarray gene expression profiles. Then, general background about support vector machine (SVM) is presented.
2.1. Artificial Bee Colony (ABC) Algorithm for Microarray Gene Expression Profile
The ABC is a recent swarm intelligence algorithm that was invented in 2005 by Karaboga. It was inspired by the social life of bees and is used to solve the optimization problems . The ABC algorithm has been widely applied in different optimization problems such as protein tertiary structures  and artificial neural networks . To the best of our knowledge, the ABC algorithm has not been applied before to select the predictive and informative genes from cancer microarray gene profiles. The ABC algorithm is a metaheuristic evolutionary algorithm that simulates the search for food in a group of bees. The bees can be distributed at different distances (some of them quite far) in order to utilize the food resources . The ABC model consists of three main components: food sources, employed bees, and unemployed bees . All concepts are defined below along with the proposed variations when applied for gene selection in microarray data analysis.
(i) Food Sources. The forager bee measures the quality of several food sources to select the best food sources. The quality of food sources is evaluated by the distance to the hive, energy, nectar taste, and the simplicity or difficulty of the energy extracting.
(ii) Employed Bees. The honeybees, or the employed bees, who found the food source, are equal to the number of food sources. An employed bee is exploiting a specific food source and takes the information about this food source. Subsequently, she shares the information such as distance, the direction, and the quality of the food source with other bees waiting in the hive. If the food source is exhausted, then the employed bee is considered a scout bee.
(iii) Unemployed Bees. A bee that does not exploit a food source yet is called an unemployed bee. There are two different types of unemployed bees: scouts, who conduct random searches of the environment, and onlookers who stay in in the nest waiting the information shared by the employed bee. The most important process for knowledge collection in ABC algorithm is exchange of information between employed and unemployed bees.
In this paper, we made some changes to the ABC algorithm representation in order to use it to solve the microarray gene selection problem. The representation of solution space (foods) for the ABC algorithm when applied on a microarray dataset is illustrated in Figure 1. The ABC algorithm first produces an initial and random solution of size SN, where SN refers to the total number of food sources. When applying an ABC algorithm to gene selection for microarray data analysis, as illustrated in Figure 1, each solution represents a group of numbers. These numbers are the indices of genes in the microarray gene expression profile (i.e., the position of food source). This is shown as , where represents a particular solution , and each solution is a -dimensional vector , where represents the number of informative genes to be optimized in each solution. Each cell, , represents the corresponding gene index.
After initialization of random solutions (populations), the ABC algorithm starts searching for the optimal solution. In the ABC algorithm, each cycle of the search consists of three phases: the employed bees phase: in which employed bees are sent to their food sources to evaluate the amount of nectar contained in each source; the onlookers phase: in which, after receiving the nectar information for the food sources, the onlookers select the food source regions and evaluate the amount of nectar in the food sources; the scouts bee phase: in which the scout bees are designated as such. The first half of the colony includes the employed bees, while the onlookers is considered as the second half. Each food source has only one employed bee. During searching in the solution space, the onlookers and employed bees deal with the exploitation process, while the scouts bees focus on the exploration process. In the next subsections, we will describe each of the phases and how can we apply them for microarray gene expression analysis.
2.1.1. Employed Bee Phase
In this phase, the employee bees search around the solutions (food resources) at and will search for the better genes index at the new location . Identification of the new gene index takes place by the following equation :where is the new gene indices (location vector of the bees), is the current gene indices (location vector of the th bee), () is a correct random number in , and the SN is the number of the solutions (artificial bees). is a random number uniformly distributed in . The random numbers selection from the microarray gene index is done by the following equation :where and are the top limit and the down limit of the variable, respectively, , and while is the random numbers function in . When the new index of the gene is identified, the optimization of it must be calculated based on the fitness function. In our problem, the fitness value is determined according to the solution classification accuracy using an SVM classifier. When the new fitness value is better than the old fitness values, then the bee changes its solution to the new solution; otherwise it stays in its solution.
2.1.2. Onlooker Bee Phase
After all employed bees complete the searching for the best solutions, the information is shared with onlooker bees. An onlooker bee selects the genes depending on their highest probability value, as roulette wheel selection strategy in genetic algorithm (GA) as follows: the possibility of selecting the particular solution (food source) by the onlooker bees is calculated using the following equation:
2.1.3. Scout Bee Phase
Every bee (employee or onlooker) is looking for predictive genes for a specific and limited number of cycles; when the fitness value does not improve, the employee bee becomes a scout bee. A solution which could not be improved for (limit) trials becomes a scout bee. A scout bee randomly generates an index of genes in the solutions search space.
It is worth mentioning that the ABC algorithm faces some challenging problems, especially in computational efficiency, when it is applied on complex and high-dimensional data such as a microarray dataset. This motivates us to solve these difficulties and further improve the performance of the ABC algorithm by proposing a hybrid technique between the ABC algorithm and mRMR filter gene selection approach, namely, the mRMR-ABC algorithm. In the following subsection, we explain the mRMR algorithm when applied to our problem.
2.2. Minimum Redundancy Maximum Relevance (mRMR) Filter Algorithm
The gene selection process is very impotent for accurate classification prediction and the mRMR method can significantly improve the classification accuracy . In a high-dimensional microarray dataset, because there are thousands of genes, it is inefficient to adopt an evolutionary algorithm such an artificial bee colony directly in a microarray dataset. In addition, it is difficult for a classifier to be trained accurately. Alternative techniques should be effectively adopted to solve this difficulty. Therefore, as a first step, mRMR is employed to reduce noisy and irrelevant genes. The mRMR approach was proposed by Peng et al. in 2005 . It is a heuristic technique can be used for continuous and discrete datasets in order to measure the relevancy and redundancy of features and determine the promising features. In this paper, the authors perform a comparative study of mRMR with maximum relevant technique (MaxRel) and employed theme with different machine learning classifiers using four different microarray datasets (handwritten digits, arrhythmia (irregular heart beat), NCI, and lymphoma cancer). The experimental results indicate that mRMR is an effective method to improve the performance feature selection. Features selected by mRMR are more predictive and achieve accurate classification result than those selected by MaxRel.
Furthermore, Ding and Peng  applied mRMR feature selection method for microarray gene expression profiling. Genes selected using mRMR achieve a more coverage balance in the solution space, and they significantly improved the classification accuracy as well. The experimental study on five gene expression datasets and using four classification methods shows that the classification performance is consistently improved. In addition, Liu et al.  proposed an approach that adopts a mRMR filter method using mutual information operation with SVM-RFE to reduce the redundancy in the selected genes. Experimental results, based on four-benchmark cancer microarray datasets, show that the mRMR filter method is more effective when it is applied on SVM-RFE. It is also shown that mRMR can be effectively combined with other feature selectors, such as wrappers. This can be done to identify a very compact subset from candidate features at lower expense. Amine et al.  reported a novel gene selection method based on a hybrid approach combining an mRMR filter method with a GA wrapper method. In this study, the authors conducted a comparative study for mRMR-GA, mRMR, and GA when they applied with SVM as classifier on five different binary and multiclass cancer microarray datasets. The results from this study demonstrated the effectiveness of the integration of mRMR and GA, and it was concluded that the mRMR-GA method achieved better performance when compared to the mRMR filter and GASVM wrapper algorithms in all datasets. Meanwhile, with the same number of selected genes in this experimental result, the gene set obtained by the mRMR-GA selection was more representative of the specific class. Recently, Abdi et al.  introduced a novel hybrid gene selection method based on partial swarm optimization (PSO) algorithm and mRMR filter method. The numerical experimental results for colon cancer and leukemia microarray datasets prove that the mRMR-PSO algorithm achieves better classification accuracy than previously reported algorithms.
In addition, the mRMR was the successfully applied in other bioinformatics problems. For instance, Yun et al.  explored the effectiveness of mRMR method with GA and PSO algorithms in selecting predictive features in audiology and dermatology datasets. They concluded that using mRMR method produces better performance in terms of the classification accuracy and the relevancy of selected features. Moreover, Huang et al.  successfully applied the mRMR feature selection method to choose an appropriate subset of informative and relative features that are important for detection of deleterious SNPs.
In our problem, we will use mRMR gene selection method to identify the predictive genes that have minimum redundancy with other genes in microarray dataset and maximum relevancy for specific cancer classes. Thus, the mRMR method used two mutual information MI operations: one between cancer classes and each gene in order to measure the relevancy, while the second mutual information between every two genes to calculate the redundancy. Figure 2 presents the mRMR dataset, which contains the ordered selected genes indices. The first row represents the maximum relevant and the minimum redundant genes.
denotes the selected genes and Rl measures the relevancy of a group of selected genes that can be defined as follows:where represents the value of mutual information between an individual gene that belongs to and the cancer class , where and denote the normal and tumor classes.
When the selected genes have the maximum relevance Rl value, it is possible to have high dependency (i.e., redundancy) between these genes. Hence, the redundancy Rd of a group of selected genes is defined aswhere is the mutual information between the th and th genes that measures the mutual dependency of these two genes.
The main purpose of applying the mRMR gene selection method is to find a subset of genes from with genes, , that either jointly have the largest dependency on the target class or have the minimal redundancy on the selected gene subset . Thus, Peng et al.  recommend searching for balanced solutions through the composite objective. This criterion combines the two criteria, which are maximal relevance criterion and minimal redundancy criterion, as follows:
Our goal is to increase the prediction accuracy and reduce the number of selected genes. Hence, we applied the mRMR method as a preprocessing step to the ABC algorithm to improve the speed and performance of the search.
2.3. Support Vector Machine (SVM)
SVM algorithm is a machine learning approach based on statistical learning theory, which is proposed in 1998 by Vapnik . An SVM is powerful classification algorithm that showed a good performance in a variety of bioinformatics classification tasks. In addition, SVMs are very effective classification techniques for microarray data and they significantly improved the classification accuracy performance. One of the main advantages of SVM models in cancer classification using high-dimensional data such as microarray datasets is that being able to be adjusted with all genes and at the same time with stable performance when using the full set of genes [26–29]. Its aim is to find the hyperplane that is separating the feature with the largest margin (distance between itself and the closest samples from each classes). Generally, the better SVM classifier seeks to balance between increasing the margin and reducing the number of errors. In our recent comparative study , we showed that machine learning classification methods produce accurate result with minimum number of genes. There are many machine learning techniques that have been applied for classifying microarray dataset, including SVM, K nearest neighbor (KNN), random forest (RF), artificial neural network (ANN), and naive Bayes (NB). Therefore, we compared the classification performance achieved by each machine learning classification algorithm that was proposed in the state-of-the-art for cancer microarray datasets, as shown in Table 1. The table also compares the classification performances for four microarray datasets. The best performance of each cancer classification approach for each microarray dataset is indicated in bold. Up to the available literature review, SVM has superior classification accuracy performance when applied on microarray data.
From early stage of the SVM, researchers have applied the linear, polynomial, and RBF kernels for classification problems . It is worth pointing out that the polynomial and RBF are the nonlinear kernel, and cancer classification using microarray dataset is a nonlinear classification task . Nahar et al.  observed from their experiment out of nine microarray datasets that the polynomial kernel is the best choice for classifying microarray datasets. Therefore, we used polynomial kernel for SVM classifier. In addition, we apply leave-one-out cross validation (LOOCV)  in order to evaluate the performance of our proposed algorithm and the existing methods in the literature. LOOCV is very suitable to our problem because it has the ability to prevent the “overfitting” problem . In LOOCV, one sample from the original dataset is considered testing dataset, and the remaining samples are considered training dataset. This is repeated such that each sample in the microarray dataset is used once as the testing dataset.
3. Proposed mRMR-ABC Algorithm
In this section, we introduce the proposed mRMR-ABC algorithm to select the predictive genes from the cancer microarray gene expression profile. The aim of this algorithm is to select the more informative gene in order to improve the SVM classifier accuracy performance by preselecting the relative and informative genes using the mRMR method and then estimating the best predictive genes by applying the ABC algorithm as a wrapper gene selection technique with the SVM classifier. In Figure 3, we demonstrated the solution representation for the proposed algorithm. The food sources represent the population of solutions. Each row of foods matrices is a particular solution holding genes indices that are to be optimized and selected for an mRMR dataset, shown in Figure 2. It is clear that, in our proposed mRMR-ABC algorithm, we select the genes form a small dataset (mRMR dataset) that contains the informative genes. Consequently, the optimization process will be improved, compared with the original ABC algorithm that selected the genes directly for the initial microarray dataset.
As illustrated in Figure 4, our proposed algorithm consists of three phases: preprocessing phase, gene selection phase, and classification phase.
(i) Preprocessing Phase (Figure 5). The initial microarray gene expression profiling is filtered and preprocessed using the mRMR gene selection method. Each gene is evaluated and sorted using the mRMR mutual information MI operations as explained in Section 2.2. The highest relevant genes that give 100% classification accuracy with an SVM classifier are identified to form a new subset named the mRMR dataset, as shown in Figure 2. The mRMR dataset denotes the more relative and less redundant genes as selected by the mRMR approach. The mRMR is applied in order to filter irrelevant and noisy genes and reduces the computational load for the ABC algorithm and SVM classifier.
(ii) Gene Selection Phase (Figure 6). An ABC algorithm is developed as described in Section 2.1 to select the most informative and predictive genes from an mRMR dataset that give the highest classification accuracy with an SVM classifier. Figure 3 illustrates the representation of the food source or the solution space for the proposed mRMR-ABC algorithm. Each solution is represented as a group of genes indices that are selected form the mRMR dataset. In a gene selection problem, each solution (i.e., subset of selected genes) is associated with the fitness value, which is the classification accuracy using an SVM classifier.
(iii) Classification Phase (Figure 7). Use the informative and predictive genes that are generated from the ABC algorithm in the second phase to train the SVM classifier. The SVM is applied again to classify the testing microarray dataset and restore the classification accuracy.
The main steps for proposed algorithm (mRMR-ABC) are presented as follows. (1) Preprocess microarray dataset using mRMR filtering method. (2) Initialize population. Repeat (3) Place the employed bees on their food sources. (4) Place the onlooker bees on the food sources depending on their nectar amounts. (5) Send the scouts to the search area for discovering new food sources. (6) Memorize the best food source found so far Until requirements are met (7) Classify microarray dataset using SVM classifier.
In addition, the pseudocode for the proposed mRMR-ABC algorithm is presented as follows.(1)Select the high relevant genes subset using mRMR filter method that gives 100% classification accuracy with SVM classifier.(2)Set the parameter: Max cycles, colony size, and limit.(3)Initialize the food sources.(4)Evaluate the food sources by calculating the fitness, which is the classification accuracy using SVM classifier.(5).(6)While Do.(7)Generate new solutions using employed bees.(8)Evaluate the new solutions by calculating the fitness (SVM classification accuracy).(9)Apply greedy selection operation.(10)Calculate the probability values using fitness function.(11)Generate new solutions using onlooker bees based on the probability of food source.(12)Evaluate the new solutions by calculating the fitness (SVM classification accuracy).(13)Apply greedy selection operation.(14)Determine abandoned solutions and generate new solutions randomly using scouts.(15)Memorize the best solution found so far.(16).(17)End While.(18)Return a best solution (predictive and informative genes).(19)Train the SVM classifier using selected genes.(20)Classify microarray dataset using SVM.(21)Return the classification accuracy.
4. Experimental Setup and Results
4.1. Experiential Setup
In this section, we evaluate the overall performance of gene selection methods using six popular binary and multiclass microarray cancer datasets, which were downloaded from http://www.gems-system.org/. These datasets have been widely used to benchmark the performance of gene selection methods in bioinformatics field. The binary-class microarray datasets are colon , leukemia [32, 33], and lung  while the multiclass microarray datasets are SRBCT , lymphoma , and leukemia . In Table 2, we present a detailed description of these six benchmark microarray gene expression datasets with respect to the number of classes, number of samples, number of genes, and a brief description of each dataset construction.
Table 3 shows the control parameters for the mRMR-ABC algorithm that was used in our experiments. The first control parameter is the bee colony size or population, with a value of 80. The second control parameter is the maximum cycle, which is equal to the maximum number of generations. A value of 100 is used for this parameter. Another control parameter is the number of runs, which was used as stopping criterion, and we used a value of 30 in our experiments, which has been shown to be acceptable. The last control parameter is the limit, which represents the maximum number of iterations allowed when the food source is not improved (exhausted). If the food source exceeds this limit, it will be selected by the scout bee. A value of 5 iterations is used for this parameter.
In this study, we tested the performance of the proposed mRMR-ABC algorithm by comparing it with other standard bioinspired algorithms, including ABC, GA, and PSO. We compared the performance of each gene selection approach based on two parameters: the classification accuracy and the number of predictive genes that have been used for cancer classification. Classification accuracy is the overall correctness of the classifier and is calculated as the sum of correct cancer classifications divided by the total number of classifications. It is computed by the expression shown below: where is the total number of the instances in the initial microarray dataset. And, CC refers to correctly classified instances.
We apply leave-one-out cross validation (LOOCV)  in order to evaluate the performance of our proposed algorithm and the existing methods in the literature. LOOCV is very suitable to our problem because it has the ability to prevent the “overfitting” problem . It also provides an unbiased estimate of the generalization error for stable classifiers such as the SVM classifier. In LOOCV, one sample from the original dataset is considered testing dataset, and the remaining samples are considered training dataset. This is repeated such that each sample in the microarray dataset is used once as the testing dataset. We implement GA, PSO algorithm, and SVM using the Waikato Environment for Knowledge Analysis (WEKA version 3.6.10), an open source data mining tool . Furthermore, in order to make experiments more statistically valid, we conduct each experiment 30 times on each dataset. In addition, best, worst, and average results of the classification accuracies of the 30 independent runs are calculated in order to evaluate the performance of our proposed algorithm.
4.2. Experimental Results
In this section, we present and analyze the results that are obtained by our algorithm. As a first step, we employed the mRMR method to identify the top relevant genes that give 100% accuracy with an SVM classifier. From Table 4 and Figure 8, we can see that the top 150 genes in the leukemia1 dataset generate 100% classification accuracy while in the colon dataset, we can get 100% accuracy using 350 genes. For the lung dataset, we achieved 100% accuracy using 200 genes and 250 genes to get the same classification accuracy for the SRBCT dataset. In addition, using 150 high relevant genes from the lymphoma dataset and 250 genes from the leukemia2 dataset, we achieved 100% classification accuracy. Then we used these high relevant genes as input in the ABC algorithm to determine the most predictive and informative genes.
We compared the performance of the proposed mRMR-ABC algorithm with the original ABC algorithm, when using SVM as a classifier with the same number of selected genes for all six benchmark microarray datasets. The comparison results for the binary-class microarray datasets: colon, leukemia1, and lung are shown in Tables 5, 6, and 7, respectively while Tables 8, 9, and 10, respectively, present the comparison result for multiclass microarray datasets: SRBCT, lymphoma, and leukemia2. From these tables, it is clear that our proposed mRMR-ABC algorithm performs better than the original ABC algorithm in every single case (i.e., all datasets using a different number of selected genes).
In this research, we reimplement mRMR with particle swarm optimization (mRMR-PSO) and mRMR with genetic algorithm (mRMR-GA) in order to compare the performance of the mRMR-ABC algorithm with the same parameters. We also compare it with published results for recent gene selection algorithms. Notably, all these algorithms have been combined with the SVM as a classification approach.
Table 11 shows the numerical comparative results of the mRMR-ABC algorithm and related algorithms proposed in the literature. Compared with the mRMR-ABC algorithm, the mAnt method opposed by Yu et al. (2009)  selected fewer genes on the colon dataset. The mAnt algorithm selected 8 genes and achieved 91.5% classification accuracy. In contrast, the mRMR-ABC algorithm selects 15 genes and achieves 96.77% classification accuracy. For the leukemia1 dataset, the mRMR-ABC algorithm achieves 100% classification accuracy with 14 selected genes. In comparison, the mRMR-PSO algorithm proposed by Abdi et al. (2012)  achieved 100% classification accuracy; however, their selected genes are greater. For the lung dataset, the mRMR-ABC algorithm selected 8 genes to achieve 100% classification accuracy. The mRMR-GA algorithm proposed by Amine et al. (2009)  selected 15 genes in order to achieve 100% accuracy on the same dataset.
For SRBCT dataset, the MLHD-GA algorithm proposed by Huang et al. (2007)  achieved 100% classification accuracy using 11 selected genes. By contrast, the mRMR-ABC algorithm selects 10 genes and achieves 100% classification accuracy. Although there are many existing algorithms that achieve 100% for the lymphoma dataset, the mRMR-ABC algorithm selected a smaller number of predictive genes. The mRMR-ABC selected only five genes to achieve 100% classification accuracy for the lymphoma dataset. Finally, for the leukemia2 dataset, the mRMR-ABC method selected 20 genes to achieve 100% classification accuracy. It exceeds all the other methods in performance except the MLHD-GA algorithm proposed by Huang et al. (2007) , which selected 9 genes to achieve 100% classification accuracy.
Generally, most related algorithms aim to identify the minimum number of selected genes with high classification accuracy. In comparison, the mRMR-ABC algorithm selects fewer genes than most of the algorithms under comparison with relatively high classification accuracy. On the other hand, for the algorithms that select fewer genes than the mRMR-ABC algorithm, their classification accuracy is less than the mRMR-ABC classification accuracy.
The explanation of the best predictive and highly frequent genes that give highest classification accuracy for all microarray datasets using mRMR-ABC algorithm has been reported in Table 12.
It is worth mentioning that the accuracy of the mRMR filter method when it is combined with ABC generally outperforms the classification accuracy of ABC algorithm without mRMR. Thus, the mRMR is a promising method for identifying the relevant genes and omitting the redundant and noisy genes. We can conclude that the proposed mRMR-ABC algorithm generates accurate classification performance with minimum number of selected genes when tested using all datasets as compared to the original ABC algorithm under the same cross validation approach. Therefore, the mRMR-ABC algorithm is a promising approach for solving gene selection and cancer classification problems.
In this research paper, we proposed applying ABC algorithm for microarray gene expression profile. In addition, we proposed a new artificial bee colony-based algorithm called the mRMR-ABC hybrid gene selection approach to be combined with SVM as a classifier. It can be used to solve classification problems that deal with high-dimensional datasets, especially microarray gene expression profile. Up to our knowledge, the ABC algorithm has not yet been applied as a gene selection technique for a microarray dataset, so this is the first attempt. Our proposed mRMR-ABC algorithm is a three-phase method; the mRMR filter technique is adopted to identify the relative and informative gene subset from the candidate microarray dataset. Then the ABC algorithm is employed to select the predictive genes from the mRMR genes subset. Finally, the SVM classifier was trained and tested using the selected genes and returned the classification accuracy. Extensive experiments were conducted using six binary and multiclass microarray datasets. The results showed that the proposed algorithm achieves superior improvement when it is compared with the other previously proposed algorithms.
Recently, Lin et al.  proposed a new hybrid approach called libD3d; this approach is a hybrid model of ensemble pruning that is based on k-means clustering and the framework of dynamic selection and circulating in combination with a sequential search method. Empirical study shows that libD3C exhibits a competitive accuracy performance against other high-performance methods and verifies the feasibility of multiclass classification. Therefore, in the future, we intend to use licD3C for microarray cancer classification, which, up to our knowledge, has not been applied yet. In addition, we would like to conduct more experimental results on more real and benchmark datasets to verify and extend this proposed algorithm. Moreover, mRMR-ABC algorithm can be considered as a general framework that can be used to solve various optimization problems.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This research project was supported by a grant from the Research Center of the Center for Female Scientific and Medical Colleges, Deanship of Scientific Research, King Saud University.
- H. M. Alshamlan, G. H. Badr, and Y. Alohali, “A study of cancer microarray gene expression profile: objectives and approaches,” in Proceedings of the World Congress on Engineering, vol. 2, 2013.
- E. Alba, J. García-Nieto, L. Jourdan, and E.-G. Talbi, “Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC '07), pp. 284–290, IEEE, Singapore, September 2007.
- R. Simon, “Analysis of DNA microarray expression data,” Best Practice and Research Clinical Haematology, vol. 22, no. 2, pp. 271–282, 2009.
- A. Osareh and B. Shadgar, “Microarray data analysis for cancer classification,” in Proceedings of the 5th International Symposium on Health Informatics and Bioinformatics (HIBIT '10), pp. 125–132, April 2010.
- S. Ghorai, A. Mukherjee, S. Sengupta, and P. K. Dutta, “Multicategory cancer classification from gene expression data by multiclass NPPC ensemble,” in Proceedings of the International Conference on Systems in Medicine and Biology (ICSMB '10), pp. 41–48, December 2010.
- S.-B. Guo, M. R. Lyu, and T.-M. Lok, “Gene selection based on mutual information for the classification of multi-class cancer,” in Computational Intelligence and Bioinformatics: International Conference on Intelligent Computing, ICIC 2006, Kunming, China, August 16–19, 2006. Proceedings, Part III, vol. 4115 of Lecture Notes in Computer Science, pp. 454–463, Springer, Berlin, Germany, 2006.
- C. Lin, Y. Zou, J. Qin et al., “Hierarchical classification of protein folds using a novel ensemble classifier,” PLoS ONE, vol. 8, no. 2, Article ID e56499, 2013.
- N. Patrenahalli and K. Fukunaga, “A branch and bound algorithm for feature subset selection,” IEEE Transactions on Computers, vol. 26, no. 9, pp. 917–922, 1977.
- D. Karaboga, “An idea based on honey bee swarm for numerical optimization,” Tech. Rep., Technical Erciyes University, Engineering Faculty, Computer Engineering Department, 2005.
- D. Karaboga and B. Akay, “A comparative study of artificial Bee colony algorithm,” Applied Mathematics and Computation, vol. 214, no. 1, pp. 108–132, 2009.
- D. Karaboga and B. Basturk, “A powerful and efficient algorithm for numerical function optimization: artificial bee colony (ABC) algorithm,” Journal of Global Optimization, vol. 39, no. 3, pp. 459–471, 2007.
- D. Karaboga and B. Basturk, “On the performance of artificial bee colony (ABC) algorithm,” Applied Soft Computing Journal, vol. 8, no. 1, pp. 687–697, 2008.
- H. M. Alshamlan, G. H. Badr, and Y. A. Alohali, “The performance of bio-inspired evolutionary gene selection methods for cancer classification using microarray dataset,” International Journal of Bioscience, Biochemistry and Bioinformatics, vol. 4, no. 3, pp. 166–170, 2014.
- H. Alshamlan, G. Badr, and Y. Alohali, “A comparative study of cancer classification methods using microarray gene expression profile,” in Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), vol. 285 of Lecture Notes in Electrical Engineering, pp. 389–398, Springer, Singapore, 2014.
- X. Wang and O. Gotoh, “Microarray-based cancer prediction using soft computing approach,” Cancer Informatics, vol. 7, pp. 123–139, 2009.
- H. Li, K. Liu, and X. Li, “A comparative study of artificial bee colony, bees algorithms and difierential evolution on numerical benchmark problems,” Computational Intelligence and Intelligent Systems, vol. 107, pp. 198–207, 2010.
- W.-L. Xiang and M.-Q. An, “An efficient and robust artificial bee colony algorithm for numerical optimization,” Computers & Operations Research, vol. 40, no. 5, pp. 1256–1265, 2013.
- C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” Journal of Bioinformatics and Computational Biology, vol. 3, no. 2, pp. 185–205, 2005.
- H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.
- H. Liu, L. Liu, and H. Zhang, “Ensemble gene selection by grouping for microarray data classification,” Journal of Biomedical Informatics, vol. 43, no. 1, pp. 81–87, 2010.
- A. Amine, A. El Akadi, A. El Ouardighi, and D. Aboutajdine, “A new gene selection approach based on minimum redundancy-maximum relevance (mrmr) and genetic algorithm (ga),” in Proceedings of the IEEE/ACS International Conference on Computer Systems and Applications (AICCSA '09), pp. 69–75, 2009.
- M. J. Abdi, S. M. Hosseini, and M. Rezghi, “A novel weighted support vector machine based on particle swarm optimization for gene selection and tumor classification,” Computational and Mathematical Methods in Medicine, vol. 2012, Article ID 320698, 7 pages, 2012.
- C. Yun, B. Oh, J. Yang, and J. Nang, “Feature subset selection based on bio-inspired algorithms,” Journal of Information Science and Engineering, vol. 27, no. 5, pp. 1667–1686, 2011.
- T. Huang, P. Wang, Z. Ye et al., “Prediction of deleterious non-synonymous SNPs based on protein interaction network and hybrid properties,” PLoS ONE, vol. 5, no. 7, Article ID e11900, 2010.
- V. N. Vapnik, Statistical Learning Theory, Adaptive and Learning Systems for Signal Processing, Communications, and Control, John Wiley & Sons, 1998.
- C. J. Alonso-González, Q. I. Moro-Sancho, A. Simon-Hurtado, and R. Varela-Arrabal, “Microarray gene expression classification with few genes: criteria to combine attribute selection and classification methods,” Expert Systems with Applications, vol. 39, no. 8, pp. 7270–7280, 2012.
- C.-P. Lee and Y. Leu, “A novel hybrid feature selection method for microarray data analysis,” Applied Soft Computing Journal, vol. 11, no. 1, pp. 208–213, 2011.
- E. Huerta, B. Duval, and J. K. Hao, “A hybrid GA/SVM approach for gene selection and classification of microarray data,” in Applications of Evolutionary Computing: Proceedings of the EvoWorkshops 2006: EvoBIO, EvoCOMNET, EvoHOT, EvoIASP, EvoINTERACTION, EvoMUSART, and EvoSTOC, Budapest, Hungary, April 10–12, 2006, vol. 3907 of Lecture Notes in Computer Science, pp. 34–44, Springer, Berlin, Germany, 2006.
- S. Mukherjee, “Chapter 9. Classifying microarray data using support vector machines,” in Of scientists from the University of Pennsylvania School of Medicine and the School of Engineering and Applied Science, Kluwer Academic, 2003.
- J. Nahar, S. Ali, and Y.-P. P. Chen, “Microarray data classification using automatic SVM kernel selection,” DNA and Cell Biology, vol. 26, no. 10, pp. 707–712, 2007.
- A. Y. Ng , “Preventing ‘overfitting’ of cross-validation data,” in Proceedings of the 14th International Conference on Machine Learning (ICML '97), pp. 245–253, 1997.
- U. Alon, N. Barka, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999.
- T. R. Golub, D. K. Slonim, P. Tamayo et al., “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–527, 1999.
- D. G. Beer, S. L. R. Kardia, C.-C. Huang et al., “Gene-expression profiles predict survival of patients with lung adenocarcinoma,” Nature Medicine, vol. 8, no. 8, pp. 816–824, 2002.
- J. Khan, J. S. Wei, M. Ringnér et al., “Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,” Nature Medicine, vol. 7, no. 6, pp. 673–679, 2001.
- A. A. Alizadeh, M. B. Elsen, R. E. Davis et al., “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000.
- S. A. Armstrong, J. E. Staunton, L. B. Silverman et al., “MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia,” Nature Genetics, vol. 30, no. 1, pp. 41–47, 2001.
- New Zealand University of Waikato, “Waikato environment for knowledge analysis,” http://www.cs.waikato.ac.nz/ml/weka/downloading.html.
- H. Yu, G. Gu, H. Liu, J. Shen, and J. Zhao, “A modified ant colony optimization algorithm for tumor marker gene selection,” Genomics, Proteomics & Bioinformatics, vol. 7, no. 4, pp. 200–208, 2009.
- H.-L. Huang, C.-C. Lee, and S.-Y. Ho, “Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers,” BioSystems, vol. 90, no. 1, pp. 78–86, 2007.
- C. Lin, W. Chen, C. Qiu, Y. Wu, S. Krishnan, and Q. Zou, “LibD3C: ensemble classifiers with a clustering and dynamic selection strategy,” Neurocomputing, vol. 123, pp. 424–435, 2014.
- R. Zhang, G.-B. Huang, N. Sundararajan, and P. Saratchandran, “Multicategory classification using an extreme learning machine for microarray gene expression cancer diagnosis,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 485–494, 2007.
- S. Peng, X. Zeng, X. Li, X. Peng, and L. Chen, “Multi-class cancer classification through gene expression profiles: microRNA versus mRNA,” Journal of Genetics and Genomics, vol. 36, no. 7, pp. 409–416, 2009.
- C.-P. Lee, W.-S. Lin, Y.-M. Chen, and B.-J. Kuo, “Gene selection and sample classification on microarray data based on adaptive genetic algorithm/k-nearest neighbor method,” Expert Systems with Applications, vol. 38, no. 5, pp. 4661–4667, 2011.
- R. Díaz-Uriarte and S. A. de Andrés, “Gene selection and classification of microarray data using random forest,” BMC Bioinformatics, vol. 7, no. 1, article 3, 2006.
- F. Chu and L. Wang, “Applications of support vector machines to cancer classification with microarray data,” International Journal of Neural Systems, vol. 15, no. 6, pp. 475–484, 2005.
- P. A. Mundra and J. C. Rajapakse, “Gene and sample selection for cancer classification with support vectors based t-statistic,” Neurocomputing, vol. 73, no. 13-15, pp. 2353–2362, 2010.
- Q. Shen, W.-M. Shi, W. Kong, and B.-X. Ye, “A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification,” Talanta, vol. 71, no. 4, pp. 1679–1683, 2007.
- M. S. Mohamad, S. Omatu, M. Yoshioka, and S. Deris, “An approach using hybrid methods to select informative genes from microarray data for cancer classification,” in Proceedings of the 2nd Asia International Conference on Modelling and Simulation (AMS '08), pp. 603–608, May 2008.
- C.-S. Yang, L.-Y. Chuang, C.-H. Ke, and C.-H. Yang, “A hybrid feature selection method for microarray classification,” International Journal of Computer Science, vol. 35, pp. 285–290, 2008.
- Q. Shen, W. M. Shi, W. Kong, and B. X. Ye, “A combination of modified particle swarm optimization algorithm and support vector machine for gene selection and tumor classification,” Talanta, vol. 71, no. 4, pp. 1679–1683, 2007.
- M. J. Abdi and D. Giveki, “Automatic detection of erythemato-squamous diseases using PSO–SVM based on association rules,” Engineering Applications of Artificial Intelligence, vol. 26, no. 1, pp. 603–608, 2013.
- H.-L. Huang and F.-L. Chang, “ESVM: evolutionary support vector machine for automatic feature selection and classification of microarray data,” BioSystems, vol. 90, no. 2, pp. 516–528, 2007.
- S. Peng, Q. Xu, X. B. Ling, X. Peng, W. Du, and L. Chen, “Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines,” FEBS Letters, vol. 555, no. 2, pp. 358–362, 2003.
Copyright © 2015 Hala Alshamlan et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.