A Dual Level Analysis with Evolutionary Computing and Swarm Models for Classification of Leukemia
One of the major reasons of mortality in human beings is cancer, and there is an absolute necessity for doctors to identify and treat a person suffering from it. Leukemia is a group of blood cancers that usually originates in the bone marrow and results in very high number of abnormal cells. For the diagnosis of cancer, microarray data serves as an important clinical application and serves as a great aid to the entire medical community. The dimensionality of the microarray data is too high, and so selection of suitable genes is quite an important step for the improvement of data classification. Therefore, for the prediction and diagnosis of cancer, there is an utmost necessity to select the most informative genes. In this work, Minimum Redundancy Maximum Relevance (MRMR), Signal to Noise Ratio (SNR), Multivariate Error Weight Uncorrelated Shrunken Centroid (EWUSC), and multivariate correlation-based feature selection (CFS) are chosen as initial feature selection techniques. Then, to select the most informative genes, five different kinds of evolutionary optimization techniques too are incorporated here such as African Buffalo Optimization (ABO), Artificial Bee Colony Optimization (ABCO), Cockroach Swarm Optimization (CSO), Imperialist Competitive Optimization (ICO), and Social Spider Optimization (SSO). Finally, the optimized values are fed through classification process and the best results are obtained when multivariate CFS with SSO is utilized and classified with Probabilistic Neural Network (PNN), and a high classification accuracy of 95.70% is obtained.
One of the worst diseases which causes a lot of deaths in humans is cancer . There are various types of cancer, and it causes the cells to divide in an uncontrollable manner, resulting in tumors, complete breakdown to the immune system, and impairments of vital organs . Some kinds of cancer cause a rapid cell growth while others cause cells to grow at a slow rate. Some forms of cancer result in visual growths named tumors while others such as leukemia do not. One of the three different blood cancer forms is leukemia while lymphoma and myeloma are the other two forms of blood cancer . An abnormal number of immature white blood cells is produced by leukemia which collapses the bone marrow and prevents the promotion of healthy important blood cells required for developing a balanced immune system . The onset of acute leukemia is rapid and progresses very fastly, and therefore, urgent treatment has to be provided to them. Thus, leukemia belongs to a broad array of cancer disease and is commonly termed as hematological malignancies. There are two types here such as Acute Myeloid Leukemia (AML) and Acute Lymphoblastic Leukemia (ALL) .
AML: This kind of leukemia is the most prevalent type in older people but can affect younger people too. Due to the excess accumulation of immature hematopoietic cells in the blood and bone marrow, the malignancy occurs. Various genetic factors are responsible for such conditions.
ALL: This kind of disease is prevalent in children who are suffering from leukemia. When immature lymphoid cells excessively accumulate in the bone marrow and peripheral blood, this disease occurs.
Based on their morphological appearance, the categorization of the leukemia cells has been done traditionally. To identify the innate differences between tumor cells, there is an absolute necessity for highly skilled technological resources . Such a process can be very expensive, highly time consuming, and tedious to handle. In a morphological manner, the cells can appear as similar, but each cell can respond quite differently to appropriate drugs and therapy . Therefore, traditional techniques have huge limitations, and therefore, it leads to a necessity to identify other parameters so that cell categorization can be well framed . High amount of useful information is provided by the gene expression data for subclassification studies. For managing gene expression data of thousands of genes simultaneously, microarrays have played quite an important role in it . In the previous decade, microarray technology has been the most commonly used gene quantification method and is still in use due to the cheap and inexpensive nature of this technology . Thus, using microarray techniques, the expression levels for tens of thousands of genes can be measured easily so that a functional relationship information is provided to the scientists between the physiological and cellular process of the biological organisms and genes . As the microarray data is so huge to process owing to its large amount of noise and other disturbances, the curse of dimensionality problem is present and so gene selection is important so that the best genes are selected and provided for classification . Some of the most important works done in leukemia microarray-based cancer classification are as follows:
For the diagnosis of chronic lymphocytic leukemia, Artificial Neural Network (ANN) was implemented by Aghamaleki et al. . A novel prognostic classification of chronic lymphocytic leukemia derived from a multivariate survival analysis was done by Binet et al. . ANN was utilized for recognizing and predicting leukemia by Afshar et al. . Utilizing momentum back propagation and genetic algorithms as a feature selection technique, microarray-based leukemia classification was performed by Wisesty et al. . The Leukemia diagnosis using transfer learning in Convolutional Neural Networks (CNNs) for classification was performed by Vogado et al. . An effective Map Reduce-based KNN classifier was utilized for the analysis of microarray leukemia data by Kumar et al. . An ensemble machine learning for leukemia cancer diagnosis based on microarray datasets was done by Alrefai . A framework to detect and discriminate ALL and AML using microarray gene expression profiles utilizing supervised machine learning was done by Dwivedi . To classify gene expression profiles of acute leukemia, various features and classifiers were explored by Cho . An enhanced leukemia cancer classifier algorithm was done by Nasser et al. . The application of Probabilistic Neural Network (PNN) to the class prediction of leukemia was done by Huang et al. . Utilizing Partial Least Squares (PLS) method, the classification of acute leukemia based on DNA microarray gene expression was done by Nguyen et al. . A SNR approach to discriminate AML with ALL was done by Goloub et al. . Gene expression-based leukemia subclassification using committee neural network was found by Sewak et al. . A leukemia multiclass assessment and classification from microarray and RNA-sequencing technologies integration at gene expression level was performed by Castiollo et al. .
Optimization algorithms have played a major role in gene selection procedure. An optimization-based tumor classification from microarray gene expression data was done by Dagliyan et al. , random cuckoo search for autism gene selection , and stellar mass black hole for engineering optimization etc . Optimization models for cancer classification extracting gene interaction information from microarray expression was performed by Antonov et al. . Other optimizations for cancer gene selection included a modified genetic algorithms with Levy flight , simplified swarm optimizations , chronological grasshopper optimization algorithms , Hybrid optimization algorithms , adaptive ant colony optimization , biogeography-based optimization , nondominated sorting GA , filter-based optimization , Particle Swarm Optimization (PSO) , Grey Wolf optimization , and hybrid of Grey Wolf and Crow search algorithm  have been reported in literature. In this work, the two-level feature selection employing statistical tests and then optimization techniques are done and then classified with suitable classifiers. The organization of the work is as follows. The experimental procedure is discussed in Section 2 along with the suitable gene/feature selection techniques. Section 3 gives the details about the different optimization techniques, and Section 4 gives the classification techniques’ details. The results and discussion are done in Section 5, and the paper is concluded in Section 6.
2. Materials and Methods
For the leukemia classification, a dataset was used which is publicly available online . There are two types of leukemia, where 25 samples of acute myeloblastic leukemia (AML) and 47 samples of acute lymphoblastic leukemia (ALL) are found. The details of the dataset are tabulated in Table 1.
The illustration of the work is shown in Figure 1.
2.1. Techniques to Select the Genes
The gene selection techniques utilized in this paper are as follows. The intention of this procedure is to shortlist the best 2000 genes from 7129 genes.
2.1.1. Minimum Redundancy–Maximum Relevance (MRMR)
By means of minimizing redundancy, the features are selected with a maximum minimizing relevance . To measure and assess the relevance for discrete datasets, a mutual information criterion is utilized by MRMR.
For a feature , the F-test value is expressed by
where is the class set , represents the mean of , expresses the mean of for class , and represents the pooled variance for given size and variance of class . For feature subset , the maximum relevance criterion is expressed by
The selection of the first method is done by this method, and utilizing the linear incremental search algorithm based on optimization function, the rest of the features are selected. However, for continuous variables, the two popular linear search schemes are MRMR-FDM and MRMR-FSQ schemes (F test distance multiplicative) and (F test similarity quotient).
For MRMR-FDM, the optimization condition is expressed as
where is the Euclidean distance between feature and .
For MRMR- FSQ optimization,
2.1.2. Signal to Noise Ratio
Pearson Correlation Coefficient (PCC) is quite an important measure utilized to find the gene significance. It is changed to specify the importance of SNR in using a gene as a predictor . For a particular gene, to find the predictor strength, this predictor is utilized. For a gene , the calculation of SNR is done as
The mean of the normal samples is expressed by , and the mean of the tumor sample is expressed by . and are the standard deviations of normal and tumor samples, respectively. The primary difference between the classes with respect to the standard deviation in between the classes is used by this value. Between the class distinction and the gene expression, a strong correlation is indicated if the values of are larger. If the values of are either positive or negative, then it corresponds to the gene being highly expressed in either class 1 or class 2. The genes which have a very large SNR value are quite informative, and so it is selected for cancer classification.
2.1.3. Multivariate Error-Weighted Uncorrelated Shrunken Centroid (EWUSC)
Based on Shrunken Centroid (SC) and Uncorrelated Shrunken Centroid (USC), this technique was developed . When the average gene expression for each gene in every class is divided by the standard deviation for that gene in the same class, then the Shrunken Centroid is found. Genes where expression is similar among the various samples of the same class, then higher weight is assigned to it. Using squared distance, to the label with the nearest average pattern, the assignment of new samples is done. From tracing the genes that are highly correlated in the set of genes found by SC, the redundant features are removed by USC approach. Both of these steps are used by EWUSC in addition to the error weights addition so that the redundant genes are removed, and the noisy genes are downgraded.
2.1.4. Multivariate Correlation-Based Feature Selection (CFS)
When features are highly correlated with the class but uncorrelated with each other, then it forms a good feature subset . By analyzing the predictive ability of every feature individually along with the degree of redundancy, the evaluation of a subset by CFS method is done. The main advantage of this technique is that a “heuristic merit” is provided for a feature subset instead of individual features. So, it implies that for a particular heuristic or function, the algorithm can decide on its progress by selecting the best options so that the output function is maximized.
3. Optimization Techniques
The shortlisted 2000 genes will undergo again a secondary feature selection methodology by means of utilization optimization techniques so that the best 50, 100, and 200 genes are finally considered and that is mentioned as a dual level analysis in this work. The feature selection is done using the five optimization algorithms as follows.
3.1. African Buffalo Optimization Algorithm
To get the best solution in the search space, ABO is utilized . Within the herd population, the initialization of the buffaloes is done. Then, by updating their locations, the global optimum is searched for as they tend to follow the current best buffalo in the herd. In the problem space, the buffaloes make sure it keeps track of its coordinates to achieve the best fitness value. The ideal location of the specific buffalo which is considered as the best with respect to the optimal solution is termed as . Progressing towards and , the dynamic location of every buffalo is traced depending on where the importance is specified and kept at a particular location. The learning parameters has a great effect on the speed of each animal.
The ABO algorithm steps are explained as follows: (1)Initialization: the buffaloes are randomly placed to the different nodes of the solution space(2)Buffalo fitness value updation: the fitness value is updated as
where and indicate the exploration and exploitation moves of the buffalo , and are learning factors, is the best fitness of the herd, and denotes the best found location of the individual buffalo. (3)The location of the buffalo is updated utilizing the following formula as(4)If the updation of is done, then proceed to step (5) or else go to (2) of this algorithm(5)Check for the meeting of the stopping criteria. If met, go back to algorithm step (3) or else go to (6)(6)The best solution is taken as the output
The updation equation (6) of the buffalo has 3 sections. The memory of the past location of the buffaloes is represented by ; a good memory ability is present for the buffalo which helps it to mention the places it has been before. This particular ability of the buffalo is important as it helps to search for best solutions by avoiding the areas that gave negative or poor results. As an alternative for the present local maximum location, a list of solutions is provided by the memory of each buffalo. The second part represents the cooperative nature of buffaloes and indicates the social nature of the buffaloes such as guarding each other, information sharing, and danger sensing. The third part mentions the intelligent abilities of the buffaloes. Therefore, the memory, socialization, and intelligent qualities of a buffalo are together represented in equation (6). Equation (7) helps the buffaloes in search of a better environment as the present environment has been fully explored and exploited or due to some unfavorable conditions.
The main highlights of the ABO algorithm are that to ensure a very fast convergence rate, and only a few parameters are used. In each iteration, the best buffalo can be easily found out. To track the location and phase of the best buffalo , adequate exploration is ensured. By exploiting other buffalo’s area too, a good exploration is achieved.
3.1.1. Initialization and Updation of Speed and Location
In the solution space, by placing the buffalo randomly, initialization phase is done. For the algorithm to converge in a smaller number of iterations, the previous knowledge of the problem can be helpful. Based on the previous maximum location and source data gathered from the exploits of the other neighboring buffaloes, the updation of the location of every buffalo is done in each iteration. With such a modelling, the algorithm can track the buffalo movement to achieve an optimal solution.
3.2. Artificial Bee Colony Algorithm
In a multidimensional space, based on the bee’s foraging activity for nectar, this global cum local search-based optimization procedure was utilized and the steps are explained in Algorithm 1 . In this entire variable space, the food sources are spanning throughout, and in this variable space, the food source is assumed as the point in the variable space. For that particular point in the variable space, the objective function is maximized by this ABC method similar to the location tracing of the food source by the bee which has the highest nectar content. The objective function should find the optimal solution in this ABC optimization problem where in an artificial multidimensional space, and the artificial bees will wander to trace the highest producing nectar source. The search task is achieved by utilizing the basic concept of food foraging procedure by the bee colony and is simulated in an artificial computer surrounding. In the entire variable space, a random population of initial food sources is denoted as , where indicates the colony size is expressed as
where is a random number in the range of [0,1]. Three different types of tasks are assured where each does a different task. A food source from their respective memories is considered by the employed bees and then seek a new food source in its neighbourhood. For this purpose, any neighbourhood operator can be utilized. A food source which is uniformly distributed within of the present memory location is utilized as
where the randomly selected food source is expressed as , and is a random number in . The food source which is newly created is then compared with . and the food source which is better is placed in the memory of the employed bee. Here in our experiment, the total number of employed bees is set as 60% of the total food sources . The food source information stored in their memories is shared by the employed bees with the onlooker bees who is present in the bee hive observing the foraging act of the employed bees. The food source location traced by an employed bee is chosen by the onlooker bee in a probabilistic manner proportion to the total nectar content in the food source . The probability of choosing the food source is higher if the nectar content is high. Modification of a selected food source to trace in its neighbourhood is done by using a similar methodology with the selected as shown in equation (9). The memory of the onlooker bee selects and keep only the better of the two food sources. The number of onlooker bees is generally set as half of the food sources. Finally, the scout bees are the third kind of bees which chooses a food source location randomly utilizing equation (8) and act like global overseers. Though a predefined number of trials, if the memory location cannot be improved by the employed bees, then it booms as a mount bee. Once it becomes a scout bee, then in the variable space, the memory located is reinitialized randomly. The number of scout bees is assumed to be 1 in our experiment, and the algorithm runs for a maximum number of generations. Only with an employed or an onlooker bee alone, each food source is associated, so that a single food source is associated in each of them. It is therefore used in other types of optimization too such as combinatorial optimization, multiobjective optimization, and to solve integer programming.
3.3. Cockroach Swarm Optimization Algorithm
Inspired by the nature of the cockroaches searching for food such as progressing in swarms, escape mechanisms, or scattering mechanism from light, CSO was developed . The collective cockroach behaviour is modelled by a set of rules in the CSO algorithm. The focus of this algorithm is to create a set of feasible solutions in its initial step. In the search space, the random generation of the initial solutions are done. For solving various optimization problems, the CSO algorithm includes 3 procedures such as (i) chase swarming, (ii) dispersing, and (iii) ruthless behaviour.
3.3.1. Chase-Swarming Phase
In this phase, the local best solutions are carried by the strongest cockroaches and then together it forms a small swarm. After the swarm formation, it is progressed towards the global optimum . In this procedure, within the range of its visibility, each individual progresses towards its local optimum. During the movement of the cockroaches in small groups, a particular approach can become the strongest by means of finding a better solution. Within its own visibility scope, a lonely cockroach has its local optimum and it progresses towards the global best solution.
3.3.2. Dispersion of Individual Phase
To preserve the diversity of cockroaches, it is performed from time to time. In this phase, a random step is taken by the cockroach in the search space.
3.3.3. Ruthless Behaviour Phase
Here, the currently best individuals replace the random individual. If the food availability is inadequate, then creating the weaker cockroaches becomes the procedure and so it is termed as ruthless behaviour.
The steps are as follows: (Step 1)A population of individuals is generated, and the algorithm parameters are initialized (step, D-space dimension, visual scope, and stopping criterion)(Step 2)Within the visual scope of the individual, and is searched for(Step 3)Chase swarming behaviour is implemented, and finally is updated at the end. If a cockroach is local optimum, then it progresses to based on , where is a random number within [0,1]. Or else the cockroach progresses to through the formula represented as
and it is present within its own visibility range (Step 4)Dispersing procedure is implemented, and is updated(Step 5)Ruthless procedure is implemented or , where (Step 6)Until a termination criterion is satisfied, the steps 2-5 are repeated and then output the final results. The stopping criteria includes the computation time, obtaining a minimum solution error and maximum number of iterations etc
3.4. Imperialist Competitive Algorithm
One of the famously used population-based metaheuristic is ICA. In a population, each individual represents a country, and in the initialization process, some best countries are selected as imperialists . The imperialist and colonies help to build the initial empire, and the generation of the new solutions is done by the colony assimilation and revolution, competition among the imperialists, and the exchange of imperialists.
The procedure is as follows: (1)Initialization: An initial population is generated(2)Initial empire construction: The cost is computed for every individual; for all the solutions, the sorting of in descending order is done. The selection of best solutions from as imperialists is done. The remaining countries is assigned to the imperialists(3)The assimilation of colonies is executed for every empire, and then the revolution of some colonies is performed. If possible, position of colony and imperialist is exchanged(4)The imperialist competition is achieved(5)Without any countries, the empire is eliminated(6)If the meeting of termination criteria is not done, then it goes back to step (3)(7)If the termination criteria are done, then the search is stopped
Based on the objective function, the calculation of the cost of a country is done. The cost is less if a solution is better. best solutions with the least cost are considered as imperialists. The colonies are formed by the rest of the countries. There are totally colonies represented as .
By the assignment of colonies to imperialists, the formation of initial empires is done based on the imperialist power and is considered as
where denotes the power of imperialist and denotes the normalized cost, and here, specifies the imperialist cost of . The calculation of the number of initial colonies managed by imperialist is expressed as , where round is the nearest integer of a fractional number and is expressed by the function round.
The total number of colonies of imperialist is expressed by . A colony in each empire progresses along the direction towards the imperialist in the process of assimilation. is the moving distance and is a random number represented by random distribution in the interval , where and the distance among imperialist and colony is expressed by . The colony progresses towards the direction of the imperialist if . However, the colonies cannot be absorbed by the imperialist in direct movement thereby a deviation from the direct line prevails. The representation of deviation is done by which follows uniform distribution in , where is just an arbitrary parameter. Change in position of some colonies causes revolution, and it is because of unexpected changes in the characteristics. For instance, the change in characteristic would lead to the change in position, and it can be influence by changing the language or religion of a particular colony. Similar to the process of mutation in competitive algorithm, the revolution in ICA is carried out so that exploration is increased and the early convergence to local optima is prevented. Once the assimilation and revolution is done in an empire, the comparison of the cost of each colony with that of the imperialist is done. Therefore, if the colony has a very less cost in comparison to the imperialist, then the swapping of colony can be done. Depending on the total empire power, the determination of imperialist competition is done. Assume is the total cost of the empire therefore for each empire , is initially calculated as
where represents a positive number between 0 and 1, and it is close to 0. For the empire , the normalized cut cost and the power is computed as
After a vector is defined, the assignment of the weakest colony from the weakest expires to the empire having largest index is done where represents the random number with uniform distribution in the range of [0,1].
3.5. Social Spider Optimization Algorithm
One of the recent meta heuristic algorithm which attracted a good attention is SSO . In this algorithm, the search space is assumed as a communal spider web. For each population, the candidate solutions represent a spider. A weight is received from each spider based on its fitness value. The simulation of the various cooperation behaviour in the colony is approached by two different search sets of evolutionary operators. To solve a nonlinear global optimum problem, the algorithm is designed with box constraint as follows:
where is a nonlinear function, and is a feasible space reduced by limiting the lower and upper limits. To solve this optimization problem, population of candidate solutions is utilized by SSO. A spider position is represented by each solution whereas the search space is represented by the general web. In this methodology, the population is divided into two search agents. represent male and represent female. The real spider colony is aimed to be simulated and therefore the number of females (represented as ) is selected randomly in the range of 60-70% of the entire population , where the rest is considered as the male individuals . Under this constraint, a set of female individuals is formed by the group as , and the male individuals . Each spider has a weight based on solution fitness and it is calculated as
where represents the fitness value of the spider solution, , indicates best fitness value, and indicate worst fitness value of the whole population . The main mechanism of SSO is the information exchange in the optimization process. Only through the vibration present in the website it can be simulated. The modelling of a vibration received from a spider to spider is expressed as follows
where the weight of the spider is , and the distance between the 2 spiders is . Three types of vibrations can be perceived by each spider and and . The vibration produced by the nearest spider with a very high weight is expressed by . is produced by the closest female spider, and their vibrations applied only if is a male spider. In the population , the best spider is produced by . At an initial stage a population of the total spider is operated to assess the total number of iterations . Various sets of evolutionary operators are assigned to each individual based on its gender.
In the context of female spiders, the novel position is obtained by the modification of the current position of spider . A probability factor is used to randomly control the modification, and the movement is produced with respect to other spiders and throughout the search space, and the transmission of vibrations is done as
with probability and , where are random numbers between the range [0,1] and denotes the iteration number. and are the nearest spider and best spider, respectively. The classification of male spiders in 2 types is done as dominant and nondominant . Only between the dominant male and female individuals, the mating is carried out with a specific range , and so a new individual is defined by the weight of each spider. The new individual can be influenced easily by the heavier element which has more probability. Once the generation of the new spider is done, it is then compared to the rest of the population. If a new spider has a good fitness value than the worst spider member, then the worst spider is replaced by , or else discarding is done.
4. Classification Procedures
The optimized values are then classified with the following classifiers.
The main assumption of NB classifier is that each characteristic is pretty independent to the rest of the characteristics . Therefore, the optimized genes contribute in an independent manner to the probability of being a part of a specific class. For estimating the essential parameters for classification, a smaller number of training samples is required by these types of classifiers. For supervised learning problem, it is a fast and efficient classifier.
The main intention of SVM lies in the hyperplane selection that is equidistant from every class so that for the separation of the classes, a maximum margin is achieved . The training support vector samples are the ones which fall into the frontier when the hyperplane is defined. The classifier greatly tolerates the classification errors which is controlled by the hyperparameters so that generalization capability of the model is controlled. Depending on the side of hyperplane to which the sample belongs, the classification of a new sample will be done for a biclass classification. This method usually changes for a multiclass classification because SVM builds classifiers where the number of classes is denoted by . Then, a voting system is also established among them mentioning the most voted class for the new samples.
A forest of classification trees is built by the RF algorithm as it grows many single classification vectors (trees) . A vector is assigned as an input to be classified in this classification model for each tree of the forest. Once the classification is done by that individual tree, the class having the largest number of votes over all the trees is decided by the standard voting system among the trees.
This classifier is an implementation of a statistical algorithm called kernel discriminate analysis. The operations are usually organized into a multilayered feed forward network . Only one epoch of training is needed in PNN. The main drawback of using this is that for storing the training samples, it assumes a lot of memory and so the recall process computation slows down gradually.
5. Results and Discussion
It is classified with a 10-fold cross validation method, and the performance of it is shown in tables below. The mathematical formulae for computing the Performance Index (PI), Sensitivity, Specificity, and Accuracy are mentioned in literature, and using the same, the values are computed and exhibited. PC is Perfect Classification, MC is Missed Classification, and FA is False Alarm in the expressions below.
The sensitivity is expressed as
Specificity is expressed as
Accuracy is expressed as
Performance Index (PI) is expressed as
Table 2 shows the average performance analysis of classifiers in terms of classification accuracies with ABO for different gene selection techniques using 50–200 selected genes. As depicted in Table 2, the PNN classifier with 50 genes at SNR features and PNN classifier with 200 genes selected in the multivariate EWUSC attained higher accuracy of 92.97%. In the case of SVM classifier with 100 genes for the multivariate, CFS reached a low accuracy value of 75.96%. This low accuracy is due to the high false alarm rate in the SVM classifier.
Table 3 demonstrates the average performance analysis of classifiers in terms of classification accuracies with ABCO for different gene selection techniques using 50–200 selected genes. As shown in Table 3, the PNN classifier with 200 genes at SNR feature exhibits higher accuracy of 91.47%. The SVM classifier with 200 genes for the multivariate EWUSC feature is ebbed at the low accuracy of 75.7581%.
Table 4 reveals the average performance analysis of classifiers in terms of classification accuracies with CSO for different gene selection techniques using 50–200 selected genes. As identified in Table 4, the NBC classifier with 50 genes at SNR feature demonstrates the higher accuracy of 92.19%. The PNN classifier with 50 genes for the multivariate EWUSC feature is achieved at the low accuracy of 75.75%.
Table 5 exposes the average performance analysis of classifiers in terms of classification accuracies with ICO for different gene selection techniques using 50–200 selected genes. The Table 5 reports that RF classifier with 50 genes at multivariate CFS attained the higher accuracy of 92.45%. The PNN classifier with 50 genes for the multivariate EWUSC feature is achieved at the low accuracy of 75.625%.
Table 6 expresses the average performance analysis of classifiers in terms of classification accuracies with SSO for different gene selection techniques using 50–200 selected genes. Table 6 exposes that PNN classifier with 200 genes at multivariate CFS attained the highest accuracy of 95.705%. The RF Classifier with 50 genes for the multivariate CFS feature achieved the lower accuracy of 75.875%.
Figure 2 shows the performance of Performance Index (PI) parameter for four classifiers averaged in five different optimization methods. As exhibited in Figure 2, the NBC classifier with 50 gene selection at CSO optimization attained higher PI of 56.94%. As in the case of NBC classifier with 100 genes for ABO algorithm, the higher PI is reached at 59.03%. NBC classifier with 200 genes selection for ABO algorithm peaked with the highest PI of 74.33%. For SVM classifier with 50 gene selection ABCO algorithm edged at high PI of 56.33% and SVM with 100 genes depicted the PI of 55.93% for the CSO algorithm. In SVM classifier with 200 genes selection case, SSO algorithm reports high PI of 63.64%. RF classifiers with 50 genes selection procedure attained high PI of 54.68% at ABCO algorithms. As in the case of RF classifiers with 100 gene selection method, ABO algorithm provides higher PI of 51.56%. In the RF classifier with 200 genes selection method, SSO algorithm exhibits high PI value of 57.41%. In PNN classifier with 50 gene selection procedure, ABO algorithm arrived at high PI of 51.68%. For the PNN classifier with 100 genes selection method also, ABO specialized at higher PI of 69.32%. For PNN classifier with 200 genes selection cases good PI of 57.32% is reached for ABCO algorithm. Due to the averaging effect across the four features, the classifier reveals better and smooth PI values. ISO algorithms demonstrate the smoothening effect across the classifiers.
6. Conclusion and Future Work
For the diagnosis, analysis, and treatment of cancer, microarray-based classification of this disease is very useful. To determine the most informative genes that can cause cancer, a great impact and utility was provided by the microarray technique in recent years. The curse of dimensionality problem is a huge drawback in microarray data analysis which destabilizes the computational instability and prevents the usefulness of a certain information from a dataset. Thus, in analyzing the cancer microarray datasets, an imperative task lies in the selection and extraction of relevant features so that effective classification is achieved. In this work, four types of initial feature selection techniques were performed and then it was further optimized with five optimization techniques before proceeding into classification. The best results are obtained when multivariate CFS feature selection with SSO is utilized and classified with Probabilistic Neural Network (PNN), and a high classification accuracy of 95.70% is obtained. Future work is to analyze with a plethora of other optimization and machine learning techniques for a better analysis of microarray-based leukemia classification.
The programming codes would be made available to the researchers upon request to the corresponding author.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant number: HR21C0198), and partly supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2021R1F1A1061814).
E. Alba, J. Garcia-Nieto, L. Jourdan, and E. G. Talbi, “Gene selection in cancer classification using PSO/SVM and GA/SVM hybrid algorithms,” in Proceedings of the IEEE Congress on Evolutionary Computation (CEC '07), pp. 284–290, Singapore, September 2007.View at: Google Scholar
P. G. Kumar, T. A. Victoire, P. Renukadevi, and D. Devaraj, “Design of fuzzy expert system for microarray data classification using a novel Genetic Swarm Algorithm,” Expert Systems with Applications, vol. 39, pp. 1811–1821, 2012.View at: Google Scholar
S. Afshar, F. Abdolrahmani, F. V. Tanha, M. Z. Seif, and K. Taheri, “Recognition and prediction of leukemia with artificial neural network (ANN),” Medical Journal of the Islamic Republic of Iran, vol. 25, pp. 35–39, 2011.View at: Google Scholar
U. N. Wisesty, R. S. Warastro, and S. Y. Puspitasari, “Leukemia and colon tumour detection based on microarray data classification using momentum backpropagation and genetic algorithm as a feature selection method,” Journal of Physics: Conference Series, vol. 971, article 012018, 2018.View at: Google Scholar
L. H. S. Vogado, R. M. S. Veras, F. H. D. Araujo, R. R. V. Silva, and K. R. T. Aires, “Leukemia diagnosis in blood slides using transfer learning in CNNs and SVM for classification,” Engineering Applications of Artificial Intelligence, vol. 72, pp. 415–422, 2018.View at: Publisher Site | Google Scholar
N. Alrefai, “Ensemble machine learning for leukemia cancer diagnosis based on microarray datasets,” International Journal of Applied Engineering Research, vol. 14, no. 21, pp. 4077–4084, 2019.View at: Google Scholar
P. Tumuluru and B. Ravi, “Chronological grasshopper optimization algorithm-based gene selection and cancer classification,” Journal of Advanced Research in Dynamical and Control Systems, vol. 10, no. 3, pp. 80–94, 2018.View at: Google Scholar
A. Cuzzocrea, S. L. Francis, and M. M. Gaber, “An information-theoretic approach for setting the optimal number of decision trees in random forests,” in Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, pp. 1013–1019, Manchester, UK, October 2013.View at: Publisher Site | Google Scholar