Abstract

Feature selection in cancer classification is a central area of research in the field of bioinformatics and used to select the informative genes from thousands of genes of the microarray. The genes are ranked based on T-statistics, signal-to-noise ratio (SNR), and F-test values. The swarm intelligence (SI) technique finds the informative genes from the top-m ranked genes. These selected genes are used for classification. In this paper the shuffled frog leaping with Lévy flight (SFLLF) is proposed for feature selection. In SFLLF, the Lévy flight is included to avoid premature convergence of shuffled frog leaping (SFL) algorithm. The SI techniques such as particle swarm optimization (PSO), cuckoo search (CS), SFL, and SFLLF are used for feature selection which identifies informative genes for classification. The k-nearest neighbour (k-NN) technique is used to classify the samples. The proposed work is applied on 10 different benchmark datasets and examined with SI techniques. The experimental results show that the results obtained from k-NN classifier through SFLLF feature selection method outperform PSO, CS, and SFL.

1. Introduction

Abundant methods and techniques have been proposed for cancer classification using microarray gene expression data. Rapid and recent advances in microarray gene expression technology have facilitated the simultaneous measurement of the expression levels of tens of thousands of genes in a single experiment at a reasonable cost. Gene expression profiling by microarray method has appeared as a capable technique for classification and diagnostic prediction of cancer.

The raw microarray data are images that are transformed into gene expression matrices. The rows in the matrix correspond to genes, and the columns represent samples or trial conditions. The number in each cell signifies the expression level of a particular gene in a particular sample or condition. Expression levels can be absolute or relative. If two rows are similar, it implies that the respective genes are coregulated and perhaps functionally related. By comparing samples, differentially expressed genes can be identified. The major limitation of the gene expression data is its high dimension which contains more numbers of genes and very few samples. A number of gene selection methods have been introduced to select the informative genes for cancer prediction and diagnosis. Feature or gene selection methods remove irrelevant and redundant features to improve classification accuracy. From the microarray data, the informative genes are identified based on their -statistics, SNR, and -test values.

PSO is one of the SI techniques proposed by Kennedy and Eberhart [1] that simulate the behaviour of bird flocking. Yang and Deb [2] proposed the CS inspired by the breeding behaviour of cuckoo. SFL is a memetic metaheuristic that is the combination of two search techniques: the local search of PSO and the competitiveness mixing of the shuffled complex evolution [3]. The randomness in SFL sometimes will not cover an effective area of the search space or it will reflect the same worst solution. To avoid this, the proposed work adopts Lévy flight for position change. The SI techniques such as PSO, CS, SFL, and SFLLF are used for feature selection.

1.1. Related Work

In this section the works related to gene selection and cancer classification using microarray gene expression data are discussed. An evolutionary algorithm is used by Jirapech-Umpai and Stuart [4] to identify the near-optimal set of predictive genes that classify the data. Vanichayobon et al. [5] used self-organizing map for clustering cancer data composed of important gene selection step. Rough set concept with dependent degrees was proposed by Wang and Gotoh [6]. In this method they screened a small number of informative single gene and gene pairs on the basis of their dependent degrees.

A swarm intelligence feature selection algorithm was proposed based on the initialization and update of only a subset of particles in the swarm by Martinez et al. [7]. Gene doublets concept was introduced by Chopra et al. [8] based on the gene pair combinations. A new ensemble gene selection method was applied by Liu et al. [9] to choose multiple gene subsets for classification purpose, where the significant degree of gene was measured by conditional mutual information or its normalized form.

A hybrid method was proposed by Chuang et al. [10], which consists of correlation-based feature selection and the Taguchi chaotic binary PSO. Dagliyan et al. [11] proposed a hyperbox enclosure (HBE) method based on mixed integer programming for the classification of some cancer types with a minimal set of predictor genes. The use of single gene was explored to construct classification model by Wang and Simon [12]. This method first identified the genes with the most powerful univariate class discrimination ability and constructed simple classification rules for class prediction using the single gene.

An efficient feature selection approach based on statistically defined effective range of features for every class termed as effective range based gene selection (ERGS) was proposed by Chandra and Gupta [13]. Biomarker identifier (BMI), which identified features with the ability to distinguish between two data groups of interest, was suggested by Lee et al. [14]. Margin influence analysis (MIA) was an approach designed to work with SVM for selecting informative genes by Li et al. [15]. A model for feature selection using signal-to-noise ratio (SNR) ranking was proposed by Mishra and Sahu [16].

Huang et al. [17] presented an improved semisupervised local Fisher discriminant (iSELF) analysis for gene expression data classification. Alonso-González et al. [18] proposed a method that relaxed the maximum accuracy criterion to select the combination of attribute selection and classification algorithm. A quantitative measure based on mutual information that incorporates the information of sample categories to measure the similarity between attributes was proposed by Maji [19]. A feature selection algorithm which divides the genes into subsets to find the informative genes was proposed by Sharma et al. [20].

2. Materials and Methods

2.1. Gene Selection Methods
2.1.1. -Statistics

Genes, which have considerably different expressions involving normal and tumor tissues, are entrants for selection. A simple -statistic measure given in (1) is used by Yendrapalli et al. [21] to find the degree of gene expression difference between normal and tumor tissues. The top- genes with the largest -statistic are selected for inclusion in the discriminant analysis. Consider Here : mean of normal samples, : mean of tumor samples, : normal sample size, : tumor sample size, : variance of normal samples, and : variance of tumor samples.

2.1.2. Signal-to-Noise Ratio

An important measure used to find the significance of genes is the Pearson correlation coefficient. According to Golub et al. [22] it is changed to emphasize the “signal-to-noise ratio” in using a gene as a predictor. This predictor is shaped with the purpose of finding the prediction strength of a particular gene by Xiong et al. [23]. The signal-to-noise ratio PS of a gene “” is calculated by Here : mean of normal samples, : mean of tumor samples, : standard deviation of normal samples, and : standard deviation of tumor samples.

This value is used to reveal the difference between the classes relative to the standard deviation within the classes. Large values of indicate a strong correlation between the gene expression and the class distinction, while the sign of being positive or negative corresponds to being more highly expressed in class 1 or class 2. Genes with large SNR value are informative and are selected for cancer classification.

2.1.3. -Test

-test is the ratio of the variances of the given two sets of values which is used to test if the standard deviations of two populations are equal or if the standard deviation from one population is less than that of another population. In this work two-tailed -test value is used to check the variances of normal samples and tumor samples. Formula to calculate the -test value of a gene is given in (3). Top-m genes with the smallest -test value are selected for inclusion in the further analysis. Consider Here : variance of normal samples and : variance of tumor samples.

2.2. Swarm Intelligence Techniques
2.2.1. Particle Swarm Optimization

PSO is one of the SI techniques that simulate the behavior of bird flocking. It is a population-based optimization tool, which could be implemented and applied easily to solve various function optimization problems.

In PSO, each single solution is like a “bird” in the search space, which is called a “particle.” All particles have fitness values which are evaluated by the fitness function to be optimized and have velocities which direct the flying of the particles. The particles fly through the problem space by following the particles with the best solutions so far.

The original PSO formulae define each particle as potential solution to a problem in -dimensional space. The position of particle is represented as .Each particle also maintains a memory of its previous best position, represented as . A particle in a swarm is moving; hence, it has a velocity, which can be represented as .

Each particle knows its best value so far (best) and the best value so far in the group (best) among bests. This information is useful to know how the other particles around them have performed. Each particle tries to modify its position using the following information:(i)the distance between the current position and best,(ii)the distance between the current position and best.This modification can be represented by the concept of velocity. Velocity of each agent can be modified by (4). The inclusion of an inertia weight in the PSO algorithm was first reported by Eberhart and Shi in the literature [24]. Consider where : index of the particle, , : population size, : dimension, , : uniformly distributed random variable between 0 and 1, : velocity of particle on dimension , : current position of particle on dimension , determines the relative influence of the cognitive component, self-confidence factor, determines the relative influence of the social component, swarm confidence factor, : personal best or best of particle , : global best or best of the group, and : inertia weight.

The current position that is the searching point in the solution space can be modified by the following equation:

All swarm particles tend to move towards better positions; hence, the best position (i.e., optimum solution) can eventually be obtained through the combined effort of the whole population. The PSO algorithm is simple, easy to implement, and computationally efficient.

2.2.2. Cuckoo Search

Cuckoo search is an optimization technique developed by Yang and Deb in 2009 based on the brood parasitism of cuckoo species by laying their eggs in the nests of other host birds. If a host bird discovers the eggs which are not their own, it will either throw these foreign eggs away or simply abandon its nest and build a new nest elsewhere. Each egg in a nest represents a solution, and a cuckoo egg represents a new solution. The better new solution (cuckoo) is replaced with a solution which is not so good in the nest. In the simplest form, each nest has one egg. A new solution is generated by Lévy flight. The rules for CS are as follows:(i)each cuckoo lays one egg at a time and dumps it in a randomly chosen nest;(ii)the best nests with high quality of eggs will carry over to the next generations;(iii)the number of available host nests is fixed, and a host can discover a foreign egg with a probability . In this case, the host bird can either throw the egg away or abandon the nest so as to build a completely new nest in a new location.

When generating new solutions for a cuckoo , a Lévy flight is performed using the following equation:

The symbol is an entrywise multiplication. Basically Lévy flights provide a random walk while their random steps are drawn from a Lévy distribution for large steps given in This has an infinite variance with an infinite mean. Here the consecutive jumps of a cuckoo essentially form a random walk process which obeys a power-law step-length distribution with a heavy tail.

2.2.3. Shuffled Frog Leaping

SFL is swarm intelligence based subheuristic computation optimization algorithm proposed by Eusuff and Lansey [25] to solve discrete combinatorial optimization problem. A group of frogs leaping in a swamp is considered and the swamp has a number of stones at distinct locations on to which the frogs can leap to find the stone that has the maximum amount of available food. The frogs are allowed to communicate with each other so that they can improve their memes using other’s information. An individual frog’s position is altered by changing the leaping steps of each frog which improves a meme results.

The search begins with a randomly selected population of frogs covering the entire swamp. The population is partitioned into several parallel groups (memeplexes) that are permitted to evolve independently, to search the space in different directions. Within each memeplex, the frogs are infected by other frog’s ideas; hence they experience a memetic evolution.

Memetic evolution progresses the quality of the meme of an individual and enhances the individual frog’s performance towards a goal. To ensure that the infection process is competitive, it is required that frogs with better memes (ideas) contribute more to the development of new ideas than frogs with poor ideas. Selecting frogs using a triangular probability distribution provides a competitive advantage to better ideas. During the evolution, the frogs may change their memes using the information from the memeplex best or the best of the entire population. Incremental changes in memotype(s) correspond to a leaping step size and the new meme corresponds to the frog’s new position. After an individual frog has improved its position, it is returned to the community. The information gained from a change in position is immediately available to be further improved upon.

After a certain number of memetic evolution time loops, the memeplexes are forced to mix and new memeplexes are formed through a shuffling process. This shuffling enhances the quality of the memes after being infected by frogs from different regions of the swamp. Migration of frogs accelerates the searching procedure sharing their experience in the form of infection and it ensures that the cultural evolution towards any particular interest is free from regional bias.

Here, the population consists of a set of frogs (solutions) that is partitioned into subsets referred to as memeplexes. The different memeplexes are considered to be different cultures of frogs, each performing a local search. Within each memeplex, the individual frogs hold ideas that can be influenced by the ideas of other frogs and evolve through a process of memetic evolution. After a defined number of memetic evolution steps, ideas are passed among memeplexes in a shuffling process. The local search and the shuffling processes continue until defined convergence criteria are satisfied. An initial population of frogs is created randomly. For -dimensional problems ( variables), a frog is represented as . Afterwards, the frogs are sorted in a descending order according to their fitness. Then, the entire population is divided into memeplexes, each containing frogs (). In this process, the first frog goes to the first memeplex, the second frog goes to the second memeplex, frog goes to the th memeplex, frog goes back to the first memeplex, and so forth. Within each memeplex, the frogs with the best and the worst fitnesses are identified as and , respectively. Also, the frog with the global best fitness is identified as . Then, a process similar to PSO is applied to improve only the frog with the worst fitness (not all frogs) in each cycle.

2.2.4. Shuffled Frog Leaping with Lévy Flight

A Lévy flight is a random walk in which the steps are defined in terms of the step lengths, which have a certain probability distribution, with the directions of the steps being isotropic and random. Lévy flights model activities that involve a lot of small steps scattered with occasional very large trips. Foraging paths of some deer and albatross are examples for Lévy flights. In the case of foraging paths, this result is sensible because the stopping points of a Lévy flight are fractal and in complex ecosystems the distribution of food is fractal. To avoid spending too much time in such unproductive areas, animals need to develop search strategies that generate a fractal distribution of stopping points. Lévy flights have this property. To improve the searching strategy of frogs and performance of classification in SFL, an additional parameter LF is added. The Pseudocodes 1, 2, 3, and 4 represent the pseudocodes of PSO, CS, SFL and SFLLF.

For each particle
 Initialize particle
END
Do
 For each particle
  Calculate fitness value
  If the fitness value is better than its personal best
   Set current value as the new
 End
Choose the particle with the best fitness value of all as
 For each particle
  Calculate particle velocity according (4)
  Update particle position according (5)
 End
While maximum iterations or minimum error criteria is not attained

Generate an initial population of host nests;
while ( < MaxGeneration) or (stop criterion)
Get a cuckoo randomly (say, ) and replace its solution by performing Levy flights;
Evaluate its fitness
Choose a nest among (say, ) randomly;
if ( )
  Replace by the new solution;
end if
A fraction ( ) of the worse nests is abandoned and new ones are built;
Keep the best solutions/nests;
Rank the solutions/nests and find the current best;
Pass the current best to the next generation;
end while

Generate random population of solutions (frogs);
   Calculate fitness function value of each frog;
    Repeat for specific number of times
     Sort the population in descending order of their fitness;
     Divide into memeplexes;
     Repeat for specific number of iterations
     For each memeplex determine the best and worst frogs and ;
     Identify the best frog for the entire population ;
     Improve the worst frog position using
     If
         
         if
         generate the random solution for
     end;
     Combine the evolved memeplexes;
    end;
   Present the best frog
   end;

Generate random population of solutions (frogs);
   Calculate fitness function value of each frog;
    Repeat for specific number of times
     Sort the population in descending order of their fitness;
     Divide into memeplexes;
     Repeat for specific number of iterations
     For each memeplex determine the best and worst frogs and ;
     Identify the best frog for the entire population ;
     Improve the worst frog position using
     If
         
         if
             and
     end;
     Combine the evolved memeplexes;
    end;
   Present the best frog
   end;
The symbol is an entrywise multiplication. Basically Lévy flights provide a random walk while their random steps are drawn
from a Lévy distribution for large steps.

3. Feature Selection Based on Swarm Intelligence Techniques

The statistical measures are used to identify top-m genes and these genes are further used for feature selection in PSO, CS, SFL, and SFLLF. Figure 1 gives the schematic representation of the proposed method.

3.1. Candidate Solution Representation

Figure 2 shows the candidate solution representation of particle position for PSO, egg for CS, and frog for SFL and SFLLF using top-m informative genes which are obtained from statistical techniques. The most used way of encoding the feature selection is a binary string, but the above optimization techniques work well for continuous optimization problem. The random values are generated for gene position. The genes are considered when the value in its position is greater than 0.5; otherwise it is ignored.

3.2. Fitness Function

The accuracy of -NN classifier is used as the fitness function [26, 27] for SI techniques. The fitness function is defined as

is the test accuracy of testing data in the -NN classifier which is built with the feature subset selection of training data. The classification accuracy of -NN is given by where : samples that are classified correctly in test data by -NN technique and : total number of samples in test data.

4. Experimental Setup

In order to assess the performance of the proposed work, ten benchmark datasets are used. Table 1 shows the datasets collected from Kent Ridge Biomedical Data Repository. The number of samples present in each class is given within parenthesis. The parameters and their values of PSO, CS, SFL, and SFLLF are shown in Table 2.

From the microarray data the discriminative genes are identified and ranked based on -statistics, signal-to-noise ratio, and -test values. The top-m genes are used to represent the candidate solutions of the SI techniques. The values 10, 50, and 100 are assigned to m for testing purpose. The SI technique identifies the features (genes) for classification. The -NN method is used for classification. By empirical analysis the value of is assigned to be 5. The classification accuracy is obtained from 5-fold cross-validation.

5. Experimental Results and Discussion

Figures 3, 4, 5, and 6 show the results obtained from -NN classifier through the feature selection methods PSO, CS, SFL, and SFLLF, respectively, for top-10, top-50, and top-100 genes obtained from -statistics, SNR, and -test. These results show that for Colon Tumor and Prostate Cancer the 100% accuracy is not achieved by any method. The SFLLF algorithm gives 100% accuracy for Lung Cancer Michigan for all different statistical tests and different numbers of top-m genes. From the results it is inferred that the m value does not influence the accuracy of the classifier. So the value of m should be identified through empirical analysis.

Table 3 compares the maximum classification accuracies obtained from the SI techniques with different statistical measures.

Tables 4, 5, 6, 7, 8, 9, 10, 11, 12, and 13 give the comparison of the proposed work with existing methods. Experimental results show that SFLLF outperforms the existing methods.

6. Conclusions

Cancer classification using gene expression data is an important task for addressing the problem of cancer prediction and diagnosis. For an effective and precise classification, investigations of feature selection methods are essential. The swarm intelligence techniques based feature selection methods are simple and can be easily combined with other statistical feature selection methods. It is a simple model based on statistical measures and swarm intelligence techniques that perform two levels of feature selection to get the most informative genes for classification process. -statistics, signal-to-noise ratio, and -test are used to select the important genes that are the reason for cancer. The SI techniques such as PSO, CS, SFL, and SFLLF are applied on the selected top- genes for feature selection. The -NN is used as a classifier. The experiment results are demonstrated on well-known gene expression benchmark datasets and the performance of SFLLF is compared with PSO, CS, SFL, and the existing works in the literature. The experimental results show that SFLLF outperforms PSO, CS, and SFL. SFLLF feature selection method gives 100% accuracy for 8 datasets out of 10 datasets with -NN classifier.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.