Abstract

This paper focuses on the feature gene selection for cancer classification, which employs an optimization algorithm to select a subset of the genes. We propose a binary quantum-behaved particle swarm optimization (BQPSO) for cancer feature gene selection, coupling support vector machine (SVM) for cancer classification. First, the proposed BQPSO algorithm is described, which is a discretized version of original QPSO for binary 0-1 optimization problems. Then, we present the principle and procedure for cancer feature gene selection and cancer classification based on BQPSO and SVM with leave-one-out cross validation (LOOCV). Finally, the BQPSO coupling SVM (BQPSO/SVM), binary PSO coupling SVM (BPSO/SVM), and genetic algorithm coupling SVM (GA/SVM) are tested for feature gene selection and cancer classification on five microarray data sets, namely, Leukemia, Prostate, Colon, Lung, and Lymphoma. The experimental results show that BQPSO/SVM has significant advantages in accuracy, robustness, and the number of feature genes selected compared with the other two algorithms.

1. Introduction

Nowadays, cancer has been one of the most common lethal factors for human beings. Missed and mistaken diagnosis sometimes makes people lose the best chance for appropriate treatments. Therefore, more auxiliary measurements are needed to promote the accuracy of cancer diagnosis and clinical test combined with medical ways [14]. With the rapid development of information sciences and molecular biological sciences, gene microarray technology brings people large amount of high-throughput gene profiles which are widely used in cancer diagnosis, clinical inspection, and other aspects. However, microarray expression data are highly redundant and noisy, and most genes are uninformative with respect to studied classes, as only a fraction of genes may present distinct profiles for different classes of samples. As such, effective methods of selecting feature genes for cancer are critically necessary. These methods should be able to robustly identify a subset of informative genes embedded out of a large data set which is contaminated with high dimensional noise.

It was Golub et al. who first employed gene expression data for cancer classification [5]. They proposed to use gene expression data of acute leukemia for cancer classification by adopting “SNR” index to calibrate the contribution of genes to the cancer classification and by using a weighted voting mechanism to distinguish cancer types [5]. This study demonstrated that the use of gene expression data to determine cancer types for the auxiliary medical diagnosis is an effective measure. Afterwards, an increasing number of researchers in the fields of biology and information sciences have proposed many effective feature gene selection methods, so that the research in this discipline is becoming one of the hotspots in bioinformatics.

Currently, there are two categories of the methods of obtaining feature genes for cancer classification based on gene expression data, namely, feature transformation methods and feature selection methods. By definition, feature transformation refers to a way of transforming the original feature attributes into a new set of features that represent the original features to the greatest extent but reduce the dimension as much as possible in order to achieve the purpose of dimension reduction. This means that the new features are low-dimensional features with similar classification abilities. Feature transformation methods for cancer classification by using gene expression data include principal component analysis (PCA) [6], kernel PCA [7], independent component analysis (ICA) values [8], locally linear embedding (LLE) [9], partial least squares (PLS) [10], the maximum margin criterion (MMC) [11], and linear discriminant analysis (LDA) [12]. Conde et al. [13, 14] proposed a feature transform method based on clustering. This approach uses self-organizing tree algorithm to carry out gene clustering and calculates the average gene expression level for each category, which is then accepted as a new feature to establish the cancer classification model. Kan et al. [15] employed PCA to make transformation of the gene expression data of children small round blue cell tumors and then used artificial neural network for classification.

Feature transformation methods can indeed reduce the dimension for gene expression data and can eliminate the “curse of dimensionality” phenomenon due to large number redundant genes so that they can help to establish effective cancer classification models. However, the new features obtained by feature transform property no longer have the original biological meaning; that is, the methods destroy the biological information of the original gene expression data, which makes it impossible to determine the target genes associated with the cancer. For this reason, feature gene selection methods have attracted more attention.

The feature gene selection uses an optimization algorithm to select a subset of the genes, which has the most classification information, from the original gene microarray data. The most commonly used feature gene selection methods can be divided in to filter, wrapper, and embedded ones. Filter algorithm is independent of the subsequent learning algorithm but uses some criteria for scoring gene subsets, which measure the contribution of the genes to classification. Such methods generally use SNR [5], test [16], the correlation coefficient [17], mutual information [18], relief [19], information gain [20], or Fisher discrimination [21]. Obviously, filter methods have advantages such as simplicity, fast calculation, and independence of classification algorithms. However, they evaluate a single gene with some criteria but ignore the correlation between genes, which resulted in a large amount of redundant information contained in candidate genes.

Different from filter methods, wrapper methods combine gene selection and classification method and use training accuracy of the learning algorithm to assess the subset of features to guide gene selection. Such methods include the sequential random search heuristics [22], random forest method [23], and PKLR [24]. In the cancer feature gene selection, a typical wrapper feature selection method combines support vector machine (SVM) and a recursive feature selection method [25]. In this method, support vector machines are used to classify the data set, then each gene is excluded in turn, and the performance change of the SVM after exclusion of the gene is calculated, and afterwards, the gene with the least absolute value of the association weight is removed from the training set until the training set is empty. The gene sets deleted together in the last step are the optimal subset. Li et al. [26] adopted genetic algorithm (GA) to select feature genes of cancer. Zhang et al. [27] coupled a binary particle swarm optimization (BPSO) and the SVM for classification of Colon data set.

Embedded methods are extension of wrapper approaches and undertake feature selection in the process of classifier training, without dividing the data set into a training set and a validation set. Typical embedded algorithms include decision tree [28] and artificial neural networks [29].

In this work, we propose a new method, which couples a binary quantum-behaved particle swarm optimization with SVM approach, to select feature gene subset from cancer microarray data. In order to prove the advantages of BQPSO/SVM, we also implement two other algorithms, BPSO/SVM and GA/SVM. The BPSO and GA used in this work are both the original version. These two algorithms or improved ones were used in this case by other scholars early in [3032]. All these three approaches are experimentally assessed on five well-known cancer data sets (Leukemia, Colon, Prostate, Lung, and Lymphoma).

This paper is structured as follows. In Section 2, we review the BQPSO algorithm, and in Section 3 the SVM technique is described and our BQPSO/SVM method is proposed. In Section 4, the five microarray data sets used in this work are described. Experimental results are presented in Section 5, including biological descriptions of several obtained genes. Finally, the paper is concluded in Section 6.

2. Binary Encoded Quantum-Behaved Particle Swarm Optimization (BQPSO)

PSO algorithm is a population-based evolutionary search technique, which was firstly proposed in [33]. Social behavior of animals such as bird flocking and fish schooling and swarm theory is the underlying motivation for the development of PSO. Inspired by the quantum theory, Sun et al. [34] developed a novel variant of PSO called Quantum-behaved Particle Swarm Optimization (QPSO), where a strategy based on a quantum potential well is employed to sample around the personal best points and then introduced the mean best position into the algorithm [3537].

Based on our previous work in [38], in this paper, we further proposed a discrete binary version of QPSO (BQPSO) as a search algorithm coupling SVM for gene selection based on cancer gene expression data. In the proposed BQPSO, the position of the particle is represented as a binary string. For instance, in Figure 1   is the first particle and is the second one; they all have two substrings (two decision variables), and the distance is defined as the Hamming distance between two binary strings; namely,where is the function to get Hamming distance between and , which is the count of bits different in the two strings; the distance is seven in Figure 1. In the BQPSO, the dimension is defined as the number of decision variables, so that a particle can have more than one decision variable. For example, particle is represented as , and it has decision variables, and refers to the th substring (th decision variable) of the position of the th particle. Given that the lengths of and are and , respectively, then we can get equation as follows:

In the BQPSO, the mean best position of all particles is determined by the states of the bits of all particles’ . In detail, for th bit of the , if 1 appears more often than 0 at the th bit of all , the th bit of will be 1; otherwise the bit will be 0. However, if 1 and 0 have the same frequency of occurrence, the th bit of will be set randomly to be 1 or 0, with probability 0.5 for either state. The function for obtaining is called . The pseudocode of the function for obtaining is given in Pseudocode 1.

Get_mbest(pbest)
for to (the length of binary string)
  sum = 0;
  for each particle i
    sum = sum + pbest[][];
  endfor
  avg = sum/;
  if avg > 0.5 mbest[] = 1; endif
  if avg < 0.5 mbest[] = 0; endif
  if avg = 0.5
    if rand() < 0.5 mbest[] = 0;
    else mbest[] = 1;
     endif
  endif
endfor
Return mbest

is the coordinate of local attractor for particle . In the continuous version of QPSO, the coordinate of lies between and . In the BQPSO, the point is generated through one-point or multipoint crossover operation of and like that used in genetic algorithm (GA), and this definitely make lay between and as well. The function getting in BQPSO is called .

Update equation of the particle position in the original QPSO is given by In the BQPSO, (4) can be written again as follows:where

Because is Hamming distance, must be an integer, which is the reason for the use of function . New string is obtained by the mutation of with the probability computed by

In [35], here is the length of substring . Function getting is denoted as . The transformation of is described in Pseudocode 2.

Transf(, )
for each bit in the substring ;
if rand() <
  if the state of the bit is 1
    Set its state to 0;
  else set its sate to 1;
  endif
endif
endfor
;
Return

The BQPSO can be summarized as , , and .

3. Gene Selection and Classification by BQPSO/SVM

3.1. The SVM Classifier

Support vector machine proposed in [39] is a technique derived from statistical learning theory. It is widely used to classify points by assigning them to one of two disjoint half spaces [40, 41]. That is to say SVM carries out mainly a 2-class classification. For linearly separable data, SVM gets the hyperplane which maximizes the margin between the training samples and the class boundary. For nonlinearly separable cases, samples are mapped to a high dimensional space. In this space, such a separating hyperplane can be found. The assignment is conducted by way of a mechanism called the kernel function.

Theoretically, SVM is able to correctly classify any linearly separable data. Consider the data with two classes, which can be expressed asand then the hyperplane that separated the two classes of the data is given byIn order to guarantee that the data can be correctly classified and the distance between the classes is as large as possible, the hyperplane must satisfyby which the distance is obtained as so that the problem of constructing the hyperplane is converted to the following optimization problem: with (9) being the constraint. By introducing the following Lagrange function to solve problem (10): where is known as the Lagrange coefficient. Solving the Lagrangian dual of the problem, one obtains a simplified problem:Solving the problem in (12), we can getby which the hyperplane is obtained asand the optimal classification function is

3.2. The Proposed BQPSO/SVM Approach

In many bioinformatics problems the number of features is significantly larger than the number of samples. In order to improve the classification or to help to recognize interesting features in noisy environments, tools for reducing the number of features are indispensable. The hybrid BQPSO/SVM approach proposed in the following contributes especially in this sense.

First of all, the data should be preprocessed. Normalization of data must be conducted so as to eliminate the impact of the dimensionless on the classification. Then we need to take traditional -test on the data, order the genes by value ascending, and get 50 top-ranked genes from all. Afterwards, most of the noisy data have been removed. These 50 genes comprise the whole search space of the BQPSO algorithm for gene selection.

For the BQPSO used in this work, the swarm sizes for the BQPSO and BPSO were set to be 20 and the population size for GA was also 20. Each particle has just one decision variable, and thus the dimension of the particle is just one. The length of the particle is 50, so every particle is a binary string with length of 50, and 1 represents that this gene is chosen and 0 is not. Feature gene selection and cancer classification based on hybrid BQPSO/SVM algorithm can be described as the procedure in Pseudocode 3.

Processing of data set;
Initialize the current positions and the pbest positions of all particles which are binary bits with each representing whether the
corresponding gene is selected or not;
do
Determine the mean best position among the particles by mbest = Get_mbest(pbest), select a suitable value for ;
for   to population size
   Call the LIBSVM tool box to construct the SVM classifier and get the classification accuracy for the data;
  With the classification accuracy and the number of selected genes (i.e. the number of features given by the number of bits
  with value 1), evaluate the objective function value according to Section 3.3;
   Update , and , it means
    if   then ;
  and , ;
  then get a stochastic position by = Get_P (, best)
    for   to dimensionality
    Compute the mutation probability ;
    Generate the new substring by = Transf(, );
   and get the new position by combining all new substring ()
  endfor
endfor
until termination criterion is met;
Output the best solution which have been found (best)
3.3. Evaluation Function

Since a particle is a binary string representing a gene subset in BQPSO/SVM, the evaluation of each particle is executed by the SVM classifier to assess the quality of the represented gene subset. The fitness of a particle is calculated employing a leave-one-out cross validation (LOOCV) method to calculate the accuracy of SVM trained with this subset. In leave-one-out cross validation, one of all samples is evaluated as test data while the others except this one are used as training data, repeated until all samples have been used as test data. The classification accuracy of LOOCV is the average accuracy of times classifying, if the data set has samples. The evaluation function is described in where and are weight values and set to 0.6 and 0.4, respectively, for the purpose of controlling that the accuracy value takes precedence over the subset size, since high accuracy is preferred when leading the search process. The target here consists of maximizing the accuracy and minimizing the number of genes . For convenience (only maximum of fitness), the second factor is presented as .

4. The Data Sets

There are several DNA microarray data sets from published cancer gene expression studies. Five of them were used in this paper. They are Leukemia data set, Prostate data set, Colon data set, Lung data set, and Lymphoma data set. All of them were taken from the BRB-ArrayTools in [42] with URL http://linus.nci.nih.gov/~brb/DataArchive_New.html. More details of these five data sets are showed in Tables 1 and 2. The value in parenthesis in Table 3 is the number of examples of class 1 or class 2 involved in that data set.

5. Experimental Results and Performance Comparison

BQPSO/SVM approach was implemented on MATLAB, along with BPSO/SVM and GA/SVM. The SVM classifier used in these three approaches is based on the LIBSVM library in [43]. For the SVM configuration, since we were considering the performance of the search algorithm in the work, rather than the influence of parameters in SVM to classification, we used the default parameters of LIBSVM. And the default kernel function was configured as radial basis function. The fitness function in this work is the classification accuracy of leave-one-out cross validation (LOOCV).

All experiments were carried out using a PC with Windows OS and a Pentium Dual-Core 2.60 GHz CPU, with 2 G of RAM. BQPSO/SVM, BPSO/SVM, and GA/SVM algorithms on five cancer related microarray data sets were independent executed 25 times over each data set, in order to have statistically meaningful conclusions as these three algorithms are stochastic search methods.

5.1. Parameter Settings

The parameters used in BQPSO, BPSO, and GA algorithms are shown in Table 3. These parameters were selected after several test evaluations of each algorithm and data set instance until reaching the best configuration in terms of the overall quality of solutions.

5.2. Discussion and Analysis

Depending on the results of the experiments, we made analysis of results focusing on the performance and robustness, as well as the quality of the obtained solutions providing a biological description of most significant ones. We conducted the experiments for BPSO/SVM and GA/SVM in order to demonstrate the advantage of the proposed BQPSO/SVM without any other factors affecting, since in our work all these three algorithms are operated in exactly the same hardware and software environment and with the same data sets and parameters.

5.2.1. Performance Analysis

Next, we compare BQPSO/SVM with BPSO/SVM and GA/SVM. Since these three algorithms are running in the same environment, parameters, and data sets, the results are absolutely comparable. Table 4 lists the highest LOOCV accuracy in 25 independent executions of each method for each data set. The mean columns contain the average of the LOOCV accuracy obtained from 25 independent executions.

The performance comparison shows that, compared to BPSO/SVM and GA/SVM, BQPSO/SVM has an obvious advantage. In terms of the correct rate, the search capability of BQPSO/SVM is stronger than the other two competitors.

The purpose of feature selection in our work is to find small subsets with high classification accuracy. In Figure 2, the number of genes is the mean size of subsets from 25 executions. Obviously, the proposed BQPSO/SVM provided smaller subsets of genes than the other two methods.

5.2.2. Algorithm Robustness

Besides the quality of the algorithm, its ability to generate similar or identical results when executed several times is also important. One of the most important norms in assessing any proposed algorithm is robustness. It is particularly important for metaheuristics which are employed in this work. The standard deviation (std. dev.) in Table 5 denotes the standard deviation of accuracy from 25 independent executions. As it can be seen from the standard deviation, the robustness of the proposed algorithm is significantly better than GA/SVM. Compared with BPSO/SVM, our proposed algorithm obtained smaller standard deviation with Prostate data set and Colon data set but found much better solutions which led to a larger standard deviation. Overall, from Table 5, it is shown that BQPSO/SVM has an obvious advantage over the other two approaches in terms of robustness.

5.2.3. Brief Biological Analysis of Selected Genes

Finally, the best subsets of genes were found for each data set. We add up all subsets having the highest accuracy and list the selected genes. For Colon data set, the top 5 genes with the highest selection frequency of each microarray data are presented in Table 6.(i)Among the genes listed in Table 5, two of them were also selected by [44]. The first gene is uroguanylin precursor Z50753. It was shown that a reduction of uroguanylin might be an indication of colon tumors in [45, 46] which reported that treatment with uroguanylin has a positive therapeutic significance to the reduction in precancerous colon ploys.(ii)The second selected gene of colon data set is R87126 (myosin heavy chain, nonmuscle). The isoform B of R87126 serves as a tumor suppressor and is well known as a component of the cytoskeletal network [47].

6. Conclusion

In this paper, a hybrid technique for gene selection and classification of high dimensional DNA Microarray data was presented and compared. This technique is based on a metaheuristic algorithm BQPSO used for feature selection using the SVM classifier to identify potentially good gene subsets and is compared with the BPSO and GA. In addition, genes selected are validated by an accurate leave-one-out cross validation method to improve the actual classification.

All three approaches were experimentally assessed on five well-known cancer data sets. Results of 100% classification rate and less than average 11 genes are obtained in most of our executions. The use of preprocessing method has shown a great influence on the performance of proposed algorithm, since it introduces an early set of acceptable solutions in their evolution process. Continuing the line of this work, we are interested in optimization of BQPSO/SVM in order to discover new and better subsets of genes using specific Microarray data sets.

Competing Interests

The authors declare that they have no financial and personal relationships with other people or organizations that can inappropriately influence their work; there is no professional or other personal interest of any nature or kind in any product, service, and/or company that could be construed as influencing the position presented in, or the review of, this paper.

Acknowledgments

The research work was supported by the National Natural Science Foundation of China (Projects nos. 61373055 and 61300150), Natural Science Foundation for College and Universities in Jiangsu Province (Project no. 16KJB520051), and the Qing Lan Project of Jiangsu and Wuxi Institute of Technology.