Research Article  Open Access
A Robust Hybrid Approach Based on Estimation of Distribution Algorithm and Support Vector Machine for Hunting Candidate Disease Genes
Abstract
Microarray data are high dimension with high noise ratio and relatively small sample size, which makes it a challenge to use microarray data to identify candidate disease genes. Here, we have presented a hybrid method that combines estimation of distribution algorithm with support vector machine for selection of key feature genes. We have benchmarked the method using the microarray data of both diffuse B cell lymphoma and colon cancer to demonstrate its performance for identifying key features from the profile data of highdimension gene expression. The method was compared with a probabilistic model based on genetic algorithm and another hybrid method based on both genetics algorithm and support vector machine. The results showed that the proposed method provides new computational strategy for hunting candidate disease genes from the profile data of disease gene expression. The selected candidate disease genes may help to improve the diagnosis and treatment for diseases.
1. Introduction
Complex diseases are frequently accompanied by changes in gene expression patterns which can serve as secondary endpoints or biomarkers [1]. Microarray technology, which allows researchers to simultaneously measure expression levels of thousands or tens of thousands of genes in a single experiment, has been widely used to explore the gene expression pattern of complex diseases [2]. Typically, there are only a small number of genes associated with diseases. Thus, the selection of feature genes that possess discriminatory power for disease phenotypes is a common task for mining microarray data that are usually high dimension (with thousands of genes) and have small sample size (with usually a few dozens of samples) [3].
The method of gene selection generally falls into one of the following three categories: the filter, wrapper, and embedded approaches. The filter approach collects the intrinsic characteristics of genes in discriminating the targeted phenotype class and usually employs statistical methods, such as mutual information, statistical tests (test, test), and Wilcoxon’s rank test, to directly select feature genes [4, 5]. This approach is easily implemented, but ignores the complex interaction between genes. The “wrapper” approach [6] aims at selecting a subset of feature genes, typically with an induction algorithm to search for an initial gene subset which can then be used for further evaluating new feature gene subsets. The wrapper method is usually superior to the filter one since it involves intercorrelation of individual genes in a multivariate manner. The wrapper method can automatically determine the optimal number of feature genes for a particular classifier. The embedded method is similar to the wrapper method, while multiple algorithms can be combined in the embedded method to perform feature subset selection [6, 7]. In the embedded method, genetic algorithms (GAs) [8, 9] are generally used as the search engine for feature subset, while other classification methods, such as KNN/GA (K nearest neighbors/genetic algorithms) [10], GASVM (genetic algorithmssupport vector machine) [11], and so forth, are used to select feature subset. Estimation of distribution algorithm (EDA) [12] is a general framework of GA. Compared to traditional GA that employs crossover and mutation operators to create new population, EDA creates new populations by using a statistical approach to estimate the probability distribution of all promising individual solutions for the previous generation. EDA can also explicitly take into account specific interactions among the variables. When EDA is used to search for feature subsets, classification methods, such as Support vector machine (SVM) [13–19], which can deal with the highdimension data in a limited sample space, can be used to select feature subsets.
In this study, we have developed a hybrid approach that combines both EDA and SVM (termed EDASVM) for selecting key feature genes. Here, EDA acts as the search engine, while SVM serves as the classifier, namely, the evaluator. We have applied EDASVM to two wellknown microarray datasets: a colon data [20] and a diffuse large B cell lymphoma data [3]. Our results have shown that EDASVM can be used to identify a smaller number of informative genes with better accuracy in comparison to GASVM [11] and an estimation of distribution algorithm named PMBGA [21].
2. Materials and Methods
2.1. Description of DLBCL Datasets
We have applied the EDASVM method to the two following data sets: the diffuse large B cell lymphoma (DLBCL) data [3], available at http://llmpp.nih.gov/lymphoma/data.shtml, and the colon data [20], available at http://microarray.princeton.edu/oncology/affydata/index.html. The colon data set consists of 62 tissue samples including 40 tumors and 22 normal tissues, which cover 2000 human gene expression. The DLBCL data set harbors preprocessed expression profile of 4026 genes in tissues derived from 21 activated Blike DLBCL (ABlike DLBCL) samples and 21 germinal center Blike DLBCL (GCBlike DLBCL) samples.
2.2. Data Preprocessing
In DLBCL dataset, among 4026 genes, 6% genes have missing values and are imputed by the KNN Impute algorithm [22] prior to the EDASVM analysis. The KNN Impute algorithm uses the expression profiles of nearest neighbors (here ) to impute the missing values for the target gene. Therefore, in colon data is a matrix with 62 rows and 2000 columns. In DLBCL data, is a matrix with 42 rows and 4026 columns.
2.3. EDASVM
Figure 1 shows the main flowchart of the EDASVM. EDA acts as the search engine, while SVM serves as the classifier, namely, the evaluator. The computational procedures are described in Algorithm 1. The major elements of the EDA include feature subset coding, population initialization, fitness computation, estimation probability distribution, generation of offspring and control of parameter assignment. At the beginning, we randomly generated the N fixedlength binary strings (individuals) to build up the initial population. Then, we calculated the fitness for each feature subset. Classification accuracy acted as the fitness index (fitness) that was evaluated using a linear SVM. The algorithm is an iterative process in which each successive generation is produced by estimating the probability distribution model of the selected individuals (parents) in the current generation and sampling the probability distribution to generate new offsprings. In this manner, reasonable subsets are developed successively until the terminal condition is fulfilled. In two data sets, lr is a learning rate and is assigned 0.08. Population size is set as 40 and the maximal generations of 50 are determined, such that the solution space can be sufficiently searched while the best minimal subset can be obtained within the evolution time.

For each gene expression submatrix , we classify the microarray samples with genes contained in individual using a linear SVM. The classifier, [18], is then, the accuracy of classification is where is the number of test samples and
The weight of each feature in individual is where is a test sample vector and is the learning sample vector. is the number of learning samples. is a class indicator (for a twoclass application, +1 for the first class, −1 for the second class), and is a nonnegative Lagrange multiplier associated with and for support vectors. sgn( ) is the sign function and is the kernel function: linear kernel (, i.e., their inner product).
In this study, a fivefold crossvalidation (CV) resampling approach is used to construct the learning and test sets. First, the twoclass samples are randomly divided into 5 nonoverlapping subsets of roughly equal size, respectively. A random combination of the subsets for the two classes constitutes a test set, and the rest of subsets is totally used as the learning set. The 5fold CV resampling produces 25 pairs of learning and test sets. Individual is evaluated by the averaged value over the 25 pairs, that is, where is the replicate number and is the classification accuracy for the th replicate.
In the EDASVM algorithm, the optimization of the feature gene subset(s) is realized via survival competitions. For each generation, we retain 50% of the highvalued individuals that will directly enter next generation in order to keep these optimal solutions unchanged. On the other hand, in order to avoid the loss of the putative important feature genes, we initially contained about half of genes in each individual or preserving informative gene. Then, we adopt a stepwise data reduction procedure to minimize the feature subsets with more reliable classification accuracy. These gene expression matrices from the optimal individuals serve as the data on which the new round of iteration is performed. The data reduction process is completed once a stable gene subset is obtained.
2.4. GASVM
GASVM was previously developed [11] by us as a feature selection method. In GASVM, better feature subsets have a greater chance of being selected to form a new subset through crossover or mutation. Mutation changes some of the values (thus adding or deleting features) in a subset randomly. Crossover combines different features from a pair of subsets into a new subset. The algorithm is an iterative process in which each successive generation is produced by applying genetic operators to the members of the current generation. In this manner, good subsets are “evolved” over time until the stopping criteria are met. Thus, coding feature subset, population initialization, fitness computation, genetic operation, and control parameter assignment (population size, the maximal number of generations, and the selection probability) are the major elements of the GASVM method.
2.5. PMBGA
PMBGA can be applied for selection of a smaller size gene subset that would classify patient samples more accurately [21]. PMGBA generates initial population and builds a probability model and then selects individuals from the population. Probability distribution can be estimated based on the collection of selected individuals, and probability model can accordingly be amended so that a population is generated by sampling from the model. Instead of applying crossover and mutation operators in the process of generating new possible solutions (offspring), population can be updated in whole or in part relied on probability model.
3. Results
3.1. Benchmark EDASVM
The EDASVM method was applied firstly to the DLBCL data set. We started analysis with all 4026 genes and progressively reduced the dimension of the feature genes successively for 8 iterations after convergence. The accuracy of EDASVM increased from 0.9339 initially to 0.9982 at convergence (Figure 2(a)), while the number of feature genes at the successive generations is 4026, 460, 66, 17, 11, 7, 6, and 6, respectively (Figure 2(b)). For the colon data set, EDASVM reached accuracy of 1.0 after 7 iterations, and the final gene subset includes only 5 genes (Figure 3).
(a)
(b)
(a)
(b)
We compared the performance of EDASVM with two alternative methods: GASVM and PMBGA (Figures 2 and 3). The convergence speed of EDASVM is the fastest among the three methods. EDASVM converged after 8 and 7 iterations for the DLBCL and colon datasets, respectively. In contrast, it took 13 and 10 iterations for GASVM to converge, and 10 and 10 iterations for PMBGA to converge. Moreover, both the accuracy and the stability of EDASVM also show advantages among the three methods. EDASVM quickly reaches high accuracy after only a couple of iterations, while both the other two methods took more iteration to reach high accuracy. In addition, the accuracy of the other two methods had large variation during the iteration, while the accuracy of EDASVM kept stable during the iteration after it reached the high accuracy.
3.2. Biological Analysis of the Selected Genes in the DLBCL Data
To understand the biological significance of the selected genes, we have analyzed the annotations of selected genes according to Gene Ontology (GO) (http://www.geneontology.org/) [23] and KEGG (http://www.genome.jp/kegg/kegg2.html) [24, 25] database. We selected six genes in the DLBCL data, which are SPIB, IRF8, NFKB2, LMO2, FCGRT, and BCL7B. The GO annotations of these six genes are shown in Table 1. Literature reviews of these six genes suggested that they are highly related to DLBCL. SPIB is an oncogene involved in the pathogenesis of ABlike DLBCL [26]. NFKB2 is a subunit of NFκB whose signaling pathway might contribute to the biological and clinical differences between the GCBlike and the ABlike DLBCL [27]. LMO2 was found to be located in the most frequent regime of chromosomal translocation in childhood T cell acute lymphoblastic leukemia. It was reported that LMO2 expressed at high level in germinal center B cell lymphocytes and at low level in ABlike DLBCL, respectively [3]. LMO2 is also one of the six genes in a multivariate model previously developed for prolonged survival in the diffusive large bcell lymphoma [28]. BCL7B was found to be directly involved in a threeway gene translocation together with Myc and IgH in a Burkitt lymphoma cell line, and the disruption of the Nterminal region of BCL7B was thought to be related to the pathogenesis of a subset of highgrade B cell nonHodgkin lymphoma [29]. BCL2 contributes to the pathogenesis in ABlike DLBCL [10] and is the common target gene of miR21 and miR221, both of which are overexpressed in ABlike than GCBlike cell lines [30]. Based on the above evidences, EDASVM successfully identified genes that may play role in the pathogenesis of DLBCL.

4. Discussions and Conclusions
In this study, we have developed a hybrid method, EDASVM, which combines the estimation of distribution algorithms (EDA) with support vector machine (SVM) for selecting key feature genes from microarray data. Although similar combination strategies have been explored previously [21], EDASVM shows unique advantages compared with the alternative methods, GASVM or PMBGA. For example, EDAVSVM not only converged more quickly, but also achieved higher accuracy with stable performance than the other two methods did. Both EDASVM and PMBGA [21] use EDA as the search engine, and SVM acts as evaluation classifier in feature selection procedure. However, there are several key differences between the two methods. First, EDASVM weights each feature using “”, so that the contribution of each feature was fully considered during the update of each generation. In contrast, PMBGA assigns only a small random number to each feature. Second, for selecting minimal feature genes, EDASVM reduced the feature number step by step, while PMBGA did so by tuning the learning rate. Finally, the way to create the next generation in GA is also different between the two methods. As for the differences between EDASVM and GASVM, GASVM employs the traditional GA, while EDASVM generates new possible solutions (individuals) by sampling the probability distribution calculated from the selected solutions of previous generation.
The structure of genes in a microarray data can be described by a Bayesian network. However, microarray data usually contains the expression of thousands or tens thousands of genes, making it virtually impossible to build a Bayesian network with so many genes. In this study, we have shown with EDASVM that proper combination of machine learning algorithms can overcome the highdimension problem, and quickly converge to a small set of feature genes strongly related to target phenotype. The success of EDASVM thus made it readily applicable for hunting disease genes in microarray data.
List of Abbreviations
DLBCL:  Diffuse large Bcell lymphoma 
EDASVM:  Estimation for distribution algorithmsupport vector machine 
GO:  GeneOntology 
KEGG:  Kyoto Encyclopedia of Genes and Genomes 
GAs:  Genetic algorithms 
EDA:  Estimation of distribution algorithm 
ABlike DLBCL:  Activated Blike DLBCL 
GCBlike DLBCL:  Germinal center Blike DLBCL 
PMBGA:  Probabilistic Model Building Genetic Algorithm 
GASVM:  Genetic algorithmsupport vector machine. 
Acknowledgments
This work is supported in part by National Natural Science Foundation of China (30971621, 81270231, and 31170791), the National Basic Research Program of China (973 Program) (2012CB9668003 and 2010CB945500), International Science and Technology Cooperation Program of China (2011DFB30010), the Fundamental Research Funds for the Central Universities to L. Li, and Shanghai Municipal Health Bureau Project to L. Li. We thank Dr. Weidong Tian for critical review of the paper.
References
 W. Yang, D. Ying, and Y. L. Lau, “Indepth cDNA library sequencing provides quantitative gene expression profiling in cancer biomarker discovery,” Genomics, Proteomics and Bioinformatics, vol. 7, no. 12, pp. 1–12, 2009. View at: Publisher Site  Google Scholar
 S. S. ShenOrr, R. Tibshirani, P. Khatri et al., “Cell typespecific gene expression differences in complex tissues,” Nature Methods, vol. 7, no. 4, pp. 287–289, 2010. View at: Publisher Site  Google Scholar
 A. A. Alizadeh, M. B. Elsen, R. E. Davis et al., “Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling,” Nature, vol. 403, no. 6769, pp. 503–511, 2000. View at: Publisher Site  Google Scholar
 P. J. Park, M. Pagano, and M. Bonetti, “A nonparametric scoring algorithm for identifying informative genes from microarray data,” Pacific Symposium on Biocomputing, pp. 52–63, 2001. View at: Google Scholar
 Y. Su, T. M. Murali, V. Pavlovic, M. Schaffer, and S. Kasif, “RankGene: Identification of diagnostic genes based on expression data,” Bioinformatics, vol. 19, no. 12, pp. 1578–1579, 2003. View at: Publisher Site  Google Scholar
 R. Kahavi and G. H. John, “Wrapper for feature subset selection,” Artificial Intelligence, vol. 97, pp. 273–324, 1997. View at: Google Scholar
 X. Li, S. Rao, Y. Wang, and B. Gong, “Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression profiling,” Nucleic Acids Research, vol. 32, no. 9, pp. 2685–2694, 2004. View at: Publisher Site  Google Scholar
 S. J. Cho and M. A. Hermsmeier, “Genetic algorithm guided selection: variable selection and subset selection,” Journal of Chemical Information and Computer Sciences, vol. 42, no. 4, pp. 927–936, 2002. View at: Publisher Site  Google Scholar
 X. M. Zhao, Y. M. Cheung, and D. S. Huang, “A novel approach to extracting features from motif content and protein composition for protein sequence classification,” Neural Networks, vol. 18, no. 8, pp. 1019–1028, 2005. View at: Publisher Site  Google Scholar
 L. Li, T. A. Darden, C. R. Weinberg, A. J. Levine, and L. G. Pedersen, “Gene assessment and sample classification for gene expression data using a genetic algorithm/knearest neighbor method,” Combinatorial Chemistry and High Throughput Screening, vol. 4, no. 8, pp. 727–739, 2001. View at: Google Scholar
 L. Li, W. Jiang, X. Li et al., “A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset,” Genomics, vol. 85, no. 1, pp. 16–23, 2005. View at: Publisher Site  Google Scholar
 Y. Saeys, S. Degroeve, D. Aeyels, P. Rouzé, and Y. Van de Peer, “Feature selection for splice site prediction: a new method using EDAbased feature ranking,” BMC Bioinformatics, vol. 5, p. 64, 2004. View at: Publisher Site  Google Scholar
 M. P. S. Brown, W. N. Grundy, D. Lin et al., “Knowledgebased analysis of microarray gene expression data by using support vector machines,” Proceedings of the National Academy of Sciences of the United States of America, vol. 97, no. 1, pp. 262–267, 2000. View at: Publisher Site  Google Scholar
 J. H. Oh and J. Gao, “A kernelbased approach for detecting outliers of highdimensional biological data,” BMC Bioinformatics, vol. 10, supplement 4, p. S7, 2009. View at: Publisher Site  Google Scholar
 S. Hua and Z. Sun, “A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach,” Journal of Molecular Biology, vol. 308, no. 2, pp. 397–407, 2001. View at: Publisher Site  Google Scholar
 Y. Zhu, X. Shen, and W. Pan, “Networkbased support vector machine for classification of microarray samples,” BMC Bioinformatics, vol. 10, supplement 1, p. S21, 2009. View at: Publisher Site  Google Scholar
 L. Evers and C. M. Messow, “Sparse kernel methods for highdimensional survival data,” Bioinformatics, vol. 24, no. 14, pp. 1632–1638, 2008. View at: Publisher Site  Google Scholar
 V. Vapnik, Statistical Learning Theory, Wiley, New York, NY, USA, 1998.
 C. J. C. Burges, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998. View at: Google Scholar
 U. Alon, N. Barka, D. A. Notterman et al., “Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745–6750, 1999. View at: Publisher Site  Google Scholar
 T. K. Paul and H. Iba, “Gene selection for classification of cancers using probabilistic model building genetic algorithm,” BioSystems, vol. 82, no. 3, pp. 208–225, 2005. View at: Publisher Site  Google Scholar
 O. Troyanskaya, M. Cantor, G. Sherlock et al., “Missing value estimation methods for DNA microarrays,” Bioinformatics, vol. 17, no. 6, pp. 520–525, 2001. View at: Google Scholar
 M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology: tool for the unification of biology. The Gene Ontology Consortium,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000. View at: Publisher Site  Google Scholar
 M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori, “The KEGG resource for deciphering the genome,” Nucleic Acids Research, vol. 32, pp. D277–D280, 2004. View at: Google Scholar
 M. Kanehisa, S. Goto, S. Kawashima, and A. Nakaya, “Thed KEGG databases at GenomeNet,” Nucleic Acids Research, vol. 30, no. 1, pp. 42–46, 2002. View at: Google Scholar
 G. Lenz, G. W. Wright, N. C. T. Emre et al., “Molecular subtypes of diffuse large Bcell lymphoma arise by distinct genetic pathways,” Proceedings of the National Academy of Sciences of the United States of America, vol. 105, no. 36, pp. 13520–13525, 2008. View at: Publisher Site  Google Scholar
 R. E. Davis, K. D. Brown, U. Siebenlist, and L. M. Staudt, “Constitutive nuclear factor kappaB activity is required for survival of activated B celllike diffuse large B cell lymphoma cells,” The Journal of Experimental Medicine, vol. 194, pp. 1861–1874, 2001. View at: Google Scholar
 I. S. Lossos, D. K. Czerwinski, A. A. Alizadeh et al., “Prediction of survival in diffuse largeBcell lymphoma based on the expression of six genes,” The New England Journal of Medicine, vol. 350, no. 18, pp. 1828–1837, 2004. View at: Publisher Site  Google Scholar
 S. Amenta, M. Moschovi, C. Sofocleous, S. Kostaridou, A. Mavrou, and H. Fryssira, “NonHodgkin lymphoma in a child with Williams syndrome,” Cancer Genetics and Cytogenetics, vol. 154, no. 1, pp. 86–88, 2004. View at: Publisher Site  Google Scholar
 C. H. Lawrie, S. Soneji, T. Marafioti et al., “MicroRNA expression distinguishes between germinal center B celllike and activated B celllike subtypes of diffuse large B cell lymphoma,” International Journal of Cancer, vol. 121, no. 5, pp. 1156–1161, 2007. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2013 Li Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.