Scientific Programming

Volume 2016, Article ID 2739621, 10 pages

http://dx.doi.org/10.1155/2016/2739621

## A Parallel Genetic Algorithm Based Feature Selection and Parameter Optimization for Support Vector Machine

College of Computer Science, Sichuan University, Chengdu, Sichuan 610065, China

Received 27 October 2015; Revised 30 May 2016; Accepted 8 June 2016

Academic Editor: Tomàs Margalef

Copyright © 2016 Zhi Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The extensive applications of support vector machines (SVMs) require efficient method of constructing a SVM classifier with high classification ability. The performance of SVM crucially depends on whether optimal feature subset and parameter of SVM can be efficiently obtained. In this paper, a coarse-grained parallel genetic algorithm (CGPGA) is used to simultaneously optimize the feature subset and parameters for SVM. The distributed topology and migration policy of CGPGA can help find optimal feature subset and parameters for SVM in significantly shorter time, so as to increase the quality of solution found. In addition, a new fitness function, which combines the classification accuracy obtained from bootstrap method, the number of chosen features, and the number of support vectors, is proposed to lead the search of CGPGA to the direction of optimal generalization error. Experiment results on 12 benchmark datasets show that our proposed approach outperforms genetic algorithm (GA) based method and grid search method in terms of classification accuracy, number of chosen features, number of support vectors, and running time.

#### 1. Introduction

The overwhelming amount of data that is currently available in any field provides great opportunities for researchers to obtain knowledge that is impossible to obtain before. However, the enormous amount of data also requires the ability of efficiently extracting the essential knowledge from existing data and generalizing the obtained knowledge to the future unseen new data. Support vector machines (SVMs), proposed by Vapnik [1], have become the references for many classification problems because of their flexibility, computational efficiency, and capability of handling high dimensional data. Despite all the promising results that SVMs provided, it is still a challenge to efficiently construct a SVM classifier which can provide accurate prediction on the unseen new samples. This so-called generalization ability crucially depends on two tasks, namely, feature selection and parameter optimization [2–4].

Feature selection is used to identify a subset of available features which is most essential for classification. Feature selection is important for a variety of reasons, including generalization performance, computational efficiency, feature interpretability, and learning convergence [5–7]. Classification problems typically involve a number of features. However, not all of these features are equally important for a specific task. By extracting the essential information from a given dataset while using the smallest number of features, one can save significant computation time and build classifiers that have better generalization ability.

Along with feature selection, parameter optimization is another key factor that affects the generalization ability of SVMs. Proper parameter setting can not only improve the classification ability of a learned SVM model, but also lead to an efficient classification on the unseen new samples. The parameters that need to be optimized include the error penalty parameter and the kernel function parameter, such as parameter for the Gaussian kernel function. The performance of a SVM largely depends on the choice of parameter. Thus, the selection of parameter is an important research topic in the study of SVMs [8–13].

Both feature selection result and parameter setting have significant impact on the accuracy and efficiency of SVMs. Besides, the choice of feature selection and the setting of parameter are influenced by each other, and independently performing these two tasks might result in a loss of classification ability [2, 4]. Motivated by these views, the trend in recent years is to turn these two tasks into a multiobjective optimization problem so that global search algorithms, such as genetic algorithm (GA) [2, 14, 15], particle swarm optimization (PSO) [3], and ant colony optimization (ACO) [4], can be used to jointly perform these two tasks. However, jointly performing these two tasks results in a largely expanded solution space, and it requires strong search ability to find optimal feature subset and parameter for SVMs. Besides, given the fact that training SVM even only once needs a great deal of computations, it will be computationally infeasible to apply these global search algorithms into practical use, when the number of training samples increases.

The aim of this paper is to present an efficient and effective method of constructing SVM classifier, so that SVMs can be applied into wider range of practical use and provide promising results. In this paper, a coarse-grained parallel genetic algorithm (CGPGA) is used to jointly select feature subset and optimize parameters for SVMs. The key idea of CGPGA is to divide the whole GA population into several separate subpopulations, and each subpopulation can search the whole solution space in parallel way. After every certain number of generations, best individual of each subpopulation will migrate to other subpopulations. The distributed topology and the migration policy can significantly accelerate the process of feature selection and parameter optimization, so as to increase classification accuracy of SVM.

Another key issue addressed in this paper is the design of a proper fitness function which can be used to assess the true generalization ability of learned SVM and direct the search of CGPGA to the direction of optimal generalization error. An essential part in model selection process (i.e., choosing one classifier over another) is to evaluate the performance of classifiers and choose the best one. However, the classifier derived from the training data is often overoptimistic, due to overspecialization of the learning algorithm to the data [16]. In this paper, a new fitness function, which combines classification accuracy obtained from* k*-fold bootstrap method, the number of chosen features, and the number of support vectors, is proposed to measure the generalization ability of learned SVM. Experiments on 12 benchmark datasets show that our proposed method not only achieves higher classification accuracy, smaller feature subset, and smaller number of support vectors, but also takes significantly shorter processing time.

The remainder of this paper is organized as follows. A brief introduction to the SVM is given in Section 2. Section 3 introduces basic concept of parallel genetic algorithms. Section 4 gives a detailed description of our proposed approach. The results of our evaluation are given in Section 5. Section 6 concludes this paper.

#### 2. Support Vector Machines

##### 2.1. Linear SVM

First, we briefly describe the SVM formulation. SVM is designed for binary-classification problems. Given the training data , , and , where is the input space, is the sample vector, and is the class label of . A hyperplane in the feature space can be described as , where is normal to the hyperplane and is a scalar. The distance from a point in the feature space to the hyperplane is

When the training samples are linearly separable, the SVM finds an optimal separating hyperplane that maximizes the minimum value of , by solving the following optimization problem:

For linearly nonseparable case, there is no such a hyperplane that is able to classify every training sample correctly. In order to relax the separable case to nonseparable one, the slack variable is introduced into the optimization problem:where parameter is the tuning parameter used to balance the margin and the training error. Optimization problem (3) can be solved by introducing the Lagrange multipliers that transform it to dual form:

In the classification phase, a sample in the feature space is assigned a label according to the following equation:

##### 2.2. Kernel

When linear SVM cannot provide satisfactory performance, nonlinear SVM is suggested. The basic idea is to map by a nonlinearly mapping function to a higher dimensional feature space, in which the data are sparse and possibly more separable. Based on the observation that only the inner product of two vectors is needed in (4) and (5), the mapping is often not explicitly given. Instead, a kernel function is incorporated to simplify the computation of the inner product value. The kernel function gives the inner product value of and in the feature space. Choosing a kernel function is therefore choosing a feature space and the decision function (5) becomes

Among a variety of kernel functions available, the generally used kernel functions include

#### 3. Parallel Genetic Algorithms

GAs are stochastic search algorithm based on principles of natural selection and recombination. They attempt to find optimal solution to the problem at hand by manipulating a population of candidate solutions. The population is evaluated and the best solutions are selected to reproduce and mate to form the next generation. After a number of generations, good traits dominate the population, resulting in an increase in the quality of the solutions. In most cases, GAs are efficient enough to find acceptable solutions. However, while being applied to more complex problems, they suffer the risk of premature convergence to local optima [17] and large increase in the time required to find adequate solutions.

There have been multiple efforts [18, 19] to make GAs faster, and one of the most promising choices is to use parallel implementations. The basic idea behind most PGAs is to divide the whole population into several subpopulations and evolve all the subpopulations simultaneously using multiple processors. A PGA basically consists of various GAs, and each processes a part of population or independent subpopulation, with or without communication between them. Therefore, PGAs can increase the diversity of population and significantly reduce computation time.

There are three main types of PGAs: () master-slave type, () fine-grained type, and () coarse-grained type [18]. A master-slave PGA acts like GA and does not affect the behavior of the algorithm. This model uses a single global population and the fitness evaluation is distributed among available processors or cores. Since, in this type of PGAs, selection and crossover consider the entire population, it is also known as global PGA. As for the fine-grained algorithm, the population is divided into a large number of very small subpopulations, which are maintained by different processors. In ideal case, each processor will be allocated only one individual. This method is rarely utilized; due to that it strictly requires too many processors and high communication cost for each generation.

The coarse-grained type is also known as distributed GA or island model, which divides the whole population into a few large subpopulations. Genetic operators are carried out within the subpopulation. After several generations, individuals from different subpopulations will be exchanged and form the new subpopulations for further evolution. The exchange process is named as “migration,” which is the essential part inside the CGPGA that could diversify the population and prevent the premature convergence. The schematic of CGPGA is given in Figure 1.