Abstract

The extensive applications of support vector machines (SVMs) require efficient method of constructing a SVM classifier with high classification ability. The performance of SVM crucially depends on whether optimal feature subset and parameter of SVM can be efficiently obtained. In this paper, a coarse-grained parallel genetic algorithm (CGPGA) is used to simultaneously optimize the feature subset and parameters for SVM. The distributed topology and migration policy of CGPGA can help find optimal feature subset and parameters for SVM in significantly shorter time, so as to increase the quality of solution found. In addition, a new fitness function, which combines the classification accuracy obtained from bootstrap method, the number of chosen features, and the number of support vectors, is proposed to lead the search of CGPGA to the direction of optimal generalization error. Experiment results on 12 benchmark datasets show that our proposed approach outperforms genetic algorithm (GA) based method and grid search method in terms of classification accuracy, number of chosen features, number of support vectors, and running time.

1. Introduction

The overwhelming amount of data that is currently available in any field provides great opportunities for researchers to obtain knowledge that is impossible to obtain before. However, the enormous amount of data also requires the ability of efficiently extracting the essential knowledge from existing data and generalizing the obtained knowledge to the future unseen new data. Support vector machines (SVMs), proposed by Vapnik [1], have become the references for many classification problems because of their flexibility, computational efficiency, and capability of handling high dimensional data. Despite all the promising results that SVMs provided, it is still a challenge to efficiently construct a SVM classifier which can provide accurate prediction on the unseen new samples. This so-called generalization ability crucially depends on two tasks, namely, feature selection and parameter optimization [24].

Feature selection is used to identify a subset of available features which is most essential for classification. Feature selection is important for a variety of reasons, including generalization performance, computational efficiency, feature interpretability, and learning convergence [57]. Classification problems typically involve a number of features. However, not all of these features are equally important for a specific task. By extracting the essential information from a given dataset while using the smallest number of features, one can save significant computation time and build classifiers that have better generalization ability.

Along with feature selection, parameter optimization is another key factor that affects the generalization ability of SVMs. Proper parameter setting can not only improve the classification ability of a learned SVM model, but also lead to an efficient classification on the unseen new samples. The parameters that need to be optimized include the error penalty parameter and the kernel function parameter, such as parameter for the Gaussian kernel function. The performance of a SVM largely depends on the choice of parameter. Thus, the selection of parameter is an important research topic in the study of SVMs [813].

Both feature selection result and parameter setting have significant impact on the accuracy and efficiency of SVMs. Besides, the choice of feature selection and the setting of parameter are influenced by each other, and independently performing these two tasks might result in a loss of classification ability [2, 4]. Motivated by these views, the trend in recent years is to turn these two tasks into a multiobjective optimization problem so that global search algorithms, such as genetic algorithm (GA) [2, 14, 15], particle swarm optimization (PSO) [3], and ant colony optimization (ACO) [4], can be used to jointly perform these two tasks. However, jointly performing these two tasks results in a largely expanded solution space, and it requires strong search ability to find optimal feature subset and parameter for SVMs. Besides, given the fact that training SVM even only once needs a great deal of computations, it will be computationally infeasible to apply these global search algorithms into practical use, when the number of training samples increases.

The aim of this paper is to present an efficient and effective method of constructing SVM classifier, so that SVMs can be applied into wider range of practical use and provide promising results. In this paper, a coarse-grained parallel genetic algorithm (CGPGA) is used to jointly select feature subset and optimize parameters for SVMs. The key idea of CGPGA is to divide the whole GA population into several separate subpopulations, and each subpopulation can search the whole solution space in parallel way. After every certain number of generations, best individual of each subpopulation will migrate to other subpopulations. The distributed topology and the migration policy can significantly accelerate the process of feature selection and parameter optimization, so as to increase classification accuracy of SVM.

Another key issue addressed in this paper is the design of a proper fitness function which can be used to assess the true generalization ability of learned SVM and direct the search of CGPGA to the direction of optimal generalization error. An essential part in model selection process (i.e., choosing one classifier over another) is to evaluate the performance of classifiers and choose the best one. However, the classifier derived from the training data is often overoptimistic, due to overspecialization of the learning algorithm to the data [16]. In this paper, a new fitness function, which combines classification accuracy obtained from k-fold bootstrap method, the number of chosen features, and the number of support vectors, is proposed to measure the generalization ability of learned SVM. Experiments on 12 benchmark datasets show that our proposed method not only achieves higher classification accuracy, smaller feature subset, and smaller number of support vectors, but also takes significantly shorter processing time.

The remainder of this paper is organized as follows. A brief introduction to the SVM is given in Section 2. Section 3 introduces basic concept of parallel genetic algorithms. Section 4 gives a detailed description of our proposed approach. The results of our evaluation are given in Section 5. Section 6 concludes this paper.

2. Support Vector Machines

2.1. Linear SVM

First, we briefly describe the SVM formulation. SVM is designed for binary-classification problems. Given the training data , , and , where is the input space, is the sample vector, and is the class label of . A hyperplane in the feature space can be described as , where is normal to the hyperplane and is a scalar. The distance from a point in the feature space to the hyperplane is

When the training samples are linearly separable, the SVM finds an optimal separating hyperplane that maximizes the minimum value of , by solving the following optimization problem:

For linearly nonseparable case, there is no such a hyperplane that is able to classify every training sample correctly. In order to relax the separable case to nonseparable one, the slack variable is introduced into the optimization problem:where parameter is the tuning parameter used to balance the margin and the training error. Optimization problem (3) can be solved by introducing the Lagrange multipliers that transform it to dual form:

In the classification phase, a sample in the feature space is assigned a label according to the following equation:

2.2. Kernel

When linear SVM cannot provide satisfactory performance, nonlinear SVM is suggested. The basic idea is to map by a nonlinearly mapping function to a higher dimensional feature space, in which the data are sparse and possibly more separable. Based on the observation that only the inner product of two vectors is needed in (4) and (5), the mapping is often not explicitly given. Instead, a kernel function is incorporated to simplify the computation of the inner product value. The kernel function gives the inner product value of and in the feature space. Choosing a kernel function is therefore choosing a feature space and the decision function (5) becomes

Among a variety of kernel functions available, the generally used kernel functions include

3. Parallel Genetic Algorithms

GAs are stochastic search algorithm based on principles of natural selection and recombination. They attempt to find optimal solution to the problem at hand by manipulating a population of candidate solutions. The population is evaluated and the best solutions are selected to reproduce and mate to form the next generation. After a number of generations, good traits dominate the population, resulting in an increase in the quality of the solutions. In most cases, GAs are efficient enough to find acceptable solutions. However, while being applied to more complex problems, they suffer the risk of premature convergence to local optima [17] and large increase in the time required to find adequate solutions.

There have been multiple efforts [18, 19] to make GAs faster, and one of the most promising choices is to use parallel implementations. The basic idea behind most PGAs is to divide the whole population into several subpopulations and evolve all the subpopulations simultaneously using multiple processors. A PGA basically consists of various GAs, and each processes a part of population or independent subpopulation, with or without communication between them. Therefore, PGAs can increase the diversity of population and significantly reduce computation time.

There are three main types of PGAs: () master-slave type, () fine-grained type, and () coarse-grained type [18]. A master-slave PGA acts like GA and does not affect the behavior of the algorithm. This model uses a single global population and the fitness evaluation is distributed among available processors or cores. Since, in this type of PGAs, selection and crossover consider the entire population, it is also known as global PGA. As for the fine-grained algorithm, the population is divided into a large number of very small subpopulations, which are maintained by different processors. In ideal case, each processor will be allocated only one individual. This method is rarely utilized; due to that it strictly requires too many processors and high communication cost for each generation.

The coarse-grained type is also known as distributed GA or island model, which divides the whole population into a few large subpopulations. Genetic operators are carried out within the subpopulation. After several generations, individuals from different subpopulations will be exchanged and form the new subpopulations for further evolution. The exchange process is named as “migration,” which is the essential part inside the CGPGA that could diversify the population and prevent the premature convergence. The schematic of CGPGA is given in Figure 1.

In this paper, CGPGA is applied to simultaneously select feature subset and optimize parameters of SVM. The whole population is divided into several subpopulations. Each of these subpopulations will take an independent evolution, and different evolutionary strategies will be applied on them. After every certain number of generations, the best individual will migrate to every other subpopulation and replace the worst one.

4. Method

The chromosome design, fitness function, and system architecture of the proposed CGPGA are described as follows.

4.1. Chromosome Design

The design of chromosome is an important step for the proposed CGPGA method. In this step, Gaussian kernel is chosen as kernel function of SVM classifier. Each chromosome comprises three parts, parameters C and γ and feature subset. Binary code is used to present the chromosome. Figure 2 shows the binary chromosome representation of our design.

In Figure 2, represents the binary code of parameter , represents the binary code of parameter γ, is the number of bits representing parameter , and is the number of bits representing parameter γ. Note that the selection of and is according to the computational precision. Besides, the binary code representing the genotype of the parameters should be transformed into phenotype bywhere p is phenotype (true value) of bit string, is minimum value of the parameter, is maximum value of the parameter, is decimal value of bit string, and n is length of bit string.

In the coding of feature subset, “1” indicates that the feature is selected and “0” indicates that the feature is not selected. represents the number of features in the original dataset.

4.2. Fitness Function

Fitness function is an essential part of CGPGA. It evaluates the performance of each individual in the population and predicts which one has the best generalization ability. The classification accuracy obtained from k-fold bootstrap method, the number of selected features, and the number of support vectors in a SVM model are used to construct a fitness function. All these measurements have been proven to be good indicators of good generalization ability [5, 20, 21]. A high fitness value will be assigned to the individual with high classification accuracy, small number of chosen features, and small number of support vectors. The fitness function iswhere is classification accuracy weight, Accuracy is average prediction accuracy of 5-time bootstrap, is weight of feature score, is score of chosen feature subset, and for “1” indicates that th feature is selected. “0” indicates that th feature is not selected. is the number of original features, is weight of support vector value and is set to 0.05 in our experiment, v is score of support vectors number, for “1” indicates that   th sample is a support vector, and l is the number of samples in training set.

4.3. System Architectures of the Proposed CGPGA-SVM

System architecture of CGPGA-based feature selection and parameters optimization method is shown in Figure 3. Main steps are described as follows.

(1) Input Dataset. Input dataset includes all the labeled samples. It will be randomly split into a training set and a testing set using bootstrap method. Training set is used to construct the SVM model while testing set is used to test the generalization ability of learned SVM.

(2) Preprocess the Data. Data preprocess is important for a variety of reasons. It can avoid attributes in greater numeric ranges dominating those in smaller numeric ranges and increase SVM accuracy [2]. Each feature of the dataset can be linearly scaled to the range bywhere is the original value, is the scaled value, max is upper bound of the feature value, and min is the low bound.

(3) Initialize the Population. Generally, the original population is randomly generated. However, in our experience, it is useful to randomly generate the genotype of parameters and but select all the features. This will make the first generation of CGPGA run like a grid search procedure.

(4) Decide the Topology and Migrating Strategy. The important characteristics of CGPGA are the use of a few relatively large subpopulations and migration. By dividing the whole population into several separate subpopulations, one can apply different searching strategies (i.e., different crossover rates and mutation rates) to different subpopulations.

Specifically, the whole population will be divided into 2 subpopulations, and each subpopulation has 60 individuals. After every 10 generations, the best individual in each subpopulation will be sent to other subpopulations and replace the worst individual. The purpose of this constant communication is to ensure a good mixing of individuals.

(5) Apply Genetic Operation. Genetic operations, such as selection, crossover, and mutation, will be applied to generate better solutions. However, in a CGPGA, genetic operations will be carried out within the subpopulation, which means different subpopulations will take different crossover rates and mutation rates.

(6) Get the Parameters. This step refers to converting each parameter from its genotype into phenotype. The converting of parameters can be done by (8).

(7) Select Feature Subset. According to the binary code of feature set in each chromosome, related features can be chosen and unrelated features will be discarded. After training dataset and testing dataset discard unrelated features, they can be used to construct the SVM model and test the generalization ability of learned SVM.

(8) Evaluate the Individuals Using Bootstrap. Each individual in the population refers to a certain pair of parameters and a certain choice of feature selection. To obtain a reliable performance estimate of this individual, bootstrap method will be used times. During each phase of bootstrap, 50% of the samples in the input dataset will be randomly chosen as the training set, while the rest of samples will be chosen as the testing set. Training set is used to construct the SVM model with the chosen parameters and feature subset, and testing set is used to predict the classification ability of learned SVM. The classification accuracies from the k-time bootstrap then can be averaged to produce a single classification accuracy. After obtaining the classification accuracy, the fitness value can be calculated by (9). It must be mentioned that the evaluation of an individual is independent from the rest of the population, and there is no need to communicate during this phase. Thus, the evaluation of individuals is parallelized by assigning a fraction of the population to each of the available processors.

(9) Termination Criteria. In our approach, if the generation number reaches generation 100 or the highest fitness value of the whole population does not improve during the last 30 generations, process will stop.

5. Experiments

5.1. Experiment Settings

The used platform is Intel Core CPU i7-3770 (3.4 GHz, 4 cores), 4 G RAM, Windows 7 operating system. The development environment is MATLAB (2012a). The software of SVM is Libsvm (3.1) [22]. In our experiment, the search range of parameter is set to [0.01, 2048], while the search range of parameter is set to [0.0001, 10]. The whole population will be divided into 2 subpopulations, and each subpopulation has 60 individuals. After every 10 generations, the best individual in each subpopulation will be sent to other subpopulations and replace the worst individual. The crossover rate and mutation rate of first subpopulation are 0.8 and 0.05. The crossover rate and mutation rate of second subpopulation are 0.7 and 0.02.

To evaluate the classification ability of the proposed approach in different classification tasks, 12 real world datasets from the UCI database [23] have been adopted. Their number of classes, number of samples, and number of original features are shown in Table 1.

In order to show the effectiveness of our proposed method, we conduct several comparisons between our proposed method and two other methods, including grid search method and GA-SVM [2]. Specifically, grid search is a widely used method of parameter optimization. In most cases, grid search can get a satisfactory result. GA-SVM, proposed by Huang and Wang [2], is the most widely used feature selection and parameter optimization method for SVMs. It can deal with feature selection and parameter optimization simultaneously and provide promising results.

To guarantee the result obtained by proposed method is valid, we adopt k-fold cross validation. The dataset will be partitioned into independent subsets randomly, and the size of each subset is approximately equal. The k-fold cross validation process is then repeated times, with each of subsets being used exactly once as the testing dataset, while the remaining datasets are used as a training set. The training set will be used as the input dataset of our proposed approach, and the performance of obtained parameters and feature subset will be tested on the testing set. The results from the folds then can be averaged to produce a single estimation. In our experiment, is set to 10. The evaluation procedure of our proposed approach using a 10-fold cross validation is shown in Figure 4. Take the Australian dataset as an example; the best pairs of parameters , the classification accuracy, the number of chosen features, and the number of support vectors for each fold obtained by our proposed approach and grid search method are shown in Table 4.

5.2. Classification Results

Table 2 gives the comparison of our proposed approach CGPGA-SVM and GA-SVM [2]. Tenfold cross validation is used to estimate the classification accuracy of each approach. The obtained classification accuracy is illustrated with the form of “average ± standard deviation.” As shown in Table 2, our proposed approach achieves higher classification accuracies on 11 datasets, except on “Svmguide1.”

Table 3 gives the experiment results of our proposed approach and grid search. Tenfold cross validation is used to estimate the classification accuracy of each approach. The obtained classification accuracy is illustrated with the form of “average ± standard deviation.” As we can see, the proposed approach produces smaller feature number, and grid search uses all the original features. Besides, the proposed approach achieves higher classification accuracy. To validate if this higher classification accuracy actually indicates stronger classification ability, we used nonparametric Wilcoxon-signed-rank test for all the 12 datasets. As shown in Table 4, the values for Splice, Svmguide1, and Mushrooms are larger than the prescribed statistical significance level of 0.05, but all other values are smaller than the significance level of 0.05. Generally, compared with the grid search, the proposed approach shows higher classification ability and produces smaller feature number.

5.3. Number of Support Vectors

The time taken for a SVM to compute the class label of a new pattern is proportional to the number of support vectors. A large number of support vectors indicate a slow classification on the new pattern. A small number of support vectors can extend the application of SVM to a wider field where classification has to be done in great speed. We compared the number of support vectors produced by CGPGA-SVM, GA-SVM, and grid search method. Since 10-fold cross validation was proposed in our experiment, 10 SVM models will be constructed for each of the datasets. We averaged the number of support vectors in 10 SVM models and compared them in Figure 5. Two conclusions can be drawn from Figure 5. () In most cases, our proposed method is capable of producing SVM with less support vectors than GA-SVM and grid search method (except on dataset of Svmguide1). () The number of support vectors does not scale with the number of training samples. In fact, the proportion of support vectors in model produced by CGPGA-SVM was maintained at a low level. This could enable the practitioners to apply the SVM to wider fields where classification has to be done in great speed, for example, online applications.

5.4. Computational Efficiency

A serious limitation of global search methods is that they involve high computational complexity. By using parallelization strategy, our proposed approach can significantly reduce the running time, while getting an enough adequate solution. We run our proposed approach and GA-SVM on all 6 datasets and recorded the average time involved in one generation of evolution. To get a fair enough comparison, both our approach and GA-SVM have 120 individuals in their populations. Table 5 gives the results. As we can see, on a common used 4-core CPU, our approach takes significant shorter time than GA-SVM.

6. Limitations and Conclusions

The overwhelming amount of data that is currently available in any field poses new challenges for machine learning techniques. To extract essential knowledge from these enormous data and generalize obtained knowledge to the future unseen new data, two problems must be efficiently addressed for SVM, namely, feature selection and parameter optimization. The number of input features in a classifier should be limited without losing its predictive power. With a smaller feature set, the classification decision is more easily explained and can be made in shorter time. Parameter optimization is another important factor that affects the generalization ability of SVMs. With a proper setting of parameters, the classification accuracy on the unseen new patterns can be ensured.

This work investigated a hybrid CGPGA-based model that hybridized the coarse-grained parallel genetic algorithm and support vector machines to maintain the classification accuracy with a small and suitable feature subset. The distributed topology and migration policy of CGPGA enable us to search the solution space with different search strategies in parallel way, thereby providing strong search ability and high efficiency.

Experiment results obtained from several real world datasets of UCI database showed promising performance in terms of 10-fold accuracy, the size of selected feature subset, the number of support vectors, and training time. They revealed that the proposed approach not only optimized SVMs’ model parameters, but also correctly obtained the discriminating feature subset in an efficient way. Moreover, the proportion of support vectors in model produced by our method was maintained at a low level. This could result in faster classification on the unseen new pattern and extend the applications of SVM to wider fields where classification has to be done in great speed.

Despite all of the promising results, our proposed work also has its limitations. Training SVM is a computation-intensive task. Our proposed work does not reduce the number of SVM models constructed during the optimization procedure or the time needed for training one SVM. The acceleration is achieved by assigning multiple SVM learning processes to each of the available processors, so that the SVM learning processes can be done in parallel way and the computational resources can be used in cost-efficient way. When the problem becomes large enough, our proposed work may show its burden.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

The research is partly supported by Science and Technology Supporting Program, Sichuan Province, China (nos. 2013GZX0138 and 2014GZ0154), and Scientific Research Foundation for Young Teachers, Sichuan University (no. 2015SCU11050).