An Optimal SVM with Feature Selection Using Multiobjective PSO
Support vector machine is a classifier, based on the structured risk minimization principle. The performance of the SVM depends on different parameters such as penalty factor, , and the kernel factor, . Also choosing an appropriate kernel function can improve the recognition score and lower the amount of computation. Furthermore, selecting the useful features among several features in dataset not only increases the performance of the SVM, but also reduces the computational time and complexity. So this is an optimization problem which can be solved by heuristic algorithm. In some cases besides the recognition score, the reliability of the classifier’s output is important. So in such cases a multiobjective optimization algorithm is needed. In this paper we have got the MOPSO algorithm to optimize the parameters of the SVM, choose appropriate kernel function, and select the best feature subset simultaneously in order to optimize the recognition score and the reliability of the SVM concurrently. Nine different datasets, from UCI machine learning repository, are used to evaluate the power and the effectiveness of the proposed method (MOPSO-SVM). The results of the proposed method are compared to those which are achieved by single SVM, RBF, and MLP neural networks.
A pattern recognition system consists of different parts. One of the most important parts of such a system is classifying, which is done by different classifiers at the end of the process. Obviously, having a powerful classifier with high accuracy is critical in a pattern recognition system, since the output accuracy of the system is highly affected by the accuracy of the classifier. So an accurate pattern recognition system which can be used in different applications strongly needs a high performance classifier. One of the powerful classification techniques is support vector machine, briefly called SVM . SVM is a supervised learning method that constructs a classification model using training data. SVM minimizes the generalization error and maximizes the geometric margin between two classes. This classifier uses a kernel function to map the input data into a high-dimensional feature space in order to find an optimal hyperplane to separate the two-class data. The performance of the SVM depends on the amount of kernel parameter, , and the amount of penalty factor, . Also choosing an appropriate kernel function is important. Furthermore, selecting the useful features among several features in the training dataset to train SVM plays an important role in improving the performance of the SVM. So, before training the SVM, the user should select a suitable kernel function and also optimal amounts for kernel parameter and penalty factor. Besides that, as mentioned before, feature selection is important for improving the performance and reducing the complexity. To solve this problem different methods based on heuristic algorithms have been proposed. For example, Huang and Wang have used GA to optimize the SVM’s parameters and also performing feature selection simultaneously in order to increase the classification accuracy . They used RBF kernel in all experiments. Samanta et al. have proposed a GA-SVM method for bearing fault detection in rotating machines . They had genetic algorithm, optimize the parameters of SVM, and also perform feature selection to improve the SVM ability in recognizing the vibration signals. Wu et al. proposed a method, based on GA and SVM, for predicting bankruptcy . They have used GA only to optimize the classifier’s parameters without feature selection. Like GA, other optimization algorithms such as PSO and SA have been used to promote the SVM’s performance in different practical fields like Biomedical [5–7] and Face Recognition . Another important point that is not considered in the mentioned researches is the reliability of the classifier, which means the validation of the classifier’s output. This is a very critical point that should be considered in selecting a classifier for different applications such as military and medicine. In all mentioned researches, the researchers have used only one fitness function to evaluate their methods. But, in addition to recognition score, calculating the reliability of the classifier’s output is a good way to evaluate the performance of the classifier. Reliability means the validation of the classifier’s output, for an unknown sample. In some problems, although the recognition score of a class is high, the corresponding reliability of that class may be low, and vice versa. Figure 1 shows this concept. According to Figure 1 the recognition score of the hollow circles is 100% but the corresponding reliability is (5/6) 83%. These numbers for dark circles are 80% and 100%, respectively.
In this study multiobjective form of PSO has been used to find optimal hyperplanes for two objective functions: recognition score and reliability. The remainder of this paper is organized as follows. In Section 2, SVM is briefly introduced. In Section 3, PSO and MOPSO algorithms are reviewed. In Section 4, two forms of artificial neural networks are reviewed as powerful methods in classification. In Section 5, the proposed method has been introduced. Section 6 shows the experimental results and the final section is devoted to conclusion.
2. Support Vector Machine
SVM is a two-class classifier described as follows . Let , , indicate a set of data containing training samples. Each sample must conform to the criteria . demonstrates the class of corresponding sample, . So and indicates the number of dimensions of input data. The separating hyperplane can be derived as in If such a hyperplane exists, then linear separation is obtained. The samples which are nearest ones to the separating hyperplane are called support vectors. In boundaries (support vectors), (1) is reformed as According to (2) for each sample (3) is true:So the problem is finding and . There are numerous hyperplanes which can separate the two-class data but SVM produces the optimal hyperplane as indicated in Figure 2. This hyperplane has the maximum distance to support vectors. The margin of a separating hyperplane is . So if we want to find the optimal hyperplane, we should minimize . For simplicity we can substitute with . So we are dealing with an optimization problem. It means that we have to minimize subjected to (3).
In Figure 2 the samples are linearly separable, but in most cases they cannot be separated as easy as indicated in Figure 2. For nonlinear problems positive slack variables are introduced. So the problem changed into In (4) is called penalty factor. It is introduced to control the tradeoff between margin maximization and error minimization. This problem can be solved by means of Lagrange multipliers. Thus the classification decision function becomeswhere is the Lagrange multiplier. is kernel function through some another mapping function, . QP solver is used to find . After that and can be achieved byIn (7) is the number of support vectors and is the input unknown sample.
Some common kernel functions are linear: , polynomial: , RBF: , quadratic: , in all of these functions should be optimally tuned with .
3. Particle Swarm Optimization Method
3.1. Single-Objective PSO
Particle swarm optimization algorithm is first suggested by Kennedy and Eberhart in 1995 . This algorithm is produced by inspiration of birds flocking and fishes grouping. In fact they used the mechanism of birds flocking to solve optimization problems. It means that a group of particles search the solution space for the best solution. Each particle has a position, velocity, and a memory to save its best position from the beginning of the process. In each iteration the particle which has the best position is regarded as the leader and the other particles tend to reach its position. So their movement is affected by two factors: their best position from the first iteration to current iteration and the leader’s position. Equations (8) and (9) describe how particles move through iterations:In the above equations, is the th dimension of the velocity of the th particle, denotes the position of the particle, is the number of iterations, and are learning factors, rand is a positive random number between 0 and 1 under normal distribution, is the inertia weight coefficient, is the best position of the particle from the beginning to current iteration, and shows the position of the leader in each iteration.
3.2. Multiobjective PSO
In a multiobjective optimization problem obviously, there is more than one objective function, to be optimized, so a multiobjective optimization problem can be defined as follows :where is a solution, , , are objective functions, and , are constraints of the problem. Contrary to single-objective case, here we cannot find a single solution which is the best for all objective functions. Instead we are looking for a set of solutions. Actually there is a tradeoff between different objective functions. So the definition of the optimality is different in this case. We call an optimal solution if another solution, like , cannot be found which has better fitness in all objective functions. Such a solution is a member of Pareto optimal front . We say is dominated by , if is better than in all objective functions. But if is better just in one objective function than , it is nondominated. So in multiobjective form we have a set of solutions that contains nondominated particles. It means that the members of this set cannot dominate each other. Figure 3 shows Pareto optimal front for a two-objective function problem. According to this picture the solutions in the Pareto front dominate the other solutions but cannot dominate each other. In MOPSO each particle has a set of leaders and has to select one of them through a mechanism. Usually this set is called External Archive [13, 14]. External Archive contains nondominated particles from the first iteration.
In fact External Archive preserves outputs of the algorithm. Up to now different versions of MOPSO are introduced. In this study we have used the one introduced in  because of its speed and rapid convergence. In this form to select a leader for each particle, the solution space is divided into numerous hypercubes and different solutions from the External Archive exist in these hypercubes.
They are placed in hypercubes according to their coordination calculated by objective functions. Each hypercube is evaluated through dividing the number of its solutions into a constant number. After evaluating each hypercube, roulette wheel mechanism will select one of these hypercubes. And finally a solution, placed in the selected hypercube, will be selected randomly as the leader for the particle. MOPSO process is described as follows:(1)Initializing the position and the velocity of each particle.(2)Evaluating the particles.(3)Saving nondominated particles in a repository.(4)Producing hypercubes to cover the solution space.(5)Initializing the memory of each particle(6)Main loop(a)Calculating the velocity of each particle by (8) (but in this form should be replaced by .(b)Updating the position of the particles through (9).(c)Evaluating the particles.(d)Updating the repository.(e)Updating for each particle.(7)End of the main loop.
4. Artificial Neural Networks
Artificial neural network is introduced in 1974 . The aim of this network is to extract logical results from received information by simulating the activity of the brain using a similar structure. In fact, artificial neural networks are organized in such a way that the relationships between inputs and outputs (which can be complex or nonlinear) are saved in a network structure and are therefore capable of assigning the related output to each of the inputs. After determining the structural components of these networks, the components of this structure are modified based on numerous comparisons between the output of the network and the desired output, so that the difference between these two values approaches zero over consecutive comparisons. In this sense, a neural network can be considered as a blind model that is able to perform the mapping (not necessarily linear) from input (vector) space to output (vector) space. In this paper we have used two of the most widely used artificial neural networks, the multilayer perceptron neural network (MLP) and radial basis function neural network (RBF), and totally compared capability of them with optimal support vector machine.
4.1. The Multilayer Perceptron Neural Network (MLP)
The simplest perceptron neural network consists of three (input, hidden, and output) layers as shown in Figure 4. The numbers of neurons in each layer are determined using the trial and error method. The initial weights of this neural network are determined randomly. The backpropagation error algorithm is used for training the neural network in which the weights of the network change in a supervised manner based on the difference between the neural network output and desired output, so, for the every input, the output can be generated by the neural network. The input and output patterns are first normalized by a normalizing factor in order to equalize the effect of training process in changing the weights of the network in the training process. For the th input pattern, the squared error in all neurons is calculated using the following equation:where and are, respectively, the values for desired output and calculated output in the th neuron for pattern . Total squared error for all patterns can also be calculated using the following equation:In the following equations represents current weight, represents previous weight, represents learning coefficient, and represents momentary coefficient:
In this method weights are updated repeatedly for all learning patterns. The training process stops when the total error value for all patterns reaches a value lower than the determined critical point or when the whole learning period reaches the final point. It is noteworthy that the training method mentioned here is an error backpropagation method with momentary term, which lowers the possibility of coordination at local minima compared with the error backpropagation method.
4.2. Radial Basis Function (RBF) Neural Network
RBF is a popular supervised neural network learning algorithm. It is a specific kind of MLP network . The RBF network is constituted by only the following three layers as shown in Figure 5: Input Layer. It broadcasts the inputs without distortion. RBF Layer. Hidden layer contains the RBF. Output Layer. Simple layer contains a linear function.
Basis functions normally take the form . The function depends on the distance (usually taken to be Euclidean) between the input vector and a vector . The most common form of basis function used is the Gaussian functionwhere determines the center of basis function and is a width parameter that controls how the curve is spread. Generally, these centers are selected by using some fuzzy or nonfuzzy clustering algorithms. In this work, we have used the -means algorithm to select the initial cluster centers in the first stage and then these centers are further fine-tuned by using point symmetry distance measure. The number of neurons in the output layer is equal to the number of classes of the classification problem. Each output layer neuron computes a linear weighted sum of the outputs of the hidden layer neurons as follows:The weight vectors are determined by minimizing the mean squared differences between the classifier outputs:And target values are as follows:The parameters are given by (for more explication, see )orThusAfter computation, we obtain
5. Proposed Method
In this paper we have used MOPSO to optimize penalty factor, choose adequate kernel function, tune the selected kernel’s parameter, and feature selection for two objective functions, recognition score and reliability, and its performance is compared with RBF and MLP neural networks. The construction of particles is indicated in Figure 6.
The first variable, , is for tuning penalty factor. is for selecting kernel functions. The amount of this variable can be 1, 2, 3, or 4 to choose one kernel among the four kernels introduced in Section 2. is for selecting the selected kernel’s parameter (except linear). The rest of the particle is for feature selection. For a dataset with number of features, are between 0 and 1. If they are less than or equal to 0.5, the corresponding feature is not selected. Conversely if they are bigger than 0.5, the corresponding feature is selected.
If we consider the two classes as “positive” and “negative,“ then the predicted test samples can be divided into four groups:(1)Samples which are “positive” and correctly predicted as “positive” (TP).(2)Samples which are “positive” but classified as “negative” (FN).(3)Samples which are “negative” and correctly classified as “negative” (TN).(4)Samples which are “negative” but predicted as “positive” (FP).
According to this categorization, recognition score is calculated by and the reliability for each class equals The termination criteria are that the iteration number reaches 200. To calculate the fitness functions, for each particle, SVM should be trained by the determined parameters, kernel function, and selected features and then recognition score and reliability for each class can be achieved by (23) to (24). For multiclass classification we have used one-versus-all method. In this method for each class of the dataset we found the optimal hyperplane, which separates the corresponding class from the others. Thus the input sample is labeled according to the opinion of the obtained hyperplanes about that sample. Figure 7 shows this method for a 3-class dataset.
6. Experimental Results
The suggested method applied to nine different datasets from UCI machine learning repository . In Table 1 the characteristics of these datasets are shown. Table 2 shows the experimental results on these datasets, Table 3 contains the learning time for different methods, and Table 4 shows the results of proposed method in classifying different datasets with and without feature selection.
According to Table 2 it can be seen that MOPSO-SVM gives comparable and also better results than MLP and RBF neural networks for Glass, Iris, Wine, Ionosphere, Hepatitis, and Vowel datasets. The important point demonstrated in Table 2 is the rates of reliabilities given for different datasets. As indicated in Table 2, the proposed method gives high rates of reliabilities for most of the datasets, meaning that the output of the promoted classifier is strongly reliable.
In fact since the hyperplanes obtained by MOPSO have an amount of errors in classifying of the test samples (unknown samples), some samples exist that more than one hyperplane assigns them to their corresponding classes. Also there may be some samples that none of the hyperplanes assign them to their corresponding classes. Such samples are considered as error samples, at which their classes cannot be distinguished. Figure 8 illustrates this concept. Another point that is obviously seen from Table 2 is that MOPSO-SVM outperforms the original SVM in most of the experiments. It means that the proposed method is an expert classifier which automatically finds the optimal SVM parameters and best feature subset for classifying different datasets. It should be noted that in all the experiments different kernel functions were chosen for single SVM and the amounts of the recognition score and reliability reported for single SVM written in Table 2 are the average results of different SVM with different kernel functions.
Analyzing the numbers seen in Table 2, we can conclude that MOPSO-SVM is a powerful and effective classifier, due to rates of reliabilities and recognition scores achieved by this method for different datasets. These numbers show that MOPSO-SVM is a reliable classifier which means that this promoted classifier can act perfectly in special applications such as military and medicine which strongly require a high-reliable classifier. Table 3 contains the learning time for different algorithms. Comparing to single SVM, MOPSO-SVM requires less learning time in most experiments. This is the result of feature selection. In fact removing redundant features from datasets results in reduction of learning time. Also proposed method has less learning time than MLP and RBF neural networks. In Table 4 the results of proposed method with and without feature selection are shown. According to this table, feature selection process has improved the recognition score and reliability for most of the datasets. It means that feature selection process is an efficient preprocessing technique which not only has the ability to reduce the learning time of the classifier but also can improve its performance. This is an important issue especially in classifying or clustering high-dimensional data. From the reported results, it is clear that using heuristic algorithm to enhance the performance of the SVM for two objective functions is a successful idea because finding optimal parameters of SVM for different datasets and also reducing the dimension of the dataset are a hard task. For example for Sonar samples, which have 60 features, there exists 260 feature subset, so it is very difficult to find the best feature subsets. Furthermore finding the optimal amounts of the parameters in order to improve the performance of the SVM is a difficult task. In fact finding an optimal SVM with optimal feature subset is an NP-hard problem which can be solved with heuristic algorithm. According to the reported results, MOPSO searches the solution space very effectively.
In this study multiobjective PSO has been used to tune the parameters of SVM and also perform feature selection for two objective functions and the performance of the proposed method (MOPSO-SVM) has been compared with single SVM, RBF, and MLP neural networks. According to the reported results, it can be seen that the proposed method gives reliabilities and recognition scores, comparable with RBF and MLP neural networks, which have shown their effectiveness in classifying overlapped datasets, and in some cases even gives better reliabilities and/or recognition scores than RBF and MLP, for example, for Glass, Iris, Wine, Ionosphere, Hepatitis, and Vowel datasets. Also the proposed method has less learning time in most of the experiments. Furthermore according to Tables 3 and 4, feature selection is an important preprocessing method which has positive effect both on learning time and on the accuracy of the classifier.
Actually the results shown in the previous section indicate that using heuristic algorithm to convert SVM from a normal classifier into an expert one was successful. Furthermore optimizing SVM in order to increase its reliability besides its accuracy by using a multiobjective heuristic algorithm is a successful idea according to the obtained results. The reported results also show the power and effectiveness of MOPSO in searching the solution space. In other words, MOPSO is a powerful algorithm which can act very effectively in solving multiobjective optimization problems.
The authors declare that they have no competing interests.
J. Kennedy and R. C. Eberhart, “Particle swarm optimization,” in Proceedings of the IEEE International Conference on Neural Networks, vol. 4, pp. 1942–1948, IEEE Service Center, Perth, Australia, 1995.View at: Google Scholar
A. Abraham and L. Jain, “Evolutionary multiobjective optimization,” in Evolutionary Multiobjective Optimization: Theoretical Advances and Applications, A. Abraham and R. Goldberg, Eds., Advanced Information and Knowledge Processing, pp. 1–6, Springer, London, UK, 2005.View at: Publisher Site | Google Scholar
V. Pareto, Cours d'Economie Politique, Volume I and II, F. Rouge, Lausanne, Switzerland, 1896.
M. Bhuvaneswari, Application of Evolutionary Algorithms for Multi-Objective Optimization in VLSI and Embedded Systems, Springer, Berlin, Germany, 2015.
S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College, New York, NY, USA, 1996.
C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995.View at: MathSciNet