Abstract

We propose an optimized Support Vector Machine classifier, named PMSVM, in which System Normalization, PCA, and Multilevel Grid Search methods are comprehensively considered for data preprocessing and parameters optimization, respectively. The main goals of this study are to improve the classification efficiency and accuracy of SVM. Sensitivity, Specificity, Precision, and ROC curve, and so forth, are adopted to appraise the performances of PMSVM. Experimental results show that PMSVM has relatively better accuracy and remarkable higher efficiency compared with traditional SVM algorithms.

1. Introduction

The swift development of machine learning technologies gives us a good chance to process and analyse data in a brand-new perspective. Machine learning, also known as knowledge discovery, is one of the most important branches of computer science, which aims to find useful patterns from data and is quite different from those traditional statistical methods. As a comparatively new machine learning algorithm, Support Vector Machine (SVM) has attracted much attentions recently and has been successfully used in various application vocations [16]. In this study, we focus on constructing an optimized SVM model, so as to use it on heart disease data classification, aiming to improve the classification efficiency and accuracy of SVM.

Many literatures have involved contents of using Support Vector Machine to deal with data. Muthu Rama Krishnan et al. designed a SVM based classifier, which was used on two UCI mammogram datasets for breast cancer detection and reached the accuracy of 99.385% and 93.726%, respectively [7]. Xie and Wang integrated a hybrid feature selection method with SVM for erythematosquamous disease diagnosis, which reached the accuracy of 98.61% [8].

Feature selection is the basis of machine learning algorithms; appropriate feature selection strategy can obviously improve the performances of machine learning methods. Deisy et al. proposed a novel information theory based feature selection algorithm to improve the classification accuracy for SVM classifiers [912]. Other feature selection methods such as mutual information measurement [13], kernel -score feature selection, and explicit margin-based feature elimination method are often adopted to get better classification results for SVM or other machine learning algorithms [1417].

Most of the machine learning algorithms have their parameters; proper measures should be taken to decide the optimized values of them. Genetic Algorithm, Particle Swarm Optimization Algorithm, Artificial Immune System Algorithm and Grid Search Method are those most often used parameter optimization algorithms [1823]. Generally, data feature selection methods and parameter optimization strategies are comprehensively considered. Lin et al. developed a Simulated Annealing approach for parameter determination and feature selection in SVM, and experiments showed the good performance of it [24]. Tan et al. proposed a new hybrid approach, in which Genetic Algorithm and Support Vector Machine are integrated effectively based on a wrapper strategy [25], which performed well on the UCI chromosome dataset. Literature [26] presented a hybrid approach based on feature selection, fuzzy weighted preprocessing, and artificial immune recognition system, which was used on the UCI heart disease and hepatitis disease datasets, and the obtained accuracies are 92.39% and 81.82%, respectively.

Besides feature selection and parameter optimization, kernel function is another factor that should be considered for kernel based machine learning algorithms like SVM. Khemchandani et al. adopted an optimal kernel selection technique in Twin Support Vector Machines; the efficiency of it was testified with some UCI machine learning benchmark datasets [12, 27]. Abibullaev et al. introduced a linear programming SVM with multikernel function for brain signal data classification and got a good performance [15, 2830].

Artificial Neural Network [23], Extreme Learning Machine [31], -Nearest Neighbor analysis, Fuzzy Logic based methods [3234], Ensemble Learning algorithms [3439], and so forth, are often used or hybridly used to finish data classification tasks and usually can get good classification results. As a new version of Support Vector Machine, Least Square SVM involves equality constraints instead of inequality constraints and works with a least squares cost function. An obvious drawback of the Least Square SVM is that the sparseness is lost [22, 40, 41]. Yang et al. developed an adaptive pruning algorithm based on the bottom-to-top strategy, in which the incremental and decremental learning procedures were used and solved the drawback of traditional Least Square SVM [4245].

Through the investigation of existing literatures, we noticed that the main points of studies using Support Vector Machine for classification are modifying and utilizing typical classification algorithms in combination and trying to acquire better classification performance. In general, the main process includes three procedures: (1) Data Preprocessing (feature selection, normalization, dimension reduction, etc.); (2) Constructing Optimized Classification Models (including parameter optimization); (3) Classification Accuracy and Efficiency Demonstration.

Although lots of efforts have been made on SVM and its applications, the performance of it is undesirable and still needs to be optimized.

The remaining parts of this paper are arranged as follows: The Mathematical Derivation of Support Vector Machine part shows the mathematical nature of SVM; the part Process of Principal Component Analysis gives the detail of Principal Component Analysis; the proposed System Normalization, Stratified Cross Validation, and Multilevel Grid Search based SVM algorithm are described in the Proposed Methods part; corresponding experiments are shown in the Experimental Results part; the following part gives the conclusions of this study.

2. Mathematical Derivation of Support Vector Machine

Support Vector Machine (SVM), as one of the most effective machine learning algorithms used for classification or regression problems, was firstly proposed by Vapnik and his Colleagues in 1995 and its history can be backtracked to the basic works of the Statistical Learning Theory since the 1960s [14]. SVM is good at processing nonlinear, high dimension, and little sample machine learning problems.

SVM is built on the basis of the VC Dimension (Vapnik Chervonenkis Dimension) Theory and the Structural Risk Minimum Theory, which are the core contents of the Statistical Learning Theory [2]. SVM has both solid theoretical foundation and ideal generation ability [6]. Presently, SVM has been used in many domains and occasions, such as handwriting recognition, biological character recognition (e.g., face recognition), credit card cheat checking, image segmentation, bioinformatics, function fitting, and medical data analysis [6].

As has been mentioned, SVM can be used to solve classification and regression problems, and SVM in these two different occasions are called SVC and SVR, respectively. In this paper, only SVC is involved, and it is uniformly referred to as “SVM.” Intuitively (e.g., in 2-dimensional space), the classification problems can be divided into Linearly Separable tasks (corresponding data is made of linearly separable samples) and Linearly Inseparable tasks (data is formed with linearly inseparable samples, also called nonlinearly separable data, or nonlinear data, for short.). Figure 1 shows these two cases. Such situations can be extended into high dimensional space.

2.1. Linearly Separable Case

For the sake of simplicity, we only consider two-class classification situations.

Given a dataset , , where , . Essentially, dataset is a set of binary group ; here stands for a data sample, and is the corresponding class label of . Particularly, we use to represent arbitrary data sample, and its corresponding class label is represented with . In such situations, SVM searches for the hyperplane with the largest separation “Margin” [14], that is, the Maximum Marginal Hyperplane (MMH). A separating hyperplane can be written as where is weight vector; is called as bias. Let , the geometrical distance from a sample to the optimal hyperplane can be expressed as here, is referred to as discriminant function. The purpose of SVM is to find the parameters and , so as to maximize the margin of separation ( in (5)). Without loss of generality, the function margin can be fixed to be equal to 1. Thus, given a training set , we get The particular samples satisfy the equalities of (3) which are referred to as Support Vectors; that is, they are exactly the closest samples to the optimal separating hyperplane. Accordingly, the geometrical distance from the Support Vector to the optimal separating hyperplane is

Obviously, the margin of separation is

To get the maximum margin hyperplane is to maximize with respect to and .

Equivalently,

The constrained optimization problem in (7) is known as the Primal Problem. Through the method of Lagrange Multiplier, we construct the following Lagrange Function: where is the Lagrange Multiplier with respect to the inequality. We can get the following conditions of optimality, through differentiating with respect to and and setting the results equal to 0: thus, we obtain The corresponding Dual Problem can be inferred by means of substituting (10) into (8):

The following equation (12) gives the Karush-Kuhn-Tucker (KKT) complementary condition:

As a result of it, only the support vectors (the closest samples to the optimal separating hyperplane, which determine the maximal margin) correspond to the nonzero (all the other equal to zero); (11), which describes the Dual Problem, is a typical Convex Quadratic Programming Optimization Problem. In most cases, the Convex Quadratic Programming Optimization Problem function can efficiently converge to the global optimum, by means of adopting some appropriate optimization techniques. We can acquire the optimal weight vector with (13) after determining the optimal Lagrange multipliers : therefore, the corresponding optimal bias can be expressed as follows:

2.2. Linearly Inseparable Cases
2.2.1. Soft Margin SVM

Soft margin SVM aims to extend the Maximal Separating Margin SVM (so-called, hard margin SVM), so that the hyperplane allows a few of noisy data to exist. On this occasion, a variable , named Slack factor, is introduced to account for the amount of a violation of classification of the classifier. Classification problem in such cases can be described as that is,

The Primal Problem can be described as The corresponding Dual Problem Function (Dual Function) of the soft margin is formulated as

KKT complementary condition in such inseparable case is where ’s are the Lagrange Multipliers corresponding to that have been introduced to enforce the nonnegativity of . At the point (saddle point), at which the derivative of the Lagrange function for the primal problem with respect to is zero, the evaluation of the derivative yields

Simultaneously considering (20) and (21), we acquire

Consequently, the optimal weight can be expressed as follows:

The optimal bias can be obtained by means of taking any sample () in the training set, for which we have and the corresponding , and using the sample in (19)

2.2.2. Kernel Trick Based SVM

For linearly inseparable cases, kernel trick is another commonly used technique. An appropriate kernel function, which is based on the inner product between the given samples, is to be defined as a nonlinear transformation of samples from the original space to a feature space with higher or even infinite dimension so as to make the problems linearly separable. That is, a complicated classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low-dimensional space. Actually, we can adopt a nonlinear mapping function to map data in original (or primal) space into a higher (ever infinite) dimension space , such that the mapped data in new feature space is more likely linearly separable. Thus we can extend the separating hyperplane function into the following form: that is, The separating hyperplane is

The primal problem in such case is

Using the same mathematical trick in linearly separable SVM, we get the corresponding Dual Function

The KKT complementary condition is

After getting the optimal Lagrange Multipliers , we acquire the optimal weight vector and the corresponding bias that are described with function (32) and function (33), respectively:

Fortunately, the inner product like the form of can be instituted with , that is, thus, we only need to calculate inner product in the original space not in the feature space, which reduces the calculation complexity obviously; meanwhile, we are refrained from searching for the proper nonlinear mapping function .

In fact, the kernel trick cannot always guarantee the mapped problems to be absolutely linearly separable, so soft margin SVM and kernel based SVM are integrated to exert the different advantages of them two and thus can solve the linearly inseparable problems more efficiently. The corresponding Dual form for the constrained optimization problem in the kernel soft margin SVM is listed as follows:

Accordingly, the optimal classifier is where

The most frequently used kernel functions are(i); (ii), ;(iii); (iv), .

The third one is the RBF kernel function, which is used in our experiments.

3. Process of Principal Component Analysis

Principal Components Analysis (PCA), as was firstly introduced by Pearson in 1901, is a methodology that can be used to reduce the number of explicative variables of a dataset. In this paper, PCA is used for dimension reduction.

The main procedure of PCA is introduced as follows.

Given a dataset , where , , . Essentially, dataset is a matrix. Firstly, should be normalized to get matrix with normalized dimensional column vectors, in case that attributes with large domains dominate with smaller ones (the following procedures will be exerted on these column vectors). Then, orthonormal vectors will be calculated to act as a basis for the normalized input data, which are referred to as the principal components. Let denote these mutually perpendicular unit vectors; they should satisfy the following requirements: Thirdly, these principal components are sorted descendingly according to their “significance.” Finally, those first several components will be chosen to reconstruct a good approximation of the original data. Thus the dimension size of the data is reduced; that is, PCA can be used to reduce data dimension.

4. The Proposed Methods

In this paper, we select the Stalog Heart Disease Dataset, which is from the UCI (University of California at Irvine) Machine Learning Repository, as our experimental data. The dataset contains 270 tuples; each tuple includes thirteen data attributes and one class attribute. The detail of the dataset is described with Table 1.

Because the value ranges of different attributes vary greatly, the tuples in Stalog Heart Disease Dataset are preprocessed as follows. Firstly, to facilitate the following experiments, we change the values of the attribute “class” into 1 and −1, respectively. That is, let 1 represent “presence,” and −1 represent “absence.” Then tuples in the dataset are normalized by column to . In the normalized process, only the front thirteen data attributes are considered. We name such attribute data normalization process as System Normalization. The actual normalization procedure for each column vector , , in Dataset is illustrated with (39).

Let represent the element of .

To demonstrate the performance of SVM classifier, we often adopt cross validation method to get the accuracy of the class model. Traditionally, when using -fold cross validation, the input dataset is randomly divided into subsets, so that all the subsets have almost the same number of samples. In the training phase of the classification model, each subset will act as testing subset only once and will act as training subset times. In other words, when using -fold classification validation, classification models will be founded, each of them using subsets as training set, and the remaining one subset as testing set. The final classification performance is to be appraised by using the average result of the classification models.

In this paper, the stratified sample technology is adopted to generate the folds (subsets); thus each subset will have the same number of positive samples and the same number of negative samples; the ratio between positive samples and negative samples of each subset is just the same as that of the whole dataset. That is, the acquired subsets will keep the statistical distribution of the original dataset. Algorithm 1 describes the main process of the Stratified Cross Validation.

Require:  
Ensure:  
(1)Calculate the row size () of dataset ;
(2)Count the number of positive samples and negative samples , respectively;
(3);
(4);
(5);
(6);
(7)for    do
(8);
(9) Sample positive samples from ;
(10)   Sample negative samples from ;
(11)   Add those sampled samples into subset ;
(12)    ;
(13)    ;
(14) end  for
(15) ;
(16) ;
(17) for    do
(18)  using subset as testing set, the remaining subsets as training set;
(19)  using the training set to create classification model of SVM;
(20)  using the testing set to generate the accuracy rate ;
(21)  ;
(22) end  for
(23)
(24) return  the average accuracy rate ;

4.1. Multilevel Grid Search

In this paper, we select -SVC as our SVM classifier, and the RBF kernel as the kernel function. Thus we need to decide the values of the punishment parameter of -SVC and the parameter of the RBF kernerl function. Generally, Grid Search technique is the most frequently used skill for deciding the optimal value of these two parameters. Given initial value ranges and search steps for the two parameters, respectively, Grid Search algorithm will iteratively check every value pair of and to try to obtain the optimal values of the value pair. Different settings of the initial value ranges and the steps for these two parameters will influence greatly the optimum results. Meanwhile, because the Grid Search method is of typical exhaustive search technology, the search process is time consuming.

Here, we present an adapted grid search method named Multilevel Grid Search (MGS), which effectively spares the search time and gets the same search result as the traditional grid search process.

Algorithm 2 reveals the details of our Multilevel Grid Search method.

Require:  
Ensure:  
(1)INPUT: ;
(2)INPUT: ;
(3)GridSearch();
(4)
(5)while    do
(6) Record the best ACC and the corresponding , ;
(7)
(8)
(9)if    then
(10)   
(11)  else
(12)   
(13)  end if
(14)  if    then
(15)   
(16)  else
(17)   
(18)  end if
(19)  GridSearch();
(20) end while
(21)  return  .

Figures 2 and 3 show the parameters optimization results of the MGS method. Figure 2(a) gives the Grid Search results on S-DATA3; Figure 2(b) demonstrates the Grid Search results on PCA and System Normalization process based S-DATA3; Figures 2(c)-2(d) demonstrate the Multilevel Grid Search process on PCA and System Normalization process based S-DATA3; the situation of Figure 3 is about the same as Figure 2, but its experimental data is R-DATA6.

Figures 2(a) and 3(a) show that PCA has little influence on the classification accuracy of SVM, and averagely, experimental results in our study show that when adopting proper PCA threshold, PCA based SVM has similar classification accuracy as traditional SVM algorithms.

5. Experimental Results

In this paper, the UCI Stalog heart-disease dataset is used to test our method. The main purpose of our study is to create a more efficient and available SVM model. The original dataset is processed based on the holdout method; that is, the given data are partitioned into two independent sets, one as training set, the other as test set. Two-thirds of the original data are allocated to the training set; the remaining one-third is allocated to the test data. Firstly, we repeat stratified subsampling based holdout method 10 times to generate 10 stratified datasets S-DATA1, ; then, random subsampling based holdout method is repeated 10 times to generate R-DATA1, ; in this paper, the experiments are mainly based on S-DATA1, , and R-DATA2, .

To study the necessity of our System Normalization process, lots of experiments are done on our datasets, and the results show that System Normalization exerts a great influence on the classification result of SVM; that is, it can effectively advance the classification accuracy of SVM. For simplicity, only part of the results are demonstrated. Figure 4 shows the influence of normalization on the classification result using traditional SVM classifier.

SVM is a promising machine learning algorithm, which has solid theory foundation and good generalization ability. But the training time of even the fastest SVMs can be extremely slow. In this paper, PCA is adopted to reduce the dimension of our data. Here, the threshold of PCA is 85%.

Figure 5 shows the PCA results on the Stalog Herat Disease Dataset. As shown in Figure 5, the System Normalization process has no obvious effect on PCA. Further study shows that, when adopting proper threshold (in this study, it is 85%), PCA has no direct influence on the classification accuracy of SVM. Figure 6 shows the ROC curve of SVM classifier on two different sampled Heart Disease Datasets: one is random sampled and the other is stratified sampled. It is obvious that PCA based SVM has similar classification accuracy as traditional SVM, but it can reduce the time complexity in some extent (about a 25% reduction).

We name the classifier based on PCA and MGS methods as PMSVM, and we call the classifier based on PCA strategy as PLSVM. All experimental results are compared with that of the famous LIBSVM algorithm. To demonstrate the performances of PMSVM, PLSVM with that of LIBSVM, Confusion Matrix, Sensitivity, Specialty, Precision, ROC curve, and AUC are used as the main evaluative criteria for classification accuracy; meanwhile, the time overhead is checked for measuring the efficiency of our method. All SVM algorithms are based on -SVC classification model and RBF kernel function.

Confusion Matrix is a useful tool to a classifier for describing the classification results for different classes. Given classes, the Confusion Matrix is a matrix, the element describes the number of samples that a classifier classifies class into class . For an ideal classifier that has classification accuracy, all the samples should be described with the elements on the main diagonal of the Confusion Matrix. That is, let represent the number of total samples of dataset ; we can get the following: Table 2 gives an example for a 2-class classifiers, where, represents the number of positive samples that are correctly classified as positive samples; represents the number of positive samples that are wrongly classified as negative samples; represents the number of negative samples that are correctly classified as negative samples; represents the number of negative samples that are wrongly classified as positive samples.

The following results of Sensitivity, Specialty, Precision, and so forth, closely rely on the contents of Confusion Matrix. Let represent the total number of positive samples, and let represent the total number of negative samples; it is easy to get the following relations, which are described as

“Sensitivity” is the ratio between and . That is, it is the ratio of truly classified positive samples. Sensitivity can be described as

“Specificity” is the ratio between and . In another word, it is the ratio of truly sampled negative samples. Specificity can be defined as

Precision gives the ratio between the number of truly classified positive samples and the total number of samples that are classified as true samples. Precision can be described as

ROC curve is a useful visual tool for comparing the performances of different classification models. ROC is the abbreviation of Receiver Operating Characteristic. ROC curve displays the comparative performance evaluation between the true positive rate and the false positive rate of a given classifier. Here, AUC means the area under the curve (ROC), and it can express the accuracy of a given classification model. Figure 7 demonstrates the ROC curve of of standard LIBSVM on 10 stratified sampled datasets using 5-fold cross validation and 10-fold cross validation methods, respectively. As we can see from the classification results, shown with Figure 7, that there are no obvious differences between the results of 5-fold cross validation method and that of 10-fold cross validation method. So, we will adopt the 5-fold cross validation technology as our default cross validation way in the following experiments, and the cross validation technology we used is referred as Stratified Cross Validation (SCV), as is mentioned in Algorithm 1.

Figure 8 demonstrates the ROC curve of PMSVM, PLSVM, and LIBSVM. Here, both parameters, and , of PLSVM and LIBSVM are set to the constant experience value 0.1. As is shown in Figure 8, the classification results of PLSVM and LIBSVM are quite similar; the difference between them is that PLSVM can effectively progress the classification efficiency by or so, because of the PCA process adopted. Generally, the classification performances of the PMSVM, which is based on our PCA and MGS strategies, have obvious advantages; the classification accuracy is more than that of PLSVM and LIBSVM; meanwhile, the running time of PMSVM is significantly shorter than that of them; that is, the time consuming of our PMSVM has about a reduction of 75% or so than that of traditional LIBSVM.

Tables 3 and 4 show the overall situations between PMSVM and LIBSVM (here, parameters and of PMSVM and LIBSVM are acquired through Multilevel Grid Search and Grid Search, resp.). Figure 9 demonstrates the time overhead of PMSVM, PLSVM, and LIBSVM.

In the whole experiment, Stratified Sampling and Random Sampling methods show no direct influences on classification results of PMSVM, PLSVM, and LIBSVM, but Stratified Cross Validation provides more reliable results. And the influence of Stratified Sample technology on the classification results of imbalanced data is what we will study further in our future work.

6. Conclusions

In this paper, an optimized Support Vector Machine classifier, named PMSVM, is proposed, in which System Normalization, PCA, and MGS methods are used to try to average up the performances of SVM. Experimental results show that System Normalization can effectively assure the classification accuracy (Figure 4). PCA (Principal Component Analysis) can play a role of dimensionality reduction. As shown in our experiments, chosing proper threshold value, PCA can both economize classification time and assure classification accuracy. The most prominent character in our study is whenever adopting our PCA and Multilevel Grid Search (MGS) methods, the time consuming of our PMSVM algorithm can be reduced by or so, averagely, and meanwhile can get similar or better classification accuracy than the classical LIBSVM algorithm.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors would like to thank Professor Xiaoyun Chen for her scrupulous guiding and Dr. C.-J. Lin for supplying the LIBSVM software that is used in this research. This study is supported by the Fundamental Research Funds for the Central Universities (lzujbky-2013-229, lzujbky-2014-47).