Abstract

The Support Vector Machine proposed by Vapnik is a generalized linear classifier which makes binary classification of data based on the supervised learning. SVM has been rapidly developed and has derived a series of improved and extended algorithms, which have been applied in pattern recognition, image recognition, etc. Among the many improved algorithms, the technique of regulating the ratio of two penalty parameters according to the ratio of the sample quantities of the two classes has been widely accepted. However, the technique has not been verified in the way of rigorous mathematical proof. The experiments based on USPS sets in the study were designed to test the accuracy of the theory. The optimal parameters of the USPS sets were found through the grid-scanning method, which showed that the theory is not accurate in any case because there is absolutely no linear relationship between ratios of penalty parameters and sample sizes.

1. Introduction

In the mid-1990s, the research team led by Vapnik proposed the advanced Support Vector Machine (SVM) [13]. By using a nonlinear mapping from lower-dimensional space to higher-dimensional space, the SVM seeks a hyperplane with the best classifying performance. Based on statistical learning theory and empirical risk minimization, SVM solving the optimization problem with the dual theory has become a valuable algorithm in the field of artificial intelligence.

The original SVM only had one penalty parameter. Cortes and Vapnik [3] proposed a new kind of SVM with two penalty parameters of C+ and C. Chew et al. [4, 5] put forward a new idea that by using the quantities of two classes of samples to adjust C+ and C, SVM has preferable classifying accuracy, which has been accepted widely. This theory, however, has not been proved mathematically. Furthermore the theory was derived from experiences described as “a rule of thumb” [4]. A number of experiments were designed to test the theory in this paper. The experiments were conducted on the dataset of USPS which is a standard handwriting database and comes from the United States Postal Service. The USPS contains ten categories of samples which are the 10 figures including 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. The USPS is often used for testing in the field of machine learning.

Through the way of grid scanning for parameters, the relationships between optimal parameters were revealed when SVM achieved the optimal performance. The experimental results did not show that the optimal parameters of C+ and C have the relationship with the class sizes, which was proposed and applied by Chew et al. [4, 5].

2. Support Vector Machine Algorithm

The initial SVM is to solve the quadratic programming as follows:where is the penalty parameter and is the slack variable for the i-th data vector and w and b are the normal vector and the bias of the hyperplane, respectively. Figure 1 shows the example of SVM in two dimensions.

The kernel function k(xi, xj) was introduced into formula (1) to replace the inner product operation, and the dual form was obtained as follows [1]:where αi are the Lagrange multipliers. The SVM with high classification accuracy should have the most samples correctly classified with αi = 0; meanwhile, the SVM should have the least samples misclassified with αi > 0.

All the training samples in Figure 1 could be divided into three cases:(1)Nonsupport vectors (NSVs), which could be correctly classified and not in H1 and H2, can satisfy the following formula:(2)Support vectors (SVs), which could be correctly classified but located on H1 and H2, can satisfy the following formula:(3)Bounded support vectors (BSVs), which are misclassified, can satisfy the following formula:

3. Improved SVM Algorithm

3.1. SVM with Double Penalty Parameters

Two penalty parameters, namely, C+ and C, were introduced by Osuna et al. [6]. The optimization problem is minimized, taking the formwhere C+ and C are the error penalties for the positive (yi = +1) and the negative (yi = −1) vectors, respectively. The dual form of (6) is

Chew et al. raised a point [4] that in order to avoid to overlearning of SVM, the BSVs should greatly outnumber the SVs in the SVM. That is to say, the misclassified samples should far exceed the samples on the boundary line. In other words, in the SVM with the best performance, there are enough BSVs to avoid overlearning and the SVs could be ignored because of their minuscule amount.

In the view of Chew et al. [4], the error rates of the two classes of samples can be expressed as B+/N+ and B/N, where B+and B are the numbers of the misclassified samples (BSVs) of the two classes and N+and N are the numbers of samples of the two classes, respectively. Thus, the constraint in formula (7) can be repressed as

Setting the ratios of the error rate between the positive class and the negative class,

When the following equation is true, the SVM has the best classification performance:

3.2. υ-SVM

Schölkopf et al. [7] put forward the υ-SVM algorithm with the parameter to replace C and ξ′ in the original SVM. The υ-SVM is shown as follows:where ρ is the location of the margin and the is the width of the margin. When minimizing the classifying error, the maximum width of the margin could be obtained. Chew et al. improved the υ-SVM [5] and made the same conclusion that the optimal parameters of υ-SVM satisfy the relationship of equation (10).

Equation (10) has been widely recognized and put into use in the two-class sample sets, and the results have been shown with the optimal classification performance [810].

4. Hypothesis Testing

There is a lack of rigorous mathematical derivations and proofs for formula (10). Furthermore, the SVs which could be correctly classified (located on H1 and H2 in Figure 1) are ignored, which bring about the only usage of the symbol “≈” instead of “=” in formula (8). In addition, formula (10) only shows the ratio of C+/C and not the absolute value of C+ and C that actually should be specified.

In the following experiment, it will be verified whether the optimal parameters of SVM satisfy equation (10).

4.1. Methods and Steps of the Testing

The experiment was divided into two main steps. In the first step, ten two-class data sets were constructed, which were tested by different parameters in the second part.

4.1.1. Establishment of Two-Class Data Sets

The USPS handwritten digital dataset is often used for algorithm testing in the field of pattern recognition and machine learning. There are 10-class samples in the USPS, which are the 10 figures formed from 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. The USPS has 256 attributions. The sample size of the training set and the testing set are 7291 and 2007, respectively [11].

The USPS was transformed into 10 two-class data sets described as USPS-0, USPS-1, USPS-2, …, USPS-9. The USPS-0 set contains two class samples, which are the digitals of 0 and all the non-0 digitals formed from digitals of 1, 2, 3, 4, 5, 6, 7, 8, and 9. The USPS-1 set contains two class samples, which are the digitals of 1 and all the non-1 digitals formed from digitals of 0, 2, 3, 4, 5, 6, 7, 8, 9, …. The USPS-9 set contains two class samples, which are the digitals of 9 and all the non-9 digitals formed from digitals of 0, 1, 2, 3, 4, 5, 6, 7, and 8.

4.1.2. Testing Procedure

The method of grid scanning for parameters was adopted to try to find out the optimal parameters of the ten two-class datasets. The polynomial kernel function was employed, and the software package of LIBSVM [12] was used to carry out the experiment.

The steps are as follows:(1)The order of the polynomial d = {2, 3, 4, 5, 6} was specified; in other words, there were five major cycles with five different values of order of the polynomial.(2)The penalty parameters C+ and C = [100 … 10000], and the step width was 100. To be more specific, the value of C+ was 100, 200, 300, …, 1000, 1100, …, 2000, 2100, …, 3000, and 3100, …, 10000 in order, so was the value of C.(3)We performed three rounds of cyclic scanning for the three parameters, namely, d, C+, and C. That is to say, each of the ten two-class datasets was tested to get the testing precision more than 50,000 times with different combinations of the three parameters.

When the highest accuracy was achieved, the parameter is the optimal parameter. If there were multiple optimal parameters, the following two rules should be abided to choose the optimal parameters:(1)Choosing , which was the inflection point of parameters(2)If there are still multiple equal , choosing which was the lowest order of polynomial

4.2. Testing Result

After performing grid scanning on the ten two-class data sets, the optimal parameters are obtained as shown in Table 1. The first column on the left side in the table is the category identifier of the 10 data sets. In the first row, d is the order of polynomial; N+ and N are the sizes of samples of the two classes, respectively; and C+ and C are the penalty parameters of the two categories.

In order to find whether the optimal parameters in Table 1 satisfy formula (10) more intuitively, the histogram of C+/C with N/N+ is shown in Figure 2.

In Figure 2, there are nine datasets satisfying the relationship of . Meanwhile, only one dataset of USPS-7 meets the relationship of . The experiment came to the conclusion that the relationship between the sample size and the optimal parameters does not satisfy formula (10); in other words, the following formula is true:

Formula (10) has not been proved in the mathematical way rigorously, but a kind of reasoning based on experience. Also, formula (10) is too simple to reveal the internal relations between the sample size and the optimal parameters.

The core of SVM is SVs [13] which could be correctly classified and located on H1 and H2 in Figure 1, and the classification hyperplane is mainly derived from the attributes of SVs. No matter the number of SVs is more or less, it cannot be ignored [4, 5] under any circumstances.

In fact, the number of the correctly classified support vectors (SVs) should exceed the number of bounded support vectors (BSVs) in the SVM with preferable performance. Obviously, as seen from Figure 1, the more the points on H1 and H2, the better the performance the SVM has. Meanwhile, the misclassified points should be as few as possible.

4.3. Analysis of the Means and Standard Deviation

In the grid scanning in Section 4.1.2, there are still many parameters with the same optimal value. That is to say, the optimal value is obtained at many points. In order to further verify the relationship between penalty parameters and sample sizes, parameters with the same optimal classification accuracy were statistically analyzed.

The means and the standard deviations of C+/Cs with the same optimal value were calculated, which are shown in Table 2. The second column on the left is the number of the same optimal values, which is presented as N(opt. C+/C). Meanwhile, the means and the standard deviations of the optimal C+/Cs are presented as E(opt. C+/C) and σ in the third and the right column.

Based on N > N+ in Table 1, the following inequality is true:

If formula (10) is true, combined with formula (13), the following formula can be derived:

However, there are four E(opt. C+/C)s in Table 2, namely, USPS-1, USPS-2, USPS-5, and USPS-6, satisfying the following inequality:

In fact, formulas (14) and (15) contradict each other, which is another proof that formula (10) is not true at any condition.

5. Conclusion

The method of grid scanning for parameters was employed to find the optimal values, which was designed to reveal the relationship between the optimal parameters and the sample sizes. Since the parameters are infinite, it is impossible to test all of the parameter possibilities. Also, the optimal parameters were tested in a very wide range of thresholds, which used much more time.

From the results of the study, it is believed that the optimal parameters of C+ and C by no means rely on the size of samples. To be more exact, there is absolutely no linear relationship between ratios of penalty parameters and sample sizes.

At present, all parameter optimization in machine learning is local optimization and the study in the paper is no exception. Optimization algorithms, such as gradient descent, Newton’s method, and Quasi-Newton methods, could be used to find out the optimal parameters of SVM, which is an iterative process and certainly takes a lot of time. Therefore, finding the optimal parameters in limited and acceptable time must be very valuable, which is the new research direction worth exploring.

Data Availability

All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the Professor Foundation of Anqing Medical College under the grant of Feng Guang.