Abstract

We propose a refined gradient ascent method including heuristic parameters for solving the dual problem of nonlinear SVM. Aiming to get better tuning to the particular training sequence, the proposed refinement consists of the use of heuristically established weights in correcting the search direction at each step of the learning algorithm that evolves in the feature space. We propose three variants for computing the correcting weights, their effectiveness being analyzed on experimental basis in the final part of the paper. The tests pointed out good convergence properties, and moreover, the proposed modified variants proved higher convergence rates as compared to Platt’s SMO algorithm. The experimental analysis aimed to derive conclusions on the recognition rate as well as on the generalization capacities. The learning phase of the SVM involved linearly separable samples randomly generated from Gaussian repartitions and the WINE and WDBC datasets. The generalization capacities in case of artificial data were evaluated by several tests performed on new linearly/nonlinearly separable data coming from the same classes. The tests pointed out high recognition rates (about 97%) on artificial datasets and even higher recognition rates in case of the WDBC dataset.

1. Introduction

According to the theory of SVMs, while traditional techniques for pattern recognition are based on the attempt to optimize the performance in terms of the empirical risk, SVMs minimize the structural risk, that is, the probability of misclassifying yet-to-be-seen patterns for a fixed but unknown probability distribution of data [14]. The most distinguished and attractive features of this classification paradigm are the ability to condense the information contained by the training set and the use of families of decision surfaces of the relatively low Vapnik-Chervonenkis dimension.

SVM approaches to classification lead to convex optimization problems, typically quadratic problems in a number of variables equal to the number of examples, and these optimization problems become challenging when the number of data points exceeds few thousands.

For making SVM more practical, several algorithms have been developed such as Vapnik’s chunking and Osuna’s decompositions [1, 5]. They make the training of SVM possible by breaking the large QP problem into a series of smaller QP problems and optimizing only a subset of training data patterns at each step. Because the subset of training data patterns optimized at each step is called the working set, these approaches are referred to as the working set methods.

Recently, a series of works on developing specializations as, for instance, reduced support vector machines (RSVM) [6] and smooth support vector machines (SSVM) [7] as well as parallel implementations of training SVM’s have been proposed [8]. Also, there have been proposed methods to solve the least squares SVM formulations [710] as well as software packages as [11], mysvm [12], and many others [3, 11, 1315]. It is worth to mention that a series of developments aimed to improve the accuracy of the resulted SVM classifier by combining it with boosting-type techniques [16, 17].

Assume that the data is represented by a finite set of labeled examples coming from two pattern classes, and is a vector-valued function representing the filter that extracts information from the current input. The function is usually referred to as the feature extractor, and is thought as the feature space. Briefly, from mathematical point of view, the problem of determining the parameters of an optimal margin classifier reduces to the quadratic programming (QP) problem [3]:

If is a solution of (1), then the SVM classifier corresponds to the decision rule, whereIF THEN ,IF THEN ,IF THEN ,and means “unclassifiable.”

Using the Lagrange multiplier method, the QP problem (1) reduces to the QP problem

The parameter cannot be explicitly computed by solving the SVM problem, a convenient choice of being derived in terms of the support vectors. Usually, a suitable value of should be selected such that holds, for instance [3],

In our work, we prefer to use a value of the parameter computed on a heuristic basis aiming to take into account the available information about the variability of the subsamples coming from the classes [18].

The performance of the resulted classifier is essentially determined by the quality of the feature extractor , the main problem becoming the design of a particular informational feature extractor. One way in over passing this difficulty is the “kernel trick.” Basically, the method consists in selecting a suitable kernel that, on one hand, “hides” the explicit expression of and, on the other hand, allows working in a feature space of possibly very high dimension without increasing the computational complexity [3]. Usually, we assume a certain particular functional expression for the kernel that “hides” both the dimension of the feature space and the explicit expression of the feature extractor .

If we assume that the kernel is given by for a certain feature extractor , then is a semidefined Mercer kernel [4]. Then, the problem of determining the parameters of an optimal margin classifier corresponds to solving the optimization problem (2), where the objective function is

In our work, we use exponential type kernels (RBF) and develop a modified gradient ascent method for solving the QP problem (2). The refinement considered in our developments comes from the use of weights in determining the direction of the search displacement at each step in order to get a better tuning to the particular training sequence. We propose three attempts in determining the weights, partially heuristically, and their corresponding performance is experimentally analyzed in the final section of the paper. In our developments, we implemented an SVM classifier of type [11].

2. Modified Gradient Ascent Method for Learning Nonlinear SVM Purposes

In [19], we proposed a modified learning rule of gradient ascent type for linear SVM that can be extended to the nonlinear case as follows. Assume that is a set of labeled examples coming from the classes. Conventionally, the first examples come from the class labeled by 1, the rest of examples being labeled by −1. By straightforward computations, the expressions of the gradient and the Hessian matrix are Note thatis a negative semidefined matrix.

For a given learning rate , if is the current value of the parameter, the updating rule of a gradient type learning algorithm is

However, being given that the updating rule (6) should be modified such that to assure that the new parameter still belongs to the space of the feasible solutions. Our method can be briefly described as follows. Assume that are the components of the current parameter vector selected for being updated, ,. If is a weighting parameter expressing the relative “influence” of and to the direction of the updating displacement, then the updated parameter satisfying the constraint is , where The indices involved in the updating step should be selected such that to assure the local maximization of . Using first-order approximations, the pair of indices should satisfy the conditions [19] Therefore, one has to pick up a pair satisfying (8) for which is maximized.

The stopping condition for the search process is controlled by a given threshold , and it holds when at least one of the conditions (9) is satisfied:

The option on the values of the learning rate and the weight parameter should be such that to assure good performance from both point of views, accuracy and efficiency. In our tests, we used.

Our research focused on several ways to compute the weight, all of them being expressed in terms of the first- and second- order statistics computed in the feature space. Let ,, ,be the sample means and sample covariance matrices computed on the basis of the samples labeled by 1 and −1 in the feature space, where is a particular feature extractor. We denote by the kernel generated by , that is, . Since we assumed the first examples as coming from the first class and the next examples as coming from the second class, we get

Concerning the options on the choice of the weight parameter  we have to take into account that its particular expression should be justified by evidence or by mathematical arguments, and moreover its value should be computable in the feature space without increasing the computational complexity. We propose three variants for the expression of the weight parameterestimated exclusively from data in terms of first- and second- order sample statistics, namely, where

The expression (11) is mostly heuristic, justified by geometric reasons, while the significance of (12) and (13) is supported by standard arguments coming from mathematical statistics (in terms of eigenvalues of sample covariance matrices and Fisher information, resp.). Note that the weight coefficients (11), (12), and (13) can be evaluated in the feature space using exclusively the values of the kernel on the available sample. Indeed, using straightforward computation, the coefficient can be expressed in terms of the kernel as follows: where the norms and are evaluated as Usually, the kernels are normalized, that is, for any . In case of a normalized kernel, using straightforward computations, we get and similarly Consequently, the expression of the weight coefficient is The evaluation of can be carried out as follows. Since by denoting we get that is, Note that the weight coefficient is the extension of the Fisher coefficient to the multidimensional repartitions.

4. Experimental Performance Analysis of the Proposed Variants of the Weight Coefficient

In order to develop a comparative analysis on the proposed variants of the modified gradient-like algorithm in solving the QP problem (2), we performed a long series of tests on simulated data coming from Gaussian repartitions and on the public databases WINE and Wisconsin Diagnostic Breast Cancer (WDBC) [20], all tests involving the feature extractor corresponding to the RBF kernel, . Note that the generated data used for training were linearly separable. The comparative analysis aimed to establish conclusions about the performance of the modified gradient ascent algorithm in the feature space using the weight coefficients against Platt’s SMO method and the standard gradient ascent algorithm in the initial space.

Also, we aimed to evaluate(i)the influence of different values of the parameter on the number of iterations needed to obtain significant accuracy,(ii)the dependency of the number of iterations required to obtain significant accuracy on the distance between the classes the samples come from and on the samples variability,(iii)the influence of different values of the parameter on class separability index.

The variability of a sample coming from a certain class can be expressed by many ways. We considered the indicator given by the mean distance between the feature vectors representing examples coming from that class to quantitatively express the variability within the sample. If is the subset of containing the examples coming from one of the labeled classes, then the measure of the variability of is that can be expressed in terms of the kernel function as [18]

The class separability index is evaluated by

Test 1. We aimed to derive conclusions of previously mentioned types in case of simulated data from normal multidimensional classes. The closeness degree between the resulted datasets of sizes can be evaluated by many ways. One way was to express it in terms of the Mahalanobis distance, . We consider two model-free indices to express the closeness degree using only the datasets, given by the sample Mahalanobis distance, and , respectively, where ,are the sample means and sample covariance matrices corresponding to, .

For instance, the conclusions derived on the basis of the samples generated from Gaussian repartitions, where,, , and are summarized in Figures 13 and Table 1.

The samples prove to be comparable values of the variability index, for each, and the variability indices depend increasingly on (Figure 1).

The results obtained in solving the QP problem (2) using Platt’s SMO algorithm and the variants of the gradient algorithm using the weight parameters given by (15), (20), and (24), respectively, are summarized in Table 1 and shown in Figure 2. Each entry a/b of Table 1 represents and of (4) corresponding to a particular value of and resulted by applying one of the proposed variants of algorithms.

According to the experimental results, we can conclude that in order to obtain the same accuracy, the variants of the gradient ascent algorithm using the weight parameter given by (15), (20), and (24) require a far less number of iterations. Moreover, as increases, the number of iterations dramatically decreases as compared to Platt’s SMO algorithm.

In order to evaluate the recognition rate of the resulted SVM classifier, we used linearly and nonlinearly datasets coming from the same distributions. In the case of this example, the mean recognition rate was around 94%.

Similar results were obtained in case “closer” or “farther” normal distributions were used to generate datasets.

Concerning the variants (15), (20), and (24) of weighting coefficient, the tests proved almost the same efficiency, their mean values being around 0.62, 0.54, and 0.64, respectively. It seems that the variant (24) behaves better than (15) and (20) for .

Test 2. We aim to develop a comparative analysis among the performance corresponding to Platt’s SMO and the variants of gradient ascent algorithms presented in Section 2 on the WINE dataset [20]. The data in WINE dataset are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars, the analysis being determined by the quantities of 13 constituents found in each of the three types of wines. The dataset consists of labeled examples of sizes 59, 71, and 48 coming from the pairwise linearly separable classes respectively. In processing the WINE dataset, we used the kernel , for different values of .

In order to implement the previously mentioned algorithms, we used the variability and separability indices (26) and (27) to find out a suitable criterion for planning a two-class classification of these three subsamples. Unfortunately, the three subsamples proved very close values of the interclass separability index (27), for all values of; therefore, from this point of view, all two-class classifications seemed to be almost equivalent. The variation with respect to of the interclass separability index (27) is presented in Figure 4.

Hopefully, the three subsamples proved quite different variability in the sense of the variability index (26), for all values of , enabling us to formulate the plan: discriminate first between and and then discriminate between and .

The results obtained in solving the QP problem (2) using Platt’s SMO algorithm, and the variants of the gradient algorithm using the weight parameters given by (15), (20), and (24), respectively, are summarized in Tables 2 and 3 and shown in Figures 5 and 6. Each entry a/b of the tables represents and of (4) corresponding to a particular value of and resulted by applying one of the proposed variants of algorithms.

In case of this example, the recognition rate was 100% for all values of . Being given the missing information about the measured features and the sizes relative small sizes of the subsamples coming from the three categories of wines, we had no possibility to test the generalization capacities of the resulted classifiers either on simulated data or by splitting the subsamples in design and test data, respectively.

Test 3. We performed a series of tests on the dataset Wisconsin Diagnostic Breast Cancer (WDBC) [20] aiming to develop a comparative analysis on the performance of the variants of gradient ascent algorithms presented in Section 2 and Platt’s SMO. We used the kernel , for different values of, and the weighting coefficients given by (15), (20), and (24).

The examples in WDBC dataset 30-dimensional vectors representing the features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass describing characteristics of the cell nuclei present in the image, a confirmed diagnosis either benign (B) or malignant (M) being supplied. It is stated that the dataset is linearly separable and relevant for the design of classifiers having good generalization capacities. The sizes of subsamples labeled by Band M are 357 and 212, respectively.

The variation with respect to of the values of the variability and separability indices (26) and (27) is presented in Figures 7 and 8. The separability index (27) proves to be quite sensitive to the variation of the values of the parameter pointing out an increasing dependency on . Moreover, for , the values of the separability index (27) stabilize around 1.41. Using the variability index (26), the tests revealed that the variability of the M-class seems to be insensitive at the variations of , while the variability index of the B-class depends increasingly on , the values stabilizing around the value 0.705 for .

In the first series of tests, all examples were used in both design and test phases. The results are summarized in Table 4 and Figure 9. Concerning the use of the weight coefficients (15), (20), and (24), the accurate estimates of the maximum value of (4) were computed in a far less number of iterations than Platt’s SMO algorithm. A slight better performance resulted for all values of in case of using (24), the mean value of being around 0.99, while the mean values of and were around 0.5 and 0.4, respectively.

From the point of view of the performance in discriminating between the B-class and M-class, for all values of , the recognition rate was 100%.

Being given that the size of the WDBC dataset is relatively large, we used it in order to develop a comparative analysis from the point of view of the generalization capacity corresponding to the proposed variants. In order to derive a suitable splitting strategy of the available data into design and test datasets, we took into consideration the relative relevance of the examples with respect to the class from which they come. We computed a prototype (barycenter) for each class by averaging the examples belonging to it, and the relative relevance of each example coming from each class is expressed in terms of the Euclidian distance to the corresponding prototype.

In order to establish suitable partitions into design and test subsets for each class, we used several strategies, the differences between them being given by the sizes and the way in which the available examples were allotted to the design and test samples. On one hand, in order to extract the most information concerning the classes, the design sample should include some of the most representative examples. On the other hand, the less representative examples give us information concerning the class variability. Following this idea, we arrived at the conclusion that the design and test samples should contain both kinds of examples in different proportions. We aimed to derive conclusions on experimental basis concerning the effects of design and test samples on the recognition rate and generalization capacity. In a series of tests, we considered the following experimental plan. (1)Compute the barycenter of each class (given by the mean of the examples belonging to the class), and sort the examples in each class according to the increasing order of the values of the Euclidian distance to the barycenter. (2)Select the first 10 examples from each class and include them in the test sample. Include the 10 examples from each class corresponding to the largest distances to the barycenter in the design sample. The rest of the examples are placed alternatively in the design and test datasets.

By applying this plan, the whole dataset was split into design dataset consisting of 179 and 106 examples and test dataset including 178 and 106 examples coming from the B-class and the M-class, respectively.

The results are summarized in Table 5 and Figure 10. Concerning the usefulness of the proposed weight coefficients, the variants (15), (20), and (24) proved almost equal efficiency, the number of iterations required to obtain an accurate estimate of the maximum value of (4) being far less than in case of Platt’s SMO algorithm. By submitting the test sample to the resulted classifiers, we obtained correct recognition rates in the range [94.01%, 95.07%], the maximum value 95.07% being obtained for and the weight coefficient (20).

Several tests were performed using the same strategy for different sizes of the design and test datasets. For instance, in Table 6 and Figure 11, the results of a test that used the design and test datasets obtained by including from each class the most relevant 10 examples in the test set and the least relevant 30 examples in the design sample are represented. This way, the learning phase was developed on a design dataset containing 189 and 116 examples coming from the B-class and the M-class, respectively. The test phase was performed on a dataset containing 168 and 96 examples coming from the B-class and the M-class, respectively, the resulted recognition rates being in the range [95.08%, 96.59%], where the maximum value 96.59% was obtained for .

As it is expected, higher recognition rates are obtained in case the design dataset is enlarged to contain more of the less relevant examples. For instance,(a)in case the design dataset consists of the least relevant 60 examples and 40 examples from B-class and M-class, respectively, the rest of examples being allotted alternatively to the design and test samples; the sizes of the test and design datasets are 41.12% and 58.88%; the resulted recognition rate is 96.15% when ;(b)the test based on the design dataset consisting of the least relevant 75 examples and 60 examples from B-class and M-class, respectively, the rest of examples being allotted alternatively to design and test samples (38.14% examples in the test set); the recognition rate increases to 98.18% when ;(c)finally, 100% recognition rate results using and the design dataset consisting of the least relevant 100 examples and 70 examples from B-class and M-class, respectively (33.56% examples in the test set).

5. Conclusions and Suggestion for Further Work

In the paper, we propose a modified gradient ascent method for solving the dual problem of nonlinear SVM. Basically, the refinement proposed here consists in using weight parameters to tune the direction of the search to the particular training sequence. The work was based on the use of variability and separability indices expressed in terms of the exponential RBF kernel. A part of the comparative analysis aimed to evaluate the dependency of the expected number of iterations required to obtain reasonable accuracy of the criterion function on the kernel parameter.

The proposed variants of the gradient ascent learning algorithm are somehow heuristically justified in the sense that there is no mathematically founded proof of the convergence properties. Therefore, several tests were performed in order to derive conclusions on experimental basis. The tests pointed out good convergence properties of the modified variants, and their convergence rates were significantly higher as compared to Platt’s SMO algorithm. The experimental analysis aimed to derive conclusions on the recognition rate as well as on the generalization capacities. All linear classifiers proved almost equal recognition rate and generalization capacities, the difference being given by the number of iterations required for learning the separating hyperplanes.

The learning phase of the SVM involved linearly separable samples randomly generated from Gaussian repartitions and the WINE and WDBC datasets. In order to evaluate the generalization capacities, several tests were also performed on new linearly/nonlinearly separable data coming from the same classes in case of samples randomly generated from Gaussian classes. In case of the WINE and WDBC datasets, both of them are linearly separable and no information concerning the generative model is supplied; therefore an additional strategy for splitting them into design and test samples was required. In case of the tests performed on the WINE dataset, being given its relatively small size, the performance was analyzed using all samples in the design and test phases, while the size of the WDBC dataset, being significantly larger, allowed us to develop different experimental plans by splitting the available data into design and test samples of different sizes.

Being given the optimality of SVMs from the point of view of generalization capacities, as expected, we obtained high recognition rates on new test data in most of cases (around 97%). In case of the WDBC dataset, higher recognition rates were obtained in case the design dataset was enlarged to contain more of the less relevant examples. For instance, a 100% recognition rate resulted using and 66.44% examples in the design set.

The tests pointed out that the variation of the recognition rates depends also on the inner structure of the classes from which the learning data come as well as on interclass separability degree. Consequently, we estimate that the results are encouraging and entail future work toward extending these refinements to multiclass classification problems and approaches in a fuzzy-based framework.