Abstract

Support vector machine (SVM) is a popular machine learning method for its high generalizaiton ability. How to find the adaptive kernel function is a key problem to SVM from theory to practical applications. This paper proposes a support vector classifer based on vague sigmoid kernel and its similarity measure. The proposed method uses the characteristic of vague set, and replaces the traditional inner product with vague similarity measure between training samples. The experimental results show that the proposed method can reduce the CPU time and maintain the classification accuracy.

1. Introduction

Support vector machine (SVM) constructs a hyperplane or set of hyperplanes in a high-dimensional feature space, which can be used for classification, regression, or other tasks. It was first introduced by Vapnik [1] for classification and has been applied to many application fields successfully. SVM is based on the structural risk minimization principle, which incorporates capacity control to prevent overfitting and thus is a partial solution to the bias-variance trade-off dilemma. The basic idea of SVM classification is to find such a separating hyperplane that corresponds to the largest possible margin between points of different classes.

How to find the adaptive kernel function is a key problem to SVM from theory to practical applications. There are several kernel functions, for example, the linear kernel, the polynomial kernel, and the RBF kernel. These kernel functions are positive semidefinite (PSD). However, some non-PSD matrices are used in practice. An important one is the sigmoid kernel which is related to neural networks. It was first pointed out by Vapnik [1] that sigmoid kernel matrix might not be PSD for the certain values of parameters, and . However, the sigmoid kernel matrix is conditionally positive definite in certain parameters and thus it is valid kernel.

Meanwhile, datasets in real life are usually noisy, and a classifier which is obtained by training with noisy data cannot classify some data samples correctly. So, fuzzy theory is introduced in support vector machines by many researchers in order to solve the above problem. There exist two cases. The first one is fuzzy support vector machine (FSVM) [24]. FSVM takes into account the noisy in the training set and associates a fuzzy membership with every sample which will account for the uncertainty in the class to which it belongs. It uses the membership function to express the membership grade of a sample belonging to positive class or negative class. The second one is to combine fuzzy theory to kernel functions of SVM and propose a fuzzy kernel-based SVM. Fuzzy kernel is apt to construct a robust classifier and process classification and regression of uncertain or fuzzy data. Soria-Olivas et al. [5] propose a fuzzy-based activation function for artificial neural networks. Camps-Valls et al. [6] extend the fuzzy-based activation function and propose a support vector classifier based on fuzzy sigmoid kernel. The fuzzy sigmoid function allows lower computational cost and higher rate of positive eigenvalues of the kernel matrix than those from the standard sigmoid kernel [6]. Yang et al. [7] develop a kernel fuzzy c-means clustering-based fuzzy SVM algorithm to deal with the classification problems with outliers or noises.

In fuzzy theories, the vague set theory [8] is one of the methods used to deal with uncertain information and has gradually become more and more popular for handling decision-making problems. Since vague sets can provide more information than fuzzy sets, they are considered superior in mathematical analysis of uncertain information. This paper combines vague sets with sigmoid kernel and proposes a novel support vector classifier based on vague sigmoid kernel.

The rest of this paper is organized as follows: Section 2 reviews the related research and briefly describes the vague set theory and support vector machine. We present a novel support vector classifier based on vague sigmoid kernel and its similarity measure in Section 3. Section 4 presents the experimental results obtained on benchmark data sets and analyzes the performance of the proposed algorithm. Section 5 concludes the paper with some final remarks.

2. Vague Set and Support Vector Machine

2.1. The Conception of Vague Set

Fuzzy set theory was first proposed by Zadeh [9]. It is an important mathematical approach to uncertain and fuzzy data analysis and has successfully been applied in the areas of fuzzy control, fuzzy decision making, and so on.

Introduced by Gau and Buehrer [8], vague set is a generalization of the concept of a fuzzy set. Note that the vague set is the same with the intuitionistic fuzzy set in essence according to some research work [10]. The major advantage of vague set over fuzzy set is that the former makes descriptions of the objective world more realistic, practical, and accurate. Presently, many scholars have been interested in the theory and already made further studies. They have been widely applied in medical diagnosis, decision making, pattern recognition, uncertain knowledge acquisition, and so forth [1114].

Definition 1 (vague sets [8]). Let be the universe of discourse, , with a generic element of denoted by . A vague set in is characterized by a truth-membership function and a false-membership function , where () is a lower bound on the grade of membership of derived from the evidence for , () is a lower bound on the negation of derived from the evidence against , and . It is clear that the grade of membership of in the vague set has been restricted in a subinterval , . The subinterval , is called vague value of in vague set .

The vague value , )] indicates that the exact grade of membership of may be unknown but is bounded by .

When the universe of discourse is continuous, a vague set can be written as

When the universe of discourse is discrete, a vague set can be written as

Let be an uncertain degree of in vague set , . characterizes the precision of our knowledge about . If is small, our knowledge about is relatively precise; if it is large, we know correspondingly little. If ) is equal to ), our knowledge about is exact, and the theory reverts back to that of fuzzy sets. If both ) and ) are equal to 1 or 0, our knowledge about is very exact, and the theory reverts back to that of ordinary sets.

For example, let be a vague set with truth-membership function and false-membership function , respectively. If a vague value is [0.5, 0.8], then according to Definition 1 we can see that , and . It can be interpreted as “assume that the total number of the votes is 10, the votes for a resolution is 5 in favor, 2 against, and 3 abstentions.” Obviously, fuzzy set cannot exactly denote and process the type of obscure information.

Many similarity measures have been proposed in the literature for measuring the degree of similarity between vague sets. Chen [15, 16] proposed the concept of similarity measures between vague sets and defined its expression as follows: , where and are two vague values, , and . It is obvious that the larger the value of , the more the similarity between the vague values and . Hung and Yang [11] presented three new similarity measures between intuitionistic fuzzy sets based on Hausdorff distance. Li et al. [17] analyzed and summarized several similarity measures between vague sets. Dou et al. [18] developed a new similarity measure of vague sets and defined the new relative degree of similarity measures to solve the fuzzy shortest path problem.

2.2. Support Vector Machine and Its Kernel

In this section, we briefly review the learning algorithm of support vector machine (SVM) initially proposed in [1]. Given a binary classification problem represented by a dataset , where represents an n-dimensional data sample and represents the class of that data sample, for , the goal of the SVM learning algorithm is to find an optimal hyperplane that separates these data samples into two classes. In order to find a better separation of classes, the data are first transformed into a higher-dimensional feature space by a mapping function . Then, a possible separating hyperplane, which resides in the higher-dimensional feature space, can be represented by

The support vector technique requires the solution of the following optimization problem: subject to the constraints where the training vectors are mapped into a higher-dimensional space by the function parameter is a user-specified positive parameter that controls the trade-off between maximizing the margin and minimizing the training error term. The slack variables hold for misclassified samples, and therefore, can be thought of as a measure of the amount of misclassifications. This quadratic optimization problem can be solved by constructing a Lagrangian representation and transforming it into the following dual problem: subject to the constraints where is the Lagrangian parameter. Note that the kernel trick is used in the last equality in (7). The Karush-Kuhn-Tucker conditions of SVM are defined by

The sample with the corresponding nonzero is called a support vector. The optimal value of weight vector is obtained by , where is the number of support vectors. The optimal value of bias can be computed from the Karush-Kuhn-Tucker conditions (9); namely, , for a random support vector sample . Once the optimal pair is determined, the SVM decision function is then given by where is called the kernel function as follows:

Several typical kernel functions are the linear kernel , the polynomial kernel , and the RBF kernel .

The kernel functions above must satisfy Mercer condition. Namely, the kernel function matrix is a symmetric and positive semidefinite (PSD) matrix. Nevertheless, some non-PSD matrices are used in SVM in practice [19]. The sigmoid kernel is an available non-PSD kernel function. The sigmoid kernel is also known as the hyperbolic tangent kernel and as the multilayer perceptron (MLP) kernel, which comes from the neural networks field.

It was first pointed out by Vapnik [1] that its kernel matrix might be non-PSD for certain values of the paramenters and . When the kernel function is non-PSD, (11) cannot be satisfied. H. T. Lin and C. J. Lin [19] also study non-PSD kernel function and its applications to SVM and testify the sigmoid kernel matrix is conditionally positive definite (CPD). When parameters and , the sigmoid kernel is suitable for a valid kernel. The sigmoid kernel has been used in several practical cases, such as support vector machine classification [6, 19], decision rules extraction [20], and chaotic time series prediction [21].

3. The Proposed Support Vector Classifier Based on Vague Similarity Measure

This section presents the proposed method for vague sigmoid kernel-based support vector classifier. It first gives a brief introduction to fuzzy kernel and then focuses on a proposed algorithm.

3.1. Fuzzy Kernel

Several researchers have studied fuzzy kernel. Kwan [22] proposes a simple sigmoid-like nonlinear activation function more suitable for digital hardware implementation as follows: where is the width of the transition region.

Inspired by the work of Kwan, Soria-Olivas et al. [5] think that the activation function of (12) can be drawn in a more natural way by defining the classical activation function by means of the fuzzy logic methodology and propose a fuzzy-based activation function for artificial neural networks, which considers triangular functions due to their simplicity. The fuzzy-based sigmoid function models the hyperbolic tangent function by means of linguistic variables. Camps-Valls et al. [6] extend the fuzzy-based activation function and propose a support vector classifier based on fuzzy sigmoid kernel. The fuzzy sigmoid kernel allows lower computational cost and higher rate of positive eigenvalues of the kernel matrix, which alleviates current limitations of the sigmoid kernel.

Although fuzzy set theory can preferably characterize fuzziness, it has an obvious shortage due to using a single-value membership to represent the degree of membership. Fuzzy theory lacks consideration of some nondeterministic factors among samples. In this case, we propose a fuzzy sigmoid kernel support vector classifier based on vague theory.

3.2. Vague Value Computation

The proposed algorithm considers two-class classification for simplicity. We first decide the class center of each class, namely, and in Figures 13, and then compute vague values of samples.

For the sample in the training set, if belongs to class but does not belong to the intersection area between class and class , we define vague value of as follows: where is the radius of class . This case is shown in Figure 1. Similarly, let be the radius of class , we get >1.

If belongs to class but does not belong to the intersection area between class and class , we define vague value of as follows: where is the radius of class . This case is shown in Figure 2. We get >1.

If sample belongs to the intersection area, we label and as and and define vague value of as follows:

This case is shown in Figure 3.

Through a detailed analysis of samples in class and class , we find shown in Figures 1 and 2, and vague set reverts back to fuzzy set in two cases. For sample in Figure 3, we reduce the classification effect of these samples.

3.3. Similarity Measure of Vague Value and Its Kernel

For samples and , their inner is obtained typically by computing Euclidean distance between two samples. In this paper, we replace Euclidean distance with similarity measures between vague sets after introducing vague membership. A similarity measure is used for estimating the degree of similarity between two sets. The main idea is described below. We first compute vague values of samples and then represent these vague values with points in the spatial coordinate system. At last, we compute similarity measures between points.

Definition 2 (see [23]). Let be the vague value of sample computed by the above method. The corresponding point in the spatial coordinate system is represented as . We also denote a 3-tuple for simplicity, where , , and .

As shown in Definition 2,    in the spatial coordinate system includes three parts, , respectively. The meanings of are shown in Definitions 1 and 2. Analyzing from the vote model, we consider that some abstentions are likely prone to be in favor, others are likely prone to be against, and others are likely to be abstention. So, we further divide abstention part into three parts: , , and , which represent the cases of being in favor, against, and abstention in all abstentions, respectively. We can use a point in three-dimensional space to depict a membership degree of a training sample.

Obviously, . Using Definition 3, we can get a point in three-dimensional space . For three parts in , .

Definition 3. Let and be two points defined as Definition 2; then their similarity measures can be defined as follows:

Based on vague value and similarity measure above, we give a computation method of vague sigmoid kernel function. Expression (12) can be readily rewritten as a function of and as follows: where .

3.4. The Proposed Vague Sigmoid-Based Support Vector Classifier

In order to compute vague values of training samples, we first decide the class center of each class. In this paper, we select fuzzy c-means (FCM) method [24, 25] to do it. In fuzzy clustering, FCM method has become one of the most popular techniques.

FCM algorithm starts with an initial guess for the cluster centers, which intends to mark the mean location of each cluster [24, 25]. The initial guess for these cluster centers is most likely incorrect. Additionally, FCM assigns every data point a membership grade for each cluster. By iteratively updating the cluster centers and the membership grades for each data point, FCM iteratively moves the cluster centers to the “right” location within a data set. This iteration is based on minimizing an objective function that represents the distance from any given data point to a cluster center weighted by that data point’s membership grade. Namely, where   is the number of clusters and selected as a specified value in this paper, is the number of data points, denotes the degree to which the sample belongs to the th cluster, is the fuzzy parameter controlling the speed and achievement of clustering, denotes the distance between point and the cluster center , and is the set of cluster centers or prototypes (). When the objects change clusters, the membership values are recalculated according to the following formula:

Each cluster center is then calculated by

After getting the cluster centers, the algorithm can compute vague values of training sample and measure the similarity of vague values.

According to analysis above, we propose a novel vague sigmoid-based support vector classifier. The algorithm description is as follows.

Step 1. Preprocess data set and classify training data set and testing data set.

Step 2. Use FCM algorithm to compute the membership of each training sample to each class and to obtain the cluster centers.
. Select the number of clusters , the maximal iterative count , fuzziness parameter and converge error .
. Initialize the membership matrix satisfied with constraint conditions
. For ,calculate the membership matrix according to (20);calculate the cluster centers according to (21);calculate the objective function according to (18);when or , stop iteration and return the membership matrix and the cluster centers .

Step 3. Compute vague values of training samples using (13)–(15).

Step 4. Use SVM based on vague sigmoid kernel to train samples with vague values.

The key steps of the proposed algorithm are to compute vague values and to compute vague sigmoid kernel .

4. Experimental Analysis

Six data sets from the University of California at Irvine (UCI) machine learning repository [26] are used in our experiments. The data sets include Ionosphere, Sonar, Pima-diabetes, Wdbc, Iris and Vehicle. Iris, and Vehicle data are multiclass problems. The other data are binary classification problems. The characteristics of these data sets are shown in Table 1. We tested the proposed vague sigmoid kernel method and compared it to sigmoid-based SVM and fuzzy sigmoid-based SVM. The results of these methods depend on the values of the kernel parameters and and penalization parameter .

In order to test the proposed methods, SVM models were trained by using LIBSVM [27]. Parameter was fixed to , where is the input dimension of data set, and other parameters of all methods were optimized using grid-based 5-fold cross-validation method. For all the datasets each training set and each testing set were the same for all methods. For all the datasets, we used the 5-fold cross-validation method to estimate the accuracy of the classifiers. We compare classification accuracy and CPU time (s) for the sigmoid, the fuzzy-based sigmoid, and the vague-based sigmoid kernels. Experimental results are shown in Table 2.

From Table 2, we can find that the sigmoid-based method achieves the better accuracy than the fuzzy sigmoid method and the vague sigmoid method on Ionosphere, Sonar, Iris, and Vehicle data sets. As we can see, the accuracy of the fuzzy sigmoid method and the vague sigmoid method is better than that of the sigmoid-based method on Pima-diabetes and Wdbc data sets. However, the sigmoid-based method has also more CPU times than the other methods. Notice that the vague sigmoid method achieves the same or slightly better accuracy than the fuzzy sigmoid method.

If we consider the cost of accuracy and CPU time, we may prefer the solution found with the vague sigmoid method. We can find that the average CPU time in the all 6 data sets is 22.32 s in the sigmoid method, while there is 19.77 s in the vague sigmoid method; in the meantime, the classification accuracy does not decrease remarkably.

5. Conclusions

Support vector machine is a novel machine learning method which has been applied to many application fields successfully. The sigmoid kernel was quite popular for support vector machines due to its origin from neural networks. In this paper, we propose a vague sigmoid kernel-based support vector classifier. The proposed method is combined with vague set methodology, which makes the computation of SVM simple. In vague sigmoid kernel, we replace the inner product computation using Euclidean distance between two samples with similarity measures. The experiment is conducted by using 6 data sets from the UCI machine learning repository. The results of classification are evaluated and compared in terms of the performance using accuracy and time. The results obtained from the experiment indicated that the proposed method can reduce the CPU time and maintain the classification accuracy.

Acknowledgments

This work is supported by China Postdoctoral Science Foundation (no. 20110491530), the Science Research Plan of Liaoning Education Bureau (no. L2011186), and the Dalian Science and Technology Planning Project of China (no. 2010J21DW019).