Recent Advances in Information TechnologyView this Special Issue
The Generalization Error Bound for the Multiclass Analytical Center Classifier
This paper presents the multiclass classifier based on analytical center of feasible space (MACM). This multiclass classifier is formulated as quadratic constrained linear optimization and does not need repeatedly constructing classifiers to separate a single class from all the others. Its generalization error upper bound is proved theoretically. The experiments on benchmark datasets validate the generalization performance of MACM.
Multiclass classification is an important and on-going research subject in machine learning. Its application is immense, such as machine vision [1, 2], text and speech categorization [3, 4], natural language processing , and disease diagnosis [6, 7]. Two kinds of approaches have been proposed to solve multiclass classification problem . The first multiclass classification approach is extending binary classifier to handle the multiclass case directly. This included neural networks, decision trees, support vector machines, naive Bayes, and -nearest neighbors. The second approach decomposes the multiclass classification problem into several binary classification tasks. Several methods are used for this decomposition: one-versus-all , all-versus-all , and error-correcting output coding .
The one-versus-all approach reduces the problem of classifying among classes into binary problems, where each problem discriminates a given class from the other classes.
For the all-versus-all method, a binary classifier is built to discriminate between each pair of classes, while discarding the rest of the classes. This requires building binary classifiers for classes problem. When testing a new example, voting is performed among the classifiers and the class with the maximum number of votes wins.
For error-correcting output coding, it works by training binary classifiers to distinguish between the different classes. Each class is given a codeword of length according to a binary matrix . Each row of corresponds to a certain class.
The above multiclass classification algorithms need construct binary classifier repeatedly to separate a single class from all the others for classes problem, which leads to daunting computation and low efficiency of classification. Reference  proposes multiclass support vector machine (MSVM), which corresponds to simple quadratic optimization and need not repeat constructing binary classifier. However, support vector machine corresponds to the center of the largest inscribed hypersphere of feasible space. When the feasible space, that is, the space of hypotheses consistent with the training data, is elongated or asymmetric, support vector machine is not effective . To address the above problems, multiclass classifier based on the analytical center of feasible space (MACM) is proposed. At the same time, in order to validate its generalization performance theoretically, its generalization error upper bound is formulated and proved. And the experiments on benchmark dataset validate the generalization performance of MACM.
2. Multiclass Analytical Center Classifier
To facilitate the discussion of multiclass analytical center classifier, the following definitions are introduced.
Definition 1 (chunk). A vector, , is broken into chunks () where the th chunk .
Definition 2 (expansion). Let be a vector where is embedded in dimensions space by writing the coordinates of in the th chunk of a vector in . denotes the zero vector of length . Then, can be written formally as the concatenation of three vectors, . And define as a vector where is embedded in the th chunk and is embedded in the th chunk of a vector in .
Definition 3. Given the sample , its expansion is defined as ; the expansion of the whole sample set is defined as .
Definition 4 (piecewise linear separability). The point sets , ( represents the class labels and the number of samples belonging to th class), are piecewise linear separable if there exists , , , where represents the dimension of point, such that
Definition 5 (piecewise linear classifier). Assume , where . Given a new point , a piecewise linear classifier is a function as follows: where arg max returns to a class label corresponding to the maximum value.
To simplify the notation for the formulation of multiclass analytical center classifier, we consider an augmented weight space as follows.
Let then, inequality (1) can be rewritten as Let . According to Definition 2, embedding into space, inequality (4) has the following form: Consider that Thus, inequality (6) can be rewritten as follows:
Inequality (7) represents the feasible space of in the higher dimension space . Similar to the binary classification based on the analytical center of version space , we define the slack variable , , , and then have the following minimization problem, whose solver corresponds to the analytical center of higher dimension space:
In order to further simplify the formulation of multiclass analytical center classifier, we introduce some notations as follows: , ; let represent the th row vector of . Then, the optimization problem (8) can be rewritten as follows:
After solving the optimization problem (9) to get the optimal weight , we have a piecewise linear classifier computed in the following way: where argmax returns to a class label corresponding to the maximum value.
If the dataset is not piecewise linear separable, the kernel function is used to map the data into high dimension linear space.
3. Generalization Error Bound of Multiclass Analytical Center Classifier
In order to analyze the generalization error bound theoretically, we introduce the definition of classification margin and data radius and then deduce the margin-based generalization error bound of MACM.
Definition 6 (classification margin). Given the linear classifier , the classification margin of the sample is defined as follows: For the whole training set , the minimal margin is as follows:
Definition 7 (data radius). Given dataset , the data radius is defined as follows:
Theorem 8. Define data radius of dataset as and data radius of dataset as ; if , then .
Proof. Consider the following: Because , . This ends the proof of Theorem 8.
Theorem 9 (see ). Consider thresholding of a real-valued function with unit weight vectors on the inner product space and fix margin . For any probability distribution on with support in a ball of radius around the origin, with probability over random samples , any hypothesis with on has error more than
From Definition 3 and inequality (7), it is shown that to correctly classify the sample , is satisfied. Here, one introduces the samples’ pairs and , where denote the corresponding dimension vector with elements 0 or 1, respectively. So, one can construct the new samples’ set .
Theorem 10. The binary classification of sample by analytical center classifier is equivalent to the multiclass classification of sample by multiclass analytical center classifier.
Proof. Assume that ; then, binary classification is to solve the following feasible problem: Suppose the bias equals 0, because and are symmetrical on the origin. The feasible constraints can be rewritten as follows: The feasible constraints (17) define the feasible space of weight vector ; the binary classification by analytical center classifier can be formulated as follows: Because and , the problem (18) is equivalent to problem (8). This ends the proof of Theorem 10.
Theorem 11. Consider the classifiers’ set from Definition 5 with on the inner product space , where and fix margin . For any probability distribution on with support in a ball of radius around the origin, with probability over random samples , any hypothesis with on has error more than provided .
Proof. Because the sample in is not independent, the generalization error bound cannot be attained from Theorem 9. Theorem 9 is independent of the sample distribution, so we can construct a new sample distribution . According to the new distribution and dataset to generate the independent sample set with samples, that is, for every , define as the point sampled uniformly and randomly from according to the distribution ; then, we have . From Theorem 8, the data radius of satisfies . The generalization error of hypothesis over from Theorem 9 can be calculated as follows:
Event which denotes a sample in is wrongly classified and event which denotes the misclassification occurs in . From the above analysis, the misclassification of any sample in causes the misclassification of the point in , so that the probability of events and satisfies the following inequality: Because the cardinality of equals , the probability of sample misclassification in is written as follows: From union bound theorem, we have the following inequality: So the generalization error of hypothesis over is This ends the proof of Theorem 11.
4. Computational Experiments
In this section, we present the computational results comparing multiclass analytical center classifier (MACM) and multiclass support vector machine (MSVM) . A description of each of the datasets follows this paragraph. The kernel function for the piecewise nonlinear MACM and MSVM methods is , where is the desired polynomial.
Wine Recognition Data. The wine dataset uses the chemical analysis of wine to determine the cultivar. There are 178 points with 13 features. This is a three class dataset distributed as follows: 59 points in class 1, 71 points in class 2, and 48 points in class 3.
Glass Identification Database. The Glass dataset is used to identify the origin of a sample of glass through chemical analysis. This dataset is comprised of six classes of 214 points with 9 features. The distribution of points by class is as follows: 70 float processed building windows, 17 float processed vehicle windows, 76 nonfloat processed building windows, 13 containers, 9 tableware, and 29 headlamps.
Table 1 contains the results for MACM and MSVM on wine and glass datasets. As anticipated, MACM produces better testing generalization than MSVM.
In this paper, the multiclass analytical center classifier based on the analytical center of feasible space, which corresponds to a simple quadratic constrained linear optimization, is proposed. At the same time, in order to validate its generalization performance theoretically, its generalization error upper bound is formulated and proved. By the experiments on wine recognition and glass identification dataset, it is shown that the multiclass analytical center classifier outperforms multiclass support vector machine in generalization error.
This work was supported in part by the National Natural Science Foundation of China under Grant nos. 61370096 and 61173012 and the Key Project of Natural Science Foundation of Hunan Province under Grant no. 12JJA005.
J. S. Prakash, K. A. Vignesh, C. Ashok, and R. Adithyan, “Multi-class support vector machines classifier for machine vision application,” in Proceedings of the International Conference on Machine Vision and Image Processing, pp. 197–199, 2013.View at: Google Scholar
S. Ai-Xiang, L. Ming-Hui, H. Shun-Liang, and Z. Jun, “A new hypersphere multi-class support vector machine applied in text classification,” in Proceedings of the IEEE 3rd International Conference on Communication Software and Networks (ICCSN '11), pp. 478–481, Xi'an, China, May 2011.View at: Publisher Site | Google Scholar
M. Aly, Survey on Multiclass Classification Methods, 2005.
A. O. Hatch and A. Stolcke, “Generalized linear Kernels for one-versus-all classification: application to speaker recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '06), Toulouse, France, May 2006.View at: Google Scholar
M. A. Bagheri, G. A. Montazer, and S. Escalera, “Error correcting output codes for multiclass classification: application to two image vision problems,” in Proceedings of the 16th CSI International Symposium on Artificial Intelligence and Signal Processing (AISP '12), pp. 508–513, Shiraz, Iran, May 2012.View at: Publisher Site | Google Scholar
E. J. Bredensteiner and K. P. Bennett, “Multicategory classification by support vector machines,” Computational Optimization and Applications, vol. 12, no. 1–3, pp. 53–79, 1999.View at: Google Scholar
N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods, Cambridge University Press, New York, NY, USA, 2000.