Computational Intelligence and Neuroscience

Volume 2015, Article ID 423581, 8 pages

http://dx.doi.org/10.1155/2015/423581

## Combining MLC and SVM Classifiers for Learning Based Decision Making: Analysis and Evaluations

^{1}School of Computer Software, Tianjin University, Tianjin 300072, China^{2}Centre for Excellence in Signal and Image Processing, University of Strathclyde, Glasgow G1 1XW, UK^{3}School of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China

Received 24 March 2015; Revised 8 May 2015; Accepted 11 May 2015

Academic Editor: Pietro Aricò

Copyright © 2015 Yi Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Maximum likelihood classifier (MLC) and support vector machines (SVM) are two commonly used approaches in machine learning. MLC is based on Bayesian theory in estimating parameters of a probabilistic model, whilst SVM is an optimization based nonparametric method in this context. Recently, it is found that SVM in some cases is equivalent to MLC in probabilistically modeling the learning process. In this paper, MLC and SVM are combined in learning and classification, which helps to yield probabilistic output for SVM and facilitate soft decision making. In total four groups of data are used for evaluations, covering sonar, vehicle, breast cancer, and DNA sequences. The data samples are characterized in terms of Gaussian/non-Gaussian distributed and balanced/unbalanced samples which are then further used for performance assessment in comparing the SVM and the combined SVM-MLC classifier. Interesting results are reported to indicate how the combined classifier may work under various conditions.

#### 1. Introduction

Maximum likelihood classification (MLC) is one of the most commonly used approaches in signal classification and identification, which has been successfully applied in a wide range of engineering applications including classification for digital amplitude-phase modulations [1], remote sensing [2], genes selection for tissue classification [3], nonnative speech recognition [4], chemical analysis in archaeological applications [5], and speaker recognition [6]. On the other hand, support vector machines (SVM) have attracted much increasing attention, which can be found in almost all areas when prediction and classification of signal are required, such as scour prediction on grade-control structure [7], fault diagnosis [8], EEG signal classification [9], and fire detection [10] as well as road sign detection and recognition [11].

Based on the principles of Bayesian statistics, MLC provides a parametric approach in decision making where the model parameters need to be estimated before they are applied for classification. On the contrary, SVM is a nonparametric approach, where the theoretic background is supervised machine learning. Due to the differences of these two classifiers, their performance appears to be much different. Taking the application in remote sensing, for example, in Pal and Mather [12] and Huang et al. [13], it is found that SVM outperforms MLC and several other classifiers. In Waske and Benediktsson [14], SVM produces better results from SAR images, yet in most cases it generates worse results than MLC from TM images. In Szuster et al. [15], SVM only yields slightly better results than MLC for land cover analysis. As a result, detailed assessments as on what conditions SVM outperforms or appears inferior to MLC are worth further investigation.

Furthermore, there becomes a trend to combine the principle of MLC, Bayesian theory, with SVM for improved classification. In Ren [16], Bayesian minimum error classification is applied to the predicted outputs of SVM for error-reduced optimal decision making. Similarly, in Vong et al. [17], Bayesian decision theory is applied in SVM for imbalance measurement and feature optimization for improved performance. In Vega et al. [18], Bayesian statistics are combined with SVM for parameter optimization. In Hsu et al. [19], Bayesian inference is applied to estimate the hyperparameters used in SVM learning to speed up the training process. In Foody [20], relevance support machine (RVM), a Bayesian extension of SVM, is proposed which enables an estimate of the posterior probability of class membership where conventional SVM fail to do so. Consequently, in-depth analysis of the two classifiers is desirable to discover their pros and cons in machine learning.

In this paper, analysis and evaluations of SVM and MLC are emphasized, using data from various applications. Since the selected data satisfy certain conditions in terms of specific sample distributions, we aim to find out how the performance of the classifiers is connected to the particular data distributions. As a consequence, the work and the results shown in the paper are valuable for us to understand how these classifiers work, which can then provide insightful guidance as how to select and combine them in real applications.

The remaining parts of the paper are organized as follows. Section 2 introduces the principles of the two classifiers. Section 3 describes data and methods that have been used, where experimental results and evaluations are analyzed and discussed in Section 4. Concluding remarks are given in Section 5.

#### 2. MLC and SVM Revisited

In this section, the principles of the two classifiers, SVM and MLC, are discussed. By comparing their theoretic background and implementation details, the two classifiers are characterized in terms of their performances during the training and testing processes. This in turn has motivated our work in the following sections.

##### 2.1. The Maximum Likelihood Classifier (MLC)

Let , , be a group of* N*-dimensional features, derived from observed samples, and denotes the class label associated with ; that is, in total we have classes denoted as , . The basic assumption of MLC is that for each class of data the feature space satisfies specified distributions, usually Gaussian, and also the samples are independent of each other. To this end, the likelihood (probability) for samples within the* k*th class, , is given as follows:where and , respectively, denote the mean vector and covariance of all samples within , which can be determined using maximum likelihood estimation as

For a given sample , the probability it belongs to class can be denoted as . The class that is determined to be within is then decided by

Based on Bayesian theory, we have

Since is a constant in (4) when is given, (3) can be rewritten as

Applying logarithm operation to the right side of (5), also letting be the discriminating function, (5) becomes

Again we can ignore the constant in (7) and simplify the discriminating function aswhere , , and .

As can be seen, is now a quadratic function of depending on three parameters, that is, , , and . When the class is specified, these parameters are determined; hence the quadratic function only depends on the class and the input sample . Also it is worth noting that the third item is actually a constant.

In a particular case when is a constant for all , that is, the prior probability that a sample belongs to one of the classes is equal, in (8) can be ignored; hence the discriminating function is rewritten as where the scalar 1/2 is also ignored as it makes no difference when (6) is applied for decision making. However, such simplification cannot be made unless we have clear knowledge about the equal distribution of the samples over the classes.

Based on (9), the decision function can be further simplified if the total number of classes is reduced to two, where the two classes are denoted as −1 and 1 and the function is introduced for simplicity:

Moreover, in a special case when , the quadratic decision function in (10) becomes a linear one as

##### 2.2. The Support Vector Machine (SVM)

SVM was originally developed for the classification of two-class problem. In Cortes and Vapnik [21], the principles of SVM are comprehensively discussed. Let the two classes be denoted as 1 and −1, similar to the decision function for MLC in (10); the decision function for linear SVM is given bywhere denotes the labeled value for the input sample ; and are parameters to be determined in the training process.

Note that the decision function in (12) is actually equivalent to the one in (10) if we adjust the scalar for , yet (12) is more feasible as it has increased the decision margin between the two classes from near zero to . By multiplying to both sides of the discriminating function , this can be further simplified as , that is,

Hence, the optimal hyperplane to separate the training data with a maximal margin is defined by where and are the determined parameters, and the maximal distance becomes .

To determine this optimal hyperplane, we need to maximize , or equivalently to minimize , subject to , . Using the Lagrangian multipliers, this optimization problem can be solved by

Eventually, the parameters and are decided as

For any nonzero , the corresponding is denoted as one support vector which naturally satisfies . Therefore, is actually the linear combination of all support vectors. Also we have .

Eventually if we combine (16) with (12), the discrimination function for any test sample becomes which solely relies on the inner product of the support vector and the test sample.

For nonlinear problems which are not linearly separable, the discrimination function is extended aswhere aims to map the input samples to another space, thus making them linearly separable.

Another important step is to introduce the* kernel trick* to calculate the inner product of mapped samples, that is, , which avoids the difficulty in determining the mapping function and also the cost for calculation of the mapped samples and their interproduct. Several typical kernels including linear, polynomial, and radial basis function (RBF) are summarized as follows:where optimal values for the associated parameters and are determined automatically during the training process.

Though SVM is initially developed for two-class problems, it has been extended to deal with multiclass classification based on either combination of decision results from multiple two-class classifications or optimization on multiclass based learning. Some useful further readings can be found in [22–24].

##### 2.3. Analysis and Comparisons

MLC and SVM are two useful tools for classification problems, where both of them rely on supervised learning in determining the model and parameters. However, they are different in several ways as summarized below.

Firstly, MLC is a parametric approach which has a basic assumption that the data satisfy Gaussian distribution. On the other contrary, SVM is a nonparametric approach and it has no requirement on the prior distribution of the data, yet various kernels can be empirically selected to deal with different problems.

Secondly, for MLC the model parameters, and , can be directly estimated using the training data before they are applied for testing and prediction. However, SVM relies on supervised machine learning, in an iterative way, to determine a large amount of parameters including , , all nonzero , and their corresponding support vectors.

Thirdly, MLC can be straightforward applied to two-class and multiclass problems, yet additional extension is needed for SVM to deal with multiclass problem as it is initially developed for two-class classification.

Finally, a posterior class probabilistic output for the predicted results can be intuitively generated from MLC, which is a valuable indicator for classification to show how likely a sample belongs to a given class. For SVM, however, this is not an easy task though some extensions have been introduced to provide such an output based on the predicted value from SVM. In Platt [25], a posterior class probability is estimated by a sigmoid function as follows:

The parameters and are determined by solving a regularized maximum likelihood problem as follows:where and denote the number of support vectors labeled in classes 1 and −1, respectively.

In addition, in Lin et al. [26] Platt’s approach is further improved to avoid any numerical difficulty, that is, overflow or underflow, in determining in case is either too large or too small:

Although there are significant differences between SVM and MLC, the probabilistic model above has uncovered the connection between these two classifiers. Actually, in Franc et al. [27] MLC and SVM are found to be equivalent to each other in linear cases, and this can also be convinced by similar decision functions in (10) and (12).

#### 3. Data and Methods

In this paper, analysis and evaluations of SVM and MLC are emphasized, using data from various applications. Since the selected data satisfy certain conditions in terms of specific sample distributions, we aim to find out how the performance of the classifiers is connected to the particular data distributions. As a consequence, the work and the results shown in the paper are valuable for us to understand how these classifiers work, which can then provide insightful guidance as how to select and combine them in real applications.

##### 3.1. The Datasets

In our experiments, four different datasets, SamplesNew, svmguide3, sonar, and splice, are used. Among these four datasets, SamplesNew is a dataset of suspicious microclassification clusters extracted from [16] and svmguide3 is a demo dataset of practical SVM guide [28], whilst sonar and splice datasets come from the UCI repository of machine learning databases [29]. Actually, two principles are applied in selecting these datasets: the first is how balanced the samples are distributed over two classes, and the second is whether the feature distributions are Gaussian-alike. As can be seen, the first two datasets are severely imbalanced, especially the first one, as there are far more data samples in one class than those in another class. On the other hand, the last two datasets are quite balanced. Regarding feature distributions, SamplesNew and svmguide3 are apparently non-Gaussian distributed, yet the other two, sonar and splice, show approximately Gaussian characteristics when the variables are separately observed. This is also validated by the determined Pearson’s moment coefficient of skewness below [30], where and are the mean and standard deviation for the th dimension of the dataset and refers to mathematical expectation. When the skewness coefficients are determined for each data dimension, the maximum, the minimum, and the average skewness coefficients are obtained and shown in Table 1 for comparisons: