Abstract

The robustness problem of the classical proximal support vector machine for regression estimation (PSVR) when confronting with samples in the presence of outliers is addressed in this paper. Correntropy is a local similarity measure between two arbitrary variables and has been proven the insensitivity to noises and outliers. Based on the maximum correntropy criterion (MCC), a correntropy-based robust PSVR framework is proposed, named as RPSVR-MCC. The half-quadratic optimization method is employed to solve the resultant optimization, and an iterative algorithm is developed to solve RPSVR-MCC. In each iteration, the complex optimization can be converted to a linear system of equations which can be easily solved by the widely popular optimization techniques. The experimental results on synthetic datasets and real-world benchmark datasets demonstrate that the effectiveness of the proposed method. Moreover, the superiority of the proposed algorithm is more evident in noisy environment, especially in the presence of outliers.

1. Introduction

Regression estimation from datasets is the basic subject in the field of machine learning. We are given the training samples of input vectors accompanied by the corresponding targets. The goal is to search a function estimation which best expresses the relationship between input vectors and their targets. In the real-world applications, many factors such as sampling errors, modeling errors, and instrument errors corrupt the training samples with noises and outliers, which are extremely far away from other samples [1]. Training the contaminated samples needs to build a robust model, which reflects the ability of dealing with a large number of outliers [2, 3]. Therefore, a robust method should be excellent enough to reject outliers and construct regression on only uncontaminated samples.

Support vector machine (SVM) was proposed by Vapnik et al. as a successful and powerful machine learning tool, which is based on the structural risk minimization principles [46]. It strikes a balance between the empirical risk and model complexity and can be derived by solving a convex quadratic programming problem. One efficient variant of SVM is the least squares SVM (LSSVM) introduced by Suykens et al. [7, 8], which replaces inequality constraints with equality ones, and then only needs to solve a linear system of equations. It tremendously accelerates the training speed. LSSVM has widely applied in a variety of real-world tasks. Different from LSSVM, proximal support vector machine (PSVM) is proposed by Fung and Mangasarian as another effective variant of SVM by adding an bias item in the objective function, which leads to a strongly convex objective function [9]. It can be considered as a special case of regularized LS-SVM, which gives the simple optimal solution as well as the very fast computation time [9]. LSSVM and PSVM both attempt to minimize mean squares error (MSE) on training samples while simultaneously escaping from overfitting [7]. However, many researchers have shown that while training samples with noises and outliers, LSSVM and PSVM are sensitive to noises and outliers and lead to poor generalization performance and bad robustness since they tend to have the large loss owing to the SSE [8].

There are a lot of papers which concern about improving the robustness of LSSVM and PSVM. A direct method of enhancing LSSVM is outlier elimination by some advanced techniques [10, 11]. Wen et al. employed a criterion according to least trimmed squares to recursively eliminate outliers [10]. In [11], samples with the slack variables larger than the 3 times standard deviation were discard in the training phase. Another alternative approach develops the weight setting strategies to mitigate the influence of outliers. The key idea of this approach is how to set the weights of samples. Suykens et al. assigned different weights according to the error variables so that samples or outliers which were less important have small weights [8]. In [12, 13], researchers provided another weight rule that smaller weights were commonly given to those samples with large distances from other samples for the purpose of reducing their impact. Brabanter et al. [14] discussed four different types of the weighting function, which included Huber, Hampel, Logistic, and Myriad and concluded that in most cases, the performances of Logistic and Myriad weighting functions were ranked in the top two. But until now, it is unclear in theory whether these weighting strategies are appropriate or not owing to noises and outliers. Some researchers had made efforts to the nonconvex loss functions to obtain great improvement on generalization performance and robustness [1520]. Nonconvex loss functions directing limited the maximal loss value directly and gave definite restrictions to the influence of outliers and led to much less sensitive to outliers. Recently, Correntropy derived from information theoretic learning was a generalized and local similarity measure between two arbitrary variables [21, 22]. It was proved as a robust measure which was appropriate for noise, nonGaussian noise and nonzero mean noise and had shown its superiority in robust learning of classification and regression [22, 23]. Maximum Correntropy Criterion (MCC) was introduced as a loss function to LSSVM for regression estimation to enhance the robustness of LSSVM, termed as RLSSVR-MCC [24].

Inspired by the advantages of maximum correntropy criterion, in this paper, we introduce maximum correntropy criterion into PSVM for regression estimation (PSVR) and derive a novel robust PSVR (RPSVR-MCC), attempting to suppress the negative influence of outliers and enhance the robustness of PSVR. The proposed model integrates maximum correntropy criterion, regularization technique, and kernel method. The proposed RPSVR-MCC cannot be solved directly by the classical optimization methods. The half-quadratic optimization technique is employed to derive an iterative algorithm for solving the corresponding optimization problem.

The contributions of the proposed RPSVR-MCC can be summarized as follows:(1)A robust PSVM for the regression problem based on maximum correntropy criterion is proposed, which can reduce the negative influence of outliers in minimizing the objective function, which is interpreted as the regularized version of RLSSVR-MCC. Adding the bias item derives simpler expression of optimal solution and wider choices of the kernel function.(2)The resultant optimization problem of RPSVR-MCC can be transformed into a half-quadratic optimization. Furthermore, an iterative algorithm is developed which only needs to solve a linear system of equations in each iteration.(3)The proposed RPSVR-MCC is illustrated on substantial datasets including six synthetic datasets and eleven real-world benchmark datasets under the cases of without outliers and with outliers. The experimental results show that RPSVR-MCC not only derives excellent estimation accuracy but also keeps stable performance in noisy environment, especially in the presence of outliers.

The reminder of this paper is organized as follows. We present briefly a background of PSVR and maximum correntropy criterion in Section 2. Section 3 proposes a robust PSVR in the kernel space and solves the proposed RPSVR-MCC by an iterative algorithm. The proposed method is evaluated by numerical experiments on synthetic datasets and benchmark datasets in Section 4. Section 5 concludes the paper.

2. Background

2.1. PSVR

In this section, we concisely present the basic principles of the classical PSVR. For more detail, the reader can refer to [9]. Consider a regression problem with a training dataset , where the input vector and the corresponding target . The essence of the regression problem is to search a function which best expresses the relationship between input vectors and their targets. PSVR is formulated as [9]where is the training error and is the pre-given parameter that provides a tradeoff between the model complexity and empirical risk. To solve the optimization problem (1) and (2), we construct the following Lagrangian function:

Utilizing the Karush-Kuhn-Tucker (KKT) conditions, training PSVR is equivalent to solving the dual problem in the form of a linear system of equations,where denotes identity matrix, , , and . denotes the matrix with element , and is the kernel function which can be expressed as the inner product calculation in high dimensional feature space. Among all the kernel functions, the most popular choice is Gaussian kernel defined by , where is the kernel parameter.

The decision function of PSVR iswhere and b are the solutions of (4) and (5).

Different from LSSVM, there is a bias term in the objective function of PSVM, and this formulation brings about the strong convexity of the objective function. The strong convexity plays a prominent part in the simpler type of optimal solution and derives that the kernel function of PSVM is not required to satisfy Mercer’s theorem so that they can be selected from a wide range. PSVM can also be interpreted as a regularized version of LSSVM. The classical PSVR is based on MSE measurement that equally treats all the samples and has been considered to be sensitive to outliers, which would seriously affect the reliability. They are significant challenges mainly due to the unpredictable nature of the error, which may be arbitrarily large in magnitude and cannot be ignored or treated with methods devised for small noises [23].

2.2. Maximum Correntropy Criterion

Correntropy is a recently developed information-theoretic measure to deal with error distributions with nonGaussian characteristics [21, 22]. The MCC expresses the similarity between the predicted output and the real sample in the correntropy sense. Given two arbitrary random variables A and B, their correntropy can be defined bywhere is the kernel function which satisfies Mercer’s theory and denotes the mathematic expectation. It has a clear theoretical foundation and is symmetric, positive, and bounded [22, 23]. In practice, the joint distribution of A and B is commonly unknown and only a finite number of samples are given. Thus, correntropy can be estimated by the following expression:where is the Gaussian kernel with bandwidth γ. Therefore, we rewrite (8) as

The maximum of correntropy of an error in (7) is called the maximum correntropy criterion (MCC) [2123]. Different from the global similarity measure MSE, MCC is local since the value of correntropy mainly depends on the kernel function along the line . Liu et al. proved that correntropy is a robust function for linear and nonlinear regression, and the kernel size controls all the properties of correntropy as one of its main merits [22].

3. Robust PSVR Based on Maximum Correntropy Criterion

As mentioned in Section 2.1, the classical PSVR model employs MSE measurement that has been considered to be sensitive to outliers. In order to address this issue, we introduce the correntropy in robust regression estimation. Replacing the third term of the objective function in (7) of PSVR with maximizing the correntropy between the true target and the predicted value , a new criterion can be derived as follows:

Different from the MSE measure, MCC in (10) differently considers the samples and gives more emphasis on the small error, which implies that if a target is contaminated or corrupted to be outlier, it will give small contribution to the objective function. Then we get the following maximum correntropy problem for training PSVR:

It combines maximum correntropy criterion, regularization technique, and kernel trick. However, the maximum correntropy in (11) is nonlinear, it is difficult to directly optimize. Recently, many researchers have made much effort to the half-quadratic technique [25], the expectation-maximization method [26], and the conjugate gradient algorithm [27], which are devoted to solve this optimization problem. In this paper, we use the half-quadratic technique to solve the optimization problem (11). By introducing an auxiliary variable, this method transforms the original objective function into a half quadratic objective function. According to the property of convex conjugated functions [27], the following proposition [25] holds (Algorithm 1).

Input: Training set , C, σ, γ, K, a integer , and a small real . Initialize , the diagonal matrix , and let .
Step 1. Solve (25) and (26) to obtain and .
Step 2. Obtain error variable by (22) and determine by (27).
Step 3. Build the diagonal matrix and then solve (25) and (26) to obtain and .
Step 4. Let . If or , go to Step 5, else go to Step 2. Let t = t + 1.
Step 5. Derive the final regression estimation by (6).

Proposition 1. There exists a convex conjugated function , such thatand for a fixed x, the maximum is reached at [25].

Introducing (12) into the objective function of (11), the optimization (11) can be transformed into the following half-quadratic optimization according to Proposition 1,where stands for the auxiliary variables appeared in the half-quadratic optimization. According to Proposition 1, for a fixed , the following equation holds:

Furthermore, we get

Obviously, can be derived by alternatingly optimizing with respect to and p while holding the other fix. First, if p is fixed, (13) can be reduced as the following problem with respect to :

The unconstrained optimization problem (16) can be expressed as the following constrained problem with respect to :where . The optimization problem (17)-(18) can be solved by constructing the following Lagrangian function:

From KKT conditions, we obtain

Substituting (20)–(22) into (23), (20)–(23) can be written as a linear system of equations,where is a diagonal matrix with and . From (24) and (21), we get

Second, the optimal p is directly obtained from Proposition 1 by

Implementing half-quadratic optimization for RPSVR-MCC is given as Algorithm 1.

4. Experiments

In this section, we present experiments on six synthetic datasets and eleven benchmark datasets to validate the performance of the proposed RPSVR-MCC. To that end, we compare it with the least squares support vector regression (LSSVR) [7] and PSVR [9]; weighted LSSVR with the Hampel weight function (WLSSVR-H) [8]; and weighted LSSVR with the Logistic weight function (WLSSVR-L) [14] and RLSSVR-MCC [24]. The Gaussian kernel is adopted in the experiments for all models. The performance of these models usually depends on parameter choices. All these models have two common parameters C and σ. In each algorithm, we select the optimal and from the set . The correntropy parameter γ in RPSVR-MCC is searched from the set . In this paper, we choose the optimal parameters of these algorithms by the classical grid search such that they can derive best performance on the testing samples.

In order to evaluate the performances of the algorithms, we employ the following four regression accuracy measures [28], which are, respectively, defined as(1)The root mean squares error (RMSE) and the mean absolute error (MAE),where is the real target, is the corresponding prediction, and m is the number of testing samples. represents the average value of . The smaller RMSE is, the better fitting performance is. MAE is also a popular deviation measurement between the real and predicted values.(2)The ratio between the sum squared error SSE and the sum squared deviation testing samples SST (SSE/SST),(3)The ratio between interpretable sum deviation SSR and SST (SSR/SST),

In most situations, small SSE/SST shows good agreement between the real and predictions values. To derive smaller SSE/SST commonly follows an increase in SSR/SST. Nevertheless, the extremely small value of SSE/SST is in fact not good, for it probably means overfitting of the regressor. Therefore, a good estimator should strike balance between SSE/SST and SSR/SST.

4.1. Synthetic Datasets

In synthetic experiments, we consider the approximation of the following sinc function which is a popular choice in regression estimation [28, 29]. We generate the synthetic datasets as follow:where represents the different types of noises to obtain the contaminated datasets. In the experiments, the first and second kinds of noises follow Gaussian distribution and , respectively, and denote this two type noises as Type A and Type B. The third and fourth kinds of noises follow uniform distribution and , respectively, denoted as Type C and Type D noises. The fifth and last kinds of noises follow student distribution and , respectively, denoted as Type E and Type F noises, where is a student random variable with c degrees of freedom.

The goal of this evaluation is to measure the robustness of different algorithms against outliers. In order to avoid biased comparisons, we randomly generate ten independent groups of 850 samples according to the sinc function, and the training samples and test samples were 350 and 500, respectively. For each kind of noises, we first add them into training samples to obtain noisy samples for the contamination purpose and then randomly select 20% training samples and add large noises on their targets to simulate outliers. The test samples are clean, which are uniform from the sinc function without any noise. Table 1 reports the average performances of these six algorithms with ten independent runs under the different noises distributions. () behind the criterions (RMSE, MAE, SSE/SST, and SSR/SST) stands for the rank of this algorithm among these six algorithms. The optimal parameters are also listed in Table 1. Table 2 shows the average ranks of these algorithms with different types of noises. The best results are highlighted in bold.

From Tables 1 and 2, we can derive the results as follows:(1)The classical LSSVR and PSVR obtain almost the dissatisfied results, reflected by its much higher RMSEs, MAEs, and SSE/SSTs. This shows that LSSVR and PSVR are sensitive to outliers. On the contrary, the weighted model (WLSSVR-H, WLSSVR-L) enhances the LSSVR and PSVR to a certain extent, yet less accurate than the RLSSVR-MCC and RPSVR-MCC.(2)It has been shown that the proposed RPSVR-MCC performs best reflected by the smallest RMSE, MAE, and SSE/SST for most of the types of noises. Especially, the RPSVR obtains the four criteria all ranked first for Type E noises. However, in term of SSR/SST, RPSVR-MCC is not satisfactory.(3)For the RMSE, MAE, and SSE/SST indexes in Table 2, the average rank of the RPSVR-MCC is better than the other algorithms. For the SSR/SST index, no obvious difference was found between these algorithms.

Figures 13 show one run regression estimations of these algorithms under Type A, Type C, and Type E noises. The results report that the curve derived by LSSVR and PSVR are disastrously damaged by outliers, regardless of the types of noises. Although WLSSVR-H and WLSSVR-L perform better than LSSVR and PSVR, they still have large deviations at some samples, especially outliers. We notice that the curve of the RPSVR-MCC always follows the original curve more closely for most of the test samples. On average, our proposed RPSVR-MCC can derive higher robustness and more satisfactory results in the presence of outliers than other algorithms.

4.2. Benchmark Datasets

In the real-world examples, we conduct experiments on eleven benchmark datasets to further evaluate the proposed RPSVR-MCC. They include Pyrimidines (Pyrim), Triazines, AutoMPG, Boston Housing (BH), Servo, and Abalone downloaded from the well-known UCI machine learning repository: https://archive.ics.uci.edu/ml/index.php, Bodyfat, Pollution and Concrete Compressive Strength (Concrete) from StatLib: http://lib.stat.cmu.edu/datasets/, and Machine CPU (MCPU) and Diabetes from the web page https://www.dcc.fc.up.pt/∼ltorgo/Regression/DataSets.html, which are widely appeared in the field of evaluating various regression algorithms.

The first column in Tables 35 displays the detailed information of these datasets including the numbers of the samples and attributes, training and test samples. For each dataset, we randomly divided them into two parts (training samples, test samples) according to the number in Tables 35. In this paper, we concern the robustness of outliers and samples were contaminated by large noises to simulate outliers in the training process. Concretely, the outliers are generated by randomly choosing 20% training samples and then adding noises to their targets according to the average value of the regression targets. Test samples are clean, which are not contaminated by any noise. All these regression algorithms are repeated ten times with different independent partition of training and test datasets. The ten accuracy measures were averaged to produce a single estimation. The performances and optimal parameters of these algorithms on ten datasets are reported in Tables 35.

In this subsection, we compare in detail the performance of the RPSVR-MCC with the other five algorithms. For the RMSE index, one can see in Tables 35 as follows:(1)Under the case of without outliers, RPSVR-MCC is superior to LSSVR and PSVR on nine datasets except Concrete and performs better than WLSSVR-H and WLSSVR-L on eight datasets except AutoMPG and Servo. Meanwhile, it is more accurate than RLSSVR-MCC on seven datasets except Concrete and derives comparable results on Pyrim and Bodyfat.(2)Under the case of with outliers, RPSVR-MCC is superior to LSSVR and PSVR on all the datasets and performs better than WLSSVR-H and WLSSVR-L on eight datasets except Pyrim and Servo. It obtains more satisfactory results than RLSSVR-MCC on six datasets except Pollution, Servo, and Concrete and derives comparable performance on Bodyfat. For the MAE and SSE/SST indexes, there are similar conclusions that on average, RPSVR-MCC outperforms these five algorithms under all cases.

In order to analyze the performance of these algorithms more accurately, we summarize the average ranks of these algorithms on benchmark datasets under the cases of without outliers in Table 6 and with outliers in Table 7. From Tables 6 and 7, in term of the RMSE, MAE, and SSE/SST indexes under the cases of without and with outliers, RPSVR-MCC always ranks first, and RLSSVR-MCC always ranks second among these six algorithms. The main reason is that RPSVR -MCC and RLSSVR-MCC both employ the robust loss function (maximum correntropy criterion) to limit the influence of outliers during the training phase. In addition, RPSVR-MCC outperforms RLSSVR-MCC in seven out of eleven datasets under the case of with outliers. For the SSR/SST index, the proposed RPSVR-MCC derives poor performance and ranks third among these algorithms. However, it is still superior to RLSSVR-MCC. On average, the RPSVR is more robust to outliers and has better generalization performance than the other algorithms.

5. Conclusion

In this paper, we introduce maximum correntropy criterion into the classical PSVR and propose a novel robust PSVR, namely, RPSVR-MCC, for dealing with regression estimation in noisy environment, especially in the presence of outliers. The proposed method integrates maximum correntropy criterion, regularization technique, and kernel method. The half-quadratic optimization technique is adopted to derive an iterative algorithm for solving the corresponding optimization problem. Two groups of experiments on synthetic datasets and benchmark datasets with outliers are conducted, respectively, to test the robustness of the proposed algorithm. Compared with the classical LSSVR and PSVR, the proposed RPSVR-MCC obtains better generalization, especially under the case of with outliers. The possible reason is that the MCC is a local criterion of similarity and appropriate for samples with large outliers. Compared with other robust SVR algorithms (WLSSVR-H, WLSSVR-L), the proposed RPSVR-MCC outperforms in eight out of eleven datasets. The proposed algorithm is superior to the correntropy-based RLSSVR-MCC in seven out of eleven datasets. In conclusion, the developed algorithm not only derives higher robustness with better estimation accuracy and keeps stable performance in dealing with outliers.

This paper discusses outlier robustness which only concentrates on target noises. However, datasets in various practical applications are commonly contaminated by both features (or attributes) and targets noises. Our future work will focus on pinball loss [30] to yield a more useful and flexible method. In addition, the proposed approach can be extended to the one-class problem for imbalance classification [31], and we plan to address it in future work.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The work was supported by the National Natural Science Foundation of China (Nos. 11626186, 11861060, and 11171346) and the Scientific Research Program Funded by Shaanxi Provincial Education Department (No. 18JK0623).