Mathematical Problems in Engineering

Volume 2017 (2017), Article ID 9650769, 11 pages

https://doi.org/10.1155/2017/9650769

## Probability Distribution and Deviation Information Fusion Driven Support Vector Regression Model and Its Application

Key Laboratory of Advanced Control and Optimization for Chemical Processes of Ministry of Education, East China University of Science and Technology, MeiLong Road No. 130, Shanghai 200237, China

Correspondence should be addressed to Xuefeng Yan; nc.ude.tsuce@nayfx

Received 29 June 2017; Revised 25 August 2017; Accepted 30 August 2017; Published 12 October 2017

Academic Editor: Xinkai Chen

Copyright © 2017 Changhao Fan and Xuefeng Yan. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

In modeling, only information from the deviation between the output of the support vector regression (SVR) model and the training sample is considered, whereas the other prior information of the training sample, such as probability distribution information, is ignored. Probabilistic distribution information describes the overall distribution of sample data in a training sample that contains different degrees of noise and potential outliers, as well as helping develop a high-accuracy model. To mine and use the probability distribution information of a training sample, a new support vector regression model that incorporates probability distribution information weight SVR (PDISVR) is proposed. In the PDISVR model, the probability distribution of each sample is considered as the weight and is then introduced into the error coefficient and slack variables of SVR. Thus, the deviation and probability distribution information of the training sample are both used in the PDISVR model to eliminate the influence of noise and outliers in the training sample and to improve predictive performance. Furthermore, examples with different degrees of noise were employed to demonstrate the performance of PDISVR, which was then compared with those of three SVR-based methods. The results showed that PDISVR performs better than the three other methods.

#### 1. Introduction

Since its proposal by Vapnik, the support vector machine (SVM) has been used in many areas, including both pattern recognition and regression estimation [1, 2]. The original SVM is utilized to provide a pair of parameters as a solution to a quadratic program problem. SVM has some advantages, such as low standard deviation and easy generation, as well as some disadvantages, such as the redundancy of the regression function and the low efficiency of support vector selection. To address these disadvantages, various improvements to the support vector algorithm and its kernel function have been proposed. Suykens proposed least-square support vector regression (LS-SVR) for a regression modeling problem [3, 4]. By transferring inequality constraints to equality constraints, LS-SVR simplifies the solution to quadratic program problems [5]. In the field of regression, Smola proposed the linear programming support vector regression (LP-SVR) model [6, 7]. LP-SVR has numerous strengths, such as the using of more general kernel functions and fast learning ability. LP-SVR can control the accuracy and sparseness of the original SVR by using the linear kernel combination as a solution approach. In addition, a new kernel function, multikernel function (MK), has been introduced into the standard SVM model. MK provides lower fault and requires a shorter training period than the original kernel function. Multiple-kernel SVR (MKSVR) is very popular in some systems. Yeh et al. [8] developed MKSVR for stock market forecasts. Lin and Jhuo [9] discovered a method to generate MKSVR parameters for integration into a system that converts the pixels of a checkpoint into the brightness value. Zhong and Carr [10] used the MKSVR model to estimate pure and impure carbon dioxide-oil matrix metalloproteinases in a CO_{2} enhanced oil recovery process.

The SVR model also has been improved by prior knowledge [11, 12]. There are numerous types of prior knowledge, including the average value and monotonicity of the sample data. To appropriately use prior knowledge, three types of methods are utilized in SVR [13]. Our team previously worked on the monotonous a priori knowledge of sample data. Our monotonous a priori knowledge of the sample data is described by first-order difference inequality constraints of kernel expansion and additive kernels [14]. The constraints are directly added to kernel formulation to acquire a convex optimization problem. For additive kernels, SVMs are conducted through the addition of dissociate kernels for every input dimension. These operations confer higher accuracy to the SVR model in support vector (SV) selection.

Inevitably, even small noise can debase the accuracy of the model. Furthermore, in some situations, part of the noisy information may be ten to even dozens of times larger than the normal data. These outliers introduce bias and inaccuracies to SVR. Nevertheless, the probability distribution of the sample data is a good indicator of noise. From the perspective of the probability distribution of sample data, normal data and data that contain the least amount of noise have the highest probability in the sample data. By contrast, data that contain large amount of noise have relatively small probability. Thus, outliers in the sample data will have the smallest probability. Therefore, the probability distribution is the prior knowledge that helps weaken the influence from noise and outliers in the sample data. We consider this information to modify our SVR model.

This article is structured as follows: Section 2 introduces standard SVR algorithms. Section 3 describes the proposed algorithm that integrates probability distribution information into the SVR framework. Section 4 provides some experimental results that were obtained from comparing the proposed algorithm with other algorithms. Finally, Section 5 presents some conclusions about the proposed algorithm.

#### 2. Review of SVR

To better describe the proposed algorism, the mathematical clarification of the basic concepts of SVR and the usage of deviation information should be provided.

##### 2.1. Support Vector Regression (SVR)

SVR is originally used to solve linear regression problems. For given training samples , fitting aims to find the dependency between the independent variable and the dependent variable . Specifically, it aims to identify an optimal function and minimize prospective risk , where is predictive function set, is the generalized parameters of the function, is the loss function, and is the fitting function [15]. Thus, the solution of the optimal linear function for SVR is expressed as the following constraint optimization problem: where the penalty coefficient* C* that determines the accuracy of the function fitting and the degree of the error greater than is given in advance. Parameter is used to control the size of the fitting error, the size of the support vector, and the size of the generalization capability. Taking into account the accuracy of the fitting error, the introduction of slack variables , becomes necessary. Figure in reference [10] illustrates this linear fitting problem.

However, the previous solution is only for a linear regression problem. Nonlinear regression necessitates the kernel function in the SVR model [16]. The kernel function can be expressed as follows:where is the mapping from a low-dimensional space to a high-dimensional space. The independent variable becomes a vector that should be mapped to a feature space so that a nonlinear problem could be changed into a linear problem. After introducing the kernel function, the new fitting function becomeswhere the symbol indicates the transpose of the matrix .

The changing of the fitting function leads to the following constraint optimization problem:In this constraint optimization problem, the length of and is* n*. The notion is the kernel function that fulfills Mercer’s requirements.

The standard SVR is a compromise between structural risk minimization and empirical risk minimization. In particular, for the support vector regression learning algorithm, the structural risk term is and the empirical risk item is . However, calculating the structural risk term requires enormous time and resources [17]. Researchers found counting the minimization of the 1-norm of the parameter will reduce the time and resources spent on calculation. Then, the optimization formula turns into the following form:Although the time and resource spent on modeling are reduced, there is no considerable difference in the final accuracy.

##### 2.2. Support Vector Regression with Deviation Information as a Consideration

Traditional SVR does not possess a special method for addressing noise in sample data. An efficient way to weaken noise is to adjust parameters in the SVR model. These parameters are called hyperparameters in SVRs. Hyperparameters exert a considerable impact on algorithm performance. The general way to test the performance of hyperparameters is via the deviation between the model output and the sample data [18]. The obtained deviation is then compared with other deviations to select the minimum deviation as the final result. The parameters that correspond to the minimum deviation are the best parameters in the optimization process. Usually, this process is conducted using an intelligent optimization algorithm, such as particle swarm optimization (PSO) [19] and genetic algorithm (GA) [20]. The deviation is set as the fitness function in an intelligent optimization algorithm. In this section, we refer to this method as deviation-minimized SVR (DM-SVR).

In most of the circumstances, the deviation between the model output and sample data is represented by the correlation coefficient or the mean square error (MSE). Given vector as the model output and vector as the sample output, the correlation coefficient* r* can be expressed as

The formula for mean square error (MSE) is as follows: In short, if the value of MSE is close to zero and the value of* r* is close to one, that group of parameters will produce the best performance.

#### 3. Probability Distribution Information Weighted Support Vector Regression

Although DM-SVR can reduce influence from the noise, it also has some weaknesses. The main disadvantage of this method is the time it spends on training. There are many parameters that need to be optimized in SVR. If there are extra parameters to optimize, these works would make the train process inefficient. To solve the uncertainty of error parameter , we introduce the probability distribution information (PDI) into SVR and designate it as PDISVR.

##### 3.1. Probability Distribution of the Output

The probability distribution information is the same as the probability distribution function and describes the likelihood that the output value of a continuous random variable is near a certain point. Integrating the probability density function is the proper way to calculate the probability value of the random variable in that certain region. From the sample data, we could set the frequency of output to appear as different values. Then, we set frequency as , where is the output value vector. Let be the probability of the sample’s output. Therefore, the relationship between and can be expressed aswhere is the range of . Then, we can easily obtain the probability distribution function. The next step is the identification of the probability of every point.

##### 3.2. Optimization Formula with Probability Distribution Information Weight

Once we have obtained the probability distribution of output, it should be integrated into the basic SVR model. In the basic SVR model, the error parameter indicates the accuracy of model fitting by providing an area that does not have any loss for the objective function. However, due to the influence of noise, some sample data contain excessive noise information. If the same parameters are adapted, the performance of the model is reduced. To prevent this situation, SVR should be adjusted in accordance with noise information. We propose illustrating noise information through the probability distribution of the output. Samples in the regions with low probability distributions have a relatively large proportion of noise. For this reason, in modeling, the region with higher probability should have a smaller error parameter than the lower probability region. Thus, the probability distribution function increases the accuracy of the SVR model in the area with the high probability of output.

Define the -insensitive loss function aswhere is a regression estimation function constructed by learning the sample and is the target output value that corresponds to . By defining the -insensitive loss function, the SVR model specifies the error requirement of the estimated function on the sample data. This requirement is the same for every sample point. To avoid this situation, the artificially set error parameter is divided by the probability distribution vector . Figure 1 illustrates the change from a constant to a vector . The distance between two hyperplanes has been modified in some area where the density of the points becomes different. Furthermore, in the high-density area, the model has a smaller error parameter. By contrast, in the low-density area, the model has a large error parameter. The density of the output points is directly related to the probability of the sample’s output . Therefore, the division of PDI would make the SVR model emphasize the area with a high density of points. This technique can improve overall accuracy despite sacrificing the accuracy of low-density areas. According to (9) and (10), the PDISVR can be expressed asBy comparing (11) with the standard form of SVR, we can see that the error parameter changes in accordance with . Then, the PDISVR model will have low error tolerance for the high density of points.