Mathematical Problems in Engineering

Volume 2017 (2017), Article ID 4191789, 12 pages

https://doi.org/10.1155/2017/4191789

## Reconstruct the Support Vectors to Improve LSSVM Sparseness for Mill Load Prediction

State Key Laboratory of Electrical Insulation and Power Equipment, School of Electrical Engineering, Xi’an Jiaotong University, Xi’an, China

Correspondence should be addressed to Jianquan Shi

Received 24 October 2016; Revised 14 May 2017; Accepted 25 May 2017; Published 5 July 2017

Academic Editor: Erik Cuevas

Copyright © 2017 Gangquan Si et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The sparse strategy plays a significant role in the application of the least square support vector machine (LSSVM), to alleviate the condition that the solution of LSSVM is lacking sparseness and robustness. In this paper, a sparse method using reconstructed support vectors is proposed, which has also been successfully applied to mill load prediction. Different from other sparse algorithms, it no longer selects the support vectors from training data set according to the ranked contributions for optimization of LSSVM. Instead, the reconstructed data is obtained first based on the initial model with all training data. Then, select support vectors from reconstructed data set according to the location information of density clustering in training data set, and the process of selecting is terminated after traversing the total training data set. Finally, the training model could be built based on the optimal reconstructed support vectors and the hyperparameter tuned subsequently. What is more, the paper puts forward a supplemental algorithm to subtract the redundancy support vectors of previous model. Lots of experiments on synthetic data sets, benchmark data sets, and mill load data sets are carried out, and the results illustrate the effectiveness of the proposed sparse method for LSSVM.

#### 1. Introduction

The ball mill pulverizing system is widely applied to large-and-medium scale power plant in China, which almost uses tubular ball mills to pulverize the coal. A lot of design parameters and unmeasured operating parameters impact the pulverizing circuits. Mill load is an important unmeasurable parameter, which is closely related to the energy efficiency of the pulverizing system. Much research has been presented to measure the mill load in the past decades [1, 2]. The soft sensing technique is a new and low cost method, which selects measurable information to estimate the mill load by building models. The most common of measurable parameters are the noise and vibration data [3, 4]. But how to build a forecasting model which plays a significant role in soft sensing technique is a significant problem. So far, researchers have come up with many methods to build the soft sensing model, such as neural networks [5], the support vector machines (SVM) [6], partial least squares [7], and least squares support vector machine (LSSVM) [8]; these methods are aimed at specific problem. In this paper, we mainly research the mathematical problems of LSSVM for building the soft sensing model.

LSSVM, as proposed by Suykens, have been introduced for reducing the computational complexity of SVM. In LSSVM, the inequality constraints are replaced with equality constraints in solving a quadratic programming. Thus it has faster speed than SVM in the training process. However, there exist two main drawbacks in LSSVM, as its solution suffers from lack of sparseness and robustness [9]. These problems will increase the training time and reduce the model prediction accuracy for the real industrial data sets, which have troublesome characters such as imbalanced distribution, heteroscedasticity, and the explosion of data. Therefore, this paper focuses on the mathematic problem of how to improve LSSVM sparseness and robustness for real industrial data set and applies it to mill load prediction.

Many efforts have been made to mitigate these shortcomings. For example, Suykens et al. introduced a pruning algorithm based on sorted support value spectrum and proposed a sparse Least Squares Support Vector Classifier, which gradually removes the training samples with the smallest absolute support values and retrains the reduced network. Later, this method was extended to the problem of Least Square Support Vector Regression [10]. Meanwhile, weighted LSSVM have been presented to improve the robustness of LSSVM solution to better [9]. From that point, the sparse algorithms based on strategy of pruning were popping up. Kruif and Vries presented a more sophisticated mechanism of selecting support vectors [11], in which the training sample introducing the smallest approximation error when it is omitted will be pruned. For more on LSSVM pruning algorithms, Hoegaerts et al. [12] provided a comparison among these algorithms and concluded that pruning schemes can be divided into QR decomposition and searching feature vector. Instead of determining the pruning points by errors, Zeng and Chen [13] introduced the sequential minimal optimization method to omit the datum that will lead to minimum changes to a dual objective function. Based on kernel partial least squares identification, Song and Gui [14] presented a method to get base vectors via reducing the kernel matrix by Schmidt orthogonalization.

Generally speaking, the algorithms mentioned above all follow backward pruning strategy. Correspondingly, the methods of forward selecting support vector iteratively recently are used for the sparseness. Yu et al. [15] provided a sparse LSSVM based on active learning, which greedily selected the datum with the biggest absolute approximation errors. Jiao et al. [16] introduced a fast sparse approximation scheme for LSSVM, which picks up the sample with making most contribution to the objective function. Later, an improved method [17] based on partial reduction strategy was extended to LSSVR. Subsequently, a recursive reduced least square support vector machine (RRLSSVM) and an improved algorithm of RRLSSVM (IRRLSSVM) were propose [18, 19]. They all choose the support vector which leads to the largest reductions on the objective function. However, the difference between them is that IRRLSSVM update the weights of the selected support vectors during the selection process, and RRLSSVM is not so. Additionally, RRLSSVM has been applied for online sparse LSSVM [20].

Backward algorithms need higher computational complexity since the full-order matrix is gradually decomposed into submatrix which leads to minimal increment for objective function [21]. Instead, forward algorithms need small computational complexity and small amount of memory required, but convergence of these algorithms has not been proved. In addition, there are some methods to sparse LSSVM from other aspects, for example, based on genetic algorithms [22, 23] and compressive sampling theory [24].

For aforementioned sparse algorithms, the hyperparameters optimized under the original data set remain the same in the process of greedy learning, in which almost all employ the radial basis function (RBF). In other words, the process of greedy learning can be considered as the parameter selects the support vector, because kernel function RBF has a local characteristic, with the measurement of the Euclidean distance between data sets. Therefore, there is another perspective to think about the sparseness. No matter what kind of algorithm, the initial model with all training data will be obtained in advance. And what we finally wanted is to reconstruct the model with the least support vectors and hold nearly approximation accuracy with the initial model. Hence, we attempt to realize the sparseness of LSSVM by reconstructing the support vectors to revert the initial models, and the refracturing strategy corresponds to the parameters of RBF. The reconstructed least square support vector machine for regression problems was proposed, abbreviated RCLSSVR. Moreover, the most noticeable innovation is to analyze the features of industrial data sets and introduce RCLSSVR to improve sparseness and robustness simultaneously. There are some features in different industrial data sets, such as imbalanced data distribution and heteroscedasticity. That would lead to a big difference in the process of sparseness due to cut or added datum with different position, which caused the iterative algorithm to choose more support vectors in order to gain the robustness. So we reconstruct the support vectors according to the initial model and the location of the original data; the problem of robustness and sparseness will be solved simultaneously.

This paper is organized as follows. In Section 2, the preliminaries knowledge is briefly introduced, including the fundamental of LSSVM and the principle of reduced LSSVM. The characteristic of real industrial conditions and our proposed algorithm will be developed in Section 3. Some simulations are taken on some function approximation problems and benchmark data sets in Section 4. Finally, the paper is concluded by Section 5. To facilitate reading, we have made a list of abbreviations for some necessary abbreviations in the table of Abbreviations.

#### 2. Normal LSSVM and the Reduced LSSVM

Given a training data set , where is the input with -dimension and is its corresponding target. The goal of function approximation is to find the underlying relation between the input and the target value. Once this relation is found, the outputs corresponding to the inputs that are not contained in the training set can be approximated.

In the LSSVM, the relation underlying the data set is represented as a function of the following form: where is a mapping of the vector to a high dimensional feature space, is the bias, and is a weight vector of the same dimension as feature space. The mapping is commonly nonlinear and makes it possible to approximate nonlinear functions. Mappings that are often used result in an approximation by a radial basis function, by polynomial functions, or by linear functions [25].

The approximation error for sample is defined as follows:

The optimization problem is to search for those weights that give the smallest summed quadratic error of the training samples in LSSVM. The minimization of the error together with the regularization is given as with equality constraint here is the regularization parameter; this constrained optimization problem is solved by introducing to an unconstrained Lagrangian function: where is the Lagrange multiplier of . The optimum can be found by setting the derivatives equal to zero

Eliminating the variables and , we can get the following linear equations:where , , , is a square matrix, , , and is the kernel function which satisfies Mercers condition. It used to calculate the mapping in input space, instead of the feature space. In LSSVM, we can use linear, polynomial, and RBF kernels and others that satisfy Mercers condition. In this paper we focus on the RBF kernel as follows:

The solution of this set of equations results in a vector of Lagrangian multipliers and a bias . The output of the approximation can be calculated for new input values of with and . The predicting value is derived as follows:

For the sparse LSSVR, the pruning or reduced strategy is applied to the solution, which let , where is the index subset of . Take into (3) and get the following equation: where is , and is the subset of with the index subset . Reformulate (10); we can obtained here is . and is an appropriate vector. The solution is given by derivative of (11) with respect to and . So the reduced LSSVR is found:

Comparing (9) and (13), we can know that the subset of training sample makes contribution to the model rather than every training sample. Therefore, the computational complexity and the operation time will be decreased in the prediction.

#### 3. Reconstruct Least Square Support Vector Regression

##### 3.1. The Characteristic of Industrial Data Set

In the real industrial process, there are various reasons resulting in the cases of imbalanced data distribution and heteroscedasticity. Thus both the features need to be briefly introduced. The first feature is the imbalanced distribution, which commonly exists in classification domains, such as spotting unreliable telecommunication customers, text classification, and detection of fraudulent telephone calls [26, 27]. Certain solutions at the data and algorithmic levels are proposed for the class-imbalance problem [28–30]. It is also a problem in sparse approximation. When adding or deleting a support vector from different areas, the change of estimated performance is different, and the samples in the rare areas will be cut off easily in the sparse process. The second feature is the heteroscedasticity. The problem of heteroscedasticity, non-constant error variance, will bring about severe consequences: the variance of the parameter is not the least, and the prediction precision decreases [31]. The solution is divided into two categories: the data transformation method and the model of heteroscedastic data [32, 33]. In estimation function, the prediction precision will be more sensitive to the samples in the regions with high variance. And the number of support vectors in high variance regions will usually be more. At last, the explosion of data, especially containing imbalance and heteroscedastic data set, has been plaguing the machine learning.

Both forward algorithms and backward algorithms have certain shortcomings on industrial data sets. Firstly, for the imbalanced data sets, since the objective function is to balance the upper bound of the maximum points to the hyperplane with the minimum mean square error, the rare samples make a little contribution to the objective function. Hence, the data in region with more samples is easier to be selected than the data in region with rare samples; secondly, for the heteroscedastic data sets, the data in the area of big variance will be selected or left relatively larger number than the data in the area of small variance; finally, for the excessive data sets, the main problem is excessiveness. The training time is unacceptable for backward greedy methods, and the convergence problem will exist in forward greedy algorithms.

##### 3.2. RCLSSVR and DRCLSSVR

In order to solve the aforementioned problem, we proposed an efficient method via reconstructing support vectors to restore the original model. From (8), we can know common kernel function has a local characteristic, equivalent of normalizing Euclidean distance between data sets based on the parameter. Therefore, it is possible to restore the original model by evenly choosing the support vector near the hyperplane and adjusting the parameter . But it is not always stable convergence because of the unknown data distribution, which can be solved by directly selecting the support vector on the original model instead of from the training data set. Thus, according to the position information of training data set and the original model, we can rebuild the selected data set , where is the estimated value calculated by (9). Then the reconstructed samples in the data set can be selected as the support vectors.

The sparse strategy is different from the forward and backward greedy methods, and the process has two steps. The first step is to select uniform samples by passing through the entire data set . Arbitrarily pick one point as the density center from the original data set in the beginning and calculate the Euclidean distance vector by (14). Find out all of samples within the density center neighborhood radius and update the density center based on (15).where is the index set of and also the subset of , and represents the absolute error between predicted value and true ones. is the ratio of maximum distance, deciding how many support samples to select. is the maximum distance between two samples of original data set. This means that the datum far away from the density center and near the hyperplane will be selected as the new density center. This selecting process will be terminated when traversing the original data set. The second step is to select support vector from data set with correspondence to the density center in each iteration. Based on the selected support vectors, the regularization parameter and kernel parameter will be optimized again by leave-one-out methods. The realizing flowchart of RCLSSVR is depicted in Algorithm 1.

*Algorithm 1 (RCLSSVR). *Input: a training data set , the radius coefficient , the unlabeled data set , the support vectors , and the neighborhood data set .

Output: the support vectors .(1)Train original model via solving (7) with the training data set where the hyperparameter is found by 10-fold cross validation;(2)obtain function estimation by (9), and get the reconstructed data set ;(3)calculate the Euclidean distance matrix and the radius ;(4)randomly select a density center , and add from to ;(5)update the unlabeled data set , where and are update in each iteration;(6)update the set , and if , choose the next density center according to (15) and add corresponding datum to ; else, select the sample as the next density center and add corresponding datum to ;(7)if , then go to next step; else go to step ;(8)optimize the parameter again via leaving one out cross validation on data set , and rebuild the model.

Generally speaking, the rate of convergence and the performance of RCLSSVR are bound up with the parameter . The smaller can avoid the underfitting problem, which also multiplies the number of support vectors. But the bigger will reduce the prediction accuracy. There are two situations worth considering after the end of RCLSSVR; one is that the prediction precision does not meet the demands; the other is the data set is still allowed to prune. Therefore, for the RCLSSVR, we should take remedial measures in two ways. The first is to improve the prediction precision via gradually adding sample to the support vectors from data set , and the data with the smallest approximation error when it is added will be selected. The second is to prune the redundant samples with meeting the precision requirement. We define density indicators to decide which sample is omitted.where is the index of the support vector ; the value of is the same as the parameter . In the process of pruning, the sample with the biggest values in the will be pruned. The remedial method is named DRCLSSVR, which is described in Algorithm 2.

*Algorithm 2 (DRCLSSVR). *(1) Comparing the performance with the set value, if less, go to ; else go to ;

(2) calculate the density of all support values and sort it. Remove sample with the biggest values in the ;

(3) retrain the LSSVM based on the reduced support vectors;

(4) go to , unless the performance degrades the set value;

(5) determine the training sample from and add the data with the performance most greatly increased after selecting;

(6) retrain the LSSVM based on the added support vectors;

(7) go to , unless the performance reached to the set.

#### 4. Experimental Results

In order to verify the performance of the proposed RCLSSVR and DRCLSSVR, some kinds of experiments are performed. In Section 4.1, the influence of different parameter on sparse performance was studied. In Section 4.2, we selected two backward algorithms and one forward algorithm to perform comparative experiments on synthetic data sets and benchmark data sets. In Section 4.3, RCLSSVR is applied to mill load data set. For comparison purpose, all experiments are finished on a platform of Intel Core i5-4460 CPU @3.20 GHz processor with 4.00 GB RAM of windows 7 operation in a Matlab2014a environment, and a toolbox of LS-SVMlab v1.8 from http://www.esat.kuleuven.be/sista/lssvmlab/. The comparison algorithms are normal LSSVM, Suykens pruning algorithm (SLSSVM) [14], backward classic algorithm (PLSSVM) [16], and IRRLSSVM [23]. Among them, PLSSVM is expected to perform best, but it is an extremely expensive algorithm. RBF kernel is used in all of the experiments, and the parameter is optimized by leaving one out cross validation strategy [22]. In addition, two performance indexes, that is, rooted mean squared errors () and , are defined to evaluate these algorithms.where is the RMSE of normal LSSVM.

##### 4.1. Experiment 1: The Performance with Different

In this subsection, we will utilize sinc function to investigate the performance of our proposed algorithms with different . Since the DRCLSSVR is the supplement of RCLSSVR, the experiment only explores the performance of RCLSSVR with different parameters. This proves the influence of parameter in situations where we make more than once time pass through the data set. In other words, until certain number of vectors is reached, we will let when . The sinc functions relation between inputs and outputs is described as follows: where , sampling with the same intervals, total 300 data. And is Gauss noise whose average value is equal to 0 and the variance is equal to 0.5. For this data set, we randomly select 200 samples as the training data set and the others as the testing data set.

In order to improve the generalization ability, it is necessary to normalize the attributes of training data sets into the closed interval when calculating the Euclidean distance vector . The parameter reflects the number of support vectors of proportion of training data set, which ranged from 0.1 to 0.3 at 0.05 intervals. The simulation on RCLSSVR is plotted in Figure 1.