Abstract
The development of sparse techniques presents a major challenge to complex nonlinear highdimensional data. In this paper, we propose a novel feature selection method for nonlinear support vector regression, called FSNSVR, which first attempts to solve the nonlinear feature selection problem in the regression technology field. FSNSVR preserves the representative features selected in the complex nonlinear system due to its use of a feature selection matrix in the original space. FSNSVR is a challenging mixedinteger programming problem that is solved efficiently by using an alternate iterative greedy algorithm. Experimental results on three artificial datasets and five realworld datasets confirm that FSNSVR effectively selects representative features and discards redundant features in a nonlinear system. FSNSVR outperforms L_{1}norm support vector regression, L_{1}norm least squares support vector regression, and L_{p}norm support vector regression on both feature selection ability and regression efficiency.
1. Introduction
Highdimensional data have commonly emerged in diverse fields, such as finance [1], economics [2], biology [3], and medicine [4]. Complex nonlinear relationships between features may exist in highdimensional datasets [5]. For example, most economic and financial time series follow nonlinear behavior [6]. Ling et al. explored the nonlinear relationship between globalization, natural resources, financial development, and carbon emissions [6]. Another example of complex nonlinear highdimensional data is in the medicine field. Medical costs have a sophisticated relation with features [7]. Complex nonlinear highdimensional data have frequently emerged in the biology field. Nonlinear relations between features can depict biological relationships more precisely and reflect critical patterns in biological systems [8].
Complex nonlinear highdimensional data may include some irrelevant and redundant features, which may reduce the effectiveness of data mining and may detract from the quality of the results [9–11]. Thus, complex nonlinear highdimensional data need a sparse technique. Feature selection, as a useful sparse technique, selects some useful features upon which to focus its attention and ignores the rest [12–17]. In general, feature selection methods are classified as filter, wrapper, and embedded methods [16–18]. The embedded method is very popular for feature selection since it conducts feature selection and other learning tasks simultaneously [19].
Sparse support vector regression, as a branch of sparse support vector machine [20–23], is a computationally powerful feature selection method. Sparse support vector regression always adopts a sparse regularization term to realize feature selection and regression simultaneously. Therefore, sparse support vector regression is an embedded feature selection method [24–26]. L_{1}norm support vector regression (L_{1}SVR) [27] and L_{1}norm least squares support vector regression (L_{1}LSSVR) [28] use the L_{1}norm sparse regularization term to shrink some coefficients of the regression estimators towards 0. According to the regression estimators, the contribution of each feature to the final decision function can be judged, and then, the useful features are selected, while the irrelevant and redundant features are discarded. To improve the sparseness of L_{1}SVR, Zhang et al. [29] proposed L_{p}norm support vector regression (L_{p}SVR) (0 < < 1). The L_{p}norm regularization term in L_{p}SVR shrinks more coefficients of the regression estimators towards 0, and some coefficients are shrunk to exactly 0, leading to more irrelevant and redundant features being discarded. However, L_{1}SVR, L_{1}LSSVR, and L_{p}SVR only solve the linear feature selection problem, which is not always suitable for complex nonlinear cases.
To solve the feature selection problem for complex nonlinear highdimensional data, we follow the spirit of nonlinear support vector machinebased feature selection [9, 30, 31] and then propose a novel feature selection method for nonlinear support vector regression, which is called FSNSVR. We bring a feature selection matrix, as a diagonal matrix with elements of either 1 or 0, into nonlinear support vector regression. As a result, FSNSVR becomes a mixedinteger programming problem (MPP). To solve FSNSVR efficiently, we employ an alternate iterative greedy algorithm to find a local optimal value [32], in which we iteratively solve the standard SVR problem and a smaller nonconvex feature selection problem. In addition, a featureranking strategy is suggested [33], which ranks the features according to their contributions to the objective in the MPP. Compared with L_{1}SVR, L_{1}LSSVR, and L_{p}SVR, the experimental results show that FSNSVR selects the most appropriate representative features in the highly complex nonlinear relationships with smaller estimation errors than those produced by L_{1}SVR, L_{1}LSSVR, and L_{p}SVR. This means that FSNSVR not only selects the representative features but also has good regression effectiveness. The contributions of this paper are summarized as follows:(1)Bringing a feature selection matrix into nonlinear support vector regression, we propose a novel feature selection method for nonlinear support vector regression to identify complex nonlinear relationships between features in their original space. The proposed model first attempts to solve the nonlinear feature selection problem in the regression technology field.(2)The proposed model is a complex mixedinteger programming problem. Ensuring the efficiency of the learning process, we employ an alternate iterative greedy algorithm to find a local optimal value for the proposed model. The alternate iterative greedy algorithm, transferring the complex mixedinteger problem into the minmax optimization problem, effectively reduces the computational complexity.(3)Experimental results on both artificial and realworld datasets indicate that the proposed model preserves the representative features selected in the complex nonlinear system, outperforming the other three linear feature selection methods, with better feature selection and regression results. The training speed of the proposed method confirms the efficiency of the alternate iterative greedy algorithm.
The remainder of this paper is organized as follows: Section 2 in this paper briefly focuses on support vector regression. In Section 3, we propose feature selection for nonlinear support vector regression. Section 4 provides artificial and realworld dataset experiments, and Section 5 concludes the paper.
2. Background
Starting with the notations, we consider a regression problem in dimensional real vector space . Suppose that is the response vector, is a known design matrix of covariates, and is the dimensional training sample. Next, we briefly review support vector regression (SVR) [26] that are closely related to FSNSVR.
The optimal nonlinear regression function of SVR is constructed as follows:where , , and are an appropriately chosen kernel. Parameters in function (1) are estimated by solving the following optimization problem:where and are the slack variables and is a parameter determining the tradeoff between the empirical risk and the regularization term. To derive the dual formulation of SVR, we first introduce the Lagrangian function for the problem (2), which iswhere and are the Lagrangian multiplier vectors. The Karush–Kuhn–Tucher (KKT) necessary and sufficient optimality conditions for the problem (2) are given by
According to the previously mentioned KKT conditions, we obtain the dual formulation of problem (2) as follows: can be obtained from the solution and of (5) by
For any solution to (5), and , if , the solution to the problem (2) can be obtained in the following way:(1)For any nonzero component ,(2)For any nonzero component ,
The final decision function can be constructed as
3. Feature Selection for Nonlinear Support Vector Regression
3.1. Problem Formulation
In this section, we propose feature selection for nonlinear support vector regression. is an feature selection matrix. We consider the following nonlinear regression function:where , , and is an appropriately chosen kernel. and are the unknown parameters that need to be estimated. The optimal feature selection matrix also needs to be searched simultaneously.
The estimator of the regression function (10) can be defined as the solution to the FSNSVR optimization problem:where and are slack variables and is a parameter determining the tradeoff between the empirical risk and the regularization term. In fact, the feature selection matrix defines a subspace spanned by the selected features. Minimizing the term in the objective function (11) has the beneficial effect of suppressing variables to produce a sparse set of nonzero feature weights. Therefore, FSNSVR has nonlinear feature selection ability.
3.2. Problem Solution
Obviously, the FSNSVR optimization problem is a mixedinteger programming problem. We reformulate problem (11) as follows:
Solving problem (12) to obtain global optimality is highly challenging and impractical [24]. We employ an alternate iterative greedy algorithm to find a local optimal value. First, we fix the integer part and then obtain the solution to (12), which leads to solving the problem in the same manner as SVR. Similar to the deduction process of nonlinear SVR in Section 2, we obtain the dual formulation of the inner minimization problem. Then, problem (12) can be rewritten as
Obviously, problem (13) is a challenging minmax optimization problem. Fixing the optimal solution of the inner maximization problem , we obtain the outer minimization integer problem, which leads to exhaustive computation of the objective for the possible .
To make the greedy algorithm work, we follow the strategy in [33] to initialize to make the algorithm more stable. The value of each feature is computed after solving SVR by
The score of the th feature that reflects the importance among all the features is computed by
The greedy algorithm starts from an initial generated by (15). If is less than , then ; otherwise, . We then fix and solve problem (13) to obtain . We calculate and according to (14) and (15), respectively. is updated by replacing if objective (13) decreases more than the tolerance. After updating , can be obtained again. The algorithm will be terminated if objective (13) decreases less than the tolerance. We summarize the procedures of the greedy approach in Algorithm 1 to give the feature selection method for nonlinear support vector regression. The proof of convergence of the greedy approach in Algorithm 1 can be obtained from Mangasarian and Kou [32].

Obtaining the solution of problem (13), can be obtained by
For any solution to (15) and , if , the solution to problem (6) can be obtained in the following way:(1)For any nonzero component ,(2)For any nonzero component ,
The final decision function can be constructed as
3.3. Computational Complexity
Concerning the computational complexity of FSNSVR, we find that FSNSVR includes two parts: one is repeatedly computing the inner maximization problem of (13), and the other is repeatedly computing (14) and (15). The first part requires solving one quadratic programming problem. The time complexity of this part is approximately . For the second part, it is easy to compute with the fixed , and the computational complexity of this part is not more than times.
4. Experimental Results
To test the feature selection and regression effectiveness of the proposed FSNSVR, we compare it with L_{1}SVR [27], L_{1}LSSVR [28], and L_{p}SVR [29] by using three artificial datasets and seven realworld datasets. L_{1}SVR, L_{p}SVR, and L_{1}LSSVR are embedded linear feature selection methods. All of these methods are implemented in the MATLAB R2019b environment on a PC running the 64bit Windows XP OS on a 1.6 GHz Intel (R) processor with 16 GB of RAM.
For the feature selection of nonlinear support vector regression, we employ a Gaussian kernel, and its kernel parameter is selected from the set . Parameter C is also selected from the set . The insensitive parameter is fixed at 0.01. The optimal values of the parameters in the experiments are obtained by using the grid search method.
Let be the number of samples, is the test sample, is the predicted value of , and is the average value of . We use the following evaluation criteria to evaluate the variable selection and regression results. P_{1}: the proportion of simulation runs with nonzero coefficients are selected R^{2}: the coefficient of determination is defined as NMSE: the normalized mean squared error (NMSE) is defined as RMSE: the root mean square error (RMSE) is defined as
Thus, the smaller the values of NMSE and RMSE are, the more statistical information is captured from the selected variables.
4.1. Artificial Datasets
To test the nonlinear feature selection performance of FSNSVR, we provide three artificial datasets. Specifically, we set , , and . The first regression model is generated as follows:where .
The second regression function is as follows:where .
The third regression function is as follows:where . The specifications of these artificial datasets are listed in Table 1.
To evaluate the performance of the feature selection results, we adopt the following criteria:where are true positives, are falsepositives, are false negatives, and are negative counts. Precision and recall are commonly used to present results for binary decision problems in machine learning since they give a more accurate evaluation of an algorithm’s performance. Here, we use precision and recall to evaluate the feature selection results of FSNSVR, L_{1}SVR, L_{p}SVR, and L_{1}LSSVR.
The best parameters of FSNSVR, L_{1}SVR, L_{p}SVR, and L_{1}LSSVR for artificial datasets are shown in Table 2. The feature selection and regression results on the previous three artificial datasets are shown in Table 3. From Table 3, we find that FSNSVR drives larger precision and recall than L_{1}SVR, L_{p}SVR, and L_{1}LSSVR. Meanwhile, FSNSVR obtains a larger and smaller NMSE than L_{1}SVR, L_{p}SVR, and L_{1}LSSVR. It is clear that FSNSVR has the ability to select the representative features and discard the redundant features in the nonlinear system. Therefore, FSNSVR is suitable for solving the nonlinear feature selection problem, while L_{1}SVR, L_{p}SVR, and L_{1}LSSVR are not suitable for solving the feature selection problem of complex nonlinear highdimensional data. In terms of running times, although the training speed of FSNSVR is slower than L_{1}LSSVR, it is significantly faster than L_{1}SVR and L_{p}SVR.
4.2. Parameters and Nonlinear Feature Selection Analysis
In this part, we analyze the effects of the parameters and C on the nonlinear feature selection results. To test the influence of the kernel parameter on NMSE, , precision, and recall, we first fix parameter C as the optimal value used in the experiments on the artificial datasets. Figures 1–3 illustrate the influence of the kernel parameter on the nonlinear feature selection results for Type A, Type B, and Type C, respectively. From Figures 1–3, we find that when increases, the NMSE value decreases and then increases. As the kernel parameter increases, R^{2} and precision increase and then decrease, which means that the kernel parameter has a strong influence on the feature selection ability of FSNSVR. When is selected as the optimal value, precision and recall reach maximum values, which means that FSNSVR selects representative features and discards irrelevant features. Although the number of selected features is small, FSNSVR can select the representative features in the datasets.
To further test the influence of parameter C on the FSNSVR feature selection results, we fix the kernel parameter as the optimal value used in the experiments on the artificial datasets. Figures 4–6 show the influence of C on NMSE, R^{2}, precision, and recall. From Figures 4–6, we observe that as parameter C increases, NMSE decreases and then remains constant. When C = 2, precision reaches a maximum value and then remains constant, which means that FSNSVR selects the representative features and discards the irrelevant features.
4.3. RealWorld Datasets
To further test the feature selection and regression performance of FSNSVR, we consider five realworld datasets from the UC Irvine (UCI) Machine Learning Repository [34]. Table 1 displays the dataset information, including the specific numbers of training samples, test samples, and features. The best parameters of FSNSVR, L_{1}SVR, L_{p}SVR, and L_{1}LSSVR for realworld datasets are shown in Table 4.
Table 5 lists the feature selection and regression results of FSNSVR, L_{1}SVR, L_{p}SVR, and L_{1}LSSVR. One can easily observe that FSNSVR selects fewer features than those of L_{1}SVR, L_{p}SVR, and L_{1}LSSVR, but FSNSVR obtains small values for NMSE and RMSE, which are comparable with the other methods. FSNSVR selects very few useful features and captures the nonlinear statistical information in the test datasets. FSNSVR realizes feature selection and regression simultaneously due to its inherent feature selection property. Faced with complex nonlinear highdimensional datasets, L_{1}SVR, L_{p}SVR, and L_{1}LSSVR present challenges since they can only solve the feature selection problem for the linear version. Regarding the running time, FSNSVR is significantly faster than L_{1}SVR and L_{p}SVR.
5. Conclusion
Our paper focused on the nonlinear feature selection problem posed by highdimensional data, especially when nonlinear complex relationships between features exist. To solve this problem, we brought a feature selection matrix in the original space into nonlinear support vector regression and then proposed a novel feature selection method for nonlinear support vector regression (FSNSVR). FSNSVR is a mixedinteger programming problem (MPP). To efficiently solve FSNSVR, we employed an alternate iterative greedy algorithm to find a local optimal value. The feature selection matrix and the supervised selection process of FSNSVR ensured that the representative features were selected and the redundant features were discarded automatically. The feature selection and regression performance of FSNSVR in the artificial and realworld datasets confirmed its sparseness and effectiveness.
The proposed method also has limitations that should be acknowledged in future studies. First, FSNSVR is not suitable for the heterogeneity problem of nonlinear highdimensional data. The spirit of quantile regression [35–37] can be brought into the nonlinear feature selection framework in the future. Second, more efficient methods to solve FSNSVR are needed since the training speed of the current method is not fast enough with regard to largescale data sets. Third, forming the application perspective on how to use FSNSVR to deal with nonlinear feature selection problem in the real world remains a question in future work.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Nos. 11871183, 61866010, 61603338, and 12101552), the National Social Science Foundation of China (No. 21BJY256), the Philosophy and Social Sciences Leading Talent Training Project of Zhejiang Province (No. 21YJRC071YB), the Natural Science Foundation of Zhejiang Province (No. LY21F030013), and the Natural Science Foundation of Hainan Province (No. 120RC449).