Feature Selection Method Based on Partial Least Squares and Analysis of Traditional Chinese Medicine Data
The partial least squares method has many advantages in multivariable linear regression, but it does not include the function of feature selection. This method cannot screen for the best feature subset (referred to in this study as the “Gold Standard”) or optimize the model, although contrarily using the L1 norm can achieve the sparse representation of parameters, leading to feature selection. In this study, a feature selection method based on partial least squares is proposed. In the new method, exploiting partial least squares allows extraction of the latent variables required for performing multivariable linear regression, and this method applies the L1 regular term constraint to the sum of the absolute values of the regression coefficients. This technique is then combined with the coordinate descent method to perform multiple iterations to select a better feature subset. Analyzing traditional Chinese medicine data and University of California, Irvine (UCI), datasets with the model, the experimental results show that the feature selection method based on partial least squares exhibits preferable adaptability for traditional Chinese medicine data and UCI datasets.
In the era of rapid information technology development, data have become increasingly important. As one of the key techniques of data mining, statistical analysis methods have received extensive attention in the fields of biomedicine, physical chemistry, and traditional Chinese medicine [1–3]. One target variable, however, is often affected by other features, which exert different degrees of influence. Traditional Chinese medicine data with multicollinearity, meanwhile, include somewhat irrelevant as well as redundant information that not only increases the time and space complexity of the model but also seriously affects its accuracy and operational efficiency. In conventional statistical analysis methods, calculating the regression coefficients has the partial advantage of reflecting the relationships between features when dealing with such data [4, 5], whereas for data containing irrelevant and redundant features, feature selection may be achieved only minimally; in which case, we lose the chance to achieve model optimization and improved regression accuracy. Therefore, given the multicollinearity of traditional Chinese medicine data as well as the problem of irrelevant and redundant content, there is an urgent need to find a method of data analysis that can remove irrelevant and redundant features from the original dataset, thereby overcoming multicollinearity and screening out the “Gold Standard” feature subset in order to construct a robust model.
The remainder of this manuscript is organized as follows. Related research is introduced in Section 2. In Section 3, the new model is described in detail. In Section 4, 3 traditional Chinese medicine datasets and 3 public UCI datasets are used in the new model and subjected to experimental analysis. The new model is compared with several existing algorithms in order to further verify its feasibility and effectiveness. The final section concludes this study with a brief summary and discussion.
2. Related Work
Feature selection, as an effective dimension reduction method, involves choosing a subset of features from the original set that has an acceptable distinguishing capability based on a particular standard [6, 7], thereby retaining the features that are most effective and favorable for regression (or classification) while decreasing the complexity of the algorithm. This method has attracted the attention of numerous researchers. In the medical field, for example, Peng et al.  have developed a multimodal feature selection method based on a hypergraph for multitask feature selection in order to choose effective brain region data; Ye et al.  have proposed an informative gene selection method based on symmetric uncertainty and support vector machine (SVM) recursive feature elimination that can effectively remove irrelevant genes; and Zhang et al.  proposed a hybrid feature selection algorithm that can select genetic subsets with strong classification ability. In addition, feature selection methods have successfully been applied in other fields. Hu et al.  proposed a feature selection algorithm for joint spectral clustering and neighborhood mutual information that can remove features unrelated to markers; Huang et al.  proposed a feature selection algorithm based on multilabel ReliefF that can remove irrelevant features while opting for features having strong correlations with categories. In terms of our research questions, most of the experimental data from traditional Chinese medicine present characteristics having multielement, multitarget, and strong multicollinearity problems . Although the feature selection method can eliminate the irrelevant features in order to achieve an improved dimensionality reduction effect, this will not be a good solution to the multicollinearity problem. Therefore, we are in the process of pursuing further improvements and continuing our exploration.
As a nonparametric multivariate statistical analysis method, partial least squares (PLS) provides a regression modeling method for multidependent variables to multi-independent variables in order to effectively solve the problem of multicollinearity [14, 15] when there is a high correlation between independent variables. Based on preponderance, some researchers have proposed a series of improved models. You et al.  have proposed a feature selection method (PLSRFE) by combining PLS with the recursive feature elimination method (RFE). The PLSRFE can remove irrelevant features and select a small number of features, although it lacks reliability due to the default parameters applied. Shang et al.  presented a robust feature selection and classification algorithm based on partial least squares regression to solve the problem of multicollinearity and redundancy in features, yet the follow-up question as to whether the selection of parameters in the model is sufficient remains to be answered. Nagaraja et al.  connected partial least squares regression and optimization experimental design in order to select features by analyzing the model parameters while ignoring the interference of noise samples, which weakened the robustness of the model. In view of the shortcomings of the methods proposed above, by only using partial least squares for all data in the regression model without considering the advantages of the feature selection method, we create a model with poor interpretability and issues such as overfitting, shortcomings that make it impossible to achieve “Gold Standard” screening, and model optimization. Therefore, the goal of this study is to determine how to effectively perform feature selection on multicollinear traditional Chinese medicine (TCM) data and establish a regression model that is as simple and accurate as possible.
In a feature selection study, higher-quality feature selection methods should exhibit the following characteristics : (1) interpretability, meaning that the features selected in the model have scientific significance; (2) acceptable model stability; (3) avoidance of deviations in the hypothesis test; and (4) model calculation complexity within a manageable range. Traditional feature selection methods such as stepwise regression, ridge regression, and principal component regression [20, 21] only satisfy some of the above characteristics. Therefore, effectively overcoming such problems and achieving better feature selection results have become a research point for regression and classification. In order to determine a feasible solution to this problem, Tibshirani has presented a feature selection method called “lasso”  and applied it successfully by taking inspiration from both the ridge regression algorithm proposed by You et al.  and the nonnegative garrote algorithm proposed by Breiman . The lasso method compresses the regression coefficients by using the absolute value function of the model coefficients as a penalty to achieve the selection of significant features as well as the estimation of corresponding parameters. The lasso method preferably overcomes the insufficiencies of the traditional selection methods in the feature selection model.
Given the above, this study proposes a feature selection method (LAPLS) that unites the concepts of the lasso method and partial least squares (PLS). This method utilizes PLS to perform the regression and applies the L1 regular term constraint to the sum of the absolute values of the regression coefficients. In this way, the regression coefficients of insignificant features are compressed to 0, helping in achieving the goal of feature selection. This technique is then combined with the coordinate descent method to perform multiple iterations in order to select a higher-quality feature subset, which can then be used to screen the target for the “Gold Standard.” The new model algorithm not only can effectively overcome the issues of multicollinearity but can also eliminate insignificant features, making it suitable for the data analysis of traditional Chinese medicine.
3. Feature Selection Method Based on Partial Least Squares (LAPLS)
The lasso method uses L1 norm penalty regression to find the optimal solution [25, 26], in which a sparse weight matrix is generated that can be applied to feature selection. The basic idea is that under the constraint that the sum of the absolute values of the regression coefficients is less than or equal to a threshold s (i.e., ), the sum of the residual sums is minimized, so that the regression coefficients with smaller absolute values are compressed to 0. Following this, feature selection and corresponding parameter estimation are simultaneously realized, thus achieving an improved data dimensionality reduction effect.
The partial least squares method is a regression technique for solving multiple independent variables and multiple dependent variables. Compared with traditional regression analysis, PLS overcomes the problems of multicollinearity, small sample size, and variable number limit [27, 28]. For achieving the effective modeling purposes of partial least squares, firstly, extracting the latent variables and from the original independent variable X and the dependent variable Y, respectively, and simultaneously meet the criteria that and should represent the datasets as well as possible, and has a strong explanatory power for [29, 30]. If the accuracy condition is not met, the residual information is used to continue to carry out the second latent variables extraction until the accuracy condition is satisfied . In the process of regression, however, the PLS method does not have the function of feature selection and cannot achieve an improved dimensionality reduction effect. Exploiting the lasso method, however, can effectively remove irrelevant and partially redundant features, thereby selecting better feature subsets. This study uses the lasso algorithm to optimize the partial least squares method, not only to achieve the feature selection functionality of partial least squares but also to achieve the screening of the “Gold Standard” via iteration. At the same time, this research helps us to overcome the problem of multicollinearity, which is an issue with the traditional lasso method.
The LAPLS method first uses the latent variables extracted by principal component analysis and canonical correlation analysis as the input for the multiple linear regression in partial least squares and then makes the sum of the absolute values of the coefficients less than or equal to a constant when performing the regression. That is to say, the L1 regular term constraint regression coefficient is added to the objective function, and, simultaneously, the sum of the squares of the residuals is minimized so that some regression coefficients strictly equal to 0 can be generated, thereby eliminating the irrelevant and partially redundant features. Combination with the coordinate descent method (CDM)  at last yields the “Gold Standard” feature subset. The construction process is shown in Figure 1.
The specific construction process is as follows: Step 1. Standardize the data (z-score): . Step 2. Extract the latent variables: the first latent variables are, respectively, extracted from , with the goal of deriving the largest mutation information,; the largest degree of correlation, ; and the largest comprehensive covariance of both sides, , in which is the first unit vector of . can be obtained by solving the maximum value of , that is, using the Lagrangian algorithm; is the eigenvector corresponding to the largest eigenvalue of . In this way, calculation of and the residual information matrix can be obtained, wherein , , , , , and . The residual information matrix is then substituted for , after which and the second latent variables are then calculated according to the above steps. Step 3. Judge whether satisfactory accuracy has been reached: according to the definition of cross-validity (1), if the currently extracted latent variables makes the square of less than 0.0975 , then the added latent variables has no significant effect on the prediction deviation of the reduction equation. Therefore, the previously extracted latent variable is sufficient to achieve satisfactory accuracy, and the algorithm can be terminated. If, however, the currently extracted latent variable makes the square of greater than 0.0975, and it can be assumed that increasing the latent variables will improve the prediction accuracy. Therefore, the next latent variables are then extracted, and a judgment is made as to whether it is beneficial for reducing the prediction deviation of the equation. This process is cyclically looped until the satisfactory accuracy is no longer improved, and the algorithm can be terminated: where is the number of samples; is the latent variables number ; and is the rank of the matrix ; and is the fitting value of at the sample point , and the solution process is as follows: firstly, divide all sample points into two parts (one part including and another part including one), and then use sample points and the latent variables for doing a regression equation, and finally the one sample point is substituted into the equation to obtain the fitted value . In addition, all sample points are used to fit the regression equation containing latent variables, and then the predicted value of the th sample point can be obtained. Step 4. Acquire the regression coefficients: assuming that the number of the latent variables extracted is m () when the satisfactory accuracy is achieved, the regression equation for and the inverse normalized equation are as shown in equation (2), thereby obtaining the regression coefficient that will be processed: where is the column of the residual matrix ; ; is the number of dependent variables. Step 5. Construct the objective function: after generating the coefficients using the above PLS regression, the function can be constructed by combining the regression coefficient with the L1 regularization term of the lasso algorithm, in which the function satisfies the constraint that the sum of the absolute values of the regression coefficients is less than or equal to a threshold. Under this condition, the sum of the squared residuals is minimized: Minimization of the residual sum of squares: where is the number of samples; is the threshold; and the parameter value is . Step 6. Solve the function: since the new model imposes the L1 regular term constraint on the regression coefficients generated by PLS, the resulting constructed function has absolute values, which makes the function underivable at the zero point. Therefore, this study uses the coordinate descent method  to solve the problem. It is worth noting that the new algorithm (LAPLS) is a reimplementation of the standard coordinate descent algorithm for lasso regression with an initial solution generated using PLS (Algorithm 1). First, the function is divided into 2 parts, and . Next, the partial derivative is calculated separately so that the overall partial derivative can be obtained: where , . Let , and then determine the regression coefficient under the regular term constraint: Step 7. Perform multiple iterations through Step 6, in which some regression coefficients are strictly compressed to 0 at a certain iteration so that the best feature subset (i.e., the “Gold Standard”) can be selected. The selected subset of features is then regressed to construct a simpler optimization model than PLS regression:
4. Experimental Design
4.1. Experimental Data Description
In this study, the 6 experimental datasets included the traditional Chinese medicine data (WYHXB, NYWZ, and DCQT) from the Key Laboratory of Modern Chinese Medicine Preparations Ministry of Education as well as Communities and Crime (CCrime), Breast Cancer Wisconsin (Prognostic) (BreastData), and Residential Building Dataset (RBuild) from the UCI Machine Learning Repository. The basic information for each dataset is listed in Table 1. There are 798 features in WYHXB, 1 dependent variable, and 54 samples; 10283 features in NYWZ, 1 dependent variable, and 54 samples; 9 features in DCQT, 1 dependent variable, and 10 samples; CCrime describes community crime and includes 127 features, 1 dependent variable, and 1994 samples; BreastData describes breast cancer cases and includes 34 features, 1 dependent variable, and 198 samples; and RBuild describes residential buildings and includes 103 features, 1 dependent variable, and 372 samples. Since the UCI datasets obtained from the UCI Machine Learning Repository generally had numerous missing values, the mean filling method was used for data preprocessing during the experiment. The reason this study adopted CCrime, BreastData, and RBuild from the UCI datasets for the experiment was to compare the regression effects of the new model on a public dataset consisting of diversified data, in order to validate the reliability and robustness of the new model.
WYHXB and NYWZ are the basic experimental data of Shenfu injections used to treat cardiogenic shock, which utilizes the left anterior descending coronary artery near the edge of the heart to replicate the middle-end cardiogenic shock rat model. In seven groups of 6 shock model rats, each were injected with 0.1, 0.33, 1.0, 3.3, 10, 15, or 20 mL·kg−1 of Shenfu. There was also a model setting group and a blank group. Sixty minutes after receiving the Shenfu, the red blood cell flow rate (μ m/s) pharmacodynamic indicator was collected. The material information contained in the Shenfu injection is called the exogenous substance (i.e., the WYHXB data, as shown in Table 2), and the material information of the experimental individual is called the endogenous substance (i.e., the NYWZ data, as shown in Table 3). In the 2 datasets, the material information is the independent variable (i.e., the feature), and the red blood cell flow rate is the dependent variable.
The experimental data of the traditional Chinese medicine (DCQT) were mainly used to study the factors affecting the physiological index (d-lactic acid content) under the influence of the active ingredients in Chinese medicinal rhubarb. The contents (characteristics) of the active ingredients of the rhubarb include aloe emodin, emodin, rhein and chrysophanol, emodin methyl ether, magnolol, as well as honokiol, hesperidin, and synephrine. The dependent variable is D(−)-lactic acid content. The partial experimental data are listed in Table 4.
4.2. Results and Discussion
4.2.1. Experimental Parameters
Using a strategy of selecting and optimizing the parameters corresponding to each dataset with the goal of ensuring model result reliability is significant given that the experimental data themselves have different characteristics and the corresponding model parameters are inconsistent. First, the model parameters were initialized and set to and , where is the model threshold, is the model parameter value, and iter is the number of iterations. Next, based on the initialization value, a comparison strategy was used for analysis. Specifically, with the maintenance condition of , the value was gradually increased or decreased until the best value was determined based on the R-squared evaluation index (as shown in Table 5). Finally, by keeping the selected value for each dataset fixed while increasing or decreasing the threshold , a set of model parameters could preferentially be selected (as shown in Table 6).
In Table 5, keeping fixed, the results are as follows: for the WYHXB data, the result of the homologous R-squared evaluation indicates that the best result occurs when k = 13 (i.e., ); for the NYWZ data, the best result occurs when k = 12 (i.e., ); for the DCQT data, the best result occurs when k = 10 (i.e., ); for the CCrime data, the best result occurs when k = 12 (i.e., ); for the BreastData data, the best result occurs when k = 12 (i.e., ); and for the RBuild data, the best result occurs when k = 13 (i.e., ). By comparing the results of Table 6, one group of optimal parameters for each dataset can be selected: for the WYHXB data, we should choose , ; for the NYWZ data, we should choose , ; for the DCQT data, we should choose , ; for the CCrime data, we should choose , ; for the BreastData data, we should choose , ; and for the RBuild data, we should choose , .
At the same time, in order to verify the feasibility and availability of the LAPLS method, we should consider the feature analyses of each dataset (with the model parameters utilizing the above results) during our experimentation. Specifically, by determining the iteration at a particular time (the number of iterations ranged from 1 to 25), the model achieving the best feature subset (“Gold Standard”) and the best homologous R-squared value can be selected. The results of this analysis are shown in Figures 2–8. It can be seen in Figure 2 that the number of features in the 6 groups of experimental data decreased as the number of iterations increased, indicating the trend towards the goal of eliminating irrelevant and partially redundant features. Similarly, Figures 3–8 show the trend of the corresponding R-squared values during the iterative process of each dataset (i.e., the stages as the number of features change). This does not indicate, however, that the fewer the number of features, the better the calculated result. The specific results were as follows: for the WYHXB data, after 10 iterations, 425 significant features could be selected (373 features had been eliminated), and the homologous R-squared value was optimal; for the NYWZ data, after 12 iterations, 1247 significant features could be selected (9036 features had been eliminated) and the homologous R-squared value was optimal; for the DCQT data, after 11 iterations, 5 significant features could be selected (4 features had been eliminated) and the homologous R-squared value was optimal; for the CCrime data, after 11 iterations, 82 significant features could be selected (45 features had been eliminated) and the homologous R-squared value was optimal; for the BreastData data, after 13 iterations, 22 significant features could be selected (12 features had been eliminated) and the homologous R-squared value was optimal; for the RBuild data, after 14 iterations, 39 significant features could be selected (64 features had been eliminated) and the homologous R-squared value was optimal.
From the above experiments, we could determine the respective parameters and the homologous iteration times for each of the 6 datasets and also obtain the dimensionality reduction effect of each feature selection by the new model (i.e., the degree of irrelevant and redundant feature elimination), as shown in Figure 9. It is worth noting, however, that the number of eliminated features cannot be 0, given the meaning of feature selection.
4.2.2. Comparison of the LAPLS with Other Methods
Further analysis of the new model consisted of randomly dividing each dataset into a training set and a test set (ratio = 7 : 3) and then utilizing traditional partial least squares (PLS), lasso, PLSRFE, and the improved algorithm (LAPLS) for training and learning. The test set was subjected to a regression experiment (with parameters and number of iterations consistent with the previously listed optimal values), in which the R-squared (R2) and root-mean-square error (RMSE) values were used as the model evaluation indicators. Meanwhile, in order to ensure the reliability of the experimental results, 10 tests were performed for each set of experimental data, and ultimately, the respective average values were chosen as the final experimental results, as shown in Table 7. Via the above experimental design, the verification of the new model could be analyzed from two perspectives: (1) the comparison between the LAPLS and the traditional methods (PLS, lasso) and (2) the comparison between the LAPLS and the same type of feature selection method (PLSRFE).
Table 7 lists the experimental results of LAPLS regression on the test sets of 6 sets of original data (WYHXB, NYWZ, DCQT, CCrime, BreastData, and RBuild). The R2 values were 0.6558, 0.7326, 0.9384, 0.6703, 0.7064, and 0.9831, respectively, and the RMSE were 412.7325, 140.5172, 0.0117, 0.1516, 3.1468, and 202.5260, respectively. Compared with PLS and lasso, the LAPLS had a slightly inferior RMSE for the CCrime (larger sample size) data (0.0128 greater error than PLS and 0.0212 greater error than lasso), but for the results of the remaining experimental datasets, the new method performed better than the traditional methods. Compared with the PLSRFE, the LAPLS was slightly inferior for the CCrime data and the LAPLS RMSE was slightly higher than the PLSRFE RMSE for the RBuild data, although the LAPLS results were better than those of the PLSRFE for the remaining experimental data. Overall, the results of the improved algorithm were better than those of the other existing algorithms, indicating that the new model has the effect of eliminating irrelevant and redundant features. In addition, as shown by the experimental results, the new model proved to be relatively adaptable, demonstrating effectiveness not only for multifeature data but also for data with fewer features.
In order to observe the experimental results more intuitively, the trend graphs were portrayed separately (Figures 10 and 11) in order to reflect the fluctuations of the R2 and RMSE values. It can be seen that the R2 and RMSE values of the new model for the 6 sets of experimental data are basically superior to those of the other algorithms, indicating that the regression results of the new model are better and that this approach effectively eliminates the irrelevant and partially redundant features. In summary, the improved algorithm not only performs feature selection and screens out the “Gold Standard” feature subset for general high-dimensional data but also could well suit the traditional Chinese medicine data.
The traditional partial least squares method has no feature selection function, and the goal of obtaining of a higher-quality feature subset cannot be achieved for experimental data from traditional Chinese medicine. Given this, we proposed a feature selection method based on partial least squares. This method made sufficient use of the advantages of the lasso algorithm, namely, imposing constraints on the sum of the absolute values of the regression coefficients and carrying out feature selection, while simultaneously combining this technique with the partial least squares method, which can solve the multicollinearity problem in order to perform regression analysis. In this way, both data dimensionality reduction and the screening of the “Gold Standard” feature subset were realized. Through the experimental comparison of TCM data and UCI datasets, it was clearly demonstrated that the improved algorithm significantly strengthens the interpretation degree and prediction accuracy of the model, and it is a suitable analytical method for TCM data. The improved algorithm, however, has the disadvantage of only eliminating the partially redundant features from high-dimensional data. Going forward, we will continue to improve the algorithm in order to boost its efficiency. In addition, ensuring that reasonable relevant parameters are set during model creation also requires further study.
The TCM data used in this study can be obtained by contacting the first author, and other data sets can be obtained through the UCI Machine Learning Repository.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This research was funded by the National Natural Science Foundation of China (nos. 61762051 and 61562045) and the Jiangxi Province Major Projects Fund (no. 20171ACE50021).
W. Li, H. Han, S. Cheng, Y. Zhang, S. Liu, and H. Qu, “A feasibility research on the monitoring of traditional Chinese medicine production process using NIR-based multivariate process trajectories,” Sensors and Actuators B: Chemical, vol. 231, pp. 313–323, 2016.View at: Publisher Site | Google Scholar
B. Andrea, J. Rahnenführer, and M. Lang, “A multicriteria approach to find predictive and sparse models with stable feature selection for high-dimensional data,” Computational and Mathematical Methods in Medicine, vol. 2017, Article ID 7907163, 18 pages, 2017.View at: Publisher Site | Google Scholar
H. Xuan, “Research and development of feature dimensionality reduction,” Computer Science, vol. 45, no. S1, 2018.View at: Google Scholar
Q. Ye, Y. Gao, R. Wu et al., “Informative gene selection method based on symmetric uncertainty and SVM recursive feature elimination,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 30, no. 5, pp. 429–438, 2017.View at: Google Scholar
J. Zhang, X. G. Hu, P. P. Li et al., “Informative gene selection for tumor classification based on iterative lasso,” Pattern Recognition & Artificial Intelligence, vol. 27, no. 1, pp. 49–59, 2014.View at: Google Scholar
M. Hu, L. Zheng, L. Tang et al., “Feature selection algorithm based on joint spectral clustering and neighborhood mutual information,” Pattern Recognition & Artificial Intelligence, vol. 30, no. 12, pp. 1121–1129, 2017.View at: Google Scholar
Z. Hao, J. Du, B. Nie, F. Yu, R. Yu, and W. Xiong, “Random forest regression based on partial least squares connect partial least squares and random forest,” in Proceedings of the International Conference on Artificial Intelligence: Technologies and Applications, Bangkok, Thailand, January 2016.View at: Publisher Site | Google Scholar
S. Wanfeng and H. U. Xuegang, “K-part lasso based on feature selection algorithm for high-dimensional data,” Computer Engineering and Applications, vol. 48, no. 1, pp. 157–161, 2012.View at: Google Scholar
R. Muthukrishnan and R. Rohini, “LASSO: a feature selection technique in predictive modeling for machine learning,” in Proceedings of the IEEE International Conference on Advances in Computer Applications (ICACA), pp. 18–20, IEEE, London, UK, December 2016.View at: Google Scholar
H. Wang, Linear and Nonlinear Methods for Partial Least Squares Regression, National Defense Industry Press, Beijing, China, 2006.
Y. Zhu and H. Wang, “A simplified algorithm of PLS regression,” Journal of Systems Science and Systems Engineering, vol. 9, no. 4, pp. 31–36, 2000.View at: Google Scholar
E. Grant, K. Lange, and T. T. Wu, “Coordinate descent algorithms for L1 and L2 regression,” Journal of the American College of Cardiology, vol. 25, no. 2, pp. 491–499, 1995.View at: Google Scholar