Abstract

The basic experimental data of traditional Chinese medicine are generally obtained by high-performance liquid chromatography and mass spectrometry. The data often show the characteristics of high dimensionality and few samples, and there are many irrelevant features and redundant features in the data, which bring challenges to the in-depth exploration of Chinese medicine material information. A hybrid feature selection method based on iterative approximate Markov blanket (CI_AMB) is proposed in the paper. The method uses the maximum information coefficient to measure the correlation between features and target variables and achieves the purpose of filtering irrelevant features according to the evaluation criteria, firstly. The iterative approximation Markov blanket strategy analyzes the redundancy between features and implements the elimination of redundant features and then selects an effective feature subset finally. Comparative experiments using traditional Chinese medicine material basic experimental data and UCI’s multiple public datasets show that the new method has a better advantage to select a small number of highly explanatory features, compared with Lasso, XGBoost, and the classic approximate Markov blanket method.

1. Introduction

At present, due to the rapid development of scientific and technological level, the information acquisition technology and storage capacity have been greatly improved, and the data obtained carry more sufficient information, for which the scale is getting larger and larger. In the field of basic research of materials about traditional Chinese medicine, high-performance liquid phase (waters H-class) and mass spectrometry (synapt G2-si) are usually used to obtain experimental data. These data often involve thousands of substances, which are characterized by high-dimensional data and easily cause dimensional disasters. At the same time, due to the limitation of the experimental times, the characteristic of small samples is also presented, which easily leads to problems, such as overfitting. Conventional statistical analysis methods, such as multiple linear regression, principal component regression, and ridge regression, choose regression coefficients to reflect the relationship between variables [13], however, which cannot effectively delete irrelevant features and redundant features, and achieve the purpose of screening important substances for basic data of traditional Chinese medicines with high dimensionality and a small amount. At the same time, the traditional feature selection methods, such as Lasso and K-split Lasso [4], only can delete irrelevant features and redundant features to some extent and cannot meet the data processing requirements of high-dimensional small samples when dealing with data. Therefore, in view of the problem that high-dimensional small sample data of Chinese medicine contain more irrelevant information and redundant information, it is urgent to find an analytical model that can select effective features from high-dimensional small sample data, and improve the accuracy and operation of the model to provide technical support for researchers.

Next, this article will introduce the research-related work in Section 2. The new method is elaborated in Section 3. In Section 4, two basic data on TCM materials and three public UCI data are used to analyze in the new method, which is also compared with several existing algorithms to further verify the feasibility and effectiveness. Finally, the full text is summarized in Section 5.

Feature selection is an effective method to solve dimensionality disasters and achieve feature dimensionality reduction. It can preserve the effective features that are most beneficial to regression (or classification) by analyzing the intrinsic relationship between features and target variables and features [5, 6], so that the redundant features and unrelated features to the target variable are better eliminated, aiming to reduce the complexity of the algorithm and improve the accuracy of the algorithm. According to the combination with machine learning, feature selection methods can be divided into filtering, encapsulation, embedded, and integrated [7]. Filtering is independent of a specific machine learning model, in which feature sorting and feature space search are generally used to obtain feature subsets including some special typical methods, for example, mutual information, symmetric uncertainty, and maximum information coefficient [810]. Encapsulation is to integrate the learning algorithm into the feature selection process, that is, the classification algorithm is regarded as a black box to evaluate the feature subset performance, which is to achieve the maximum classification accuracy rate. Embedded incorporates the feature selection process as the part into the learning algorithm. This method is used to solve the problem of high reconstruction cost when encapsulating different datasets. The integrated method is to gain the results, respectively, by learning using multiple feature selection methods firstly and then integrate each result with a certain rule. The method is better than the single feature selection method, which is suitable for solving the problem of instability of the feature selection method.

The feature selection method has attracted the attention of many domestic and foreign scholars. For example, in the field of biomedicine, Yao et al. [11] proposed a multimodal modal feature selection method based on hypergraph for multitask feature selection and finally selected effective brain region information; Sun et al. [12] proposed a hybrid feature selection algorithm based on Lasso, which can select a subset of information genes with strong classification ability; Mingquan et al. [13] proposed information gene selection method based on symmetry uncertainty and support vector machine (SVM) recursive feature elimination, which can effectively eliminate genes unrelated to categories. At the same time, feature selection methods are also well applied in other fields. Nagaraja [14] used partial least squares regression and optimized experimental design to select features with strong correlation with categories; Hu et al. [15] proposed feature selection algorithm by joint spectral clustering and neighborhood mutual information, which can remove signature-independent features.

However, the research methods mentioned in the above literature can only remove irrelevant features or eliminate redundant features to a certain extent and cannot meet the data processing needs of high-dimensional small sample problems of traditional Chinese medicine. Therefore, some researchers have conducted in-depth discussion and research to do a two-stage analysis of feature correlation and redundancy and approximated the approximate Markov blanket (AMB) to the feature selection process to achieve the purpose of screening effective and fewer features [16]. Among them, the literature [17] proposed a method of approximating the Markov blanket using cross entropy. The method first uses the Pearson coefficient to calculate the correlation between features and removes the irrelevant features and then uses the approximate Markov blanket to perform redundant features: deletion; the paper [18] proposed a maximum correlation minimum redundancy feature selection algorithm using approximate Markov blankets. The method first uses the criterion of maximum correlation minimum redundancy for feature correlation ordering and then does approximate calculation by combining mutual information with Marco to remove irrelevant features and redundant features; the literature [19] proposed a feature selection method based on the maximum information coefficient and approximate Markov blanket (FCBF-MIC), which firstly measured correlation between features and categories by symmetric uncertainty to delete the features that are not related to categories or weakly correlated. Secondly, the Markov carpet is approximated by using the maximum information coefficient, thereby achieving the purpose to delete redundant features. However, after the analysis and discussion of the experiment, it is found that the above method is more strict because of the definition of the approximate Markov blanket, which makes it impossible to select a small number of highly explanatory features in the high-dimensional small sample data of Chinese medicine, so it is still needed for us to do further research and exploration of Chinese medicine data analysis methods.

In a feature selection study, higher-quality feature selection methods should exhibit the following characteristics [20]: (1) interpretability, meaning that the features selected in the model have scientific significance; (2) acceptable model stability; (3) avoidance of deviations in the hypothesis test; and (4) model calculation complexity within a manageable range. At the same time, in the literature [21], a standard of optimal feature subsets is proposed into four categories: unrelated features, weakly correlated and redundant features, weakly correlated nonredundant features, and strongly correlated features. It is considered that the optimal feature subset should contain the latter two in this paper. Through a large number of experimental comparisons, the standard has been proved to have lower time complexity and better feature selection results [22, 23].

In view of this, this paper proposes a hybrid feature selection method based on iterative approximate Markov blanket (CI_AMB), which is divided into two phases: in the first phase, it first uses the maximum information coefficient to measure correlation between the per-dimensional features and target variables and achieves the filtering of unrelated features and the acquisition of candidate feature subsets according to some evaluation criteria; in the second stage, the candidate feature subsets are sorted and classified into K subsets and then iteratively cull redundant features to obtain weakly correlated nonredundant features and strongly correlated features based on the maximum approximate Markov carpet of information coefficients. Not only can the algorithm effectively filter the irrelevant features and eliminate redundant features, but also reduces the time complexity of the model and improves the interpretation degree of the model. It is a new model suitable for high-dimensional small sample data analysis of traditional Chinese medicine.

3. Research on Hybrid Feature Selection Method Based on Iterative Approximation Markov Blanket (CI_AMB)

The maximum information coefficient (MIC) is a new information-based metric proposed by Reshef et al. [24] in 2011. It not only better reflects the correlation between features and target variables, and features and features, but also makes up for the problem that metrics such as mutual information cannot be normalized and sensitive to discretization and that metrics such as information gain and symmetry uncertainty cannot effectively measure the nonfunction dependence between features. In many experimental analyses, the characteristic that the largest information coefficient has good stability and the ability to metric the relationship among the features are also effectively demonstrated [2527].

The Markov blanket is a method that minimizes subset of features to keep maximizing the target variable information and meanwhile makes the remaining feature subset to be independent of the target variable under the conditions that subset of features has been selected [19, 28]. Although the Markov carpet can achieve the effect of feature dimension reduction, because its independent conditions are too strict and the relationships discovered belong to the NP-hard problem, the feature selection method often adopts the strategy of approximating the Markov blanket. Therefore, combining the advantages of the largest information coefficient, in this paper, we use MIC to approximate the Markov blanket (see Definition 1) in order to better eliminate the redundant features, so that the optimal feature subset screening and model optimization are achieved.

Definition 1. (approximate Markov blanket). Assume that there are two different features in the feature set, respectively, ifIt is considered that is an approximate Markov blanket of , that is to say, is retained while is a redundant feature and removed from the feature set.

Definition 2. (weakly correlated nonredundant features and strongly correlated features). Only when satisfying the condition that there is no an approximate Markov blanket to feature , the feature is a weakly correlated nonredundant feature or a strongly correlated feature, namely, , where is the feature complete set and and are the irrelevant feature set and the redundant feature set, respectively.
The CI_AMB method is mainly divided into two stages. In the first stage, it firstly uses the MIC method to measure the correlation between each feature and the target variable and achieves the filtering of better irrelevant features according to the evaluation criteria to achieve the acquisition of candidate feature subsets. The features selected by the MIC method are usually highly correlated with the redundant features accompanied, in which the more amounts of the redundant features not only increase the time complexity and space complexity of the model, but also reduce the degree of interpretation of the model. Therefore, in the second stage, the new method further analyzes the redundancy of the feature, that is, according to the feature score obtained by the MIC method, the features of the candidate subset are arranged in ascending order and equally divided into K parts. And then, the approximate Markov blanket (AMB) is used to iteratively eliminate redundant features, so that weakly correlated nonredundant features and strongly correlated features can be selected (Algorithm 1). The flow of the algorithm construction is shown in Figure 1.
The specific construction process of the model is as follows:Phase 1. Filtering irrelevant featuresStep 1. MIC calculation: MIC calculation is performed on the original data with features, that is, the maximum information coefficient is calculated for each feature by formula (2) and obtains a score sequence of all features, and the value is [0, 1]. It is worth noting that the closer the score of the feature is to 1, the stronger the correlation of feature and the target variable is, and the closer the score is to 0, the weaker the correlation is:where refers to the maximum mutual information of under mesh partitioning [19, 29], is the ordered pair set of samples, means dividing the value range of feature into segments, means dividing the value range of dependent variable into segments, and is the upper limit of the meshing. Generally, the value of is , and is the sample size.Step 2. Determining the candidate feature subset: MIC calculation is used to obtain the score sequence , and the descending order is arranged and the sequence is intercepted according to a certain ratio, and then the current top ranked feature subset is selected; if the selected feature subset satisfies the best of evaluation index RMSE, the candidate feature subset ( dimension feature, ) can be directly selected, but if not, the progress of filtering operation and judgment is continued:Step 3. Data division and initialization: the candidate feature subsets are arranged in reverse order according to the feature scores, thereby obtaining an aligned candidate feature subset in order to ensure the maximum retention for the features with high important correlation in the regression tasks by ranking the features in the later processing, then subdividing the candidate feature subset into K groups, and defining is the feature after dividing the candidate feature subset into K groups subset, while initializing the optimal feature set to be empty.Phase 2. Eliminating redundant featuresStep 4. Feature redundancy analysis: first, the redundant features are removed from the first one feature subset by using the AMB method (i.e., Definition 1), and then the nonredundant features are filtered into . Secondly, and the second one feature subset are merged as the current feature subset, and it will be analysed by the AMB method to delete the redundant features, and then the is updated currently. Therefore, the optimal feature subset Tbest with the remaining the m″ dimension (m″<m′) is obtained by iterating sequentially to the k-th feature subset (k) in order.Step 5. Model evaluation: compare and evaluate various strategies by using the weakly correlated nonredundant and strongly correlated optimal feature subsets () obtained in the above steps.

Input: dataset , samples, features
   K//the number of feature subsets divided
Output: optimal feature subset , samples, features
Begin
 Phase 1: filtering irrelevant features
 For I = 1 to m: //MIC calculation
  Standardize and ;
  Calculating the MIC score value for each feature in ;
  End
  According to the evaluation index RMSE, the filtered candidate feature subset is determined, and the candidate feature subset of      the dimension is arranged in ascending order;
  Then, the selected candidate feature subset sequences are divided: ; //Divided into K shares
  ; //Initialize the optimal feature subset to be empty
  Phase 2: eliminating redundant features
  Performing redundancy analysis on the first feature subset using the AMB method and filtering out nonredundant features to join     ;
    For I = 2 to K: //Iterative AMB
    ; //Add the current optimal feature subset to the next partition subset
  ; //Update the list of optimal features using the AMB method, and finally
  End
  Construct a regression model and verify and evaluate the validity and reliability of the model;
  End

4. Experimental Design

4.1. Experimental Data Description

The five experimental datasets were used in this paper including the traditional Chinese medicine material basic experimental data (WYHXB and NYWZ) of the Modern Chinese Medicine Preparation Ministry of Education, the Residential Building Dataset (RBuild), Communities and Crime on the UCI dataset (CCrime), and BlogFeedback (BlogData for short), and the basic information of each dataset is described in Table 1. Among them, there are 798 features, 1 dependent variable, and 54 samples in WYHXB data, and 10283 features, 1 dependent variable, and 54 samples in NYWZ data; BlogData is data describing blog posts, which includes 280 features, 1 dependent variable, and 60021 samples; RBuild is data describing residential buildings, which includes 103 features, 1 dependent variable, and 372 samples; CCrime is data describing community crime, which includes 127 features, 1 dependent variable, and 1994 samples. It is worth noting that the UCI datasets obtained from the UCI Machine Learning Repository generally have more missing values; therefore, the mean filling method is used for data processing during the experiment. In this paper, using BlogData, RBuild, and CCrime of the UCI dataset is to compare the regression effects of the new model on the public dataset to verify the reliability and generalization of the new model in our experiments.

Both WYHXB and NYWZ are the basic experimental data of Shenfu injection in the treatment of cardiogenic shock. The experimenters used the left anterior descending coronary artery near the cardiac tip to replicate the metaphase cardiogenic shock rat model and gave the Shenfu injection (unit: ml·kg−1) to the shock rat models divided into 7 groups (0.1, 0.33, 1.0, 3.3, 10, 15, and 20, respectively) by the dose of Shenfu injection, in which included 6 rats in each group, and set the model group and blank group in whole experiment meanwhile. After 60 minutes of administration, the pharmacodynamic indicators of the red blood cell flow rate (m/s) were collected. The substance information contained in the Shenfu injection is called exogenous substance (i.e., WYHXB data, as shown in Table 2), and the substance information of the experimental individual itself is called endogenous substance (i.e., NYWZ data, as shown in Table 3). In the two data, the material information is characteristic, and the red blood cell flow rate is the dependent variable.

4.2. Results and Discussion

The programming tool used in this experiment is Python 3.6, the operating system is Windows 10, the memory is 8 GB, and the CPU is Intel (R) Core (TM) i5-3230M.

4.2.1. Filtering of Irrelevant Features

In order to ensure the reliability of the new model, the RMSE (root mean square error) of the two regression models of GBDT [30] and XGBoost [31] was adopted as the comprehensive evaluation index, that is, the average value of the two regression models RMSE was taken as the evaluation index, and then the characteristics of the original dataset were filtered by the certain ratio gradually (if the number of features has a decimal number, the result is rounded in the experiment), so that it can be sure that the corresponding RMSE value is the best when what the ratio is taken. And it is more appropriate to judge how many irrelevant features are deleted to achieve the purpose of effectively filtering the irrelevant features, and the experimental results are shown in Table 4.

According to the experimental results in the above Table 4, when in the WYHXB data, the corresponding average RMSE mean value is the best, and 120 irrelevant features are filtered (the original features are 798); when in the NYWZ data, the corresponding average RMSE is the best, and 2057 irrelevant features are filtered (10283 original features); when in the BlogData data, the corresponding RMSE is the best, and 140 irrelevant features are filtered (280 original features); when in RBuild data, its corresponding RMSE mean is the best, and 31 irrelevant features are filtered (103 original features); when CCrime data takes , its corresponding RMSE mean is the best, and 7 irrelevant features (127 original features) were filtered. As a result, after filtering the irrelevant features by the above MIC method, a candidate feature subset of five sets of experimental data can be obtained. By further analyzing the candidate feature subsets, it can be found that the RMSE of the original data has little difference with the RMSE of the candidate feature subsets (the experimental results are shown in Table 5); therefore, the features deleted in this experiment have little effect on the accuracy of the model and the process finally filters out irrelevant features and better preserves the features associated with the target variables.

4.2.2. Elimination of Redundant Features

Through the above experiments, filtering of irrelevant features can be achieved by obtaining candidate feature subsets. However, according to the construction of the new model, it is necessary to divide the candidate feature subsets (ascending order) equally in the experimental process, but different partitioning strategies will affect the final experimental results, so further discussion and analysis of the parameter K are needed (the value range of K is set to 1 to 15) to determine the optimal K value to ensure the reliability of the model results. At the same time, in order to avoid the contingency of the experiment as much as possible, the experiment still adopts the RMSE of GBDT and XGBoost as the comprehensive evaluation index (i.e., the mean RMSE of the two). After experimental analysis (results shown in Figures 26), it can be found that when k = 5 of WYHXB data, its corresponding RMSE value is the best; when k = 6 of NYWZ data, its corresponding RMSE value is the best; when the k = 5 of the BlogData data, the corresponding RMSE value is the best; when the k = 3 of the RBuild data, the corresponding RMSE value is the best; when k = 14 in the CCrime data, the corresponding RMSE value is the best. After the division of the candidate feature subsets, the redundancy of the features can be analyzed in the later experiments, so as to select the optimal feature subsets.

For further analyzing the model, each dataset was randomly divided into a training set and a test set with the ratio of 6 : 4, and XGBoost [31], Lasso [32], FCBF-MIC [19], and the improved algorithm (CI_AMB) were used for training and learning; the test set was subjected to regression experiment (model parameters selected were consistent with the above experimental results), and RSME was used as the model index. At the same time, in order to ensure the reliability of the model results, each test data was tested 10 times, and then the average value was taken as the final experimental results. In order to verify the effect and effectiveness of the feature selection during the experiment, the original data were also compared using the regression model of GBDT and XGBoost. The experimental results are shown in Tables 67:

It can be seen from the experimental results in the above table that the feature selection of the CI_AMB method is performed on the test set of five sets of raw data, and the experimental results are as follows: the number of original features of the WYHXB data is 798, and after the redundancy feature is removed, the final number of optimal feature subsets selected is 80, including 19 strongly correlated features and 61 weakly correlated nonredundant features. The number of original features of NYWZ data is 10283. After the elimination of redundant features, the final number of optimal feature subsets that can be screened is 220, including 59 strongly correlated features and 161 weakly correlated nonredundant features; the original number of BlogData data is 280, after the redundant features are eliminated, and finally, the number of optimal feature subsets that can be screened out is 48, including 5 strongly correlated features and 43 weakly correlated nonredundant features. The number of original features of RBuild data is 103, after the elimination of redundant features. Finally, the number of optimal feature subsets that can be screened out is 35, including 16 strongly correlated features and 19 weakly correlated nonredundant features; the number of original features of CCrime data is 127, after doing the redundancy, and the final number of suboptimal set of features can be selected to 37, including 3 strongly correlated feature and 34 weakly correlated nonredundant feature. It is worth noting that after filtering the irrelevant features and eliminating the redundant features, the obtained strongly correlated features and weakly correlated nonredundant features are distinguished according to the degree of correlation between the features and the target variables, that is, if the MIC score is greater than 0.6, it is a strongly correlated feature, and if not, it is a weakly correlated nonredundant feature.

After the CI_AMB feature selection, it can be found that (1) compared with the original data (in the case of no feature selection), the new method has the slightly inferior result (0.0024 greater error than the result of the original data) in the CCrime data (using the RMSE of GBDT as the evaluation index, Table 6), but in other datasets, the results are better than the original data (see Table 6 and 7); (2) compared with XGBoost, Lasso, and FCBF-MIC, while the number of features is similar, the RMSE values of the evaluation models in the CI_AMB method are better than in other methods. At the same time, in order to observe and compare the experimental results more intuitively, the trend graphs of the two evaluation indicators (GBDT and XGBoost) were plotted (Figures 7 and 8) to reflect the overall fluctuation of the RMSE. Combining the above table with the experimental results of Figures 7 and 8, it can be observed that the improved algorithm is generally superior to other algorithms, indicating that the new model is effective in removing the effects of irrelevant features and redundant features. In summary, not only can the improved algorithm better filter out the strongly correlated features and weakly correlated nonredundant features, but also improves the regression accuracy of the model to some extent.

5. Conclusions

Aiming at the problem that the basic experimental data of TCM present high dimensionality and few samples and contain more irrelevant information and redundant information, a hybrid feature selection method based on iterative approximation Markov blanket is proposed. The method performs two-stage feature analysis by the maximum information coefficient and iterative approximation Markov blanket, respectively, to do filtering of unrelated features and culling of redundant features, so as to achieve the screening of optimal feature subsets. Through the experimental comparison between the basic data of Chinese medicine and UCI dataset, it is proved that the improved algorithm significantly reduces the feature dimension and improves the interpretation degree of the model. It is a kind of analysis suitable method for high-dimensional small sample data of traditional Chinese medicine. In the next research work, we will continue to optimize the algorithm and ensure the reasonable setting of relevant parameters can be further studied when building the model.

Data Availability

The traditional Chinese medicine data used in this study can be obtained by contacting the first author. The UCI datasets can be obtained through the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html). It should be noted that the UCI datasets are commonly used standard test datasets proposed by the University of California, Irvine, for machine learning.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China (nos. 61762051 and 61562045) and the Jiangxi Province Major Projects Fund (20171ACE50021, 20171BBG70108, and YC2018-S281).