#### Abstract

*Background*. Hepatocellular carcinoma (HCC) is a highly aggressive malignancy. Traditional Chinese Medicine (TCM), with the characteristics of syndrome differentiation, plays an important role in the comprehensive treatment of HCC. This study aims to develop a nonnegative matrix factorization- (NMF-) based feature selection approach (NMFBFS) to identify potential clinical symptoms for HCC patient stratification.
*Methods*. The NMFBFS approach consisted of three major steps. Firstly, statistics-based preliminary feature screening was designed to detect and remove irrelevant symptoms. Secondly, NMF was employed to infer redundant symptoms. Based on NMF-derived basis matrix, we defined a novel *similarity measurement of intersymptoms*. Finally, we converted each group of redundant symptoms to a new single feature so that the dimension was further reduced.
*Results*. Based on a clinical dataset consisting of 407 patient samples of HCC with 57 symptoms, NMFBFS approach detected 8 irrelevant symptoms and then identified 16 redundant symptoms within 6 groups. Finally, an optimal feature subset with 39 clinical features was generated after compressing the redundant symptoms by groups. The validation of classification performance shows that these 39 features obviously improve the prediction accuracy of HCC patients. *Conclusions*. Compared with other methods, NMFBFS has obvious advantages in identifying important clinical features of HCC.

#### 1. Introduction

Hepatocellular carcinoma (HCC) is the third most common cause of cancer-related death worldwide and the leading cause of death in patients with the complication of cirrhosis [1, 2]. The occurrence of HCC is larvaceous and short of specific symptoms [3, 4]. Its diagnosis depends on biopsy, imaging examination such as Doppler ultrasound, computed tomography, magnetic resonance imaging, and blood test [5, 6]. Once the patients with HCC see doctors, the disease has often entered its late stage, losing the chance of resection. Hence, seeking simple methods to predict HCC and its clinical stage is very meaningful and helpful to improve the diagnosis of HCC.

As one of the most popular complementary and alternative medicine modalities, Traditional Chinese Medicine (TCM) plays an active role in treatment of malignant tumors including HCC in Chinese and some East Asian countries [7, 8]. Unlike modern medicine, the diagnosis and treatment of TCM depend on the analysis of symptoms and signs of HCC collected by inspection, auscultation and olfaction, inquiry, and pulse taking and palpation [8]. TCM regards specific combination of symptoms and signs as a TCM syndrome, which is the main basis for treatment; and it can be also used to guide clinical diagnosis of HCC. Our previous work proposed a hierarchical feature selection (PSOHFS) model to quickly identify the potential HCC syndromes from a TCM clinical dataset [9], by which all the original symptoms were classified into several groups according to the categories of clinical observations, and each symptom group was then converted into a syndrome signature to reduce the searching space of feature selection. But the limitation of this method is that the interactions among symptoms which belong to different categories (aspects) were ignored. Therefore, the current challenge is to design an efficient feature selection approach for high-dimensional TCM data with consideration of clinical significance.

In this study, a nonnegative matrix factorization- (NMF- [10]) based feature selection (NMFBFS) method was proposed to select pivotal clinical symptoms for HCC diagnoses. A TCM clinical dataset was used in this work, which consisted of 407 HCC patients with 57 clinical symptoms. Each patient sample is labeled with a clinical-staging symbol which indicates the severity of certain patient. Firstly, the preliminary screening with statistical method was designed to detect irrelevant symptoms from the full symptom set. Secondly, the process of NMF was implemented after eliminating the irrelevant symptoms. Based on the NMF-derived basis matrix, we defined a similarity measure to infer redundant symptoms by calculating the distance and correlation among the symptoms. Finally, the secondary dimension reduction was implemented based on the inferred groups of redundant symptoms. We converted each symptom group to a new feature (named “mixed feature”) if these symptoms represent similar distribution patterns on the sample space. The experiment results show that 39 novel features inferred by NMFBFS obviously improve the accuracy of diagnosis of HCC clinical samples. Moreover, NMFBFS-derived 39 optimal clinical features included some well-known common symptoms of HCC patients. Comparing to three representative feature selection methods (ReliefF [11], mRMR [12], and Elastic Net [13]), our proposed approach showed the best performance to identify optimal clinical features for HCC patients.

#### 2. Materials and Methods

##### 2.1. Experimental Data

###### 2.1.1. Description

In this work, the questionnaire survey dataset of HCC includes 407 samples within two years, and each patient was observed on 57 clinical symptoms (Table 1). Each patient sample is labeled with a symbol of clinical stage, which is related to TCM pattern of syndrome and indicates the severity degree of HCC. According to the international staging system [14], there are three stages and two substages in each phase in this dataset. The aim of our work is to identify the symptom signatures, which are related to three clinical stages: phases I, II, and III, and the larger value indicates that stronger positive symptom occurred. Within our dataset, all the original symptoms are described by two types of data: binary (0 or 1) or integer (0, 1, 2, 3, …). For example, the type of symptom “tinnitus” is binary (0 or 1), which means two possible states: occurrence (positive) or nonoccurrence (nonpositive). Another example is “sleeplessness” whose value can be 0, 1, 2, or 3. The larger the value is, the stronger the positive state will be. A symptom does not appear positive if its value equals zero.

###### 2.1.2. Data Preprocessing

*Refinement of Feature Set*. Our original dataset consists of 407 HCC patient samples (Table 1). The first step of preprocessing is to remove the useless features because they provide no useful information for the following classification. If a feature is constant on all the observed samples, it can be considered as useless feature. For our dataset, some symptoms, such as “pale tongue” and “slow pulse,” were removed out because there is no any observed patient positive on these symptoms. After removing this kind of features, a refined clinical dataset with 407 samples and 57 symptoms () can be obtained.

*Simplification of Clinical Staging*. The clinical staging of HCC patients in our original dataset was marked with collections “IA,” “IB,” “IIA,” “IIB,” “IIIA,” and “IIIB.” For identifying the symptom signatures related to three clinical stages, all the samples would be relabeled as three classes. Here, we remarked class label “1” for the samples labeled “IA” and “IB.” In a similar way, class label “2” is used for “IIA” and “IIB” and “3” is for “IIIA” and “IIIB.” Finally, all the 407 clinical samples can be distributed in three categories: 82 samples in phase I, 195 in phase II, and 130 in phase III. The details of the refined dataset were described in Table 1.

##### 2.2. Feature Selection

Feature selection can be organized into three categories, depending on how they interact with the construction of model. Filter methods employ a criterion to evaluate each feature individually and are independent of the model [15]. Among them, feature ranking is a common method which involves ranking all the features based on a certain measurement and selecting a feature subset which contains high ranked features [16]. However, one of the drawbacks of ranking methods is that the selected subset might not be optimal in that a redundant subset might be obtained. Wrapper methods involve combination searches through the feature space, guided by the predicting performance of a model [17]. Heuristic search is widely used in wrapper methods as searching strategy which can produce good results and is computationally feasible; however, they often yield local optimum results. For an embedded method, the feature search process is embedded into classification algorithm, so that the learning process and the feature selection process cannot be separated [18].

##### 2.3. Nonnegative Matrix Factorization

Nonnegative matrix factorization (NMF) aims to obtain a linear representation of multivariate data under nonnegativity constraints. These constraints lead to a part-based representation because only additive, not subtractive, combinations of the original data are allowed [19]. In general, NMF can be used to describe hundreds to thousands of features in a dataset in terms of a small number of metafeatures, particularly in gene expression profiles analysis [20–22].

Let be nonnegative matrix; that is, each element in . Nonnegative matrix factorization (NMF) consists in finding an approximationwhere the* basis matrix * and the* mixture coefficient matrix * are and nonnegative matrices, respectively, where and . The objective behind the small value of is to summarize and split the information contained in into factors (also called “basis” or “metafeature”). The matrix has the same number of samples but much smaller number of features rather than matrix . Therefore, the metafeature expression patterns in usually provide a robust clustering of samples [22].

The main approach to NMF is for solving estimate matrices and as a local minimum:where is a loss function that measures the quality of the approximation which is usually based on either the Frobenius distance or the Kullback-Leibler divergence [19]. is an optional regularization function, defined to enforce desirable properties on matrices and , such as smoothness or sparse [23, 24].

In our study, the loss function in NMF is based on Kullback-Leibler divergence [25]. The above function was defined as follows:where and are regulation functions for and , respectively. Here, we applied Tikhonov smoothness regularization [26] for in where is a constant positive or zero. In addition, we applied sparsity-enforcing regularization [26] for inIn formula (5), is th row of . and define the -norm and -norm of . The algorithm proposed by Lee is a well-established method to solve the optimization of NMF [27].

##### 2.4. NMF-Based Feature Selection

In this study, our proposed NMF-based feature selection (NMFBFS) approach can be seen as a two-stage filter method. In the first stage, preliminary screening is implemented to detect irrelevant symptoms and remove them from the whole feature set. In the second stage, NMF clusters the redundant symptoms which potentially have similar patterns into different groups, and each group is then transformed into new single features to reduce the dimension. Obviously, the process of NMFBFS is independent of classifier and can quickly infer the optimal feature subset even in the high-dimensional dataset. The flowchart of NMFBFS is shown in Figure 1.

###### 2.4.1. Removing the Irrelevant Symptoms

In our questionnaire, all the symptoms were defined by clinical doctors, which covered many aspects of patients. However, the relevance weight of each feature for distinguishing samples among the clinical stages was not quantitatively studied. In machine learning, the irrelevant features provide no useful information in any context and always scarcely contribute to patient stratification [28]. If the sample size is large, it is meaningful to quickly detect the irrelevant symptoms by calculating the frequencies of positive. Here, we calculated the ratio (frequency) of presence (positive) of each symptom on the samples in every clinical stage. If the frequencies of certain symptom in all the clinical stages are very low, which indicates that this symptom hardly appears positive in most of patients, therefore it is considered as an irrelevant symptom. After removing the irrelevant symptoms from the original dataset, the rest of symptoms are considered as relevant features, which are potentially related to at least one class of patients (or one clinical stage).

###### 2.4.2. Identifying Redundant Symptoms Based on NMF

After the irrelevant symptoms had been removed, nonnegative matrix factorization was applied on the dataset (). For a given rank , the matrix can be decomposed to* basis matrix * and* coefficient matrix *. Usually, the value of rank is much smaller than the number of features () and the sample number (), so that there is at least one dimension in both and being very small. The widespread appliances of NMF in biclustering further indicate that basis matrix can be used for feature clustering and coefficient matrix is used for sample clustering, respectively [20, 21]. In our study, the number of samples is much larger than the dimensionality; hence, directly calculating distance or correlation to measure the similarity between original features (symptoms) on all the samples will lead to biases because some features might represent local similar patterns on a part of samples. Fortunately, the basis matrix represents the compressed sample space of matrix , which facilitates uncovering the difference between features. Here, we introduced two features ( and ) in original dataset as an example to clarify the basic idea of this step. According to the definition of NMF, we can easily knowwhere and are th and th rows of matrix ; and are th and th rows of matrix . The following can be easily found. (1) If , then ; (2) if , then , where is a constant. Furthermore, if th row in matrix is very close to , the feature might have a similar pattern as on all the samples. Therefore, we defined a novel* similarity measurement* in formula (7) to approximately evaluate redundancy between the two original symptoms via matrix : whereFormula (8) uses* distance-based similarity*, which indicates how two corresponding features are close to each other; and formula (9) adopts* correlation-based similarity*, which describes similar patterns of two original features. Hence, our developed similarity measurement considered distance and correlation between features at the same time. in formula (8) is the maximal distance value in all pairs of (). Based on the above definition of similarity, we further calculated the similarity matrix using all the basis rows in (), where element denotes the similarity between original features and . Given a threshold (), we can screen all the redundant features by groups with .

###### 2.4.3. Transformation of Redundant Symptoms by Group

In the above section, all the redundant symptoms were screened out and were organized into different groups. For each symptom group, a new mixed feature was extracted as the representation of the whole group and replaced all the original features within this group. Therefore, NMFBFS-inferred optimal feature subset includes two parts: nonredundant original features and new generated mixed features (see Figure 1). There are two strategies that can be used to transform the redundant symptom groups to mixed features.

(1) Calculate the mean vector from all the redundant symptoms as inwhere , and are the feature vectors of original dataset and are determined as redundant symptoms in a group. denotes the number of inferred redundant symptoms in a group. The vector of new single feature was averaged on that group.

(2) Randomly select a vector from one of redundant symptoms asIn our study, we transformed the groups of redundant symptoms to new mixed features by using formula (10). After this step, the feature space of the clinical dataset was further reduced so that the optimal feature subset rarely included redundant features.

#### 3. Simulation Design

Firstly, we calculated the frequencies of each original symptom appearing positive at each clinical stage and then removed the irrelevant symptoms if their frequency values were very low.

Secondly, a representative sample set was screened out for NMF analysis. In our dataset, the number of samples in three phases of HCC varies a lot, that is, from 82, 130 to 195. If the whole dataset is used, a class imbalance problem will be caused [29–31]. In addition, the sex ratio of patients is also seriously unbalanced in the original dataset (Table 1). For avoiding the bias caused by imbalance of samples, we selected 40 samples from each clinical phase with equal proportion of male and female (20 : 20) to construct a representative clinical dataset (120 samples in total) for the following NMF analysis. Considering the fact that each original sample has a class label which corresponds to clinical stage of that patient, for all the original samples (407), we can actually get a preliminary participation of samples as three clusters, which can also be considered as a trained KNN clustering model [32]. We then defined the center of each cluster, which is the mean vector of all the samples in the same cluster. Given a large value of , we input each center of cluster into the above KNN model and keep the output consistent with the corresponding class label of the center. Based on the -nearest neighbors, we can finally screen out 40 representative samples (20 males and 20 females) of each clinical stage according to Euclidean distance.

Finally, several redundant symptom groups were identified. Then we transformed each redundant symptom group into a new mixed feature. Combining all the nonredundant original features with new generated mixed features, we obtained an optimal clinical symptom subset of HCC. At last, the classification performance of this feature subset was further validated by least squares support vector machines (LSSVM) [33, 34].

*Experimental Parameters*. At first, we set a frequency threshold to identify the irrelevant symptoms. The NMF package [35] was then employed as a computational framework for nonnegative matrix factorization algorithms in . For this method, the optimal rank should be determined firstly. Currently there are several approaches that had been proposed to determine the optimal value of [36, 37]. In our study, two methods, that is, cophenetic coefficient [36] and RSS curve [37], had been adopted to determine the optimal rank range from 2 to 7. After obtaining the results of NMF with optimal , we calculated the similarity matrix with all the basis rows and inferred the redundant symptoms with a threshold , which meet the following conditions: and in formulas (7)–(9). Finally, a LSSVM classifier had been implemented to validate the classification performance of inferred optimal symptom subset. In the LSSVM multiclass model, Gaussian RBF kernel was employed, and the kernel parameters and were determined by grid search [38]. In our grid search, we set and . Variable changes from −1 to 5 with step 0.25, and variable changes from −1 to 4 with step 0.2. Therefore, we have the range of for and the range of for . Totally, there are 24 levels for the value of and 25 levels for . In other words, there are 600 pairs of tested when training a LSSVM classifier. To find an optimal value of , we used 5-fold cross-validation to evaluate the classification accuracy of LSSVM model.

#### 4. Results and Discussion

Firstly, we calculated the frequencies of positive for all the original symptoms (57) at each clinical stage (see Supplementary Table S1 available online at http://dx.doi.org/10.1155/2015/846942). Eight irrelevant symptoms were judged as irrelevant features (threshold: 10%). From Table 2, we can clearly see that these symptoms appeared on few patients (less than 10% in each clinical stage) in the clinical observation and therefore they were considered as noisy features in the process of diagnosis. Because the total number of samples is large (407), we considered that the eight irrelevant symptoms identified with statistical analysis are very reliable. A part of symptoms shown in Table 2 was proved by previous studies. For example, Lai et al. concluded that no association is detected between “emotional depression” and the risk of hepatocellular carcinoma in older people in Taiwan [39, 40]. In addition, Peng et al. studied 169 Chinese patients with HCC; only three patients presented with hydrothorax, which also indicated that this symptom was not a key symptom in the process of liver cancer development [41, 42]. In addition, “edema in lower extremities” is undoubtedly a well-known symptom of HCC patients in clinic [43]; however, it was considered an irrelevant symptom in this study because it rarely appeared in all the three stages of our data. Increasing the observed samples or reducing the threshold will make it as a candidate symptom.

Secondly, the calculation of NMF was implemented after removing all the detected irrelevant symptoms. According to the description in “Simulation Design”, NMF was applied on the representative matrix with 120 HCC samples, which uniformly covered three clinical phases. Figure 2(a) represents the fact that is a sparse matrix, in which large partition of elements is zero (no positive), such as symptom shown in Figure 2(b). However, there are also some symptoms that were positive on many patients, such as symptom shown in Figure 2(c). Matrix does not show obvious subtypes and patterns; hence, it is hard to compare the similarity directly between symptoms with the row vectors of since the number of samples is still very large. In this study, we used NMF to compress the representative matrix and to reveal the distribution patterns of features (symptoms) on fewer samples. Before the calculation of NMF, a critical parameter should be firstly determined: the value of factorization rank . According to Brunet’s method, the first value of for which the cophenetic coefficient starts decreasing is the optimal one [36]. Frigyesi and Höglund suggested choosing the first value where the RSS curve presents an inflection point [37]. Based on these two methods, we determined that “3” is a reasonable value of rank for the clinical data matrix . The curves shown in Figure 3 also confirm this conclusion. Nonnegative matrix factorization was then implemented on the matrix () with rank 3. It also indicates that the number of metafeatures (basis) equals 3.

**(a)**

**(b)**

**(c)**

**(a)**

**(b)**

Figure 4 represents the final results of NMF which included the basis matrix () and mixture coefficient (). Each row in matrix uses a compressed pattern to approximatively represent the distribution of a symptom on all the original samples. Comparing with matrix shown in Figure 2, the obvious difference in matrix is that there are several groups of features revealing similar patterns in the compressed sample space, such as and in Figure 4. According to Figure 2(a), we can find that the distance between the vectors of symptoms and in is also close; furthermore, the compressed patterns of and in matrix ( and ) in Figure 4 facilitate easier identifying of redundant features which have very similar distribution patterns.

**(a)**

**(b)**

The matrix has the same number of samples but much smaller number of metafeatures (basis) rather than original matrix [36]. Therefore, the metafeature expression patterns in usually provide a robust clustering of samples. Given the th column in as , we determined that th clinical sample is placed into th cluster if , where . Hence, we used matrix to group all the samples into 3 clusters, which correspond to 3 bases (metafeature). Figure 5 shows that there are great overlaps between the clinical-staging markers (a priori knowledge of class labels) and indexes of basis components (metafeatures) on the 120 original clinical samples included in dataset .

In matrix , each column also corresponds to a metafeature or basis (see Figure 4). Entry in matrix is the coefficient of original feature in metafeature (basis) [36]. Therefore, an original feature relates to certain basis if is the largest entry in row of matrix . From Figure 4, we can clearly see that the original symptom features participating in the same basis have similar expression patterns rather than that in other bases. Table 3 represents the symptoms which are related to all basis components. Combination of Figure 5 and Table 3 further indicates that the “basis 1” related symptoms are very related to the clinical samples of phase II, and “basis 2” and “basis 3” related symptoms are very related for phase I and phase III, respectively. This finding contributes to identifying* clinical phase-specific* important symptoms via NMF. Moreover, the partition of 49 clinical symptoms shown in Table 3 was well supported by some related studies. For example,* nausea* is observed as a common adverse effect in HCC patients in phase I [44]. The symptoms* ascites, anorexia, fever*, and* jaundice* often occurred in phase II [43, 45–48]. The symptoms “*yellow complexion*” and “*yellow skin and eye*” shown in Table 3 are obvious appearances of* jaundice*. For phase III,* pain* is the most obvious characteristic in HCC patients [49]. There are three pain-related symptoms presented in Table 3: “*pain in shoulder and back,*” “*chest pain,*” and “*distending pain in hypochondrium.*” Moreover,* fatigue* and* weakness* were also common in HCC patients [43]. Together, these findings suggest that NMF with an optimal rank can reveal the latent associations between the potential symptom features and clinical phases.

Just as mentioned above in “Simulation Design,” several groups of redundant features were then screened out according to a given threshold (Table 4). We obtained two redundant symptom groups from each basis component, which indicates that the redundant symptoms included in the same group also might have similar patterns in the original sample space. Here, we take Figures 2(b)-2(c) as examples to collaborate the effectiveness of our method. Figure 2(b) represents the distribution of positive of five symptoms in the dataset . These five symptoms (, , , , and ) were identified as basis 2 related features, and they are most possibly belonging to phase I (Table 4). Although each of the row vectors in Figure 2(b) is not completely equal, they all represent relative lower frequency of positive () and their local distribution patterns are similar in a way. Comparing the corresponding rows of these five symptoms in matrix in Figure 4, we found that the compressed patterns of these symptoms are very similar. Similarly, the symptoms (, , and ) are potentially related to basis 3, the frequency of positive for each is over 50%, and the mean value of positive for these three symptoms is 1.77, which further indicate that they might be related to some patients whose conditions are very serious. Although the symptoms , and were not identified as redundant symptoms with given threshold (0.95), their compressed patterns in matrix in Figure 4 also suggested that their patterns were very close. In summary, we considered a fact that the matrix facilitates evaluating the difference among symptoms, and matrix can validate the high degree of correlation between class labels of samples and basis indexes. After inferring the redundant symptoms with given threshold, we combined each symptoms’ group together and converted it into a new feature (named mixed feature). Finally, we obtained 39 clinical features () of HCC as the optimal feature subset, which consisted of two parts: 33 original symptom features () and 6 new mixed features () (Table 5). Based on the analysis of results of NMF, the feature space of original dataset was further reduced.

For evaluating the potential of NMFBFS-inferred optimal feature subset, we firstly tested the classification accuracy of three candidate feature subsets , , and on the training set (120 representative samples). and were generated by feature selection with the threshold (0.95). denoted 49 original symptom features in the dataset . Table 6 indicates that the 39 optimal features, which covered 33 original symptom features and 6 new mixed features, result in the best classification accuracy on the training samples. The performance of was much better than ; however, it was still worse than because the new mixed features also have important contributions to classification.

We then compared the performance of our NMFBFS with three well-known feature selection methods (ReliefF [11], mRMR [12], and Elastic Net [13]). ReliefF was implemented using MATLAB function. “mRMRe” and “elasticnet” packages were applied for mRMR and Elastic Net based feature selection, respectively. Supplementary Figure S1 represents the ReliefF-based feature ranking. Supplementary Figure S2 represents the Elastic Net () solution paths for feature selection. We selected Top 20 features and Top 40 features as two candidate feature subsets for each method to evaluate their classification performances: and generated from ReliefF; and inferred from mRMR; and inferred from Elastic Net. Table 7 represents the classification performance of the above six candidate feature subsets and the NMFBFS-derived optimal feature subset on the training set (120 representative samples). The results indicate that NMFBFS-inferred feature subset has the best classification accuracy in training samples.

Except 120 representative training samples which were screened out to implement the NMF analysis, the remaining samples can be used to test the classification accuracy of optimal feature subset. We randomly selected 40 samples (10 : 20 : 10 for each clinical stage) from the rest of the samples and then evaluated the classification accuracy of inferred feature subset by each method (NMFBFS, ReliefF, mRMR, and Elastic Net). Table 8 shows the differences among all these methods, and it can be found that the optimal feature subset inferred by our proposed method has the best generalization performance.

Finally, the more important thing is that the selection of threshold determines how many groups of redundant symptoms will be screened out. Here, we further discussed the effects of threshold to the optimal feature subsets on the classification performance. Table 9 shows the differences among three optimal feature subsets inferred by the proposed approach with different values for threshold . From Table 9, we can obviously see that the bigger value of will screen redundant symptoms strictly, which leads to less similar symptoms that would be obtained. With a smaller value of , much more symptoms can be categorized into the same groups; hence, the original feature space will be sharply reduced by our approach. Table 9 denotes that, with the decrease of , the size of optimal feature subset was narrowed down but the classification accuracy was also decreased. These results suggest that a bigger value of will result in less redundant symptoms and therefore induce a larger size of optimal feature subset; oppositely, smaller can provide more redundant symptoms and sharply reduce the feature dimension. An extreme case is that equals “0,” which means that we can get one mixed feature for each basis and the size of optimal feature subset is equal to the number of bases. In a word, how to determine the value of depends on the size of optimal feature subset and its corresponding classification performance.

#### 5. Conclusions

In this study, we developed the NMFBFS approach to efficiently extract the important clinical symptoms of HCC from clinical observation data. NMFBFS is a two-stage filter method for feature selection as follows. (1) In the first stage, preliminary screening is implemented to detect and remove the irrelevant features; (2) in the second stage, NMF was applied to identify the redundant features by groups which might represent similar distribution patterns. Each redundant symptom group was then transformed into a new mixed feature so that the dimension of dataset was further reduced.

The application of NMFBFS on a clinical dataset of HCC proved the effectiveness of this approach. The optimal clinical features derived from NMFBFS approach contained many well-recognized symptoms of HCC patients. Moreover, this study also provides a general computational framework of a novel feature selection approach to efficiently extract the optimal feature subset from a high-dimensional dataset.

#### Abbreviations

HCC: | Hepatocellular carcinoma |

TCM: | Traditional Chinese Medicine |

NMF: | Nonnegative matrix factorization |

LSSVM: | Least squares support vector machines |

KNN: | -nearest neighbor. |

#### Conflict of Interests

The authors declare that they have no competing interests.

#### Authors’ Contribution

Zhiwei Ji and Guanmin Meng contributed equally to this work.

#### Acknowledgments

This work was supported by the National Science Foundation of China (nos. 61472282 and 61133010). The data in this work was collected by the Changhai Hospital, Shanghai, China.

#### Supplementary Materials

Supplementary information includes two Figures and one Table.

Fig S1: denotes the results of feature ranking with ReliefF.

Fig S2: denotes the results of feature selection with Elastic Net.

Table S1: represents the frequencies of each symptom feature appearing positive over the samples in all the clinical stages.