Abstract
To select more effective feature genes, many existing algorithms focus on the selection and study of evaluation methods for feature genes, ignoring the accurate mapping of original information in data processing. Therefore, for solving this problem, a new model is proposed in this paper: rough uncertainty metric model. First, the fuzzy neighborhood granule of the sample is constructed by combining the fuzzy similarity relation with the neighborhood radius in the rough set, and the rough decision is defined by using the fuzzy similarity relation and the decision equivalence class. Then, the fuzzy neighborhood granule and the rough decision are introduced into the conditional entropy, and the rough uncertainty metric model is proposed; in the meantime, the definition of measuring the significance of feature genes and the proof of some related theorems are given. To make this model tolerate noises in data, this paper introduces a variable precision model and discusses the selection of parameters. Finally, based on the rough uncertainty metric model, we design a feature genes selection algorithm and compare it with some existing similar algorithms. The experimental results show that the proposed algorithm can select the smaller feature genes subset with higher classification accuracy and verify that the model proposed in this paper is more effective.
1. Introduction
Nowadays, with the continuous changes of human lifestyle and environment, the incidence and mortality of cancers are rising. Therefore, how to improve the analysis, identification, and treatment of tumors has become one of the research hotspots of scholars [1]. Gene expression profiling data can approximately reflect the expression information of the entire genome of biological cells. With the development of gene chip technology, the accurate acquisition of gene expression profiling data becomes possible, which provides an important basis for clinical tumor diagnosis and tumor pathogenesis research [2–4]. However, among the large number of genes included in the gene expression profiling data, there are only a few important genes that can be used as information genes to track diseases [5, 6]. Therefore, when scholars process gene expression profiling data, they will be transformed into information genes selection problems, that is, feature genes selection, whose purpose is to reduce noises and redundant data in gene expression profiles and to obtain feature genes subsets with strong diseaserecognition ability [5].
The gene expression profiling data contain complex and specific information. If the original information of the data can be applied to the calculation as accurately as possible, the result of feature genes selection will be improved to a large extent. The classical rough set proposed by Pawlak [7] has been extensively developed and studied [8, 9]. Its theory is based on the equivalence relation and can only deal with the discrete data. The processing of continuous data needs to be discretized, and it will face problems just like information loss. To solve this problem, neighborhood rough set [10–13] and fuzzy rough set [14–18] are successively proposed as two important models. The neighborhood rough set can directly process the continuous data, which overcomes the shortcomings of classical rough set, but it cannot accurately describe the fuzziness of samples under the fuzzy background. In the fuzzy rough set, the description of a sample is usually depicted by its relationship with the neighbor samples, so the data noise will increase the risk of the calculation result and increase the classification error rate [19]. For this problem, the concept of fuzzy neighborhood is proposed in the literature [20], which overcomes the above deficiencies to some extent and constructs a fitting feature selection model based on the fuzzy rough set. It is crucial to find superior feature evaluation functions in the feature selection process, such as the dependence [19, 21, 22] and information entropy [23–25] methods that have already been proposed. As a kind of knowledge acquisition tool, rough set theory is gradually used in the analysis of gene expression profiling data, which uses the dependency function to evaluate the classification ability of feature genes subset, and has achieved good research results. However, the dependence function mainly depends on the positive domain and the boundary domain, which will lead to inaccurate measurement. At present, some scholars have introduced information entropy into rough sets, such as rough entropy [26] and conditional entropy [27, 28]. What some scholars mentioned in the literatures [29, 30] is that the algebraic definition of attribute importance focuses on the influence of attributes on the certain subset of categories, while the information theory definition considers the influence on the uncertain subset of categories, which are highly complementary. Therefore, the combination of the two will make the measurement mechanism more comprehensive.
In order to make the measurement mechanism more comprehensive and reduce the loss of original data information in the calculation process, this paper uses the combination of fuzzy and neighborhood concepts in the data characterization stage to redefine the fuzzy neighborhood granule in the literature [19] and uses the fuzzy similarity between samples and the decision equivalence class to define the rough decision. Based on the above concept, the original information of the data can be restored as perfectly as possible during the sample characterization. Then, conditional entropy is introduced when the feature evaluation function needs to be selected. A new feature genes selection model is proposed: the rough uncertainty metric model (RUM). At the same time, the definition of the evaluation function for the significance of feature genes and the proofs of some related theorems are given. Finally, based on the rough uncertainty metric model, a feature genes selection algorithm is designed and compared with other existing similar algorithms to prove the validity of the new model.
The remainder of the paper is organized as follows. In Section 2, some basic concepts about rough set theory are reviewed. In Section 3, the rough uncertainty metric model is proposed and a heuristic feature genes selection algorithm is presented for this model. In Section 4, the validity and feasibility of the proposed model are verified by comparing experiments. Section 5 concludes the paper.
2. Related Theoretical Knowledge
This section mainly reviews some basic conceptual knowledge of neighborhood rough set theory and fuzzy rough set theory, including neighborhood relation, fuzzy relation, and the combination of fuzzy and neighborhood concepts.
2.1. Neighborhood Relation
In the data processing stage, the neighborhood rough set mainly uses the neighborhood radius to realize the division of the universe. It can control the size of the sample neighborhood and process the continuous numerical data through the relationship measurement between samples. The basic neighborhood concepts proposed by Hu et al. [10] is as follows:
Let the universe of discourse be an dimensional realvalue space, . Then, L is a measure on and satisfies the following conditions:(1), and the equality holds up if and only if , (2), (3),
For , its neighborhood is expressed as and . According to the nature of metric,(1)(2), (3)
It can be seen from (3) that U is the full coverage of .
2.2. Fuzzy Relation
The fuzzy relation is described approximately by the membership degree of a sample on a set about an attribute. This expression is a good mathematical representation of clear and unclear concepts. Dubois and Prade [14] mentioned the basic fuzzy relations.
Let and a mapping be on the universe of discourse U, then S is a fuzzy set. For , is the membership of x on S, and is recorded as a fuzzy set on U. B is an attribute set of the sample, which can induce a fuzzy binary relationship on U.
If are two fuzzy sets, their intersection, union, and complement are computed as follows:(1)(2)(3)
If satisfies,(1)reflexivity: , (2)symmetry: , then, is the fuzzy similarity relation on U. For , the fuzzy neighborhood of x about can be expressed as , , which is a fuzzy set on U.
2.3. Fuzzy Neighborhood Granule
The fuzzy relation can accurately describe the relationship between samples, but the strength of association between samples is different. To reduce the influence of redundant and noise in the data, the neighborhood radius can be set to filter out some weak correlation data to improve the computational efficiency. Wang et al. [20] proposed the concept of fuzzy neighborhood:
Let be a decision information system. For , , , is the fuzzy similarity relation on U. For , the fuzzy neighborhood of x with respect to is defined aswhere and and , respectively, represent the value of the corresponding attribute b. For , the fuzzy neighborhood granule of x with respect to is defined aswhere is the fuzzy neighborhood radius, and when , .
3. Rough Uncertainty Metric
In this section, the paper introduces the rough decision by using the fuzzy similarity relation between samples to further determine the inclusion of equivalence classes accurately. Combining the rough decision with the fuzzy neighborhood granule of the sample, the definition of the conditional entropy with respect to the attribute set and the proof of its related theorems are given. Then, the rough uncertainty metric model is proposed, and the corresponding feature genes selection algorithm is designed according to the proposed model.
3.1. Rough Decision
Generally, the decision equivalence classes of the general domain are divided by decision attributes, but there is always a correlation between samples. For example, a fuzzy neighborhood granule of a sample contains samples with different equivalent decision classes. Therefore, a sample cannot be completely categorized into a decision equivalence class. In this paper, the fuzzy similarity relation is used to calculate the membership degree of each sample for different decision equivalence classes, and a more accurate rough decision is proposed.
Definition 1. Let be a decision information system. ; , is the fuzzy similarity relation on U, and the rough decision is defined as follows:where , , and .
3.2. Rough Uncertainty Metric Model
In the data characterization stage, the original information of the data is restored as much as possible. Next, fuzzy neighborhood granule and rough decision are used to realize the uncertainty metric of feature genes, and the evaluation method of the significance of feature genes is given.
Definition 2. Let be a decision information system. are two conditional attribute subsets, and is the fuzzy neighborhood granule of x with radius α. The rough entropy with respect to B is defined asThe joint entropy with respect to B, C is defined aswhere , represents the number of nonzero values in the fuzzy neighborhood granule of the object , and represents the probability of an element in the fuzzy neighborhood granule .
Definition 3. Let S and T be two fuzzy sets. is defined as the number of nonzero values of objects whose membership degree in S is not greater than that of T.
Example 1. Given a sample set , S and T are two fuzzy sets on U, the membership of the samples on them are assumed as follows: represents the number of nonzero values of objects whose membership degree in S is not greater than that of T. Thus,Similarly, represents the number of nonzero values of objects whose membership degree in T is not greater than that of S:
Definition 4. Let be a decision information system. , is a conditional attribute subset, fuzzy neighborhood granule , and rough decision are two fuzzy matrices. Then, the conditional entropy of the decision attribute set D with respect to B on U is defined aswhere represents the rough decision corresponding to the equivalence class to which the sample belongs.
Example 2. Given a decision table , where , , , and , is the fuzzy similarity relation matrix with respect to attribute set A andAccording to the given conditions and Definition 1,Then, .
Similarly,Then, .
So, the rough decision is as follows:Here, the neighborhood radius (the specific experiment will discuss the parameters according to different data sets). According to Definition 2, represents the number of nonzero values in the fuzzy neighborhood granule of the object , and then,where represents the rough decision corresponding to the equivalence class to which the sample belongs, where . Since and ,According to Definition 3, is similar to S and is similar to T in it. Then,The remaining samples are equally available. Therefore, the conditional entropy of the decision attribute set D with respect to A on U is
Theorem 1. Let be a decision information system, , is a conditional attribute subset, is a rough decision defined on decision attribute set D. Then,
Proof. According to Definitions 4 and 2, the derivation process is as follows:Hence, .
Theorem 2. Let be a decision information system, , .
Proof. Assume equivalent to , that is, , obviously not established. So , .
Theorem 3. Let be a decision information system, are two conditional attribute subsets. If , then ; the equation is true if and only if .
Corollary 1. If , then .
Proof. According to the definition of the fuzzy neighborhood granule, if , , combined with Definition 3, , , that is, .
Theorem 4. Let be a decision information system, . , if , then the attribute b is unnecessary.
Proof. Assume is necessary and satisfies . According to the definition of fuzzy neighborhood granule, if the attribute b is necessary, then . And , then by Theorem 3 and Corollary 1, , obviously not in line with the assumption. So , if , then the attribute b is unnecessary.
Generally, given a decision information system , is conditional attribute subset. Let B be a reduction of A, if B satisfies the following conditions:(1)(2), .Due to the inconsistency and noises in the data sets, it becomes difficult to find the smallest accurate reduction [31]. Therefore, this paper employs the variable precision model to tolerate the error between the conditional entropy of the reduction attribute subset and the conditional entropy of the original attribute set and set the parameter β as constraint. That is, if satisfies the condition , can be used as a reduction of A.
Definition 5. Let be a decision information system, , the significance of condition attribute a relative to A is defined as, the significance of condition attribute r relative to is defined as
3.3. Feature Genes Selection Algorithm Based on Rough Uncertainty Metric Model
For the above theory, this paper designs a feature genes selection algorithm based on the rough uncertainty metric model. As shown in Algorithm 1, the application of the new model in feature genes selection is realized.

4. Experimental Results and Analysis
In addition to the richness of the theory, a good model needs to have a good practical effect. Therefore, the experimental contrast analysis is set up in this part. Under the same conditions, the same genetic data are used to compare our algorithm with other existing similar algorithms. The specific experimental results data are used to illustrate the advantages of the proposed model.
4.1. Experiment Preparation
In order to verify the validity of the proposed model, four data sets are selected from the public data sources as experimental objects. The specific information is shown in Table 1. The data sets WPBC, WDBC, and HeartCle are selected from the UCI Machine Learning Repository, and Colon is selected from http://datam.i2r.astar.edu.sg/datasets/krbd/. At the beginning of the data processing stage, in order to eliminate the influence of value dimension inconsistency among features, all numerical experimental data will be normalized and mapped to [0, 1] by the formula . In addition, the fuzzy similarity relation of samples and with respect to an attribute is calculated as
4.2. Parameter Discussion
In order to reduce the influence of noise and redundancy in the data used in the experiment, this paper sets the parameter α as the neighborhood radius to calculate the fuzzy neighborhood granule of the sample, which will filter out the data that are less relevant to the sample and consumes computation time as much as possible, improving the efficiency of the experiment to some extent. Due to the objective conditions in reality, these noise and redundant data can only be minimized but cannot be completely avoided. Therefore, the experimental results need to tolerate the error effects caused by these data. So, the parameter β is set in this paper to control the size of the error in the experiment; thus, a smaller feature genes subset with higher classification accuracy is selected. Since different data sets have different correlation strengths, the parameters need to be set separately according to different data sets.
This paper sets the parameters α and β to vary from 0 to 0.5, respectively, with an interval of 0.05. For different data sets, the experiment compares the number of feature genes and the corresponding classification accuracy obtained by different parameters. Two classifiers, support vector machine (linear SVM) and Knearest neighbor (KNN, K = 3), are used to evaluate the classification accuracy of feature genes subset by 10fold crossvalidation. The comparison process is shown in Figures 1–4 (the classification accuracy in the figures is based on the linearSVM classifier, which is the same as the experimental result under the 3NN classifier). Finally, the selected feature genes subset and corresponding suitable parameters under different data sets are shown in Table 2.
4.3. Experimental Comparison
In this section, in order to verify the validity of the proposed model, the designed algorithm is compared with the existing similar algorithms. The experimental objects include original data without algorithm processing (raw data), feature selection algorithm based on fuzzy entropy (FISEN) [32], feature selection algorithm based on fuzzy neighborhood rough set (FNRS) [19], and algorithm proposed in this paper (RUM). The comparison includes two aspects: the number of selected feature genes and the classification accuracy.
A good feature genes selection algorithm aims at finding a subset of smaller feature genes that make classification more accurate. First, the number of original data and selected feature genes under different algorithms are shown in Table 3. It is not difficult to find from the table that the number of feature genes in the original data is more, and significant reduction is achieved by different algorithms. Comparing the average number of feature genes, the FNRS algorithm is lower than the FISEN algorithm, and the RUM algorithm is the least among all the compared algorithms. In different data sets, the RUM algorithm is also the least compared to other comparison algorithms. Therefore, the RUM algorithm proposed in this paper is superior in terms of the number of selected feature genes.
Certainly, only relying on the minimum number of selected feature genes is not enough to illustrate the advantages of an algorithm. If the experimental results can achieve equal or even higher classification accuracy on a relatively small subset of feature genes, the algorithm can be proved to be excellent. Table 4 shows the classification accuracy of the original data and the data obtained from the different algorithms under classifier linearSVM. In general, the RUM algorithm proposed in this paper is the highest among the four experimental objects in terms of average classification accuracy. From the perspective of a single data set, the two classification accuracies obtained by the RUM algorithm and the FNRS algorithm are the same on the WDBC data set, which are higher than others, and the same on the Colon data set. On the remaining data sets, the classification accuracy obtained by the RUM algorithm is the highest, especially on the WPBC data set, which is about 78 percent higher than the FNRS algorithm.
Comparing the classification accuracy on a classifier alone, the persuasive power may be insufficient. In this paper, the classifier 3NN is added to further verify the advantages of the proposed RUM algorithm and is shown in Table 5. First, the average classification accuracy obtained by the RUM algorithm is the highest. Apparently, the classification accuracy of the RUM algorithm on the WDBC data set is slightly lower than that of the original data and the FNRS algorithm, but this is not enough to show that the RUM algorithm is not good. It can be seen from Table 3 that the number of feature genes obtained by the RUM algorithm on the WDBC data set is the least, even less than onethird of the FNRS algorithm and the FISEN algorithm, which is only onesixth of the original data. In the meantime, the classification accuracy achieved by the RUM algorithm is only less than 1 percent lower than the original data and FNRS algorithm, and higher than the FISEN algorithm. In addition, on the other three data sets, data selected by the RUM algorithm has the highest classification accuracy, especially on the WPBC and Colon data sets, which is about 10 percent higher than other algorithms.
From the above concept, the advantage of the rough uncertainty metric model in the feature genes selection has been well verified by comparing the number of selected feature genes and analyzing the classification accuracy on the two classifiers.
5. Conclusions and Future Works
A novel model is established in this paper. In this model, the fuzzy neighborhood granule of the sample are constructed by combining the fuzzy concept with the neighborhood concept, and the decision equivalence class is further accurately expressed as rough decision by using the fuzzy similarity relation between samples. The original information between the data is reserved as perfectly as possible during the data characterization phase. Then fuzzy neighborhood granule and rough decision are introduced into conditional entropy, and a metric method is proposed to evaluate the significance of feature genes. Based on the nature of this model, which is proved by the four theorems in the paper, a feature gene selection algorithm is designed. Finally, the proposed algorithm is compared with the existing similar algorithms on the common data sets. The experimental results show that the proposed algorithm can obtain a relatively small subset of feature genes and achieve better classification results, which verify the validity of the proposed model. However, this model still has something inadequate: the selected parameters based on the corresponding single data set are not generalized, requiring further study and improvement in the future work. In the next step, our study will focus on the parameter selection problem and find out how to adaptively set the appropriate generalized parameters.
Data Availability
Four tumor microarray data sets used to support the findings of this study have been deposited in the public data sources, which include UCI Machine Learning Repository and http://datam.i2r.astar.edu.sg/datasets/krbd/.
Conflicts of Interest
The authors declare that there are no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (Nos. 61772176, 61370169, and 61402153), the Plan for Scientific Innovation Talent of Henan Province (No. 184100510003), the Science and Technology Department, Henan Province (Nos. 182102210362 and 162102210261), the Young Scholar Program of Henan Province (No. 2017GGJS041), and the Key Scientific and Technological Project of Xinxiang City of China (No. CXGG17002).