Abstract

Mining useful knowledge from high-dimensional data is a hot research topic. Efficient and effective sample classification and feature selection are challenging tasks due to high dimensionality and small sample size of microarray data. Feature selection is necessary in the process of constructing the model to reduce time and space consumption. Therefore, a feature selection model based on prior knowledge and rough set is proposed. Pathway knowledge is used to select feature subsets, and rough set based on intersection neighborhood is then used to select important feature in each subset, since it can select features without redundancy and deals with numerical features directly. In order to improve the diversity among base classifiers and the efficiency of classification, it is necessary to select part of base classifiers. Classifiers are grouped into several clusters by k-means clustering using the proposed combination distance of Kappa-based diversity and accuracy. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model. Experimental results on three Arabidopsis thaliana stress response datasets showed that the proposed method achieved better classification performance than existing ensemble models.

1. Introduction

The development of high-throughput sequencing technology has provided researchers with a large amount of microarray data, and extracting valuable information from it has become a hot research topic in bioinformatics [1, 2]. Plants often encounter various stresses at different growth stages throughout their lives, which may lead to inhibition of growth, leaf injury, and plant death. How to predict these stresses will play a very important role in the development of forestry and agriculture. To eliminate negative influences due to these stresses before the appearance of some symptoms, microarray data is used to diagnose and recognize the type of plant stress.

The microarray data has the characteristics of high-dimensionality, small sample, and high redundancy. Traditional classification algorithms for microarray data have problems such as poor classification stability and low accuracy. Therefore, the analysis of such data is required a classification model with strong processing capability. For high-dimensional microarray data, feature selection is a crucial step towards effective and efficient classification [3, 4]. Therefore, high performance methods for feature selection and sample classification have become increasingly important.

Feature selection is an important process in the analysis of high-dimensional data [5, 6]. Rough set theory is a mathematical tool that deals with imprecise, inconsistent, and incomplete problems [7]. The classical rough set theory is employed in attribute reduction problems, and sometimes it requires equivalent relationship among samples. This kind of tolerance relation seems arbitrary and puzzling. In order to deal with this issue, the generalized rough sets theory which substituted binary relation for originally equivalent relationship was proposed [8]. A new binary tolerance relationship intersection neighborhood for processing numerical data was put forward, and it was employed to select features in microarray data [9], which was more flexible for application on data with complex structure. The proposed model uses a rough set model based on the intersection neighborhood to select features in each dataset. For feature selection of microarray data, these models are usually designed based on a single data source of microarray data. Because of the biological interaction between genes, fusion of existing biological knowledge into the classification model can improve its classification performance.

This paper firstly uses pathway knowledge to make a preliminary selection of features in stress response microarray data and selects important features using intersection neighborhood rough set. Each feature subset contains genes in pathway knowledge unit. Because of the high redundancy of microarray data, only a very small number of genes are related to classification [10], and a gene selection model based on intersection neighborhood rough set (INRS) is used on each feature subset to select important and nonredundant features for subsequent classification work.

The ensemble classification model is constructed by using the information complementarity among base classifiers; thus, it has more stable and accurate classification performance [11, 12]. These methods have become increasingly important because they have better performance than single classifiers in many areas especially for classification problems with complex data structures [9, 13]. Many ensemble learning methods including Boosting [14], Stacking [15], Bagging [16], and Random Subspace Method [17] are proposed. For classification of microarray data, Meng et al. proposed an ensemble classification model using neighborhood system and rough set theory [18]. They aimed at averaging the results of different classifiers [19]. To speed up feature selection process, Meng et al. proposed a parallel feature selection method using MapReduce [20].

In the case of a large number of base classifiers in ensemble model, there will be some redundant classifiers, resulting in poor overall difference. In order to improve the performance of ensemble classification, it is necessary to select base classifiers. Ensemble pruning methods can be roughly divided into four categories: iterative optimization method, ranking method, clustering method, and pattern mining method. In the clustering based technology, Lin et al. proposed a dynamic base classifier selection strategy based on K-means clustering and cyclic sequence [21]. Zhang and Cao proposed an ensemble pruning method based on spectral clustering [22]. Krawczyk used a cluster-based pruning method in the weighted Bagging ensemble classification [23]. However, these methods do not consider the diversity among base classifiers and the classification performance of the classifiers at the same time when calculating the distance between base classifiers.

In this paper, we propose an ensemble classification method for high-dimensional data using neighborhood rough set (ECHDNRS). Since feature distributions of these subsets, which are generated by neighborhood rough set and pathway, are significantly different, diversity among base classifiers trained on these subsets is obvious. Not only is this method a way of generating different training sets for training base classifiers from features, but also it can be regarded as an improvement of the traditional Random Subspace Method. For the purpose of improving the classification performance of the ensemble model and reducing the space and time consumption, k-means clustering is used to select base classifiers. Using k-means instead of passing data to all base classifiers leads to time and space reduction. A new function which combines Kappa-based diversity and accuracy for calculating the distance between two base classifiers is proposed. All base classifiers are grouped into clusters and the base classifier with the best classification accuracy in each cluster is selected to generate the ensemble classification model. Our contributions can be summarized in four aspects:(1)Use pathway to preselect features.Pathway is used as a feature preselection tool for ensemble classification. Each pathway contains a group of features that regulate a biological process; thus, the biological significance of features can be assessed. We associate each pathway with its corresponding features included in microarray data to form different feature subsets.(2)Employ neighborhood rough set to select important features in each pathway unit.Neighborhood rough set is used to select important features in each pathway unit. In the feature selection, neighborhood rough set makes the classification ability of the pathway unit unchanged. Moreover, it will reduce the training time of the ensemble classification model.(3)Combine Kappa-based diversity and accuracy to calculate classifier distance.We consider the diversity among base classifiers and the classification performance of the classifiers at the same time when calculating the distance between base classifiers. A new function which combines Kappa-based diversity and accuracy for calculating the distance between two base classifiers is proposed.(4)Demonstrate that ECHDNRS can achieve good classification performance.

Extensive experiments demonstrate the good classification performance of ECHDNRS compared with existing schemes.

The rest of the paper is organized as follows: Section 2 describes Framework of ensemble classification. Feature selection based on pathway and INRS is described in Section 3. Our proposed ensemble pruning method based on k-means is described in Section 4. Experiment results and analysis are discussed in Section 5. Finally, conclusion and future work are given in Section 6.

2. Framework of Ensemble Classification

An ensemble classification model generates many classification models for a certain classification problem and results of these models are comprehensively considered [12]. Generally, there are three ways to generate different base classifiers: (1) training base classifiers on training sets with different samples, such as Bagging; (2) training base classifiers on training sets with different features, such as Random Subspace; (3) using different classification algorithms to train base classifiers based on the same training set [14]. For microarray data with high dimensionality and small sample size, in order to generate base classifiers with significant diversity, it is suitable to employ a model that trains base classifiers on training sets with different features. In this paper, pathway knowledge is used to preselect features; therefore, for the same microarray dataset, different training sets are generated from which we obtain different base classifiers. The procedure of ECHDNRS is shown in Figure 1.

Our ECHDNRS model consists of five steps:(i)Step 1: Integrate pathway knowledge to generate different feature subsets. Each pathway lists features contained in a specific path. ECHDNRS combines microarray data with the corresponding pathway knowledge to preselect features, to form feature subsets Pi (i = 1, 2, ..., m). Since the pathway knowledge is limited, many features included in microarray data have no associated biological knowledge, and they are used to generate unit P0.(ii)Step 2: Select important features using intersection neighborhood rough set model (INRS). It is employed to select significant features in each feature unit Pi (i = 0, 1, 2, ..., m); then the unit (i = 0, 1, 2, ..., m) without redundant features is obtained.(iii)Step 3: All samples are partitioned into training samples, pruning samples, and test samples, as shown in Figure 2. Each training set Ti is the samples set in which all training samples only contain features in . Then, are used to train base classifiers, and SVM is selected as classification algorithm.(iv)Step 4: Prune base classifiers using k-means algorithm. Base classifiers are used to classify pruning samples so as to validate the classification performance. Then k-means clustering groups base classifiers into several clusters based on the classification results on pruning samples. The base classifier with the best classification performance in each cluster will be selected to generate the final ensemble model.(v)Step 5: Integrate classification results of these selected base classifiers. Each test sample is classified by all selected base classifiers; then, the model integrates the results of different classifiers by majority vote method.

These steps of the ECHDNRS are described in detail in the following subsections.

3. Feature Selection

The performance of classifier depends on the interrelationships among the number of samples, dimension of features, and complexity of classifier [13]. If the number of samples in training set is much smaller than the number of features, it will lead to poor classification performance due to the overfitting of classifier on training set [24]. This behavior is referred to as the peaking phenomenon [25, 26].

In practice, sample number is very small relative to the dimension of feature, usually tens of thousands in microarray data. For the objective of improving the classification performance, feature selection is essential. Therefore, pathway knowledge is used in preselection which is used to generate a number of feature subsets. Moreover, INRS is employed to select important features in each unit.

3.1. Combining High-Dimensional Data with Prior Knowledge

We use high-dimensional microarray data combined with pathway knowledge to generate different subsets. The pathway biological knowledge is downloaded from https://www.arabidopsis.org/biocyc, which was derived from KEGG by Kanehisa [27]. The KEGG pathway database integrates current knowledge on molecular interaction networks, including graphical cellular biochemical processes such as metabolism, cell cycle, signal transduction, membrane transport, and conservative subchannel information. It is a collection of hand-painted metabolic pathways, containing the following aspects of intermolecular interactions and response network: (1) metabolism; (2) genetic information processing; (3) environmental information processing; (4) cellular processes; (5) biological system; (6) human diseases; (7) drug development.

For classification of microarray data, traditional classification models are usually designed based on a single data source of microarray data. Because of the biological interaction between genes, fusion of existing biological knowledge into the classification model can improve its classification performance. Gene Ontology (GO) knowledge was first applied to cancer prediction. Related experiments show that combining biological knowledge can improve the accuracy of prediction results and enhance its biological interpretability and credibility [28]. After that, the prediction model combined with pathway knowledge was also applied to the prediction of cancer [29]. In recent years, a classification model at the pathway level combined with the superbox principle was applied to disease classification [30].

The proposed ECHDNRS model eliminates the randomness of the traditional Random Subspace, without using prior knowledge; it randomly extracts features to form feature subsets. Feature selection integrated with biological knowledge for plant stress response improves the biological interpretation of the results [18]. Three examples of pathway are shown in Figure 3, where pij represents the jth feature contained in pathway i. For pathway units, the number of features ranges from 1 to more than 200.

There exist features that are contained in microarray data but without corresponding pathway. Wilcoxon rank sum is employed in preselection for features that are not associated with any pathway annotation, and then 200 top-ranked features are used to generate unit P0. Wilcoxon rank sum test is suitable for ranking of samples that do not meet specific probability distribution such as Gaussian distribution and it is suitable for binary classification samples.

For each feature, the expression quantity of every sample in microarray data is viewed as observations in statistical test. Thus, each feature has two groups of observations depending on the class labels of the sample, respectively denoted by X = {xi | i = 1, 2, …, n1} which represent expression quantity corresponding to samples belonging to class 1 and Y = {yj | j = 1, 2, …, n2} which represent expression quantity correspond to samples belong to class 2, where n1 and n2 are the number of samples belonging to class 1 and class 2, respectively. All of the (n1 + n2) samples are ranked based on the expression quantity in ascending order. Since it is possible that many samples have the same expression quantity, they need to be adjusted, so as to obtain the same rank which is the average of all the ranks. For two groups of observations X = {xi |i = 1, 2, …, n1} and Y = {yj|j = 1, 2, …, n2}, when n1, n2 > 10, the test statistic of Wilcoxon rank sum test for each feature is defined as follows:

U is the smaller one of U1 = n1n2 + [n1(n1 + 1)] − T1 and U2 = n1n2 + [n2(n2 + 1)] − T2. T1 and T2 are the sum of rank of class 1 and class 2, respectively. The test statistic Z obeys standard normal Gaussian distribution with mean 0 and variance 1. The P-value for feature is as follows:whereby is the test statistic Z based on Wilcoxon rank, the sum test for feature , and P(|Z| > ||) represents the probability of |Z| > ||. The smaller p is, the bigger the diversity of feature in two classes is. Finally, all features are ranked based on p in descending order, and then top-N features are selected as the output of feature preselection.

By combining microarray data with pathway knowledge, information table is obtained and shown in Table 1.In the table, P0 contains features which are not associated with any pathway annotations and are preselected by Wilcoxon rank sum test. Pi (i = 1, 2, ..., m) is a biological knowledge unit including features of a pathway corresponding to microarray data, and m is the number of pathways selected as prior knowledge. For a feature pijPi (j = 1, 2, ..., |Pi|) and a sample sk (k = 1, 2, ..., n), the value is the expression quantity of feature pij in sample sk. There exist some redundant features in some pathway for classification, so further feature selection is needed. After the process of forming these units, feature selection model based on INRS is employed to delete redundant features in each unit.

3.2. Feature Selection Based on Intersection Neighborhood Rough Set Model

For GEDT = {S, PiD, V, f}, S(s1, s2 ,…, sn) is the microarray sample set, Pi represents the unit of pathway knowledge which contains related features, and D = {d} is the class label. For each feature pijPi and sample skS, is the expression value of feature pij for sample sk, and dk is the class label of sk.

R is binary relation on data space U for objects x and y. If y is included in the neighborhood of x, the neighborhood of x is defined as follows [31]:

Since all the gene expression microarray data is numerical, we focus on the tolerance relation defined for numerical features. We use the intersection neighborhood as the binary relation. For object xU, based on Pi the relation is defined in [7, 18]

In order to simplify the process of feature selection, we assume that each feature has the same threshold δ. For objects x and y, when the distance between each feature pijPi is less than or equal to δ, then y is in the neighborhood based on Pi of x.

Based on the above binary relation, for a subset of objects XU, the definitions of lower and upper approximations in the extended rough set theory are defined in (3) and (4), respectively [31]:

The definitions of positive, negative, and boundary regions based on intersection neighborhood relation are as follows [31]:

The principle of feature selection model based on the rough set is to keep classification ability unchanged. The classification ability in rough set is defined as the number of training samples included in the positive region based on decision feature; for each unit Pi, it is represented by . It is a kind of NP-hard problem to find an optimal subset of features. Features in each unit Pi are ranked using Wilcoxon rank sum test. Therefore, the proposed feature selection model takes backward strategy in which the rank of features is employed as heuristic information. Every feature ranked from top to bottom in the unit is evaluated such that when deleting a feature from the feature set, if it meets the condition of unchanging the classification ability, it is removed from the unit; otherwise, it cannot be removed. This method can maintain features which have better classification ability, and it can also include fewer features in the selected feature subsets. The feature selection algorithm based on the INRS is described in Algorithm 1.

Input: GEDT = {S, PiD, V, f}
δ //the list of intersection neighborhood threshold
Output: RED // a set of features which is a reduction of unit Pi
Step 1: For each sample skS, calculate the intersection neighborhood based on different threshold δ
Step 2: Divide the sample set S based on the class label D = {d} to obtain the equivalence classes which are represented as S/IND(D) (where samples with the same class label contained in one equivalence class).
Step 3: Calculate the positive region defined on intersection neighborhood based on all of the features in Pi.
Step 4: Start with RED = Pi.
Step 5: As Step 1 for sample skS, calculate the intersection neighborhood . Then as Step 3 calculate the positive region based on RED-{pij}, ;
If , then make RED = RED−{pij}.
Step 6: Repeat Step 5 until all the features pij in the subset Pi are validated, then use final RED as a reduction of Pi, marked as  = RED.

The computational complexity of the intersection rough set-based algorithm is O(|Pi|2|S|2), whereby |Pi| is the number of features contained in Pi and |S| is the number of samples. This feature selection method takes full advantage of global information of each feature subset Pi.

3.3. Ensemble Pruning Based on Clustering

For ensemble classification, many base classifiers are generated for the same problem, a large amount of memory and considerable computational cost are needed [32]. Therefore, classifier pruning is essential to ensemble model. Additionally, Zhou revealed that the ensemble of a proper subset of base classifiers sometimes outperforms the original ensemble [33, 34].

3.3.1. Classifier Distance Based on Diversity and Accuracy

In order to improve the performance of the ensemble classifier, classifiers with significant diversity are selected. Diversity can be viewed as a measure of dependence, complement, or orthogonality among classifiers [35]. Diverse classifier ensembles are preferred. There exist many approaches to measure diversity among binary classifier output including the Q statistic, the correlation, the disagreement, the double fault, the entropy of the votes, the difficulty index, the Kohavi-Wolpert variance, the interrater agreement, and the generalized diversity [36].

Cohen proposed Kappa statistic as the index for consistency judgment. In practice, it can measure the consistency of diagnosis well; therefore, it has been widely used in clinical trials. Kappa is also used to evaluate the classification performance of base classifier. This index compensates for classifications that may be due to randomness. It is considered as a standard statistically robust metric for measuring the accuracy in multiclass problems [37].

In our method, diversity is measured based on Kappa coefficient. The evaluation method implemented in the proposed method does not compute the Kappa index as a global performance measure for each candidate classifier. We calculate the specific Kappa value for the similarity between two candidate classifiers (kappa by similarity). Kappa coefficient of the output of two base classifiers is calculated as follows [38]:

There are two base classifiers hi and hj which are the i-th and the j-th classifiers, respectively. N is the total number of samples. Ctt represents the number of samples which were correctly classified by hi and hj; on the contrary, Cff stands for the number of samples that were incorrectly classified by hi and hj; Ctf is the number of samples that were correctly classified by hi but were incorrectly classified by hj, and Cft is the number of samples that were incorrectly classified by hi but were correctly classified by hj.

The value of Kappa ranges from −1 to 1, and kappa stands for the value of Kappa coefficient. When kappa < 0, it means that the consistency is worse due to randomness; when kappa > 0, it means that the bigger the value, the better the consistency. We define the diversity distance between two base classifiers as follows: when kappa ≤ 0, the distance is 1; when kappa > 0, the distance is 1 − kappa. The diversity distance is symmetric for each pair of classifiers, since D(d)ij = D(d)ji.

Giacinto and Roli stated that ensemble classifiers should be accurate and diverse [39]. Thus, when considering the diversity between two base classifiers, it is also essential to consider their classification accuracy. Suppose that there are m samples in pruning set; cik denotes the actual output of the i-th classifier on the k-th sample; when cik = 0, the k-th sample is correctly classified by the i-th classifier; otherwise, it represents the sample that is incorrectly classified. If the actual output cik is 0 as well as cjk = 0, then cikcjk = 1; otherwise, cikcjk = 0. Accuracy distance between them is defined as follows:

We consider both of the diversity distance and accuracy distance in one distance function and define the distance between base classifiers hi and hj as follows:where α[0, 1] is the weight for diversity distance, Dij[0, 1] is also symmetric and nonnegative, and Dii equals 0 for each I; therefore, it meets the requirements of the distance definition.

3.4. Pruning Base Classifiers Using K-Means

An ensemble pruning method based on k-means clustering is proposed which considers both Kappa-based diversity and accuracy of classifiers. The objective of clustering is to partition base classifiers into many homogenous clusters in which classifiers within a cluster are more similar to each other than those belonging to different clusters. This means that classifiers belonging to different clusters are more diverse. Then an exemplar is selected from each cluster to participate in the ensemble model.

An improved k-means clustering is used to partition the set of base classifiers H = {h1, h2, ..., hN} into k clusters based on the distance we defined. Firstly, k cluster centroids labeled as are randomly selected from all the base classifiers. Secondly, calculate the distance of each classifier between it and every centroid; then it belongs to the cluster of the nearest centroid. Thirdly, in normal k-means clustering, in order to adjust the centroids, the average of all members in one cluster is viewed as the calculated centroid. However, in this situation, these calculated centroids may not stand for the real base classifiers. For each calculated centroid, the original method is improved by selecting one classifier with the smallest distance between them as the new centroid. Finally, repeat the above operations until reaching the iteration that gives the best output or until centroids of all base classifiers are unchanged. To obtain the optimal number of clusters, the number of clusters k is increased gradually until min starts to deteriorate. When the optimal number k of the clusters is obtained, according to the assumption in [40], the agreement among the classifiers from the same cluster is large, so the majority of the classifiers can be removed. Then the classifier with the best classification performance in each cluster is selected as the exemplar to participate in the ensemble model.

We use a group of selected classifiers to classify test samples and integrate the results by majority vote method. Since among the selected base classifiers there exists significant diversity, if most of the base classifiers are consistent, then the result will have a higher credibility. If the differences among the base classifiers are not obvious, in some cases it is possible that most of the classifiers misclassify the samples, and then the ensemble classifier will also incorrectly classify the samples.

Training time includes feature selection time, base classifiers generation time, and classifier pruning time. The feature selection based intersection neighborhood rough set is very time consuming, since it needs to calculate positive region for each feature unit. Base classifiers generation time is related to classification algorithm. Classifier pruning is to reduce the classification time; thus, it is necessary for ensemble classification. After the training process, inference time associates with the base classifier number selected by the ensemble in our proposed model. Therefore, ensemble pruning based on k-means can reduce the inference time.

4. Experimental Results and Discussion

4.1. Dataset and Experiment Settings

Arabidopsis thaliana is usually used to study the responses of plants to different types of stress [41], because of its abundant biological experiment data and information encoded in gene annotations. In this paper, three plant stress response datasets about Arabidopsis thaliana and corresponding pathway knowledge are applied in the experiment to test the performance of the proposed ECHDNRS model. The datasets are Arabidopsis-Drought, Arabidopsis-Oxygen, and Arabidopsis-TEV, which are responses to drought, oxygen, and Potyvirus (TEV) stress, respectively. All the three datasets can be downloaded from GEO (Gene Expression Omnibus) website (http://www.ncbi.nlm.nih.gov/geo/). Each dataset has two classes. Detailed information about the three Arabidopsis thaliana datasets is shown in Table 2, and the experimental group and the control group are denoted by Class A and Class B, respectively.

The data values are normalized into the range of [−1, 1] before intersection neighborhoods of objects are constructed, to eliminate the deviation of the classification result caused by different property ranges. In order to observe the impact of the change in δ threshold of intersection neighborhood on the classification performance, δ is set ranging from 0.05 to 0.95 with a difference of 0.1. The weight of Kappa-based diversity distance α in the process of ensemble pruning ranges from 0.1 to 0.9 with a step of 0.1. For all samples, 60% of them are used as training samples, 20% as pruning samples, and the rest as testing samples. The number of samples is limited in microarray data, and in order to eliminate the coincidence caused by random sampling, the average classification performance of 10 times sampling is used as the final result. In this paper, SVM is used to classify the microarray data. We set the kernel function of SVM classifier as RBF function (K(x, y) = exp(−γ||xy||2)) which has a strong ability to adapt to different datasets and use libSVM to implement SVM.

4.2. Experiment Result Analysis

The classification accuracies of method ECHDNRS are shown in Figures 46. For three datasets, when α is set to 0.1, 0.2, 0.8, and 0.9, the ensemble model achieves better performance. For Arabidopsis-Drought, Arabidopsis-Oxygen, and Arabidopsis-TEV dataset, the ensemble model achieves the best performance when δ is set to 0.65, 0.55, and 0.55, respectively. When δ is smaller than peak-value, the performance is more stable and better than that when δ is bigger than the peak-value.

Feature selection, training, classifier pruning, ensemble classification without pruning, and pruning ensemble classification time are shown in Table 3. The feature selection process is very time consuming; it takes at least 1820s. On these datasets, classifier pruning reduces the classification time and improves classification performance; thus, it is necessary for ensemble classification.

Hence, ECHDNRS selects many feature units. The average -value of these selected features on three datasets are shown in Table 4. ECHDNRS makes the ensemble model obtain a good classification performance, and some of the selected features are not performed very well on -value.

The sum of distance between all base classifiers and its centroid is shown in Figure 7. When k is bigger than 7, it begins to deteriorate. Therefore, we set k to 7.

The evaluation criterion of binary classification is based on four simple criteria: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN). In this paper, four evaluation criteria are used to evaluate the comparison results; they include accuracy (ACC), sensitivity (SN), specificity (SP), and geometric mean (G-mean). They are defined as follows:

ACC evaluates the classification accuracy of all the samples. SN and SP measure the classification accuracies of the samples belonging to positive and negative classes, respectively. G-mean comprehensively evaluates the classification ability of positive and negative classes.

The comparison methods include four classical ensemble models and one single model; they are Random Subspace, Bagging, AdaboostM1, Stacking, and SVM; they are implemented in Weka [42]. Two single distances, diversity distance based on Kappa and accuracy distance for clustering base classifiers, are compared with ECHDNRS; they are named as DECHDNRS and AECHDNRS, respectively. The accuracy distance is a general calculation formula of the distance between the base classifier. Kappa coefficient is a common evaluation index for the performance of classifier. Therefore, ensemble pruning methods, which are based on them, are also compared with ECHDNRS. SVM is used as the base classifiers for all ensemble models. All of the methods based on ensemble pruning employ the same method as ECHDNRS to generate base classifiers. The intersection neighborhood threshold δ has different effect on the performance of different ensemble models; therefore, the average classification performance of different δ is used to compare with other methods. The comparison results of ECHDNRS with other ensemble methods are shown in Tables 57.

For each dataset, for example, we rank these nine classification methods according to the classification accuracies: the classification method with the best accuracy is ranked the 1st, and the method with the worst accuracy is ranked 9th. Then, for each classification method, calculate the average ranking over the three datasets. The average rankings of three datasets are listed in Table 8.

From the performance of all methods on three datasets and the rankings of them, the ECHDNRS obtains the best performance on SP and G-mean, and it is similar to DECHDNRS on accuracy. On SN, ECHDNRS is worse than other methods, but it balances the classification ability of positive and negative classes in a better way. Since it can classify the samples belonging to negative class well, other methods perform much worse on them. Therefore, it obtains better performance on G-mean, which comprehensively evaluates the classification ability of positive and negative classes. Generally, these methods based on ensemble pruning perform better than classical ensemble models.

5. Conclusions

An ensemble classification method ECHDNRS for plant stress response is proposed in this paper. Combining microarray data with pathway knowledge to eliminate randomness of the traditional Random Subspace, then the feature selection model based on intersection neighborhood rough set reduces redundant features in each feature unit. Furthermore, in order to improve the classification performance of the ensemble model, the hybrid approach of classification and clustering is employed to select base classifiers. The k-means clustering algorithm that employs the proposed distance function which is the combination of Kappa-based diversity and accuracy groups all base classifies into several clusters, and the base classifier with the best classification accuracy in each cluster is selected. Experimental results on three Arabidopsis thaliana stress-related datasets show that the proposed method obtains better results than classical ensemble methods including Random Subspace, Bagging, AdaboostM1, and Stacking, and it also performs better than traditional Kappa pruning and clustering pruning methods based on single distance. How to reduce the time consumption when performing feature selection based on intersection neighborhood rough set is the topic of our future work.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61872071) and the Fundamental Research Funds for the Central Universities (N2116010).