Abstract

Most model-free feature screening approaches focus on the -individual predictor; therefore, they are not able to incorporate structured predictors like grouped variables. In this article, we propose a group screening procedure via the information gain ratio for a classification model, which is a direct extension of the original sure independence screening procedure and also model-free. The proposed method yields a better screening performance and classification accuracy. It is demonstrated that the proposed group screening method possesses the sure screening property and ranking consistency properties under certain regularity conditions. Through simulation studies and real-world data analysis, we demonstrate the proposed method with the finite sample performance.

1. Introduction

Ultrahigh-dimensional data are commonly available in a wide range of scientific research and applications. Feature screening plays an essential role in the ultrahigh-dimensional data, where Fan and Lv [1] first proposed the sure independence screening (SIS) in the seminal paper. They showed that the method based on Pearson correlation learning possesses a sure screening property for linear regressions. All relevant predictors can be selected with probability tending to one even if the number of predictors can grow much faster than the number of observations with for some , as discussed by Fan et al. [2].

To address the ultrahigh-dimensional feature screening in classification problem, the Kolmogorov filter was utilized for ultrahigh-dimensional binary classification by Mai and Zou [3]. Cui et al. [4] utilized empirical conditional distribution functions in the fused mean-variance-based screening approach. The assumption that the data are continuous is made by all of the above classification feature screening approaches. Huang et al.’s study of categorical covariates [5] constructed a model-free discrete feature screening method with the help of the Pearson chi-square statistics. The method's sure screening property satisfied Fan and Lv [1] when all of the covariates were binary. For multiclass classification, Ni et al. [6] further added weighting-based adjusted Pearson chi-square feature screening. Information entropy theory was used by Ni and Fang [7] to screen model-free features for ultrahigh-dimensional multiclass classification. However, some types of covariates, particularly categorical and discrete covariates, are grouped, which are frequently represented by microarrays, genomics, quantitative measures, and brain imaging. A good number of methods for selecting grouped variables originate from selecting individual variables and produce a sparse solution at the group level or even within-group level. See, for example, group Lasso [8], group SCAD [9], group MCP [10], group hierarchical Lasso [11], group bridge [12], group exponential Lasso [13], etc. Some grouped variable selection methods may not converge when the number of groups grows much faster than the sample size . This is especially true when the regularization parameter is set for a nonsparse estimation, which frequently causes nonidentifiability or near-singularity issues. The estimated coefficients may not be globally optimal solutions even if the algorithm converges in the case of “large , small .” We believe that new screening techniques that can reduce the number of groups before selecting important groups and variables within these groups are required because of these factors. Niu et al. [14] looked at data with grouping structures in ultrahigh-dimensional and suggested a group screening method based on working independence in linear models. Song and Xie’s [15] group screening method made use of the F-test statistic, which improved marginal methods by reducing the burden of multiple tests and aggregating individual effects. For ultrahigh-dimensional data for a linear model, Qiu and Ahn [16] suggested group sure independence screening (gSIS), group-wise adjusted R-squares screening (gAR2), and group high dimensional ordinary least-squares projector (gHOLP). He and Deng [17] applied the joint information entropy to screen some important grouped covariates.

In this paper, we propose a new approach named Group Information Gain Ratio Sure Independence Screening (GIGR-SIS) for grouped feature screening. It is based on information gain ratio, as a direct extension of the original sure independence screening procedure, which is effective in this scenario. The proposed method selects the covariate groups whose information gain ratio are greater than a threshold value, where the threshold value is defined by a given number. Similar to Ni and Fang [7], continuous covariates are sliced using the standard normal quantile. Employing information gain ratio to assess the importance of grouped covariates, we show that GIGR-SIS possesses the sure screening property under certain regularity conditions.

This paper is organized as follows. Section 2 describes the proposed GIGR-SIS method in detail. Then we establish its sure screening property. In Section 3, we assess the performance of our method via numerical examples, including simulations and a real data application. Some concluding remarks are given in Section 4, and all the proofs are given in the Appendix.

2. Theory and Method

We first introduce entropy and information gain ratio, and then propose the screening procedure based on information gain ratio. After that, we establish the property of group feature screening.

2.1. Information Entropy and Information Gain Ratio

For group covariates, each group of covariates can be considered as a whole. Suppose be a categorical response with classes and covariate matrix X be a multivariate covariate matrix of dimension with G-grouped covariates which can be expressed as , where represents the -th group covariate and represents the dimension of the covariates belonging to the g-th group. To introduce the concept of entropy and information gain ratio, assume that all the covariate components of covariate matrix X are classified with J categories . The value of any element in , combinations are formed. represents the last of the combinations between covariate categories in the -th group covariate matrix, , where represents the indicator variable in the combination between covariate categories in the g-th group covariate matrix, and is the first covariate category combination.

Let represent the probability function of response variable, represent the probability function of group covariates, and represent the probability function of response variables under the condition of group covariates, where, , and . Let . The marginal entropy of and marginal entropy of , respectively, are defined as

Conditional entropy is defined as

The information gain is defined as

The information gain ratio is defined as

In equation (3), is non-negative and achieves its maximum if and only if by Jensen’s inequality, and the term in equation (2) is the conditional entropy of given . According to Ni and Fang [7] and He and Deng [17], further support is provided by the following proposition.

Proposition 1. We have when is a categorical variable, and if and only if , and are statistically independent.

The conditional entropy can only be determined by dividing into multiple categories when is continuous. For a fixed integer , let be the -th percentile of , , , and , replacing in equation (2) by and .

Based on continuous covariates, we define conditional entropy as follows:

Proposition 2. We have when is a continuous variable, and if and only if , Y and are statistically independent.

2.2. Grouped Feature Screening Procedure Based on Information Gain Ratio

Let be the covariate matrix and be a categorical response with classes , , where and is the sample size. { functionally depends on for some } denotes the active covariate subset.

A changed information gain ratio is used to measure the relationship between and because we must select a simplified model with a medium scale that can almost include . The following is the information gain ratio for each pair :where and when is a categorical group, and .

Then when is the continuous group, and is the percentile of , and . In , is the number of slices that are applied to each .

When using information gain to select features, it tends to select features with larger values. The essence of the information gain ratio is to multiply a penalty parameter based on information gain. When the number of features is large, the penalty parameter is small; when the number of features is small, the penalty parameter is large. The information gain ratio makes up for the deficiency that information gain tends to select features with large values.

With the sample group data , , can be easily estimated by

When is categorical,

When is continuous,where is the sample normal percentile of . In either case, .

We suggest going with a submodel , where condition (C2) in subsection 2.3 specifies predetermined thresholds for the values and . In actuality, we can select a model:

2.3. Group Feature Screening Property

In this subsection, we establish the sure screening property of GIGR-SIS. Sure independence screening (SIS), which provided a statistical theoretical foundation for ultrahigh-dimensional feature screening techniques, was first proposed by Fan and Lv [1]. IG-SIS [7] and GIG-SIS [17] were testified to be satisfied with sure independence screening. We assume the following conditions based on these theories.(C1): There are two positive constants and such that , , , and for every , , and .(C2): There is a positive constant and such that .(C3): , , where and .(C4): There is a positive constant such that for any and in the domain of , where is the Lebesgue density function of conditional on .(C5): There is a positive constant and such that for any and in the domain of , where is the Lebesgue density function of . Furthermore, is continuous in the domain of .(C6): , , where and .(C7): , where is a constant.

Condition (C1) ensures that neither a very small nor a very large proportion of any given class of variables can exist. In Huang et al.’s study [5], a similar assumption is also made for condition (C1) as well as Cui et al. [4]. When the sample size reaches infinity, condition (C2) permits the minimum true signal to disappear to zero in the order of . The number of classes for the response and the covariates can diverge in a particular order under conditions (C3) and (C6). Make sure that the sample percentiles are close to the actual percentiles by excluding the extreme case that some places heavy mass in a narrow range under condition (C4). The density must have a lower bound of order for condition (C5). According to Cui et al. [4], condition (C7) makes it possible for the active covariate subset and the inactive covariate subset to be very different, as well as in Zhu et al.’s study [18].

Theorem 1. Under conditions (C1) to (C3), if all the covariates are categorical, we havewhere is a positive constant. If and , GIGR-SIS has a sure screening property.

Theorem 2. Under conditions (C4) to (C6), if the covariates consist of continuous and categorical variables, we havewhere is a positive constant. If and , GIGR-SIS has a sure screening property.
The proposed screening index can effectively distinguish between active and inactive covariates at the sample level, as shown by Theorem 3.

Theorem 3. Under conditions (C1), (C4), (C5), and (C7), if and , then

3. Numerical Studies

3.1. Simulation Results

In this subsection, we carry out five simulation studies to demonstrate the finite sample performance of our group screen methods described in section 2. We will compare the performance of GIGR-SIS with IG-SIS, GIG-SIS, gSIS, and gHOLP.

There are five indicators for assessing method performance in Models 1 through 5. All active covariates are included in the minimum model size, or MMS. Although it is superior, it is comparable to the number of active covariates. The majority of results include 5%, 25%, 50%, 75%, and 95% of MMS; CP1, coverage probability 1, the probability with which the indicators of all active covariates were included in the model size ; CP2, coverage probability 2, the probability with which all active covariate indicators were included in a model of size ; and CP3, coverage probability 3, the probability with which the indicators of all active covariates were included in a model of size . The indicators of whether the selected model incorporates all active covariates are CPa or total coverage probability.

3.1.1. Model 1: Binary Response

To begin, we take into consideration a straightforward model in which all of the covariates are categorical, the number of responses is binary, and is 2, as proposed by Ni and Fang [7] and He and Deng [17]. We take into account two distributions:(1)Balanced: (2)Unbalanced: with

The true model is defined as , with and , and the group size is 5. We generated the latent variable , where , based on . We then constructed active covariates:(1)If , then (2)If and , then (3)If and , then

Finally, we generated covariates using the standard normal distribution's quantile. The particular approach is as follows:(1)If is an odd number, then (2)If is an even number, then

Here, is the standard normal distribution’s -th percentile.

As a result, of all the p covariates, half are two-category, while the remaining half are five-category. Take into consideration the scenarios where the sample size was 80, 120, or 160, and the dimension of the covariate was 1500.

Three indicators for assessing method performance over a hundred simulations are presented in Table 1.

Evaluation of Various Sample Sizes. The MMS of GIGR-SIS, GIG-SIS, gSIS, and gHOLP all approach as sample sizes increase, and the coverage probability indexes all reach 1. However, when the sample size is 80 or 100, four IG-SIS coverage probability indexes perform worse than GIGR-SIS. The GIGR-SIS MMS is superior to the IG-SIS MMS. Therefore, in Model 1, the GIGR-SIS performs better in terms of finite sample performance than the IG-SIS.

Comparison of Various Response Structures. In the balanced response, the performance of five indexes is comparable to that of the unbalanced response. Due to the small fluctuation range in MMS, the performance of GIGR-SIS is also more robust than that of the other grouped screening methods.

3.1.2. Model 2: Multiclass Response

More covariate classification is taken into account, and the response is multiclass with . We take into account two distributions:(1)Balanced: (2)Unbalanced: with

The true model is defined as , with and , and the group size is 5. For covariates , , where and are quantile functions of the standard normal distribution, and then we generate the latent variable . After that, we define to construct active covariates:(1)If , then (2)If , then

At last, we generate covariates by and take and . The particular approach is as follows:(1)For , then (2)For , then (3)For , then (4)For , then (5)For , then

One-fifth of the covariates are quadripartite, while one-fifth of the covariates are binary. Sex partite covariates make up one-fifth of the data. The remaining covariates are decuple, while one-fifth of the covariates are octuple.

Three indicators for assessing method performance over 100 simulations for Model 2 are presented in Table 2. The performance of GIGR-SIS is significantly superior to that of other screening methods in this more complex example. In particular, the GIGR-SIS MMS has a relatively small sample size.

Evaluation of Various Sample Sizes. The MMS of GIGR-SIS approaches and four indexes of coverage probability both reach 1 as sample sizes increase. However, the coverage probability indexes IG-SIS, gSIS, and gHOLP are all worse than GIGR-SIS. Compared to the MMS of other grouped screening methods, GIGR-SIS is superior. The fact that gSIS is expected to work well when there are few correlations between the predictors is the reason for its poor screening. The fact that gHOLP is expected to work well when the predictors are correlated is the reason for its poor screening. Additionally, due to the influence of model limitations, the results of gSIS and gHOLP remain the same as the sample size increases. Therefore, in Model 2, the GIGR-SIS performs better with finite samples than other grouped screening methods.

Comparison of Various Response Structures. Five indexes perform better in the unbalanced response than in the balanced response. Due to the small fluctuation range in MMS, the performance of GIGR-SIS is also more robust than that of other screening methods.

3.1.3. Model 3: Continuous and Categorical Covariates

After that, we looked into a more complicated example with both continuous and categorical covariates, and the response is a multiclass one, and . We took into account two distributions:(1)Balanced: (2)Unbalanced: with

The true model is defined as } with and , and the group size is 4. We take , in this model. For covariates , we generated latent variable , , and , where with when and . According to Ni and Fang [7] and He and Deng [17], is given in Table 3. when . To generate , we have those as follows:(1)For , then , if , (2)For , then , if , (3)For , then

Four-category covariates account for one-fifth of all covariates. The remaining covariates are continuous, while one-fifth of the covariates are four-category. Continuous covariates make up half of the 12 active covariates, while categorical covariates with four categories and ten categories make up the remaining covariates.

When the numbers of covariates are grouped, He and Deng [17] proposed a grouped feature screening method by using the joint information entropy to screen some important grouped covariates. We denote them as GIG-SIS-4, GIG-SIS-8, and GIG-SIS-10. The simulation results for 100 replications for the balanced and unbalanced cases are presented in Tables 4 and 5, respectively.

Evaluation of Various Sample Sizes. The MMS of GIGR-SIS approaches and the four indexes of coverage probability both reach 1 as the sample sizes increase. However, when the sample size is 180, the five coverage probability indices of gSIS and gHOLP are inferior to GIGR-SIS. GIGR-SIS outperforms IG-SIS, gSIS, and gHOLP in terms of MMS. In Model 3, the GIGR-SIS outperforms the IG-SIS, gSIS, and gHOLP in terms of performance with finite samples. Additionally, five GIGR-SIS coverage probability indices are comparable to GIG-SIS. As a result, it is demonstrated that the GIGR-SIS possesses group feature screening characteristics.

Comparison of Various Response Structures. Four indexes perform better in the unbalanced response than in the balanced response. Due to the influence of model limitations, the results for gSIS and gHOLP are poor. However, there are two types of responses to which the MMS of GIGR-SIS and GIG-SIS is robust.

Comparison of Various Slice Counts. In terms of MMS and five indexes of coverage probability, the performance of GIGR-SIS is superior regardless of the number of slices applied to continuous covariates. The number of slices has no effect on GIGR-SIS or GIG-SIS's performance.

3.1.4. Model 4: Continuous Covariates

The true model is defined as with and , and the group size is 4. We take and in this model. Condition on , we generate latent variable . For covariates , and , where with when and . According to Ni and Fang [7] and He and Deng [17], is given in Table 3. when . To generate , for , then .

The active covariates and all of the covariates are continuous covariates. We use a slice count of for the continuous covariates. The names GIGR-SIS-8, IG-SIS-8, GIG-SIS-8, gSIS-8, and gHOLP-8 are used to represent the respective methods.

In Table 6, we can get that the GIGR-SIS screening method proposed in this paper works well for continuous data as the sample size increases, but gSIS and gHOLP are affected by the model limitations and are poorly screened for continuous data. The MMS of GIGR-SIS approaches when the sample sizes increase, and the four indexes of coverage probability both reach 1. Additionally, five GIGR-SIS coverage probability indices are comparable to GIG-SIS. As a result, it is demonstrated that the GIGR-SIS possesses group feature screening characteristics.

3.1.5. Model 5: Computational Time Complexity Analysis

It is the same as Model 1. But for distribution of , we consider balanced data, that is, . The true model is defined as , with and , and the group size is 5. The active and irrelevant covariates are generated in the same way as in Model 1. Similar to this, half of the -dimensional covariates are two-category, while the other half are five-category.

Model 5 controls for a constant sample size of 150 and considers a dimensional vector of covariates ranging from 1500 to 10500, with an equal series of 1000 equal differences. The running time of the five methods will be recorded for each experiment, and the median running time in 100 replicate experiments will be recorded as the running time index of the five methods, and the trend of the five methods will be compared as the dimension of covariates increases, so as to compare the computational complexity of the methods.

The median run times in Table 7 show a linear trend for the five methods as the sample size varies linearly. It can be seen that the running time of GIGR-SIS is not much different from that of GIG-SIS, while the running time of gSIS and gHOLP is half of that of GIGR-SIS. This is because gSIS and gHOLP are limited by the model setting, and although the running time is short, the screening effect is not good; see Table 1 for details.

3.2. Real Data

In this subsection, we analyse a real dataset, mutational status of p53 in cell lines, from the gene expression studies reported in Subramanian et al.[19] and Zeng and Breheny [20]. The p53 study aims to identify pathways that correlated with the mutational status of the gene p53, which regulates gene expression in response to various signals of cellular stress [20]. The p53 data consists of sample and features, where of cell lines are classified as normal and of which carry mutations in the p53 gene. These gene sets contain a total of genes, in the context of biological genetics, and the genes here do not act alone but together in the same gene pathway.

According to the analysis in Subramanian et al. [19] and Zeng and Breheny [20], the gene variables were grouped into gene pathways that contained varying numbers of genes. In the screening, we selected group variables as significant group variables. We applied GIGR-SIS-4, GIGR-SIS-8, and GIGR-SIS-10 methods to select important gene pathways, which are group variables, respectively. Because of the number of group variables, that is, more genes in gene pathways, here we only give the final selected group number, and the specific screening group number results are shown in Table 8. We can find that the results of our screening methods are not much different after the selection of group variables.

4. Conclusions

In this article, we have proposed the grouped feature screening GIGR-SIS method based on information gain ratio for categorical covariates in the classification model. Compared with most existing literature, we deal with variables which can be naturally grouped. The GIGR-SIS feature screening method is model-free, and we establish the sure screening property for this group screening approach. Simulation results show that the performance of GIGR-SIS outperforms that of GIG-SIS, gSIS, and gHOLP. The study on p53 data also shows the advantages of our screening method.

The grouped feature screening encounters some difficulties when data are missing. For the purpose of resolving the classification model’s missing covariates or responses, we would like to propose a novel feature screening approach.

Appendix

Proposition 1 and Proposition 2 have been proved in He and Deng [17]; hence, they are omitted.

Lemma A.1 (Bernstein inequality). If are independent random variables with a mean value of 0 and bounded support is , then the inequality , where .

Lemma A.2. We have the following three inequalities for discrete group covariates and discrete response Y:(1)(2)(3)

Proof of Lemma A.1. The proofs for the aforementioned inequality (a) and inequality (b) have been provided in Ni [21] and He et al. [17] and are comparable. The proof of inequality (c) is given here. .
The expectation of isLet and be known, then
The preceding formula is established by Bernstein inequality.

Lemma A.3. Under condition (C1), we have , for discrete response Y and discrete group covariates , where is a constant.

Proof of lemma A.2. Section 2.2's and indicate that we haveFor , we haveThen,Similarly, for , we haveThen,For , we haveThen,In sum, we have inequality
, where is a constant.

Proof of Theorem 1. According to Conditions (C1) to (C3) and Lemma A.3, we havewhere b is a positive constant.

Lemma A.4 (Lemma A.2 [21]). For any continuous covariate satisfying conditions (C4) and (C5), for any , we have , where and are two positive constants.

Lemma A.5 (Lemma A.5 [17]). For continuous , under (C1), (C4), and (C5), we have for any , and there is a positive constant .

Proof of Theorem A.2. Based on Lemma A.5, since the Proof of Theorem A.2 is identical to that of Theorem 1, it is omitted.

Proof of Theorem A.3. Under Conditions (C1), (C4), (C5), and (C7) and by Lemma A.3 and A.5., the proof is similar to that of Ni and Fang [7]. Then, we havewhere . Since , there exists a positive constant such that . Also, implies that and for large n. Then there exists an such that . By Borel Contelli lemma, we have

Data Availability

The p53 in cell lines data used in the real data analysis was obtained from the R package grpregOverlap (https://github.com/YaohuiZeng/grpregOverlap.git).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this article.

Acknowledgments

The work was supported by National Natural Science Foundation of China (grant no. 71963008).