Abstract

Variable selection plays an important role in data mining. It is crucial to filter useful variables and extract useful information in a high-dimensional setup when the number of predictor variables tends to be much larger than the sample size . Statistical inferences can be more precise after irrelevant variables are moved out by the screening method. This article proposes an orthogonal matching pursuit algorithm for variable screening under the high-dimensional setup. The proposed orthogonal matching pursuit method demonstrates good performance in variable screening. In particular, if the dimension of the true model is finite, OMP might discover all relevant predictors within a finite number of steps. Throughout theoretical analysis and simulations, it is confirmed that the orthogonal matching pursuit algorithm can identify relevant predictors to ensure screening consistency in variable selection. Given the sure screening property, the BIC criterion can be used to practically select the best candidate from the models generated by the OMP algorithm. Compared with the traditional orthogonal matching pursuit method, the resulting model can improve prediction accuracy and reduce computational cost by screening out the relevant variables.

1. Introduction

Variable screening is an important technique in data mining. It captures informative variables by reducing the dimension in a high-dimensional setup when the number of predictor variables tends to be much larger than the sample size . However, statistical inference is difficult to compute in ultra-high dimensional linear models before variable screening due to the computational complexity. It is necessary to remove the irrelevant variables from the model before statistical inference. The core idea is to screening out the informative variables with the aim of building a relevant model for future prediction. By removing most irrelevant and redundant variables from the data, variable selection helps improve the performance of learning models in terms of obtaining higher estimation accuracy [1]. Then the AIC [2] or BIC [3] can be applied to further guarantee the accuracy of the relevant model.

The focus of this article is on ultra-high dimensional linear models, in which the number of predictor variables tends to be much larger than the sample size . In particular, the number of covariates may increase at an exponential rate. Such linear models have gained a lot of attention in practical areas, such as sentiment analysis and finance. Existing techniques in the past literature include forward selection [1], least absolute shrinkage and selection operator (Lasso) [4], smoothly clipped absolute deviation penalty (SCAD) [5], etc. These efforts have been devoted to the challenging ultra-high dimensionality problem, which is motivated by contemporary applications such as bioinformatics, genomics, finance, etc. In other words, it is becoming a major issue to investigate the existence of complex relationships and dependencies in data with the aim of building a relevant model for inference. A practically attractive approach is to first use a quick screening procedure to reduce the dimensionality of the covariates to a reasonable scale, for example below the sample size, and then apply variable selection techniques such as LASSO and SCAD in the second stage.

Motivated by the current studies on variable screening approaches in ultra-high dimensional linear models, it is of our interests in showing the screening consistency property of the OMP under certain conditions, by restricting the technical conditions stated in Wang [6] and hence selecting a subset of predictors, which includes all relevant predictors, to ensure variable screening results.

The rest of this article is organized as follows: Section 2 provides the literature review on current variable screening methods. Section 3 demonstrates a variable screening algorithm based on the OMP. Furthermore, the asymptotic results of the estimators are studied. Section 4 examines via simulation that our proposed technique exhibits desired sample properties and can be useful in practical applications. Finally, Section 5 concludes the article and provides some future research direction. In particular, the proof of the asymptotic theories and lemmas can be found in Appendix.

2. Literature Review

In the content of variable selection, screening approaches have gained a lot of attention besides the penalty approaches such as Lasso [4] and SCAD [5]. When the predictor dimension is much larger than the sample size, the story changes drastically in the sense that the conditions for most of the Lasso-type algorithms cannot be satisfied. Therefore, to conduct model selection in the high-dimensional setup, variable screening is a reasonable solution.

Sure independence screening (SIS), which is proposed by Fan and Lv [7], has gained popularity under the condition when the number of predictor variables tends to be much larger than the sample size . Sure screening means a property that all the important variables are selected after applying a variable screening procedure with probability tending to 1. It is desired to have a dimensionality reduction method with the sure screening property. There are three facts why sure screening is of great importance and usage when dimension is larger than sample size , which is clearly stated in Fan and Lv [7]. First of all, the design matrix is rectangular, having more columns than rows. In this case, the matrix is giant in dimension and singular. The maximum spurious correlation between a covariate and a response can be large due to the dimensionality and the fact that an unimportant predictor can be highly correlated with the response variable owing to the presence of important predictors associated with the predictor. In addition, the population covariance matrix may become ill conditioned as grows, and it makes variable selection difficult. Third, the minimum nonzero absolute coefficient may decay with and fall close to the noise level, say, the order . Hence, in general, it becomes challenging to estimate the sparse parameter vector accurately when .

To solve the abovementioned difficulties in variable selection, Fan and Lv [7] proposed a simple sure screening method using componentwise regression or equivalently correlation learning, to reduce dimensionality from high to moderate scale that is below sample size. Below is the description of the SIS method.

Let be a d-vector that is obtained by componentwise regression, that iswhere the data matrix is first standardized columnwise. For any given , we sort the componentwise magnitudes of the vector in a descending order and define a submodelwhere denotes the integer part of . It shrinks the full model down to a submodel with size smaller than the sample size . This correlation learning ranks the importance of features according to their marginal correlations with the response variable. Moreover, it is called the independence screening because each feature is used independently as a predictor to decide the usefulness for predicting the response variable. The computational cost of SIS is of order .

With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as SCAD, Lasso, or adaptive Lasso [8, 9], denoted by SIS-SCAD, SIS-Lasso, or SIS-AdapLasso, respectively. Moreover, sure screening property has been proven in Fan and Lv [7]. Intuitively, the core idea of SIS is to select the variables by two stages. In the first stage, an easy-to-implement method is used to remove the least important variables. In the second stage, a more sophisticated and accurate method is applied to reduce the variables further.

Though SIS enjoys sure screening property and is easy to be applied, it has several potential problems. First of all, if there is an important predictor jointly correlated but marginally uncorrelated with the response variable, it is not selected by SIS and thus cannot be included in the estimated model. Second, similar to Lasso, SIS cannot handle the collinearity problem between predictors in terms of variable selection. Third, when there are some unimportant predictors which are highly correlated with the important predictors, these unimportant predictors can have higher chance of being selected by SIS than other important predictors that are relatively weakly related to the response variable. In all, these three potential issues can be carefully treated when some extensions of SIS are proposed. In particular, iterative SIS (ISIS) is designed to overcome the weakness of SIS.

ISIS works in two steps. In the first step, a subset of variables is selected by using an SIS-based model selection method such as SIS-SCAD or SIS-Lasso methods. There is an -vector of residuals from regressing the response over . In the second step, the residuals are treated as the new response variable and the previous step is repeated to the remaining variables. It returns a subset of variables . Fitting the residuals from the previous step on can significantly weaken the prior selection of those unimportant variables that are highly correlated with the response through their relations with . In addition, the second step also makes those important variables which are missed out in the first step possible to be selected. Iteratively, the second step is iterated until disjoint subsets are obtained with the union has a size .

If SIS is used to select only one variable at each iteration, that is , ISIS is equivalent to orthogonal matching pursuit (OMP) [10], which is a greedy algorithm for variable selection. This is discussed in the study by Barron and Cohen [11].

Kim [12] proposed a filter ranking method using the elastic net penalty with sure independence screening (SIS) on resampling technique to overcome the overfitting and high-performance computational issues. It is demonstrated via extensive simulation studies that SIS-LASSO, SIS-MCP, and SIS-SCAD with the proposed filtering method achieve superior performance of not only accuracy but also true positive detection compared to those with the marginal maximum likelihood ranking (MMLR) method.

Another very popular yet classical variable screening method is the forward regression (FR). As one type of important greedy algorithms, FR’s theoretical properties have been investigated in the studies by Barron and Cohen [11], Donoho and Stodden [13], and Wang [6]. In particular, Wang [6] investigated FR’s screening consistency property, under an ultra-high dimensional setup, by introducing the four technical conditions.

There are a few comments on those four technical conditions introduced in the study by Wang [6]. First of all, the normality assumption has been popularly used in the past literature for theory development. Second, the smallest and largest eigenvalues of the covariance matrix need to be properly bounded. This bounded condition together with the normality assumption ensures the sparse Riesz condition (SRC) defined in the study by Zhang and Huang [14]. Third, the standard norm of the regression coefficients is bounded above by some proper constant. It guarantees that the signal-to-noise ratio is convergent. Moreover, the minimum value of the nonzero s needs to be bounded below. This constraint on the minimal size of the nonzero regression coefficient ensures that relevant predictors can be correctly selected. Otherwise, if some of the nonzero coefficients converge too fast, they cannot be selected consistently. Last but not least, is bounded above in the order of for some small constant . This condition allows the predictor dimension to diverge to infinity at an exponential fast speed, which implies that the predictor dimension can be substantially larger than the sample size .

Under the assumption that the true model exists, Wang [6] introduces the FR algorithm in the aim of discovering all relevant predictors consistently. The main step of FR algorithm is the iterative forward regression part. Consider the case where relevant predictors have been selected accordingly. Then the next step is to construct a candidate model that include one more predictor that belongs to the full set but excluding the selected predictors and calculate the residual sum of squares based on the constructed candidate model. This step is repeated for each predictor that belongs to the full set but excluding the selected predictors, and all the residual sum of squares are recorded accordingly. The minimum value of all the recorded residual sum of squares are found, and the th relevant predictor is updated based on the index of the corresponding minimum residual sum of squares. A detailed algorithm can be found in the study by Wang [6].

Wang [6] showed the theoretical proof that FR can identify all relevant predictors consistently, even if the predictor dimension is considerably larger than the sample size. In particular, if the dimension of the true model is finite, FR might discover all relevant predictors within a finite number of steps. In other words, sure screening property can be guaranteed under the four technical conditions. Given the sure screening property, the recently proposed BIC of Chen and Chen [3] can be used to practically select the best candidate from the models generated by the FR algorithm. The resulting model is good in the sense that many existing variable selection methods, such as Adaptive Lasso and SCAD, can be applied directly to increase the estimation accuracy.

The extended Bayes information criterion (EBIC) proposed by Chen and Chen [3] is suitable for large model spaces. It has the following form:where is an arbitrary candidate model with , , and . We then select the best model , where .

EBIC, which includes the original BIC as a special case, examines both the number of unknown parameters and the complexity of the model space. The model in Chen and Chen [3] is defined to be identifiable if no model of comparable size other than the true submodel can predict the response almost equally well. It has been shown that EBIC is selection consistent under some mild conditions. It also handles the heavy collinearity problem for the covariates. Furthermore, EBIC is easy to implement due to the fact the extended BIC family does not require a data adaptive tuning parameter procedure.

Other screening approaches include tournament screening (TS) [15], sequential Lasso [16], quantile-adaptive model-free variable screening [17], and conditional screening [18]. When , the tournament screening possesses the sure screening property to reduce spurious correlation. Furthermore, the asymptotic properties of sequential Lasso for feature selection in linear regression models with ultra-high dimensional feature spaces are investigated. The advantage of sequential Lasso is that it is not restricted by the dimensionality of the feature space. Quantile-adaptive model-free variable screening has two distinctive features, allowing the set of active variables to vary across quantiles and overcoming the difficulty in specifying the form of a statistical model in a high-dimensional space. Baranowski [19] proposed a workflow representation for scheduling, provenance, or visualization to resolve variable and method dependencies and evaluated the performance of screening properties. Samudrala [20] proposed a parallel algorithm by identifying key components for dimensionality reduction of large-scale data. It shows better performance for dimension reduction compared to the existing methods. Chen [21] proposed a model-free feature screening method when the censored response and error-prone covariates both exist. An iterative algorithm is developed in the presence of the censored response and error-prone covariates. In addition, we also develop the iteration method to improve the accuracy of selecting all important covariates. Choudalakis [22] proposed appropriate numerical methods for parameter estimation under the high-dimensional setup. A thorough comparison is considered among existing methods for both coefficient estimations and variable selection for supersaturated designs. Xu et al. [23–25] proposed several multi-objective robust optimization models for MDVRPLS in refined oil distribution. Ren [26] proposed an asymmetric learning to hash with variable bit encoding algorithm (AVBH) to solve the high-dimensional data problem, and a real data application is applied for the finite performance of the proposed AVBH algorithm. We proposed a parallel framework for dimensionality reduction of large-scale data. We also identified key components underlying the spectral dimensionality reduction techniques and proposed their efficient parallel implementation.

3. Main Results

Orthogonal matching pursuit (OMP) is an iterative greedy algorithm that selects at each step the column which is most correlated with the current residuals. The selected column is then added into the set of selected columns. Inspired by the idea of the FR algorithm in Wang (2009), it is shown that under some proper conditions, OMP can enjoy the sure screening property in the linear model setup.

3.1. Model Setup and Technical Conditions

Let be the observation collected from the subject , where is the response and is the high-dimensional predictor with and . Moreover, is the regression coefficient. In matrix representation, the design matrix is and the response vector is . Consider the linear regression model as

Without loss of generality, it is assumed that the data are centered, that is the columns of are orthonormal and s are conditionally independent given the design matrix . Equivalently, and . Moreover, the error term are independently and identically distributed with mean zero and finite variance . A model fitting procedure produces the vector of coefficients .

Before the main result for the screening property of OMP is presented, four technical conditions are needed as follows:

3.1.1. Assumption 1. Technical Conditions
(C1) Normality assumption. Assume that follows the normal distribution.(C2) Covariance matrix. and represent, respectively, the smallest and largest eigenvalues of an arbitrary positive definite matrix . We assume that there exist a positive constant , such that .(C3) Regression coefficients. We assume that for some constant , where denotes the standard norm. Also assume for some , with .(C4) Divergence speed of and . We assume and . In other words, there exists constants , , and , such that , , and .
3.2. OMP Algorithm

Under the assumption that the true set exists, our main objective is to discover all relevant predictors consistently. To this end, we consider the following OMP algorithm (Algorithm1):

Step 1 (Initialization). Set . Set the residual .
Step 2 (Forward selection).
(i) (2.1) Evaluation. In the th step , we are given . Then, for every , we compute
,
, is the projection onto the linear space spanned by the elements of and is the identity matrix.
(ii) (2.2) Screening. We then find
,
and update accordingly.
Step 3 (Solution path). Iterating Step 2 for times leads a total of nested candidate models. We then collect those models by a solution path with .
3.3. Theoretical Properties

To prove Theorem 1, the following lemmas are needed. For convenience, we define . Moreover, and are the submatrices of and (corresponding to ), respectively.

Lemma 1. If we have , then eigenvalues of the submatrix of , that is , are also bounded. Suppose contains variables, that is is a matrix. Moreover, with probability tending to 1, we haveas long as .

The Proof of Lemma 1 can be found in Appendix.

Lemma 2. Assume conditions (C1), (C2), (C4), and . Then, with probability tending to 1, we have

Note that proof of Lemma 2 can be found in the study by Wang (2009), with only slight changes.

Before the theorems are established, we follow Wang (2009)’s idea on screening consistency of a solution path and define the solution path to be screening consistent, if

Then OMP’s screening consistency can be formally established by the following theorem.

Theorem 1. Under model (4) and conditions (C1)–(C4), we have as where the constant is independent of , the constants , , , , and are defined in conditions (C2)–(C4), and is the smallest integer no less than .

Theorem 1 proves that within steps, all relevant predictors will be identified by the OMP algorithm. This number of steps is much smaller than the sample size under condition (C4). In particular, if the dimension of the true model is finite with , only a finite number of steps are needed to discover the entire relevant variable set.

Furthermore, Theorem 1 provides a theoretical basis for OMP, which enables us to empirically select the best model from . On the other hand, the solution path contains a total of nested models. To further select relevant variables from the solution path , the following BIC (Chen and Chen, 2008) is considered,where is an arbitrary candidate model with , , and . We then select the best model , where . We typically do not expect to be selection consistent (i.e., ). However, we are able to show that is indeed screening consistent.

Theorem 2. Under model (4) and conditions (C1)–(C4), then as

Define . By Theorem 2.1, we know that satisfies with probability tending to 1. Therefore, our aim is to prove that as . Then the theorem conclusion follows. Equivalently, it suffices to show that

A detailed proof can be found in Appendix.

4. Numerical Analysis

4.1. Simulation Setup

For a reliable numerical comparison, the following three simulation examples on OMP algorithm are presented, to examine the finite performances of the screening consistent property of OMP. For each parameter setup, a total of simulation replications are conducted.

Let be the model selected in the simulation replications and the corresponding average model size . Recall represents the true model, and the coverage probability, which measures how likely all relevant variables is discovered by the method, is evaluated. This defined coverage probability characterizes the screening property of a particular method.

To characterize the capability of a method in producing sparse solutions, we define

To characterize the method’s underfitting effect, we further define

If all sparse solutions are correctly identified for all irrelevant predictors and no sparse solution is mistakenly produced for all relevant variables, the true model is perfectly identified, that is . To measure such a performance, we define the percentage of correctly fitted (%), which characterizes the selection consistency property of a particular method.

As we need to know which variables are truly relevant or irrelevant, we create sparse regression vectors by setting for all , except for a chosen set of coefficients, where is defined in advance for every . Moreover, the noise vector is chosen i.i.d. . Note that all the simulation runs are conducted in MATLAB.

Example 1. (independent predictors). This is an example borrowed from Fan and Lv [7]. is generated independently according to a standard multivariate normal distribution. Thus, different predictors are mutually independent. with , where is a binary random variable with and is a standard normal random variable.

Example 2. (autoregessive correlation). is generated from a multivariate normal distribution with mean 0 and . This is called an autoregressive type correlation structure. Such type of correlation structure might be useful if a natural order exists among the predictors. As a consequence, the predictors with large distances in order are expected to be mutually independent approximately. This is an example from Tibshrani [4] with . In addition, the first, fourth, and seventh components of are set to be 3, 1.5, and 2, respectively.

Example 3. (grouped variables). is generated by the following rule. for , for , and otherwise, where , , and are independent. This creates within-group correlations of for and for . This example presents an interesting scenario where a group of significant variables are mildly correlated and simultaneously a group of insignificant variables are strongly correlated. The settings are similar to those in Example 2. . In addition, the three nonzero components of are set to be 3, 1.5, and 2, respectively.

4.2. Simulation Results for OMP Screening Consistent Property

Finite sample performance of OMP screening consistent property is investigated based on the abovementioned three examples in Section 4.1. Simulation results are presented in Table 1.

First of all, simulation results for the independent predictor example are in good performance in terms of screening selection consistency for OMP. In other words, we have 100% coverage probability, which means all relevant variables can be discovered by OMP method. In addition, 94% of correctly fitted denotes that BIC selects the true set of variables correctly 94 times out of 100 simulation replications. This result is not surprising since Zhang [14] pointed out that OMP can select features or variables consistently under a certain irrepresentable condition. Furthermore, the percentage of correct zeros and the percentage of incorrect zeros are 99.9% and 1.6%, respectively. Last but not least, the average model size is 7.94, which is slightly below .

Furthermore, simulation results for autoregressive correlation example are in very good performance in terms of screening selection consistency for OMP. Both of the coverage probability and the percentage of correctly fitted are 100%. Especially 100% of correctly fitted denotes BIC selects the true set of variables correctly 100 times out of 100 simulation replications. This is good news since the number of nonzero is 3, which is a very sparse representation given . On top of that, the percentage of correct zeros and the percentage of incorrect zeros are 100% and 0%, respectively. Last but not least, the average model size is 3. Therefore, it seems that our OMP algorithm works pretty well under this autoregressive correlation setup with a sparse representation of .

Third, simulation results for grouped variables example are in worst performance among all the three examples in terms of screening selection consistency for OMP. However, the performance itself is not bad. Coverage probability is 96%, meaning that not all the relevant predictors can be discovered by OMP algorithm in some of the simulation replications. In addition, 34% of correctly fitted denotes that BIC selects the true set of variables correctly only 34 times out of 100 simulation replications. On top of that, the percentage of correct zeros and the percentage of incorrect zeros are 95.9% and 2%, respectively. Last but not least, the average model size is 3.84.

Besides a summary of simulation results of OMP algorithm, three plots are presented in Figure 1. For each of the three examples, one particular plot of number of variables included in the final model versus BIC is extracted for reference. These graphs are not representable as a whole; however, they do provide trends of BIC casewise. Take BIC of Example 1 as an example. Please refer to Figure 1(a). BIC decreases as the number of variables included in the model increases. BIC reaches a local minimum when the number of variables included hits . Thereafter, BIC increases as the number of variables included in the model increases until the number of variables gets near to the sample size. This is not surprising since BIC decreases as the model complexity increases. Similar trends can be observed for Example 2 and Example 3. Please refer to Figures 1(b) and 1(c).

One possible suggestion for OMP algorithm is that after the screening process with candidate models, only BIC candidate models for minimum BIC values are compared. By doing so, computational time can be saved without loss of correctness of screening consistent property of OMP.

In conclusion, finite simulation performances in terms of screening selection consistency for OMP are good under all the three examples. Those performances support our theories proposed in Section 3.3.

5. Conclusion and Future Research

To conclude, this article shows the theoretical proof that OMP can identify all relevant predictors consistently, even if the predictor dimension is considerably larger than the sample size. In particular, if the dimension of the true model is finite, OMP might discover all relevant predictors within a finite number of steps. In other words, sure screening property can be guaranteed under the four technical conditions. Given the sure screening property, the recently proposed BIC of Chen and Chen (2008) can be used to practically select the best candidate from the models generated by the OMP algorithm. The resulting model is good in the sense that many existing variable selection methods, such as adaptive Lasso and SCAD, can be applied directly to increase the estimation accuracy. Compared with the traditional orthogonal matching pursuit method, the resulting model can improve prediction accuracy and reduce computational cost by screening out the relevant variables.

The abovementioned variable selection procedure only considers the fixed effect estimates in the linear models. However, in real life, a lot of existing data have both the fixed effects and random effects involved. For example, in the clinic trials, several observations are taken for a period of time for one particular patient. After collecting the data needed for all the patients, it is natural to consider random effects for each individual patient in the model setting since a common error term for all the observations is not sufficient to capture the individual randomness. Future research may include random effects in the model by imposing penalized hierarchical likelihood algorithm for accurate variable selection.

Appendix

Proof of lemmas and theorems

Proof of Lemma 1

Let be an arbitrary d-dimensional vector and be the subvector corresponding to . By the condition, we know immediately,

The argument is presented in the following. Suppose that the matrix is positive definite and has the partition as given by . Then the inverse of has the following form:where . In fact, the above formula can be derived from the following identity:

and the fact that

Moreover, the largest singular value is referred to the operator norm of the linear operator (matrix) in a Hilbert space. If is a matrix of complex entries, then its singular values are defined as the square roots of the largest eigenvalues of the nonnegative definite Hermitian matrix . If A is Hermitian, then let denote its eigenvalues.

Stage 1. To prove .
Let , so . Then we havewhere . This part has been proven in page 334 of the book Spectral Analysis of Large Dimensional Random Matrices.

Stage 2. To prove .
is positive definite. So is the submatrix of , that is diagonal entries. Let , , and . We have that is positive definite. We want to prove is positive definite. We have is also positive definite because for any vector ,Since is positive definite. Therefore, for any vector SoNow, the desired conclusion of Lemma 1 is implied bywhere is an arbitrary positive number. The left-hand side of (A.9) is bounded bywhere is the set that contains variables. Hence,By lemma A3 in Bickel and Elizaveta [28], there exists constants and , such that .By C4, we have . Therefore,
Since , for some constant .

Proof of Theorem 1

Proof. For every , we havewhere and . Note that is selected to solve the maximization problem:In addition,By Lemma 1, , with probability tending to 1. Therefore,with probability tending to 1.
Suppose that .where , . In what follows the two terms involved in (A.18) will be carefully evaluated separately.

Step 1. (The first term in (A.18)). For convenience, we define . Then,Note thatApplying (A.20) back to (A.19), we haveRecall that . Then by conditions (C1)–(C3), and Lemma 2.1, we havewith probability tending to 1. Applying this result back to (A.21) and also noting the technical conditions (C3) and (C4), we have

Step 2. (The second term in (A.18)).where . Note that is a normal random variable with mean 0 and variance given by . Thus, the right-hand side of (A.24) can be bounded further bywhere stands for a chi-square random variable with one degree of freedom. By conditions (C1) and (C2) and Lemma 1, we know that with probability tending to 1. On the other hand, the total number of combinations for and is no more than . Then we can proceed by using Bonferroni’s inequality to obtainwhich implies that with probability tending to 1 as . Then in conjunction with (C4), we havewith probability tending to 1. Combining (A.23) and (A.27) and putting them back to (A.18), we can showuniformly for every . Recall that . Under condition (C4), we haveIn contrast, under the assumption , we have . This contradicts with the result of (A.29). Hence, it implies that it is impossible to have for every . Consequently, with probability tending to 1, all relevant predictors should be recovered within a total of steps. This completes the proof.

Proof of Theorem 2

Proof. It suffices to show thatNote thatunder the assumption . By definition, we have . Then, with probability tending to 1, the right-hand side of (A.14) isby the definition of . In addition, we use the fact that for any . Then the right-hand side of (A.32) isaccording to (A.29). Moreover, the right-hand side of (A.33) is independent of , hence, it is a uniform lower bound for with . Thus, it suffices to show that the right-hand side of (A.33) is positive with probability tending to 1. To this end, we first note thatunder condition (C4). Therefore, we can look atwith probability tending to 1 under condition (C4). This completes the proof.

Data Availability

The data used to support the findings of the study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

A preprint has previously been published as a part of the Ph.D. thesis of the first author [27]: https://scholarbank.nus.edu.sg/bitstream/10635/43427/1/thesis.pdf. This work was funded by the Shanghai Planning Project of Philosophy and Social Science (Grant no. 2021EGL004). The authors gratefully acknowledge the Shanghai Planning Office of Philosophy and Social Science for the technical and financial support.