Weaker Regularity Conditions and Sparse Recovery in High-Dimensional Regression
Regularity conditions play a pivotal role for sparse recovery in high-dimensional regression. In this paper, we present a weaker regularity condition and further discuss the relationships with other regularity conditions, such as restricted eigenvalue condition. We study the behavior of our new condition for design matrices with independent random columns uniformly drawn on the unit sphere. Moreover, the present paper shows that, under a sparsity scenario, the Lasso estimator and Dantzig selector exhibit similar behavior. Based on both methods, we derive, in parallel, more precise bounds for the estimation loss and the prediction risk in the linear regression model when the number of variables can be much larger than the sample size.
In the recent years, the problems of statistical inference in high-dimensional setting, in which the dimension of the data exceeds the sample size , have attracted a great deal of attention. One concrete instance of a high-dimensional inference problem concerns the standard linear regression model: where is called the design matrix, is an unknown target vector, and is a stochastic error term, in which the goal is to estimate a vector based on response and the vector of covariates . In the setting , the classical linear regression model is unidentifiable, so that it is not meaningful to estimate the parameter vector .
However, many high-dimensional regression problems exhibit special structure that can lead to an identifiable model. In particular, sparsity in the regression vector is an archetypal example of such structure; that is, only a few components of are different from zero, say -sparsity; is then said to be -sparsity, and there has been a great interest in the study of this problem recently. The use of the -norm penalty to enforce sparsity has been very successful and there have been several methods, such as the Lasso  or basis pursuit , and the Dantzig selector . Sparsity has also been exploited in a number of other questions, for instance, instrumental variable regression in the presence of endogeneity .
There is now a well-developed theory on what conditions are required on the design matrix for such -based relaxations to reliably estimate ; for example, see [5–16]. The restricted eigenvalue (RE) condition due to Bickel et al.  is a weaker one of the conditions mentioned above. Wang and Su [7, 13] presented some equivalent conditions with them, respectively, and there is also a large body of work in the high-dimensional setting; for example, see [3, 6, 12, 17–19], which showed a uniform uncertainty principle (UUP, a condition that is stronger than the RE condition; see [10, 20]). In this paper, we consider a restricted eigenvalue condition that is weaker than the RE conditions in [7, 10, 13] under certain setting.
Thus, in the setting of high-dimensional linear regression, the interesting question is accurately estimating the regression vector and the response from few and corrupted observations. In the standard form, under assumptions on the matrix and with high probability, the estimation bounds are of the form (e.g., see [7, 8, 13, 21]), and the prediction errors are bounded by (e.g., see [1, 7, 21]), where is a positive constant.
The main contribution of this paper is the following: we present a restricted eigenvalue assumption that is weaker than the RE conditions in previous paper under certain setting. Using the -norm penalty, our results are more precise than the existing ones. There is an open question that is finding a weaker assumption and obtaining better results no matter under what circumstances.
The remainder of this paper is organized as follows. We begin in Section 2 with some notations and definitions. In Section 3, we introduce some assumptions and discuss the relation between our assumptions and the existing ones. Section 4 contains our main results, and we also show the approximate equivalence between the Lasso and the Dantzig selector. We give three lemmas and the proofs of the theorems in Section 5.
In this section, we introduce some notations and definitions.
Let a vector . We denote by the number of nonzero coordinates of , where denotes the indicator function and the cardinality of . We use the standard notation to stand for the -norm of the vector of . Moreover, a vector is said to be -sparse if ; that is, it has at most nonzero entries. For a vector and a subset , we denote by the vector in that has the same coordinates as on and zero coordinates on the complement of .
For linear regression model (1), regularized estimation with the -norm penalty, also known as the Lasso  or the basis pursuit , refers to the following convex optimization problem: where is a penalization parameter. The Dantzig selector has been introduced by Candes and Tao  as where is a tuning parameter. It is known that it can be recast as a linear program. Hence, it is also computationally tractable.
For an integer and -sparse vector , let be a subvector of confined to . One of the common properties of the Lasso and the Dantzig selector is that, for an appropriately chosen and a vector , where is the solution from either the Lasso or the Dantzig selector, it holds with high probability (cf. Lemmas 11 and 12): with for the Dantzig selector by Candes and Tao  and with for the Lasso by Bickel et al. , where and is the set of nonzero coefficients of the true parameter of the model.
Finally, for any , , we consider the Gram matrix: where is the designed matrix in model (1) and denotes the transpose matrix of .
3. Discussion of the Assumption
Under the sparsity scenario, we are typically interested in the case where , and even . Here, sparsity specifies that the high-dimensional vector has coefficients that are mostly 0. Clearly, the matrix is degenerate, and ordinary least squares does not work in this case, since it requires positive definiteness of . That is, It turns out that the Lasso and Dantzig selector require much weaker assumptions. The idea by Bickel et al.  is that the minimum in (10) be replaced by the minimum over a restricted set of vectors and the norm in the denominator of the condition be replaced by the -norm of only a part of . Note that the role of (7) is to restrict set of vectors to
Assumption 1 (RE (Bickel et al. )). For some integer such that and a positive number , the following condition holds:
Bickel et al.  showed that the bounds of estimation error and prediction error are and , respectively, for both the Lasso and Dantzig selector, where is a positive constant and is the sparsity level. Next, we describe the RE assumption presented by Wang and Su , which is obtained by replacing by its upper bound in (10).
Assumption 2 (RE (Wang and Su )). For some integer such that and a positive number , the following condition holds:
The two conditions are very similar. The only difference is the - versus -norm of a part of in the denominator. The RE condition is equivalent to RE; see [7, 13] for the discussion on equivalence. The results of [7, 13] are more precise for the bounds of estimation and prediction than those derived in Bickel et al.  and do not lie on the sparsity level .
In order to obtain our regularity condition in this paper, we decompose into a set of vectors , such that corresponds to locations of the largest coefficient of in absolute values, corresponds to locations of the next largest coefficient of in absolute values, and so on. Hence, we have , where , , for all, and .
Now for each , we have where vector represents the largest entry in absolute value in the vector, and hence
Replacing by in (13), we get the following assumption.
Assumption 3 (LR). For some integer such that and a positive number , the following condition holds:
The inequality immediately implies that the assumption LR is weaker than the assumptions RE and RE. Noting the norm in the denominator of (16), it makes the proof become more complicated. We need an equivalent condition of LR for the sake of simplicity, as similarly discussed on equivalence (cf. [7, 13]).
Assumption 4 (LR). For some integer such that and a positive number , the following condition holds:
The two conditions above can be used to solve all the problems of sparse recovery in high-dimensional regression. Due to technical reasons, we only give the results when the LR is satisfied.
4. Main Results of Sparse Recovery for Regression Model
In order to provide performance guarantees for -norm penalty applied to sparse linear models, it is sufficient to assume that the regularity conditions are satisfied. In this section, we show main results when the LR is satisfied. In particular, for convenience, we assume that all the diagonal elements of the matrix are equal to 1.
We firstly prove a type of approximate equivalence between the Lasso and the Dantzig selector. Similar results on equivalence can be found in [7, 10, 13]. It is expressed as closeness of the prediction losses and when the number of nonzero components of the Lasso or the Dantzig selector is small as compared to the sample size.
Theorem 5. For linear model (1), let be independent random variables with . Consider the Lasso estimator and Dantzig estimator defined by (5) and (6) with the same . If LR is satisfied, where , then, with probability of at least , one has
Next, we get the bounds on the rate of convergence of Lasso and Dantzig selector.
Theorem 6. For linear model (1), let be independent random variables with . Consider the Lasso estimator defined by (5) with . If LR is satisfied, where , then, with probability of at least , one has where .
Theorem 7. For linear model (1), let be independent random variables with . Consider the Dantzig selector defined by (6) with . If LR is satisfied, where , then, with probability of at least , one has where .
Remark 8. We have no conditions on the parameter . As in , we can rewrite in terms of another parameter in order to clarify the notation:
Then, the results of Theorems 5–7 are as follows:
The results of Theorems 7.1 and 7.2 in Bickel et al.  are
Comparing the results above, our results greatly improve those in Bickel et al. .
Additionally, the similar results for Lasso can be found in Wang and Su . They are
It is clear that our results are more precise than those in the existing results, for example, [7, 10].
Remark 9. The assumptions LR and LR are weaker than assumptions RE and RE, since . Note that the inequality holds under the setting discussed in Section 3. That is, our weaker assumptions hold under certain condition, but they cannot be considered to be better than those in previous paper at any time.
5. Lemmas and the Proofs of the Results
In this section, we give three lemmas and the proofs of the theorems.
Lemma 10. Let be independent random variables with . Then, for any ,
Proof. Since , it immediately follows that where .
Lemma 11. Let be independent random variables with . Let be the Lasso estimator defined by (5). Then, with probability of at least , one has, simultaneously for all and ,
Proof. By the definition of ,
for all , which is equivalent to
From Lemma 10, we have that
holds with probability of at least .
Adding the term to both sides of this inequality, it yields that Now, note that since . So, we get that
To prove (32), it suffices to note that, from Lemma 10 and , we have that Then
Lemma 12. Let satisfy the Dantzig constraint and set , . Then Further, let the assumptions of Lemma 11 be satisfied. Then, with probability of at least , one has, for ,
Proof of Theorem 5. Set . We start the calculation by simple matrix equality:
where the last inequality holds with probability of at least from (43).
By assumption LR and (42), we get that From (32), a nearly identical argument yields that
This theorem follows from (46) and (49).
Proof of Theorem 6. Set . Using (31) with probability of at least ,
From (48), we have
By assumption LR, we obtain that
where . Thus,
From (50), we have that
Inequalities (55) and (57) coincide with (19) and (21), respectively.
Finally, to prove (20) we decompose into a set of vectors , such that corresponds to locations of the largest coefficient of in absolute values, corresponds to locations of the next largest coefficient of in absolute values, and so on. Hence we have that , where , , for all , and .
It immediately follows that On the other hand, from (54), we have that Therefore, and the theorem follows.
Proof of Theorem 7. Set . Using (42) and (43), with probability of at least , we have that From assumption LR, we get that where . This and (61) yield that The first inequality in (63) implies (24). Next, (22) is straightforward in view of the second inequality in (63) and of relation (42). The proof of (23) follows from (20) in Theorem 6. From (22) and (58), we get that Then where the second inequality holds from the second inequality in (63) and the inequality .
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
E. Gautier and A. B. Tsybakov, “High-dimensional instrumental variables regression and confidence sets,” Working Paper, 2011.View at: Google Scholar
S. van de Geer, The Deterministic Lasso, Seminar für Statistik, Eidgenössische Technische Hochschule (ETH), Zürich, Switzerland, 2007.
S. Q. Wang and L. M. Su, “Simultaneous lasso and dantzig selector in high dimensional nonparametric regression,” International Journal of Applied Mathematics and Statistics, vol. 42, no. 12, pp. 103–118, 2013.View at: Google Scholar
S. Q. Wang and L. M. Su, “New bounds of mutual incoherence property on sparse signals recovery,” International Journal of Applied Mathematics and Statistics, vol. 47, no. 17, pp. 462–477, 2013.View at: Google Scholar