Abstract

During the last few years, a great deal of attention has been focused on Lasso and Dantzig selector in high-dimensional linear regression when the number of variables can be much larger than the sample size. Under a sparsity scenario, the authors (see, e.g., Bickel et al., 2009, Bunea et al., 2007, Candes and Tao, 2007, Candès and Tao, 2007, Donoho et al., 2006, Koltchinskii, 2009, Koltchinskii, 2009, Meinshausen and Yu, 2009, Rosenbaum and Tsybakov, 2010, Tsybakov, 2006, van de Geer, 2008, and Zhang and Huang, 2008) discussed the relations between Lasso and Dantzig selector and derived sparsity oracle inequalities for the prediction risk and bounds on the estimation loss. In this paper, we point out that some of the authors overemphasize the role of some sparsity conditions, and the assumptions based on this sparsity condition may cause bad results. We give better assumptions and the methods that avoid using the sparsity condition. As a comparison with the results by Bickel et al., 2009, more precise oracle inequalities for the prediction risk and bounds on the estimation loss are derived when the number of variables can be much larger than the sample size.

1. Introduction

During the last few years, a great deal of attention has been focused on the penalized least squares (Lasso) estimator of parameters in high-dimensional linear regression when the number of variables can be much larger than the sample size (e.g., see [112]). Quite recently, Candes and Tao [13] have proposed the Dantzig estimate for such linear models, and other authors [1, 6, 1422] have discussed the Dantzig estimate and established the properties under a sparsity scenario, that is, when the number of nonzero components of the true vector of parameters is small.

Lasso estimators have also been studied in the nonparametric regression setup (see [2326]). In particular, Bunea et al. [23, 24] obtain sparsity oracle inequalities for the prediction loss in this context and point out the implications for minimax estimation in classical nonparametric regression settings as well as for the problem of aggregation of estimators. Modified versions of Lasso estimators (nonquadratic terms and/or penalties slightly different from ) for nonparametric regression with random design are suggested and studied under prediction loss in Koltchinskii [27] and van de Geer [28]. Sparsity oracle inequalities for the Dantzig selector with random design are obtained by Koltchinskii [29]. In linear fixed design regression, Meinshausen and Yu [7] establish a bound on the loss for the coefficients of Lasso that are quite different from the bound on the same loss for the Dantzig selector proven in Candes and Tao [13]. Bickel et al. [15] show that, under a sparsity scenario, the Lasso and the Dantzig selector exhibit similar behavior, both for linear regression and for nonparametric regression models, for prediction loss, and for loss in the coefficients for . In the nonparametric regression model, they prove sparsity oracle inequalities for the Lasso and the Dantzig selector. Moreover, the Lasso and the Dantzig selector are approximately equivalent in terms of the prediction loss. They develop geometrical assumptions that are considerably weaker than those of Candes and Tao [13] for the Dantzig selector and Bunea et al. [23] for the Lasso.

We give the assumptions equivalent with assumptions by Bickel et al. [15] and derive oracle inequalities that are more precise than Bickel et al.’s [15] for the prediction risk in the general nonparametric regression model and bounds that are more precise than Bickel et al.’s [15] on the estimation loss in the linear model when the number of variables can be much larger than the sample size. We begin, in the next section, by defining the Lasso and Dantzig procedures and the notation. In Section 3, we present our key three assumptions and discuss the relations between the assumptions and assumptions by Bickel et al. [15]. In Section 4, we give some equivalent results and sparsity oracle inequalities for the Lasso and Dantzig estimators in the general nonparametric regression model and improve corresponding results by Bickel et al. [15]. The concluding remarks are given in Section 5.

2. Definitions and Notations

Unless stated otherwise, all of our notations, definitions, and terminologies follow Bickel et al. [15]. Let be a sample of independent random pairs with where is an unknown regression function to be estimated, is a Borel subset of , the ’s are fixed elements in , and the regression errors are Gaussian. Let be a finite dictionary of functions , . We assume throughout that .

Consider the matrix , , and the vectors , , and . With the notation we will write for the norm of , . The notation stands for the empirical norm for any . We suppose that , . Set

For any and , define and . The estimates we consider are all of the form , where is data determined. Since we consider mainly sparse vectors , it will be convenient to define the following. Let denote the number of nonzero coordinates of , where denotes the indicator function, , and denotes the cardinality of . For a vector and a subset , we denote by the vector in that has the same coordinates as on and zero coordinates on the complement of .

Define the Lasso solution by where is some tuning constant, and introduce the corresponding Lasso estimator The Dantzig selector is defined by where is the diagonal matrix The Dantzig estimator is defined by where is the Dantzig selector.

We refer to Bickel et al. [15] for detailed discussion of the Dantzig constraint and the constraint that the Lasso selector satisfies.

Finally, for any , , we consider the Gram matrix and let denote the maximal eigenvalue of .

3. Discussion of the Assumptions

Under the sparsity scenario, we are typically interested in the case where and even . Here, sparsity specifies that the high-dimensional vector has coefficients that are mostly 0. Clearly, the matrix is degenerate, and ordinary least squares do not work in this case, since the require positive definiteness of . That is, It turns out that the Lasso and Dantzig selector require much weaker assumptions. The idea by Bickel et al. [15] is that the minimum in (12) be replaced by the minimum over a restricted set of vectors, and the norm in the denominator of the condition be replaced by the norm of only a part of . This is feasible. Because for the linear regression model, the residuals and satisfy with by Candes and Tao [13] and by Bickel et al. [15], respectively, where and is the set of nonzero coefficients of the true parameter of the model; therefore, for any satisfying (13), we have where is a positive definite matrix and . Thus, we have a kind of “restricted” positive definiteness if is small enough. This results in the following restricted eigenvalue (RE) assumption.

Assumption RE (Bickel et al. [15]). For some integer such that and a positive number , the following condition holds: The purpose of giving this assumption may be in order to facilitate the use of since they frequently use it in the proofs of their theorems and so do Candes and Tao [13].

Note that the role of is only to restrict set of vectors; that is, restricts to . Therefore, it is not necessary that the norm in the denominator of (12) be replaced by the norm of only a part of . We give the following assumptions.

Assumption RE. For some integer such that and a positive number , the following condition holds:

Assumption RE. For some integer such that and a positive number , the following condition holds:

Assumption RE. For some integer such that and a positive number , the following condition holds:

Note that and since and . Moreover, it is easy to see that for fixed , the four assumptions are equivalent, and those assumptions 1–5 by Bickel et al. [15] are all sufficient conditions for assumptions RE, RE, and RE.

In Section 4, we will see that RE and RE are all better than since they use and as little as possible. Therefore, the inequalities given are more precise.

4. Comparisons with the Results by Bickel et al.

In the following, we give a bound of the prediction losses with respect to when the number of nonzero components of the Lasso or the Dantzig selector is small as compared to the sample size.

Theorem 1. Let be independent random variables with . Fix , . Let assumption RE or RE be satisfied with , where , and let , . Consider the Lasso estimator defined by (6)-(7) with where , and consider the Dantzig estimator defined by (10) with the same . If , then, with probability at least , one has

Proof. Set . We apply by Bickel et al. [15] with , which yields that, with probability at least , where . From by Bickel et al. [15], we have Then,

Corollary 2. Let the conditions of Theorem 1 hold, but with RE in place of RE. If , then, with probability at least , one has

This corollary greatly improves Theorem by Bickel et al. [15]. The right-hand side of the inequality of Theorem is

A general discussion of sparsity oracle inequalities can be found in Tsybakov [30]. Here, we prove a sparsity oracle inequality for the prediction loss of the Lasso estimators. Such inequalities have been recently obtained for the Lasso-type estimators in a number of settings, see [15, 23, 24, 27, 28].

Theorem 3. Let be independent random variables with . Fix integers , , . Let assumption RE or RE be satisfied, where . Consider the Lasso estimator defined by (6)-(7) with for some . Then, with probability at least , one has

Proof. Fix an arbitrary with . Set , , where . On the event in p1723 by Bickel et al. [15], we get, from the first line in by Bickel et al. [15], that Since then
From (8) and by Bickel et al. [15], we have Thus, From (28) and (32), we have

Corollary 4. Let the conditions of Theorem 3 hold, but with RE in place of RE. Then, with probability at least , one has

This corollary greatly improves Theorem by Bickel et al. [15]. The right-hand side of the inequality of Theorem is where .

In the following, we assume that the vector of observations is of the form where is an deterministic matrix, , and . We consider dimension that can be of order and even much larger. Then, is, in general, not uniquely defined. For , if (36) is satisfied for , then there exists an affine space of vectors satisfying (36). The Lasso estimator of in (36) is defined by The correspondence between the notation here and that of the previous is

Theorem 5. Let be independent random variables with . Let all the diagonal elements of the matrix be equal to 1, and let , where , , . Let assumption RE or RE be satisfied, where . Consider the Lasso estimator defined by (37) with and . Then, with probability at least , one has

Proof. Set and . Using and by Bickel et al. [15], where we put , , and , we get that, on the event (i.e., with probability at least ), From by Bickel et al. [15], we have Then, By assumption RE or RE, we obtain that, on , Thus, From (43), we have Thus, The inequalities (49) and (47) coincide with (40) and (41), respectively. Next, (42) follows immediately from in Bickel et al. [15] and (41).

Corollary 6. Let the conditions of Theorem 5 hold, but with RE in place of RE. Then, with probability at least , one has

This corollary improves Theorem by Bickel et al. [15]. The right-hand sides of the inequalities of Theorem are respectively. That is, they are 4, 25/9, and 25/9 times as large as (50)–(52), respectively.

5. Conclusions

We point out that with by Candes and Tao [13] and by Bickel et al. [15] are only the sufficient condition of Lasso and Dantzig selector. Their role should not be overemphasized. That is, should not be deliberately used in any case for solving inequality. We should use as little as possible when proving inequalities.

In fact, the corresponding results have been enlarged due to the use of when solving the problems of Lasso and Dantzig selector. When proving sparsity oracle inequalities for the prediction loss and bounds on the estimation loss, using again must be to enlarge the inequalities again and to result in reduced accuracy.

We have seen that RE and RE are all much better than since in RE and RE the use of and is less. Therefore, the inequalities given are more precise.