Abstract

Variable selection is fundamental to high-dimensional statistical modeling. Many variable selection techniques may be implemented by penalized least squares using various penalty functions. In this paper, an arctangent type penalty which very closely resembles penalty is proposed; we call it Atan penalty. The Atan-penalized least squares procedure is shown to consistently select the correct model and is asymptotically normal, provided the number of variables grows slower than the number of observations. The Atan procedure is efficiently implemented using an iteratively reweighted Lasso algorithm. Simulation results and data example show that the Atan procedure with BIC-type criterion performs very well in a variety of settings.

1. Introduction

High-dimensional data arise frequently in modern applications in biology, economics, chemometrics, neuroscience, and other scientific fields. To facilitate the analysis, it is often reasonable and useful to assume that only a small number of covariates are relevant for modeling the response variable. Under this sparsity assumption, a widely used approach for analyzing high-dimensional data is regularized or penalized regression. This approach estimates the unknown regression coefficients by solving the following penalized regression problem:where is design matrix, is an -dimensional response vector, is the vector of unknown regression coefficients, denotes norm (Euclidean norm), and is a penalty function which depends on a tuning parameter .

In the above regularization framework, various penalty functions are used to perform variable selection by putting relatively large penalties on small coefficients, such as the best subset selection, penalized regression or Lasso [1], Bridge regression [2], SCAD [3], MCP [4], SICA [5], SELO [6], Dantzig selector [7], and Bayesian variable selection method [8]. The mainstream methods are the Lasso, Dantzig selector, and the folded concave penalization [9] such as the SCAD and MCP. The best subset selection, namely, penalty, along with the traditional model selection criteria such as AIC, BIC, and RIC [1012] is attractive for variable selection since it directly penalizes the number of nonzero coefficients. However, one drawback of penalized least squares (PLS) procedure is instability of the resulting estimators [13]. This results from the fact that penalty is not continuous at . Another perhaps more significant drawback of penalty is that implementing PLS procedures is NP-hard and may involve an exhaustive search over all possible models. Thus, implementing these procedures is computationally infeasible when the number of potential predictors is even moderately large, let along the high-dimensional data.

The Lasso penalized regression is computationally attractive and enjoys great performance in prediction. However, Lasso may not consistently select the correct model and is not necessarily asymptotically normal [14, 15]; a strong irrepresentable condition is necessary for the Lasso to be selection consistent [15, 16]. The folded concave penalization, unlike the Lasso, does not require the irrepresentable condition to achieve the variable selection consistency and can correct the intrinsic estimation bias of the Lasso. Fan and Li [3] first systematically studied nonconvex penalized likelihood for fixed finite dimension . They recommended the SCAD penalty which enjoys the oracle property (a variable selection and estimation procedure is said to have the oracle property if it selects the true model , with probability tending to one, and if the estimated coefficients are asymptotically normal, with the same asymptotic variance as the least squares estimator based on the true model) for variable selection. Fan and Peng [17] extended these results by allowing to grow with at the rate or . Lv and Fan [5] introduced the weak oracle property, which means that the estimator enjoys the same sparsity as the oracle estimator with asymptotic probability one and has consistency, and established regularity conditions under which the PLS estimator given by folded concave penalties has a nonasymptotic weak oracle property when dimensionality can grow nonpolynomially with sample size . Theoretical properties enjoyed by SELO [6] estimators allow the number of predictors to tend to infinity, along with the number of observations , provided . For high-dimensional nonconvex penalized regression with , Kim et al. [18] proved that the oracle estimator itself is a local minimum of SCAD penalized least squares regression under very relaxed conditions; Zhang [4] proposed MCP and devised a novel PLUS algorithm which when used together can achieve the oracle property under certain regularity conditions. Recently, Fan et al. [9] have shown that the folded concave penalization methods enjoy the strong oracle property for high-dimensional sparse estimation. Important insight has also been gained through the recent work on theoretical analysis of the global solution [1921].

The practical performance of PLS procedures depends heavily on the choice of a tuning parameter. The theoretically optimal tuning parameter does not have an explicit representation and depends on unknown factors such as the variance of the unobserved random noise. Cross-validation is commonly adopted in practice to select the tuning parameter but is observed to often result in overfitting. In the case of fixed , Wang et al. [22] proposed that one selects tuning parameter by minimizing the generalized BIC tuning parameter selector. Wang et al. [23] extended those results to the setting of a diverging number of parameters. Recently, Dicker et al. [6] proposed a BIC-like tuning parameter selector. Wang et al. [21] extended the work of [24, 25] for BIC on high-dimensional least squares regression; they proposed a high-dimensional BIC for a nonconvex penalized solution path.

In this paper, we propose an arctangent type (Atan) penalty function which very closely approximates penalty. Because the Atan penalty is continuous, the Atan estimator may be more stable than the estimators obtained through penalized methods. The Atan penalty is a smooth function on and we use an iteratively reweighted Lasso algorithm. We formally establish the model selection oracle property enjoyed by the Atan estimator. In particular, the asymptotic normality of the Atan is formally established. Our asymptotic framework allows the number of predictors, , along with the number of observations , provided . Furthermore, a BIC-like tuning parameter selection procedure is implemented for Atan.

This paper is organized in the following way. In Section 2, we introduce PLS estimators and give a brief overview of existing nonconvex penalty terms, and the Atan penalty is then presented. Then, we discuss some of its theoretical properties in Section 3. In Section 4, we describe a simple and efficient algorithm for obtaining Atan estimator. Simulation studies and an application of the proposed methodology are presented in Section 5. Conclusions are given in Section 6. The proofs are relegated to the Appendix.

2. The Atan-Penalized Least Squares Method

2.1. Linear Model and Penalized Least Squares

Suppose that is a random sample from the linear regression modelwhere is design matrix, is an -dimensional response vector, and are the iid random errors with mean 0 and variance -dimensional noise vector.

When discussing variable selection, it is convenient to have concise notation. Denote the columns of by and the rows of by . Let be the true model and suppose that is the size of the true model. That is, suppose that , where denotes the cardinality of . In addition, for , let be the -dimensional subvector of containing entries indexed by and let be matrix obtained from by extracting columns corresponding to . Given matrix and subsets , let be submatrix of with rows determined by and columns determined by .

Various penalty functions have been used in the variable selection literature for linear regression model. Commonly used penalty functions include , , the nonnegative garrotte [26], elastic-net [27, 28], SCAD [3], and MCP [4]. In particular, penalized least squares procedure is called the Lasso. However, Lasso estimates may be biased and inconsistent for model selection [3, 15]. This implies that the Lasso does not have the oracle property. The adaptive Lasso is a weighted version of Lasso which has the oracle property [15]. Slightly abusing notation is that the adaptive Lasso penalty is defined by , where is a data-dependent weight.

Fan and Li [3] proposed a continuously differentiable penalty function called the SCAD penalty, which is defined byAuthors of [3] suggested using from a Bayesian perspective. The minimum concave penalty [4] translates the flat part of the derivative of the SCAD into the origin and is given bywhich minimizes the maximum of the concavity. Zhang [4] proved that the MCP procedure may select the correct model with probability tending to 1 and that MCP estimator has good properties in terms of -loss, provided and satisfy certain conditions. Zhang’s results in fact allow for .

2.2. Atan Penalty

In this section we first propose a novel nonconvex penalty that we call Atan penalty. We then study its applications in sparse modeling.

The Atan penalty is defined byfor and . It is clear that this penalty is concave in . Moreover, we can establish its relationship with and penalties. In particular, we have the following propositions.

Proposition 1. Let be given in (5); then The propositions show that the limits of Atan at and are penalty and penalty, respectively. The first-order derivative of with respect to is The Atan penalty function () is plotted in Figure 1(a), along with the SCAD, Lasso, MCP, and penalty. Figure 1(b) depicts the Atan with different .

3. Theoretical Properties

In this section we study the theoretical properties of the Atan estimator proposed in Section 2 in the situation where the number of parameters tends to with increasing sample size . We discuss some conditions of the penalty and loss functions in Section 3.1. Our main results are presented in Section 3.2.

3.1. Regularity Conditions

We need to place the following conditions on the penalty functions:(A) and .(B), where .(C), , and .(D)There exist constants such that , where and are the smallest and largest eigenvalues of , respectively.(E).(F) for some and .

Since may vary with , it is implicit that may vary with . Additionally, we allow model and the distribution of (in particular, ) to change with . Condition (A) limits how and may grow with . This condition is substantially weaker than that required in [17], which requires , and slightly weaker than that required in [28], which requires and the same as that required in [6]. As mentioned in Section 1, other authors have studied PLS methods in settings where ; that is, their growth condition on is weaker than condition (A) [18]. Condition (B) gives a lower bound on the size of the smallest nonzero entry of . Notice that the smallest nonzero entry of is allowed to vanish asymptotically, provided it does not do so faster than . Similar conditions are found in [17]. Condition (C) restricts the rates of tuning parameters and . Note that condition (C) does not constrain the minimum size of . Indeed, no such constraint is required for our asymptotic results about the Atan estimator. Since the Atan penalty approaches penalty as , this suggests that the Atan and penalized least squares estimator have similar asymptotic properties. On the other hand, in practice, we have found that one should not take too small, in order to preserve stability of the Atan estimator. Condition (D) is an identifiability condition. Conditions (E) and (F) are used to prove asymptotic normality of Atan estimators and are related to the Lindeberg condition, of the Lindeberg-Feller central limit theorem. Conditions (A)–(F) imply that Atan has the oracle property and may correctly identify model as we will see in Theorem 4.

3.2. Oracle Properties

Letbe the objective function, and is the Atan penalty function.

Theorem 2. Suppose that conditions (A)–(D) hold; then, for every , there exists a constant such thatwhenever . Consequently, there exists a sequence of local minimizers of and , such that .

Lemma 3. Assume that (A)–(D) hold, and fix ; then where is the complement of in .

Theorem 4 (oracle properties). Suppose that (A)–(F) hold; then there exists a sequence of -consistent local minima of Atan, , such that(i),(ii), in distribution, where is any arbitrary matrix such that , and is nonegative symmetric matrix.

4. Implementation

4.1. Iteratively Reweighted Lasso Algorithm

The computation for the Atan-penalized method is much more involved, because the resulting optimization problem is usually nonconvex and has multiple local minimizers. Several algorithms have been developed for computing the folded concave penalized estimators, such as the local quadratic approximation (LQA) algorithm and the local linear approximation (LLA) algorithm [3, 29]. Both LQA and LLA are related to the majorization-minorization (MM) principle [30]. Recently, coordinate descent algorithm was applied to solve the folded concave penalized least squares [31, 32]. Reference [4] devised a PLUS algorithm for solving the PLS using the MCP. Zhang [33] analyzed the capped- penalty for solving the PLS.

In this section, we present an iteratively reweighted Lasso (IRL) algorithm to solve the following minimization problem:We show that the solution of the penalty least squares problem can be transformed into that of a series of convex weighted Lasso estimators, to which the existing Lasso algorithms can be efficiently applied.

Now, taking a first-order Taylor-series approximation of the Atan penalty about the current value , we obtain the overapproximation of (11):Note that the linear approximation in (12) is analogous to one proposed in Zou and Li [29] and they argued strongly in favor of a one-step approximation. Instead, we offer the following algorithm for Atan-penalized least squares. Assume all the covariates have been standardized to have mean zero and unit variance. Without loss of generality, the span of regularization parameters is , for [34]. Here, we summarize the details of IRL algorithm as in Algorithm 1, where the active set is defined as .

(1)Start with . Set and .
 Outer loop:
(2)Set and ;
(3)Increment and ;
(4)Update the weights: , ;
Inner loop:
 Solve the Karush-Kuhn-Tucker (KKT) conditions for fixed :
, if ,
, if ,
(5)Goto Step (3);
(6)Repeat Steps (3)–(5) until .
The estimate is the limit point of the outer loop, ;
(7)Decrement and . Return to (2) using as a warm start.

In the situation with , we set . However, when , one attains the saturated model long before the regularization parameter reaches zero. In this case, we suggest [34], and we used . In practice, we have found that if the columns of are standardized so that , for , then taking works well.

Remark 5. Algorithm 1 is very much like MM algorithms. Hunter and Li [30] were the first to advocate MM algorithms for variable selection when . However, we should notice that Algorithm 1 differs from the algorithm described in [30]. The use of MM algorithms in [30] is to justify a quadratic overapproximation to penalty functions with singularities at the origin. In the inner loop of Algorithm 1, we avoid such approximation by solving the KKT conditions precisely and efficiently [35]. Our use of MM algorithms in Algorithm 1 is to justify the local linear approximation to Atan penalty in the outer loop.

Remark 6. LLA to SCAD penalty were also proposed by Zou and Li [29]. Algorithm 1 differs from that proposed in [29] in that it constructs the entire coefficient path, even when , whereas the procedure in [29] computes the coefficient estimates for fixed regularization parameter starting with a root- consistent estimator of true coefficient at the initial step. Moreover, even with the same initial estimate, the limit point from Algorithm 1 will differ from [29] after one iteration.

Remark 7. In the more general case where , we used coordinate-wise optimization to compute the entire regularized coefficient path via Algorithm 1. Our experience is that coordinate-wise optimization works well in practice and converges very quickly.

4.2. Regularity Parameter Selection

Tuning parameter selection is an important issue in most PLS procedures. There are relatively few studies on the choice of penalty parameters. Traditional model selection criteria, such as AIC [10] and BIC [11], suffer from a number of limitations. Their major drawback arises because parameter estimation and model selection are two different processes, which can result in instability [13] and complicated stochastic properties. To overcome the deficiency of traditional methods, Fan and Li [3] proposed the SCAD method, which estimates parameters while simultaneously selecting important variables. They selected tuning parameter by minimizing GCV [1, 3, 26].

However, it is well known that GCV and AIC-based methods are not consistent for model selection in the sense that as , they may select irrelevant predictors with nonvanishing probability [22, 36]. On the other hand, BIC-based tuning parameter selection roughly corresponds to maximizing the posterior probability of selecting the true model in an appropriate Bayesian formulation and has been shown to be consistent for model selection in several settings [22, 37, 38]. The BIC tuning parameter selector is defined by where is the estimate of the degrees of freedom given by and . The diagonal elements of are coefficients of quadratic terms in the local quadratic approximation to SCAD penalty function [3].

Dicker et al. [6] proposed BIC-like procedures implemented by minimizingTo estimate the residual variance, they use . This differs from other estimates of the residual variance used in PLS methods, where the denominator is replaced by [22]; here, is used to account for degrees of freedom lost to estimation.

More works on the high-dimensional BIC for the least squares regression to tuning parameter selection for nonconvex penalized regression can be seen in [24, 25, 39]. Here we used BIC statistic as (15).

4.3. A Standard Error Formula

The standard errors for the estimated parameters can be obtained directly because we are estimating parameters and selecting variables at the same time. Let be a local minimizer of Atan. Following [3, 17], standard errors of may be estimated by using quadratic approximations to Atan. Indeed, the approximationsuggests that Atan may be replaced by the quadratic minimization problemat least for the purposes of obtaining standard errors. Using this expression, we obtain a sandwich formula for the estimated standard error of , where . Considerwhere , , and is the number of elements in . Under the conditions of Theorem 4, This is a consistency result for .

5. Numerical Studies

5.1. Simulation Studies

We now investigate the sparsity recovery and estimation properties of the proposed estimator via numerical simulations. We compare the following estimators: the Lasso estimator (implemented using R package glmnet [40]); the adaptive Lasso estimator (denoted by Alasso [15]), the SCAD estimator from the CD algorithm without calibration [31]; the MCP estimator from the CD algorithm with [31]; the Dantzig selector [7]; and Bayesian variable select method [8]. For the proposed Atan estimator, we take and BIC statistic (15) is used to select the tuning parameter . In the following, we report simulation results from four examples.

Example 8. In this example, simulation data are generated from the linear regression model: where , , and is multivariate normal distribution with zero mean and covariance between the th and th elements being with . In our simulation, sample size is set to be 100 and 200, . For each case, we repeated the simulation 200 times.

For linear model, model error for is . Simulation results are summarized in Table 1, in which MRME stands for median of ratios of ME of a selected model to that of the unpenalized minimum square estimate under the full model. Both the columns of “C” and “IC” are measures of model complexity. Column “C” shows the average number of nonzero coefficients correctly estimated to be nonzero, and column “IC” presents the average number of zero coefficients incorrectly estimated to be nonzero. In the column labeled “underfit,” we present the proportion of excluding any nonzero coefficients in 200 replications. Likewise, we report the probability of selecting the exact subset model and the probability of including all three significant variables and some noise variables in the columns “correct-fit” and “overfit,” respectively.

As can be seen from Table 1, all variable selection procedures dramatically reduce model error. Atan has the smallest model error among all competitors, followed by Alasso, MCP, SCAD, Bayesian, and Dantzig. In terms of sparsity, Atan also has the highest probability of correct-fit. The Atan penalty performs better than the other penalties. Also, Atan has some advantages when dimensional is high which can be seen in Example 9.

We now test the accuracy of our standard error formula (18). The median absolute deviation divided by 0.6745, denoted by SD in Table 2, of 3 estimated coefficients in the 200 simulations can be regarded as the true standard error. The median of the 200 estimated SDs, denoted by , and the median absolute deviation error of the 200 estimated standard errors divided by 0.6745, denoted by , gauge the overall performance of standard error formula (18). Table 2 presents the results for nonzero coefficients when sample size . The results for the other case with are similar. Table 2 suggests that the sandwich formula performs surprisingly well.

Example 9. The example is from Wang et al. [23]. More specifically, we take and and stands for the largest integer no larger than . For this example, predictor dimension is diverging but the dimension of the true model is fixed to be 5. Results from the simulation study are found in Table 3. A similar conclusion as in Example 8 can be found.

Example 10. In this simulation study presented here, we examined the performance of the various PLS methods for substantially larger than in the previous studies. In particular, we took , , , and , where is the vector with all entries equal to 1. Thus, . We simulated 200 independent datasets in this study and, for each dataset, we computed estimates of . Results from this simulation study are found in Table 4.

Perhaps the most striking aspect of the results presented in Table 4 is that hardly no method ever selected the correct model in this simulation study. However, given that , , and are substantially larger in this study than in the previous simulation studies, this may not be too surprising. Notice that, on average, Atan selects the most parsimonious models of all methods and has the smallest model error. Atan’s nearest competitor in terms of model error is Alasso. This implementation of Alasso has mean model error , but its average selected model size is 103.5250 larger than Atan’s. Since , it is clear that Atan underfits in some instances. In fact, all of the methods in this study underfit to some extent. This may be due to the fact that many of the nonzero entries in are small relative to the noise level .

Example 11. As an extension of this method, we consider the problem of simultaneous variable selection and estimation in the partially linear model:where is a scalar response variate, is a -vector covariate, is a scalar covariate and takes values in a compact interval (for simplicity, we assume this interval to be ), is column vector of unknown regression parameter, function is unknown, and model error is independent of with mean . Traditionally, it has generally been assumed that is finite dimension; several standard approaches, such as the kernel method, the spline method [41], and the local linear estimation [17], have been proposed.

In this study, we simulate points , , from the uniform distribution on . For each , are simulated to be normally distributed with autocorrelated variance structure , such that ’s are then formed as follows: We investigate the scenario: , , . In the scenario, we have , , and with others, . For each case, we repeated the simulation 200 times. We investigate functions: . For comparison, we apply the different penalties in the parametric component with the B-spline in the nonparametric component.

The results are summarized in Table 5. Columns 2–5 in Table 5 are the averages of the estimates of , , respectively. Column 6 is the number of estimates of , , which are 0, averaged over 200 simulations, and their medians are given in column 7. Model errors are computed as . Their medians are listed in the 8th column, followed by the model errors standard deviations in parentheses.

The performance of is assessed by the square root of average squared errors (RASE):where are the observed data points at which function is estimated. Column 9 summarizes the RASE values for the different situations.

We can see the following from Table 5: (1) Comparing with the Lasso and Dantzig estimator, the SCAD and Atan estimators have a smaller model error, which is due to the fact that the SCAD and Atan estimators are unbiased estimators while the Lasso and Dantzig are biased especially for the larger coefficient. Moreover, the Atan and SCAD estimators are more stable. (2) Each method is able to select important variables, but it is obvious that the Atan estimator has slightly stronger sparsity. (3) For the nonparametric component, Atan and SCAD estimator have smaller RASE values.

5.2. Real Data Analysis

In this section, we apply the Atan regularization scheme to a prostate cancer example. The dataset in this example is derived from a study of prostate cancer in [42]. The dataset consists of the medical records of 97 patients who were about to receive a radical prostatectomy. The predictors are eight clinical measures: (cancer volume) (lcavol), log (prostate weight) (lweight), age, the logarithm of the amount of benign prostatic hyperplasia (lbph), seminal vesicle invasion (svi), (capsular penetration) (lcp), gleason score (gleason), and percentage gleason score 4 or 5 (pgg45). The response is the logarithm of prostate-specific antigen (lpsa). One of the main aims here is to identify which predictors are more important in predicting the response.

The Lasso, Alasso, SCAD, MCP, and Atan are all applied to the data. We also compute the OLS estimate of the prostate cancer data. Results are summarized in Table 6. The OLS estimator does not perform variable selection. Lasso selects five variables in the final model; SCAD selects lcavol, lweight, lbph, and svi in the final model, while Alasso, MCP, and Atan select lcavol, lweight, and svi. Thus, Atan selects a substantially simpler model than Lasso, SCAD. Furthermore, as indicated by the columns labeled ( is equal to one minus the residual sum of squares divided by the total sum of squares) and in Table 6, the Atan estimator describes more variability in the data than Alasso and MCP and nearly as much as OLS estimator.

6. Conclusion and Discussion

In this paper, a new Atan penalty which very closely resembles penalty is proposed. First, we establish the theory of the Atan estimator under mild conditions. The theory indicates that the Atan-penalized least squares procedure enjoys oracle properties even when the number of variables grows slower than the number of observations. Second, the iteratively reweighted Lasso algorithm makes our proposed estimator implementation fast. Third, we suggest a BIC-like tuning parameter selector to identify the true model consistently. Numerical studies further endorse our theoretical results and the advantage of the Atan estimator for model selection.

We do not address the situation where in this paper. In fact, the proposed Atan method can be easily extended for variable selection in the situation of . Also, as it is shown in Example 11, the Atan method can be applied to semiparametric model and nonparametric model [43, 44]. Furthermore, there is a recent field of applications of variable selection which is to look for impact points in functional data analysis [45, 46]. The possible combination of Atan estimator with the questions in [45, 46] would be interesting. These problems are beyond the scope of this paper and will be interesting topics for future research.

Appendix

Proof of Theorem 2. Let and fix . To prove the theorem, it suffices to show that if is large enough, then holds for all sufficiently large, with probability at least . Define and note that The fact that is concave on implies that when is sufficiently large.
Condition (B) implies that Thus, for big enough,By (D),On the other hand (D) impliesFurthermore, (C) and (B) implyFrom (A.5)–(A.8), we conclude that if is large enough, then holds for all sufficiently large, with probability at least . This proves Theorem 2.

Proof of Lemma 3. Suppose that and that . Define by and . Similar to the proof of Theorem 2, letwhere is defined in (8). ThenOn the other hand, since the Atan penalty is concave on ,for . Thus,By (C), It follows from (A.10)–(A.12) that there is constant such that Since , the result follows.

Proof of Theorem 4. Taken together, Theorem 2 and Lemma 3 imply that there exists a sequence of local minima of (8) such that and . Part (i) of the theorem follows immediately.
To prove part (ii), observe that, on the event , we must have where . It follows that whenever . Now note that conditions (B)–(D) imply and, thus, To complete the proof of (ii), we use the Lindeberg-Feller central limit theorem to show thatin distribution. Observe that where .
Fix and let . Then Since and since (E) implies we must have Thus, the Lindeberg condition is satisfied and (A.19) holds.

Competing Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was supported by the Tian Yuan Special Funds of the National Natural Science Foundation of China (Grant no. 11426192) and the Project of Education of Zhejiang Province (no. Y201533324).