Abstract

Partial linear models, a family of popular semiparametric models, provide us with an interpretable and flexible assumption for modelling complex data. One challenging question in partial linear models is the structure identification for the linear components and the nonlinear components, especially for high dimensional data. This paper considers the structure identification problem in the general partial linear single-index models, where the link function is unknown. We propose two penalized methods based on a modern dimension reduction technique. Under certain regularity conditions, we show that the second estimator is able to identify the underlying true model structure correctly. The convergence rate of the new estimator is established as well.

1. Introduction

Partially linear models (PLMs), containing both linear and nonlinear additive components, provide a useful class of tools for modelling complex data. PLMs have wide applications in practice due to their flexibility and interpretability.

Given the data set , where is the response and and are vectors of covariates, the basic PLM has the following form: where is the intercept, is a vector of unknown parameters for linear terms, each is an unknown nonlinear function from to , and ’s are i.i.d. random errors with zero conditional mean and a finite variance.

Given that the linear term and nonlinear term are known in advance, estimation and inference for various PLMs have been well studied in literature, such as [13]. Under the same settings, motivated by sparse models and the Lasso idea developed rapidly in recent years, there exists a substantial work on variable selection and estimation for sparse PLMs, including [46] and others. However, faced with complicated data-generating processes, a key and challenging question to use PLM is how to decide which features should be assigned to be linear and which are nonlinear, especially for high dimensional problems such as text categorization and biology. Furthermore, the difficulty level of model search increases dramatically as the data dimension grows due to the curse of dimensionality. In the last several years, a number of methods have been proposed to address various aspects of this problem. Two commonly used methods are the screening and hypothesis testing procedures. The screening method is sometimes useful in practice but lacks theoretical justifications, while, for hypothesis testing, it is often hard to construct proper test statistics and the tests may have a poor performance when the number of covariates is large. To handle estimation and component identification problems efficiently for PLM, Zhang et al. [7] originally proposed a penalized method called LAND for automatically identifying linear and nonlinear terms for PLM. This method is based on the orthogonal decomposition of Sobolev space of order , consisting of linear subspace and nonlinear subspaces. Another main stream of research is based on spline approximation for nonparametric functions and penalized likelihood. For example, Huang et al. [8] start with a nonparametric additive model but use spline series expansions to approximate the component functions. This spline series approximation approach allows them to prove selection consistency for more general functions. Lian et al. [9] proposed a doubly SCAD-penalized approach for partial linear structure identification problems of nonpolynomial (NP) dimensionality. These above methods can be easily extended to be generalized partial linear models; that is, the added link function is known in advance. If the link function is unknown, a standard technique in the literature is the two-step estimation. Firstly given that the component functions are fixed, a rough estimator of the link function is obtained; then the derived link function is fixed, estimating the component functions again. Repeat the process until convergence. However, this method is unstable and quite sensitive to the choice of penalty parameters. Moreover, this method involves a nonconvex optimization, which can not guarantee satisfactory numerical results.

In this paper we consider a general single-index models with the partial linear form where the error term is assumed to be independent of covariates, besides satisfying the standard moment conditions mentioned above. The link function is typically unknown. Note that additive structure of the unknown link function and the error is not assumed in model (2). This model is quite general; it includes the classical generalized partial linear model with the form , as well as the partial linear transformation model . Clearly, when the link function is not specified, the partial linear term is identifiable only up to a multiplicative scalar because any location-scale change in can be absorbed into the link function.

By making use of the cubic spline approximation to smooth functions and the regularization technique, we develop a new penalized method to identify model structure for the general partial linear models (2). The proposed method is formulated by the theory of sufficient dimension reduction and two different variant forms of eigenvalues problems. We specially use a modern sufficient dimension technique for estimating the linear and nonlinear terms, called gradient-based kernel dimension reduction (gKDR), proposed by Fukumizu and Leng [10]. This method shows some specific advantages compared with the above-mentioned techniques; for example, the gKDR method can handle any type of variable for including multivariate or nonvectorial one in the same way. Besides, the linear condition and constant variance conditions are avoided when the gKDR is adopted.

Based on the gKDR and two variant forms for the eigenvalues problem, we design two different penalized approaches by adding twofold Lasso-type penalties. The first algorithm is a nonconvex optimization problem with equality constraint, which can be solved efficiently by an alternating direction method of multipliers (ADMM). To provide better statistical properties, a convex relaxation of the eigenvalues problem is used to formulate our other method, a semidefinite programme. Using this convex approach, we can establish the estimation consistency and support recovery for the general partial linear models. The convex optimization problem can also be solved in a polynomial time algorithm by the ADMM.

The rest of the article is organized as follows. In Section 2 we transform model (2) into a surrogate model by the cubic spline approximation. Next we introduce the gKDR in the literature of dimension reduction. Then we propose two different penalized approaches based on two variants of eigenvalues problems. Statistical properties of the new estimators, including its convergence rate and selection consistency, are established in Section 3. All the proofs are relegated to the Appendix.

2. Problem Formulation

Notation. We collect here some standard notation used throughout the paper. For matrices with the same dimension is the Frobenius inner product, and is the square Frobenius norm. is the usual norm in the Euclidean spaces. is -norm defined to be norm of the vector of rowwise norms of . We denote , meaning that is semidefinite positive.

We mainly consider the problem of linear identification of the general single-index model (2) and estimate relevant variables to the response. For nonparametric components in model (2), one often uses spline or wavelet approximation as finite representations, because of their good approximation capabilities to functions in various nonparametric classes and computational advantages. In our specific setting, we adopt the cubic spline approximation, because two important properties of cubic spline are critical for our analysis. One property is that all the its bases are linearly independent of each other. Another property is that the function is contained in the cubic spline bases. We refer the reader to [11] for a detailed description for spline functions. Thus, benefited from the two properties of cubic spline, each function of this spline space has a unique finite representation, and the coefficient of the base function corresponds to the linear part.

Let be the total covariate defined on , and is the response. For each covariate , let be an -dimensional basis expansion, where . For any vector , we define as cubic spline-based function we consider. With this spline finite representation, each in our model (2) can be approximated well by this form of above. Let be vector-vauled base function for any . To represent our model by a unified form, we denote as the coefficient representation of the th linear-form variable. Let and , where is the coefficient corresponding to the th true nonlinear component. Thus, let be approximating coefficient, our focus on the original model (2) is now transformed to the following one: This semiparametric model (3) indicates essentially that the response depends solely on the generalized predictor vector through a linear combination , or, equivalently, is independent of the original covariate when is given. In the literature, the theory of sufficient dimension reduction provides an effective starting point to estimate without loss of regression information of on and without assuming the specific link function. The popular methods for sufficient dimension reduction include sliced inversion regression [12], sliced average variance estimation [13], contour regression, directional regression [14], and references therein.

In this article, we use the gKDR formulated by the reproducing kernel techniques. This method shows some specific advantages compared with the above-mentioned techniques; for example, the gKDR method can handle any type of variable for including multivariate or nonvectorial one in the same way. Besides, the nonparametric nature of the kernel method avoids making strong assumptions on the distribution of covariates, the response, or the conditional probability, which are required usually in many classical dimension reduction methods such as SIR, pHd, and contour regression.

2.1. gKDR Method

To give the gKDR method, we need some technical notations. For a compact set in some metric space, we denote a positive-definite kernel on , that is, a symmetric function satisfying the following semidefinite positive condition: for any in and . It is known [15] that a positive-definite kernel on is uniquely associated with a Hilbert space consisting of functions on , and is called a reproducing kernel Hilbert space (RKHS) associated with the kernel .

For short, we denote as a subset of . Let and be measurable spaces and be a random variable on with probability . Let and be positive-definite kernels on and , respectively, and the corresponding RKHSs denoted by and . It is assumed that and are finite. Now we introduce two integral operators, so that an appropriate variant of the problem of estimating in (3) can be established. The cross-covariance operator is defined as the operator as follows: Similarly, denotes the self-covariance operator on ; that is, Remark that these definitions are natural extensions of the ordinary covariance metrics on Euclidean spaces. Although and depend on the kernels, we omit the dependence for notational simplicity.

With these definitions, we now define the population matrix by Reference [10] has shown that the eigenvector corresponding to the largest eigenvalue of the is equal to the true vector for any , provided that the maximal eigenvalue is simple. To be precise, the following equality holds: where is denoted to be the maximum eigenvector of the matrix . Thus, the problem of estimating of (3) is equivalent to the eigenvalue problem of solving .

We now turn to the empirical version of , so as to estimate based on available sample. Denote Gram matrices and by and , respectively. Let . Then, the gKDR proposes the eigenvectors of symmetric matrix: where is a regularization parameter in Thikonov-type regularization. Then, we define as an empirical estimator of . In the next two subsections, we introduce two variational forms with respect to the eigenvalue problems, so that our proposed methods based on Lasso idea can be formulated naturally.

2.2. A Nonconvex Variational Algorithm

In order to employ the popular Lasso idea, we need to transform the eigenvalue problem (7) to an optimization scheme. Recall the following Courant-Fischer variational representation of the maximal eigenvalue and eigenvector:

To identify the linear and nonlinear terms in model (3), we define an estimator by adding twofold convex penalties to (10) such that some sparse estimators might be generated. To be precise, let , where each . The penalized estimation method is defined as follows: where is the first element of and is the last component of . Note that is the penalty parameter and is a tuning parameter, typically, for normalization. Our penalized method is motivated by the following fact. For any given function , by the cubic spline representation, as mentioned earlier, the first spline base is the single function and all the bases are linearly independent of each other. Consequently, and with if and only if is a nonlinear function. Based on this and the lasso constraint in (11), the proposed method can identify both the linear and nonlinear terms, of course, also finding the relevant feature simultaneously.

We observe that optimization of  (11) is complicated, since it belongs to a class of constrained nonconvex programmes. In this article, we consider an alternative formulation of (11); precisely, we use the ADMM [16]. ADMM is an algorithm that is intended to blend the decomposability of dual ascent with the superior convergence properties of the method of multipliers. Let , and by introducing an additional constraint, we can reformulate algorithm (11) as where is the indicator function for the unit -ball surface in and we adopt the convention . As in the method of multipliers, we form the augmented Lagrangian with a parameter : where is the Lagrangian multipliers. Defining the residual , we have where is called the scaled dual variable. Using the scaled dual variable, we can express ADMM as where is Euclidean projection onto the unit -ball surface.

There are many convergence results for ADMM discussed in the literature. ADMM consists of iteratively minimizing with respect to , minimizing with respect to , and then updating the dual variable . The -update involves iterative evaluation of the proximal operator associated with To this end, we use the block coordinate descent (BCD) [17] for -regularization to compute of (15). Specially at th iteration, the BCD solves a problem of the form where is the subelement of with group forming a partition of and equals or approximates . Note that is taken as or according to the specified group. Obviously, the BCD can be viewed as a generalization of the classical coordinate descent for -regularization. In our specific setting, a direct computation on yields that where we write and is the th column of . is the identity and is the submatrix of with the index . Since in (21) is usually not a multiple of the identity, the solution minimizing (19) has no closed form. We here use a simple but efficient replacement; that is, , where is the largest eigenvalue of , which can be computed efficiently, since is only a submatrix indexed by and independent of . In this case, the iterative solution of (19) is obtained by group soft thresholding of the Newton step: where is called the soft thresholding operator associated with , defined as for any . Under mild conditions, converges to some vector as goes to infinity, denoted by . It is worthwhile to note that, by (20), we find that the iteration in (22) can be computed parallel between groups.

With regard to projection computation of (16), it is known that if ; otherwise, is set to any given value on -ball surface. Thus, An efficient implementation for ADMM is obtained; to be precise, the overall procedure can be stated as in Algorithm 1.

given: parameters , , ,  , the group indexes and the kernels , .
initialize: , , ,
for  
repeat
  for    repeat
    , for each group ,
    
  until   converges to some
, then compute ,
   and
   then
until converges to .
then return
2.3. A Semidefinite Programming Relaxation

This section intrduces a convex relaxation of (10). Since estimate of individual eigenvectors of in (7) is unstable if the gap between their eigenvalues is small, it seems reasonable to instead focus on their span, that is, the principal subspace of variation. Let as the projection onto the subspace spanned by ; a lesser known but equivalent variational representation of (10) is of the convex program: where . The optimization problem (23) is a semidefinite program (SDP), which can be solved exactly in polynomial time. If the maximal eigenvalue is simple, the optimum always achieved at a rank-one matrix, denoted by , corresponds to the maximal eigenvector; that is, . Motivated by this nice conclusion, we can estimate the sparse vector by adding -type regularization, and computing the maximal eigenvector .

Using the ADMM again, we rewrite (11) as the equivalent problem with equality constrained where and is the indicator function for . The augmented Lagrangian associated with (25) has the form and updates mainly involve computing the projection operator and the proximal operator, respectively; that is, where is the elementwise soft thresholding operator defined as On the other hand, is the Euclidean projection onto and has a closed form as follows.

Lemma 1 (Fantope projection). If   is a spectral decomposition of , then , where and satisfies the equation .

We notice that computing of involves an eigendecomposition of and then modifying the eigenvalues by solving a monotone, piecewise linear equation.

3. Statistical Properties

Before giving statistical results of the estimator (24) in terms of consistency and model selection properties, we recall the existing error bound [10] for in Frobenius norm for any .

Recall that the following technique conditions are required to derive the gradient-based estimator. For shorthand, let .(i) and are separable.(ii) and are measurable, and , .(iii) is continuously differentiable and for .(iv) for any .(v)Define for any given function . Suppose that is differentiable with respect to , and the linear function is continuous for any and .(vi)There is and such that some satisfies and (vii)Let , where is an i.i.d. copy of . Then .

Theorem 2. Under Assumptions (i)–(vii), for the choice , we have for every as . If additionally,   , then converges in probability to of the order in Frobenius norm.

For simplicity, we write . We write the second error bound of Theorem 2 by .

Theorem 3. Under the assumptions of Theorem 2, let and is chosen to satisfy ; then

The proof is shown in the Appendix. The following theorem provides a sufficient condition for support recovery by diagonal thresholding .

Theorem 4. Under the assumptions of Theorem 3, for any , Therefore, the feature selection procedure succeeds with high probability if .

The above theorem follows immediately from the conclusion of Theorem 3.

Appendix

The following lemma is critical for our analysis, shown in [18]. It establishes a close relationship between the matrix and the subspace spanned by its some largest eigenvalues.

Lemma A.1. Let A be a symmetric matrix and be the projection onto the subspace spanned by the eigenvectors of coressponding to its largest eigenvalues . If , then for all satisfying and .

Proof of Theorem 3

Proof. Since , recall that as the projection onto the subspace spanned by . Letting and in Lemma A.1, if , we have where the first inequality follows from Lemma A.1, the second inequality follows from the Cauchy-Schwartz inequality, and the last one follows from the definition of . Furthermore, with the choice of , it is easy to verify that which implies that Plugging the conclusion of Theorem 2 into (A.3), we complete the proof of Theorem 3.

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

The first author’s research is supported partially by National Natural Science Foundation of China (Grant no. 11301421), and Fundamental Research Funds for the Central Universities of China (Grants nos. JBK120509 and JBK140507), as well as KLAS-130026507.