Research Article  Open Access
Remodeling and Estimation for Sparse Partially Linear Regression Models
Abstract
When the dimension of covariates in the regression model is high, one usually uses a submodel as a working model that contains significant variables. But it may be highly biased and the resulting estimator of the parameter of interest may be very poor when the coefficients of removed variables are not exactly zero. In this paper, based on the selected submodel, we introduce a twostage remodeling method to get the consistent estimator for the parameter of interest. More precisely, in the first stage, by a multistep adjustment, we reconstruct an unbiased model based on the correlation information between the covariates; in the second stage, we further reduce the adjusted model by a semiparametric variable selection method and get a new estimator of the parameter of interest simultaneously. Its convergence rate and asymptotic normality are also obtained. The simulation results further illustrate that the new estimator outperforms those obtained by the submodel and the full model in the sense of mean square errors of point estimation and mean square prediction errors of model prediction.
1. Introduction
Consider the following partially linear regression model: where is a scalar response variable, and are, respectively, dimensional and dimensional continuousvalued covariates with being finite and , is the parameter vector of interest and is the nuisance parameter vector which is supposed to be sparse in the sense that is small, is an unknown function satisfying for identification, is the random error satisfying . For simplicity, we assume that is univariate. Let , , be i.i.d. observations of obtained from the above model.
A feature of the model is that the parametric part contains both the parameter vector of interest and nuisance parameter vector. The reason for this coefficient separation is as follows. In practice we often use such a model to distinguish the main treatment variables of interest from the state variables. For instance, in a clinical trial, consists of treatment variables and can be easily controlled, is a vector of many clinical variables, such as patient ages and body weights. The variables in may have an impact on but are not of primary interest and the effects may be small. In order to make up for potentially nonnegligible effects on the response , the nuisance covariate are introduced into model (1); see Shen et al. [1]. Model (1) contains all relevant covariates and in this paper we call it full model.
The purpose of this paper is to estimate , the parameter of interest, when is removed from the model. The main idea is remodeling based on the following working model: As is known, is a nonzero function if , which relies on two elements, one is , related with the correlation between the covariates of and , the other is , determined by the nuisance parameter in the removed part. Thus the least squares estimator based on model (2) may be inconsistent. In the following, we will make use of the above two elements. Specifically, in the first stage, we shall construct a remodeled model by a multistepadjustment to correct the submodel bias based on the correlation information between the covariates. This adjustment is motivated by Gai et al. [2]. In the paper, they proposed a nonparametric adjustment by adding a univariate nonparametric estimation to the working model (2), and it can dramatically reduce the bias of the working model. But this only holds in a subset of the covariates, although the subset may be fairly large. In order to obtain a globally unbiased working model for linear regression model, Zeng et al. [3] adjusted the working model by multiple steps. Because only those variables in correlated with may have impact on estimation of , in each step a univariate nonparametric part was added to the working model and consequently a globally unbiased working model was obtained.
However, when many components of are correlated with , the number of nonparametric functions added in the above working model is large. Such a model is improper in practice. Thus, in the second stage, we further simplify the above adjusted model by a semiparametric variable selection procedure proposed by Zhao and Xue [4]. Their method can select significant parametric and nonparametric components simultaneously under sparsity condition for semiparametric varying coefficient partially linear models. The relevant papers include Fan and Li [5], Wang et al. [6, 7], among others. After twostage remodeling, the final model is conditionally unbiased. Based on this model, the estimation and model prediction are significantly improved.
The rest of this paper is organized as follows. In Section 2, a multistep adjustment and remodeled models are firstly proposed, then the models are further simplified via the semiparametric SCAD variable selection procedure. A new estimator of the parameter of interest based on the simplified model is derived, its convergence rate and asymptotic normality are also obtained. Simulations are given in Section 3. A short conclusion and some remarks are contained in Section 4. Some regular conditions and theoretical proofs are presented in the appendix.
2. New Estimator for the Parameter of Interest
In this paper, we suppose that covariate has zero mean, is finite and , and . We also assume that covariates and and parameter are prespecified, so that the submodel (2) is a fixed model.
2.1. MultistepAdjustment by Correlation
In this subsection, we first adjust the submodel to be conditionally unbiased by a multistepadjustment.
When is normally distributed, the principal component analysis (PCA) method will be used. Let be the covariance matrix of , then there exists an orthogonal matrix such that , where is the diagonal matrix with being eigenvalues of . Denote and .
When is centered but nonnormally distributed, we shall apply independent component analysis (ICA) method. Assume that is generated by a nonlinear combination of independent components , that is , where is an unknown nonlinear mapping from to , is an unknown random vector with independent components. By imposing some constraints on the nonlinear mixing mapping or the independent components , the independent components can be properly estimated. See Simas Filho and Seixas [8] for an overview of the main statistical principles and some algorithms for estimating the independent components. For simplicity, in this paper we suppose that with , , and are scalar functions.
In the above two cases, 's are independent of each other. Set to be the size of set . Without loss of generality, let .
We construct the following adjusted model: where , and . The model (3) is based on 's population and depends on the distributions of , and . It is easy to see that model (3) is conditionally unbiased, that is, .
The adjusted model (3) is an additive partially linear model, in which is the parametric part, and , , are the nonparametric parts and is the random error. Compared with the submodel (2), the nonparametric parts , , may be regarded as biascorrected terms for the random error . For centered , , , the nonparametric components can be properly identified. In fact, centered can be relaxed to any such that satisfies .
When is centered and normally distributed, the nonparametric parts , . So the multistep adjusted model (3) is really a partially linear model with and . Specially, when , the full model is a linear model, the multistep adjusted model is also a linear model But when the variables in are not jointly normal, the nonparametric parts can be highly nonlinear, which are similar to the results of marginal regression; see Fan et al. [9].
2.2. Model Simplification
When the most of the features in the full model are correlated, then is very large and even is close to . In this case, the adjusted model (3) is improper in practice, so we shall use the group SCAD regression procedure, proposed by Wang et al. [6], and the semiparametric variable selection procedure, proposed by Zhao and Xue [4], to further simplify the model.
Let with , and assume that the model (3) is sparse, that is, is small. We define the semiparametric penalized least squares as where , and is the SCAD penalty function with being a tuning parameter defined as with , and . In (6), denotes the set . Because are nonparametric functions, thus they cannot be directly applied for minimization. Here we will replace and by basis function approximations. For , let be orthogonal basis functions satisfying where is the density function of . Similarly, let be orthogonal basis functions satisfying the above condition which is only replaced by the support and density function of . Denote , . Then and can be approximated by Denote , invoking that the identity matrix, we get where , , .
Denote by , and the least squares estimators based on the penalized function (10), that is . Let and , then is an estimator of , is an estimator of .
Let and . For simplicity, we assume that and . So we get the following simplified working model where , and . Under the assumption of sparsity, the model (11) contains all of significant nonparametric functions and fully utilizes both the correlation information of covariates and the model sparsity on nuisance covariate.
If is centered and normally distributed with covariance matrix the identity matrix, then , , where denotes the unit vector with at position , and is sparse with . So the model (4) is sparse. For model (5), the special case of model (4), we can apply the SCAD penalty method proposed by Fan and Li [5] to select variables in and estimate parameters and simultaneously. The selected covariate and the corresponding parameter are denoted by and , the resulting parameter estimators are denoted by and , respectively. Finally, we can use the simplified model for model prediction. Under the condition of sparsity, its model size is much smaller than those of the multistep adjusted model (5) and the full model (1).
2.3. Asymptotic Property of Point Estimator
Let , , , and , be the true values of , , , and , , respectively, in model (3). Without loss of generality, we assume that , , and , , are all nonzero components.
We suppose that , can be expressed as and can be expressed as , and belong to the Sobolev ellipsoid .
The following theorem gives the consistency of the penalized SCAD estimators.
Theorem 1. Suppose that the regularity conditions (C1)–(C5) in the appendix hold and the number of terms . Then,(i),(ii),(iii),
where .
From the last paragraph of Section 2.2 we know that, for linear regression model and normally distributed , the multistep adjusted model (5) is a linear model. By orthogonal basis functions, such as power series, we have , then , implying the estimator has the same convergence rate as that of the SCAD estimator in Fan and Li [5].
Theorem 2. Suppose that the regularity conditions (C1)–(C6) in the appendix hold and the number of terms . Let and . If and as , then, with probability tending to 1, , .
Remark 3. By Remark 1 of Fan and Li [5], we have that, if as , then . Hence from Theorems 1 and 2, by choosing proper tuning parameters, the variable selection method is consistent and the estimators of nonparametric components achieve the optimal convergence rate as if the subset of true zero coefficients was already known; see Stone [10].
Let be the nonzero components of , corresponding covariates are denoted by . In addition, let where for homoscedastic case, .
Theorem 4. Suppose that the regularity conditions (C1)–(C6) in the appendix hold and the number of terms . If is invertible, then where “” denotes the convergence in distribution.
Remark 5. From Theorems 1 and 4, it can be found that the penalized estimators have the oracle property. Furthermore, the estimator of the parameter of interest has the same asymptotic distribution as that based on the correct submodel.
2.4. Some Issues on Implementation
In the adjusted model (4), are used. When the population distribution is not available, they need to be approximated by estimators. When is normally distributed and eigenvalues of the covariance matrix are different from each other, then is asymptotically with , where is the th eigenvector of with ; see Anderson [11]. For the case when the population size is large and comparable with the sample size, if the covariance matrix is sparse, we can use the method in Rütimann and Bühlmann [12] or Cai and Liu [13] to estimate the covariance matrix. So we can use to approximate . When in model (4) are replaced by these consistent estimators, one can see that the approximation error can be neglected without changing the asymptotic property.
The nonparametric parts in the adjusted model depend on the univariate variable , for . So it needs to choose the steps firstly. In real implementation, we compute all the multiple correlation coefficients of () with and . Then we choose the components for given small number , where denotes the multicorrelation coefficient between and and can be approximated by its sample form; see Anderson [11].
There are some tuning parameters needing to choose in order to implement the twostage remodeling procedure. Fan and Li [5] showed that the SCAD penalty with performs well in a variety of situations. Hence, we use their suggestion throughout this paper. We still need to choose the positive integer for basis functions and the tuning parameter of the penalty functions. Similar to the adaptive lasso of Zou [14], we suggest taking , where is initial estimator of by using ordinary least squares method based on the first term in (10). So the two remaining parameters and can be selected simultaneously using the leaveoneout CV or GCV method; see Zhao and Xue [4] for more details.
3. Simulation Studies
In this section, we investigate the behavior of the newly proposed method by simulation studies.
3.1. Linear Model with Normally Distributed Covariates
The dimensions of the full model (1) and the submodel (2) are chosen to be and , respectively. We set and , where , a 30dimensional uniform distribution on , and is chosen in the following ways:
Case (I). .
Case (II). .
We assume that , where with or . The error term is assumed to be normally distributed as .
Here we denote the submodel (2) as model (I), the multistep adjusted linear model (5) as model (II), the twostage model (12) as model (III), and the full model (1) as model (IV). We compare mean square errors (MSEs) of the new twostage estimator based on model (III) with the estimator based on model (I), the multistep estimator based on model (II), the SCAD estimator and the least squares estimator based on model (IV). We also compare mean square prediction errors (MSPEs) of the above mentioned models with corresponding estimators.
The data are simulated from the full model (1) with sample size and simulation times . We use the samplebased PCA approximations to substitute 's. The parameter in the SCAD penalty function is set to be and is selected by leaveoneout CV method.
Table 1 reports the MSEs of point estimators on the parameter and the MSPEs of model predictions. From the table, we have the following findings: (1) has the largest MSEs and takes the second place, nearly all the new estimator has the smallest MSEs. (2) When , the MSEs of are smaller than those of , while when they are larger than those of . These show that if the correlation between the covariates is strong, the MSEs of are larger than those of , the multistepadjustment is necessary, so the estimations and model predictions based on twostage model are significantly improved. (3) In case (I) and (II) the simulation results have the similar performance. (4) Similar to the trend of the MSEs of the five estimators, the MSPE of the twostage adjusted model is the smallest among the mentioned five models.

In summary, Table 1 indicates that the twostage adjusted linear model (12) performs much better than the full model, and better than the submodel, the SCADpenalized model and the multistep adjusted model.
3.2. Partially Linear Model with Nonnormally Distributed Covariates
The dimensions of the linear part in the full model (1) and the submodel (2) are chosen to be and , respectively. We set , , , where , , a 10dimensional uniform distribution on .
We assume that the covariates are distributed in the following two ways.
Case (I). , a 51dimensional student distribution with degree of freedom , where
Case (II). , with , , , , , where , , , uniform distributions on and constant . All , , , , , and are independent.
The error term is assumed to be normally distributed as .
Here we denote the submodel (2) as model (I)′, the multistep adjusted additive partially linear model (3) as model (II)′, the twostage model (11) as model (III)′ and the full model (1) as model (IV)′. We compare mean square errors (MSEs) of the new twostage estimator based on model (III)′ with the estimator based on model (I)′, the estimator based on model (II)′ and the least squares estimator based on model (IV)′. We also compare the mean average square errors (MASEs) of the nonparametric estimators of and the mean square prediction errors (MSPEs) of different models with corresponding estimators.
The data are simulated from the full model (1) with sample size and simulation times . We use the samplebased approximations of ICA, see Hyvärinen and Oja [15]. The parameter in the SCAD penalty function is set to be , the number and the parameter is selected by GCV method. We use the standard Fourier orthogonal basis as the basis functions.
Table 2 reports the MSEs of point estimators on the parameter , the MASEs of and the MSPEs of model predictions. From the table, we have the following results: (1) has the largest MSEs, its MSEs are much larger than the MSEs of the other estimators, and the new estimator always has the smallest MSEs. (2) The MASEs of have similar trend to the MSEs of the four estimators, while the differences are not very noticeable. (3) Similar to the MSEs of the estimators, the MSPEs of the twostage adjusted model are the smallest among the four models. (4) In Case (II), the simulation results of models (I)′, (II)′ and (III)′, perform a little better than those in Case (I) because of the correlation structure among the covariates.
 
^{1}“—” denotes the algorithm collapsed and returned no value. 
In summary, Table 2 indicates that the twostage adjusted model (11) performs much better than the full model and the multistep adjusted model, and better than the submodel.
4. Some Remarks
In this paper, the main objective is to consistently estimate the parameter of interest . When estimating the parameter of interest, its bias is mainly determined by the relevant variables, and its variance may be impacted by other variables. Because variable selection much relies on the sparsity of the parameter, when we directly consider the partially linear model, some irrelevant variables with nonzero coefficients may be selected in the final model. This may affect the estimation of the parameter on its efficiency and stability. Thus based on the prespecified submodel, a twostage remodeling method is proposed. In the new remodeling procedure, the correlation among the covariates and the sparsity of the regression structure are fully used. So the final model is sufficiently simplified and conditionally unbiased. Based on the simplified model, the estimation and model prediction are significantly improved. Generally, after the first stage the adjusted model is an additive partially linear model. Therefore, the remodeling method can be applied to partially linear regression model with linear regression model as a special case.
From the remodeling procedure, we can see that it can be directly applied to additive partially linear model, in which the nonparametric function has componentwise additive form. As for general partially linear model with multivariate nonparametric function, we should resort to multivariate nonparametric estimation method. If the dimension of covariate is high, it may be faced with “the curse of dimensionality”.
In the procedure of model simplification, orthogonal series estimation method is used. This is only for technical convenience, because the semiparametric penalized least squares (6) can be easily transformed into parametric penalized least squares (10) and then the theoretic results are obtained. Although other nonparametric methods such as kernel and spline can be used without any essential difficulty, they can not directly achieve this goal. Compared with kernel method, it is somewhat difficult for series method to establish the asymptotic normality result for the nonparametric component under primitive conditions.
Appendix
A. Some Conditions and Proofs
A.1. Regularity Conditions (C1)–(C6)
(C1) has finite nondegenerate compact support, denoted as .(C2) The density function of and of satisfies on its support for for some constants and , and it is continuously differentiable.(C3) and are continuous. For given and , is positive definite, and its eigenvalues are bounded.(C4), , the first two derivatives of are Lipschitz continuous of order one.(C5) as .(C6) for where satisfies for ; for .Conditions (C1)–(C3) are some regular constraints on the covariates and condition (C4) is some constraints on the regression structure as those in Härdle et al. [16]. Conditions (C5)(C6) are assumptions on the penalty function which are similar to those used in Fan and Li [5] and Wang et al. [7].
A.2. Proof for Theorem 1
Let , , , and . Firstly, we shall prove that, , , .
Denote , then we have where with , and .
By the conditions (C1) and (C2), the maximal squared bias of is equal to so . Similarly, . Then, Noticing that , by Zhao and Xue [4], we have So
Similarly, we have By properly choosing a sufficiently large , dominates uniformly in .
Using Taylor expansion, By simple calculations, we have that where and are some positive constants. We can find that is also dominated by uniformly in , and under the condition (C5), we have Hence, by choosing a sufficiently large , , which implies that with probability at least there exists a local minimum of in the ball . Denote the local minimizer as , then With the same argument as above, there exists a local minimum in the ball , and the local minimizer satisfies that For the nonparametric component , noticing that it is known that , so Thus, we get
Similarly, there exists a local minimizer satisfies that . Then we can get .
A.3. Proof for Theorem 2
When , for large by the form of . Then by Theorem 1, it is sufficient to show that: with probability tending to as , for any , it satisfies , satisfies with , and satisfies , for some small , So the minimizer of is obtained at .
In fact,
Under the conditions and , then . So the sign of the derivative is determined by .
So with probability tending to , , . Then under , , .
A.4. Proof for Theorem 4
By Theorems 1 and 2, we know that, as , with probability tending to , attains the local minimum value at and and . Let , and , then From (A.17), it yields that where . Applying the Taylor expansion, we get Furthermore, condition (C5) implies that , and noting that as , then . So from (A.18), it yields
Let , and , then we have