Penalized Quadratic Inference Function-Based Variable Selection for Generalized Partially Linear Varying Coefficient Models with Longitudinal Data
Semiparametric generalized varying coefficient partially linear models with longitudinal data arise in contemporary biology, medicine, and life science. In this paper, we consider a variable selection procedure based on the combination of the basis function approximations and quadratic inference functions with SCAD penalty. The proposed procedure simultaneously selects significant variables in the parametric components and the nonparametric components. With appropriate selection of the tuning parameters, we establish the consistency, sparsity, and asymptotic normality of the resulting estimators. The finite sample performance of the proposed methods is evaluated through extensive simulation studies and a real data analysis.
Identifying the significant variables is of great significance in all regression analysis. In practice, a number of variables are available for an initial analysis, but many of them may not be significant and should be excluded from the final model in order to increase the accuracy of prediction. Various procedures and criteria, such as stepwise selection and subset selection with Akaike information criterion (AIC), Mallows Cp, and Bayesian information criterion (BIC), have been developed. Nevertheless, these selection methods suffer from expensive computational costs. Many shrinkage methods have been developed for the purpose of computational efficiency, e.g., the nonnegative garrote , the LASSO , the bridge regression , the SCAD , and the one-step sparse estimator . Among those, the SCAD possesses the virtues of continuity, unbiasedness, and sparsity. There are a number of works on the SCAD estimation methods in various regression models, e.g., [6–9]. Zhao and Xue  proposed a variable selection method to select significant variables in the parametric components and the nonparametric components simultaneously for the varying coefficient partially linear models (VCPLMs).
On the other hand, longitudinal data occurs frequently in biology, medicine, and life science, in which it is often necessary to make repeated measurements of subjects over time. The responses from different subjects are independent, but the responses from the same subject are very likely to be correlated. This feature is called “within-cluster correlation”. Qu et al.  proposed a method of quadratic inference functions (QIFs) to treat the longitudinal data. The QIF can efficiently take the within-cluster correlation into account and is more efficient than the generalized estimating equation (GEE)  approach when the working correlation is misspecified. The QIF approach has been applied to many models, including varying coefficient models (VCM) [12, 13], partially linear models (PLM) , varying coefficient partially linear models (VCPLMs) , and generalized partially linear models (GPLM) . Wang et al.  proposed a group SCAD procedure for variable selection of VCM with longitudinal data. More recently, Tian et al.  proposed a QIF-based SCAD penalty for the variable selection for VCPLM with longitudinal data.
As introduced in Li and Liang , the generalized partially linear varying coefficient model (GPLVCM) possesses the great flexibility of a nonparametric regression model and provides the explanatory power of a generalized linear regression model, which arises naturally due to categorical covariates. Many models are the special case of GPLVCM, e.g., VCM, VCPLM, PLM, and GLM. Li and Liang  studied variable selection for GPLVCM, where the parametric components are identified via the SCAD but the nonparametric components are selected via a generalized likelihood ratio test instead of shrinkage. In this paper, we extend the QIF-based group SCAD variable selection procedure to GPLVCM with longitudinal data, and the B-spline methods are adopted to approximate the nonparametric component in the model. With suitable chosen tuning parameters, the proposed variable selection procedure is consistent, and the estimators of regression coefficients have oracle property, i.e., the estimators of the nonparametric components achieve the optimal convergence rate, and the estimators of the parametric components have the same asymptotic distribution as that based on the correct submodel.
The rest of this paper is organized as follows. In Section 2, we propose a variable selection procedure for the GPLVCM with longitudinal data. Asymptotic properties of the resulting estimators and an iteration algorithm are presented in Section 3. In Section 4, we carry out simulation studies to assess the finite sample performance of the method. A real data analysis is given in Section 5 to illustrate the proposed methodology. The details of proofs are provided in the appendix.
2.1. GPLVCM with Longitudinal Data
In this article, we consider a longitudinal study with subjects and observations over time for the th subject () for a total of observations. Each observation consists of a response variable and the predicator variables , where and is a scalar. We assume that the observations from different subjects are independent, but those within the same subject are dependent. The generalized varying coefficient partially linear model (GPLVM) with longitudinal data takes the form where is the expectation of when , , and are given, is an unknown regression coefficient vector, is a known smooth link function, and is a unknown monotonic smooth function vector. Without loss of generality, we assume .
We approximate by B-spline basis functions with the order of , where and is the number of interior knots, i.e., where is a vector of unknown regression coefficients. Accordingly, is approximated by where and “” is the Kronecker product. We use the B-spline basis functions because they are numerically stable and have bounded support . The spline approach also treats a nonparametric function as a linear function with the basis functions as pseudodesign variables, and thus, any computational algorithm for the generalized linear models can be used for the GPLVCMs.
To incorporate the within-cluster correlation, we apply the QIFs to estimate and , respectively. Denote , we define the extended score as follows: where , is the marginal variance matrix of subject , and are the base matrices to represent the inverse of the working correlation matrix in GEE approach. Following Qu et al. , we define the quadratic inference functions to be where . Note that depends on . The QIF estimate is then given by
2.2. Penalized QIF
In real data analysis, the true regression model is always unknown. An overfitted model lowers the efficiency of estimation while an underfitted one leads to a biased estimator. A popular approach to identify the relevant predictors while estimating the nonzero parameters and functions in model (1) simultaneously is to exert some kind of “penalty” on the original objective function. Here, we choose the smoothly clipped absolute deviation (SCAD) penalty because it has several advantages such as unbiasedness, sparsity, and continuity. The SCAD-penalized quadratic inference function (PQIF) is defined as follows: where and is the SCAD penalty function, where the derivative is defined as where ; here, we choose as in .
Note that This group-wised penalization ensures that the spline coefficient vector of the same nonparametric component is treated as an entire group in model selection.
Denote to be the penalized estimator obtained by minimizing the penalized objective function of (7). Then, is the estimator of the parameter and the estimator of the nonparametric function is calculated by , where .
3. Asymptotic Properties
3.1. Oracle Property
We next establish the asymptotic properties of the resulting penalized QIF estimators. We first introduce some notations. Let and denote the true values of and . In addition, is the spline coefficient vector from the spline approximation to . Without loss of generality, we assume that and , i.e., only the first component of is nonzero. Similarly, we assume that and , i.e., only the first component of is nonzero. For convenience and simplicity, let denote a positive constant that may have different values at each appearance throughout this paper and denote the modulus of the largest singular value of matrix or vector . Before the proof of our main theorems, we list some regularity conditions used in this paper.
Assumption (A1). The spline regression parameter is identifiable, that is, is the spline coefficient vector from the spline approximation to . In addition, there is a unique satisfying , where is the parameter space.
Assumption (A2). The weight matrix converges almost surely to a constant matrix , where is invertible.
Assumption (A3). The covariate matrices and , , satisfy and .
Assumption (A4). The error satisfies , and there exists a positive constant such that .
Assumption (A5). All marginal variances and .
Assumption (A6). is a bounded sequence of positive integers.
Assumption (A7). is th continuous differentiable on , where .
Assumption (A8). The inner knots satisfy where .
Assumption (A9). The link function is th continuous differentiable and for some .
Assumption (A10). as , where
Theorem 1 indicates that the estimator of nonparametric components achieve the optimal convergence rate.
Theorem 1. Assume that Assumptions (A.1)–(A.10) hold and the number of knots , then Furthermore, under suitable condition, Theorem 1 shows that the penalized QIF estimator has the sparsity property.
Theorem 2. Assume that the conditions in Theorem 1 hold and as , with probability approaching 1,
Theorems 1 and 2 indicate that with the tune parameter being suitably chosen, the proposed selection method possesses model selection consistency. Next, we establish the asymptotic property for the estimator of the nonzero parametric components. Let and let and denote their true value, respectively. In addition, let and denote the spline coefficient vector of and , respectively, and let and denote their correspondent covariate. Let , and where , is a block matrix with its block taking the form Theorem 3 states that is asymptotically normally distributed.
Theorem 3. Suppose that Assumptions (A.1)–(A.9) hold and the number of knots , then where and represents the convergence in distribution.
3.2. Selection of Tuning Parameters
Theorems 1–3 imply that the proposed variable selection procedure possessed the oracle property. However, this attractive feature relies on the choice of tuning parameters . The popular criteria to choose include cross-validation, generalized cross-validation, AIC, and BIC. Wang et al.  suggested using BIC for the SCAD estimator in linear models and partially linear models and proved its model selection consistency property, i.e., the optimal parameter chosen by BIC can identify the true model with probability tending to one. Tian proved that for partially linear models. Hence, we adopt BIC to choose the optimal . Following [19–21], we simplify the tuning parameters as where and are the unpenalized QIF estimates. Consequently, the original two-dimensional problem becomes a univariate problem about , which can be selected according to the following BIC-type criterion: where is the regression coefficient estimated by minimizing the penalized QIF in (2.8) for a given and is the number of nonzero coefficients of and . Thus, the tuning parameter is obtained by
From Theorem 4 of Tian et al. , the BIC tuning parameter selector enables us to select the true model consistently.
3.3. An Algorithm Using Local Quadratic Approximation
Based on Fan and Li’s local quadratic approximating approach , we propose an iterative algorithm to minimize the PQIF (7). Similar with Tian et al. , we choose the unpenalized QIF estimator as the initial estimator. Let be the value of at the th iteration. If (or ) is close to 0 (or ), i.e., (or ) with some small threshold value , then we set (or ). We consider in our simulations.
Suppose , for , and , for , and , where are the nonzero parametric components and . Similarly, let , where and correspond to zero functions and zero functions, respectively. Let denote a vector which has the same length and same partition with .
For the parametric term, if , the penalty function at is approximated by
Similarly, to the nonparametric component, if , the penalty function at is approximated by where is the first-order derivative of the penalty function . This leads to the local approximation of the PQIF by a quadratic function: where , , with , and
Minimizing the quadratic function (22), we obtain . The Newton-Raphson method then iterates the following process to convergence:
4. Simulation Studies
4.1. Assessing Rule
In this section, we conduct a simulation study to assess the finite sample performance of the proposed procedures. Following , the performance of estimator will be assessed by the generalized mean square error (GMSE), which is defined as
The performance of estimator will be assessed by the square root of average square errors (RASE) where are the grid points where the function is evaluated. In our simulation, is used.
To assess the performance of the variable selection, we use “” to denote the average number of zero regression coefficients that are correctly estimated as zero and use “IC” to denote the average number of nonzero regression coefficients that are erroneously set to zero. The more closer the value of “” to the number of true zero coefficient in the model and the more closer the value of “IC” to zero, the better the performance of the variable selection procedure is.
In our simulations, we use the sample quantiles of as knots and take the number of internal knots to be 3, that is, . This particular choice is consistent with the asymptotic theory in Section 3 and performs well in the simulations. For each simulated dataset, the proposed estimation procedures for finding out penalized QIF estimators with SCAD and LASSO penalty functions are considered. The tuning parameters for the penalty functions are chosen by BIC from 50 equispaced grid points in . For each of these methods, the average of zero coefficients over the 500 simulated datasets is reported.
4.2. Study 1 (Partial Penalty)
Consider a Bernoulli response where , and are drawn independently from . Response variable with compound symmetry correlation structure (CS) is generated according to Oman . In our simulation study, we consider and 0.75, representing weak and strong correlations, respectively. In some situations, we prefer not to shrink some certain components in the variable selection procedure when some kind of prior information is available. Partial penalty arises naturally for such case. In this example, we only exert penalty on the parametric component, i.e., coefficient . In this situation, the PQIF (7) becomes
Tables 1 and 2 show that the performance of the proposed variable selection approach improves as increases, e.g., the number of correctly recognized zero coefficient increases to the number of true zero coefficient in the model and the GMSE of decreases as increases. In addition, the RASE of also decreases as increases, which means the estimated curve of fits better to the true line of when the sample size increases. Moreover, the SCAD penalty method outperforms the LASSO penalty ones in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity.
4.3. Study 2 (Fixed-Dimensional Setup)
In this example, we generate data from the following model: where and with , and come from a multivariate normal distribution with mean zero, marginal variance 1 and correlation coefficient , and . Response variable with compound symmetry correlation structure (CS) is generated by the same method as study 1 and we also consider and 0.75, representing weak and strong correlations, respectively. We generated 500 datasets for each pair of . The results are also reported in Tables 3 and 4.
Table 3 reports the variable selection for the parametric components; it shows that the performances become better and better as increases, e.g., the number of correctly recognized zero coefficients, which is denoted as values in the column labeled “,” becomes more and more closer to the true number of zero regression coefficients in the model. At the same time, the GMSE decreases steadily as increases. Table 4 shows that, for the nonparametric components, the performances of the proposed variable selection method are similar to those of the method for the parametric components. As increases, the RASE of the estimated nonparametric function also becomes smaller and smaller. This reflects that the estimate curves fit better to the corresponding true line as the sample size increases. Moreover, the SCAD penalty method outperforms the LASSO penalty ones in the sense of correct variable selection rate, which significantly reduces the model uncertainty and complexity.
To study the influence of misspecified correlation structure to the proposed approach, we perform variable selection when the working correlation structure is specified to be CS and first-order autoregressive (AR-1), respectively. The result is listed in Table 5. It is known that the QIF estimator is insensitive to misspecification in correlation structure. Table 5 shows that the proposed variable selection procedure gives similar results even when the correlation structure is misspecified. This indicates that our method is robust.
4.4. Study 3 (High-Dimensional Setup)
In this example, we discuss how the proposed variable selection procedure can be applied to the “large , diverging ” setup for longitudinal models. We consider the high-dimensional setup of study 2. In this simulation, we take . The true coefficient vector is , where and are defined in study 2. The other settings are the same with study 2. The results are reported in Table 6. It is easy to see that the proposed variable selection procedure is able to correctly identify the true model and works well in the “large , diverging ” setup.
5. Application to Infectious Disease Data
We apply the proposed method to analyze an infectious disease data (indon.data), which has been well analyzed by many authors, such as [16, 23–27]. In this study, a total of 275 preschool children were examined every three months for 18 months. The response is the presence of respiratory infection (, ). The primary interest is in studying the relationship between the risk of respiratory infection and vitamin A deficiency (, ).
In our study, we consider the following GPLVCM model where is age, is vitamin A deficiency, are the seasonal cosine and seasonal sine variables, respectively, which indicate the season when those examinations took place, is gender (, ), is height, is stunting status (, ), and is the square of height. The with-cluster correlation structure is assumed to be exchangeable, i.e., compound symmetric. This structure is also used in [16, 26, 27].
We apply the proposed QIF-based group SCAD variable selection procedure to the above model and recognize five nonzero coefficients and one nonzero function , where , , , , and . The results are generally consistent with those previous studies, but our results show that the height has no significant impact on the infectious rate and can be removed from the model. Figure 1 reports the curve of baseline age function estimated by QIF-based group SCAD that is estimated by QIF and that is estimated by QIF-based SCAD partial penalty to in , where the GPLM without the varying coefficient term is used. Figure 1 implies that the probability of having respiratory infection increases at the very early stage, then decreases steadily, and declines dramatically when the age is over 5.5 years old. This also coincides with previous results [16, 26, 27].
6. Conclusion and Discussion
We proposed a QIF-based group SCAD variable selection procedure for the generalized partially linear varying coefficient models with longitudinal data. This procedure can select significant variables in the parametric components and nonparametric components simultaneously. Under mild conditions, the estimators of regression coefficients have oracle property. Simulation studies indicate that the proposed procedure is very effective in selecting significant variables and estimating the regression coefficients.
In this paper, we assume that the dimensions of the covariates and are fixed. Study 3 in simulations shows that the proposed approach still have desired results when the dimensions and go to infinity as . However, when in ultrahigh-dimensional case, the proposed variable selection procedure may not work well anymore. As a future research topic, it is interesting to consider the variable selection for the generalized partially linear varying coefficient models with ultrahigh-dimensional covariates.
A. Proofs of the Main Results
For convenience and simplicity, let denote a positive constant that may have different values at each appearance throughout this paper and denote the modulus of the largest singular value of matrix or vector .
Let , then . Let , , and .
Similarly, let , and ; then, , and .
Let , then . Let Then,
Proof of Theorem 1. Let , and . We first show that for any given , there exists a large constant such that Note that , for all , and , for all , together with Assumption (A1) and , we have By Taylor expansion and Assumption (A4), we have Invoking the proof of Theorem 2 in Zhang and Xue , By choosing a sufficient large , dominates . Similarly, dominates for a sufficient large . Thus (A.3) holds, i.e., with probability at least , there exists a local minimizer that satisfies . Therefore, and . Let and denote the spline coefficient vector from the spline approximation to . From Assumptions (A7) and (A8) and Theorem 12.7 in , we get that . Therefore, Thus, we complete the proof of Theorem 1.
Proof of Theorem 2. According to Theorem 2, in order to prove the first part of Theorem 2, we need only to prove that, for any satisfying and for any satisfying , there exists a certain that satisfies, as , with probability tending to 1:
These imply that the PQIF reaches its minimum at .
Following Lemmas 3 and 4 of , we have According to (8), the expression of the derivative of SCAD-penalized function, it is easy to see that . Together with Assumption (A10), , it is clear that the sign of (A.10) is decided by that of . This implies (A.8) and (A.9) hold. Thus, we complete the proof of the first part.
Similarly, we can prove that with probability tending to 1, . Note that and ; the second part of Theorem 2 is proved. Thus, we complete the proof of Theorem 2.
Proof of Theorem 3. Let and let denote the covariates corresponding to . Denote and to be the first derivatives of the PQIF with respect to and , respectively, i.e., By Theorems 1 and 2, and satisfies that By the Taylor expansion, we have where is between and . Apply the Taylor expansion to , we obtain By Assumption (A10), . Note that as ; therefore, by Lemma 4 of  and through some calculation, we have where is the block of and Similarly, we have where . Hence, Following the proof of Theorem 2 in , we prove (16). Thus, we complete the proof of Theorem 3.
The data can be downloaded from https://content.sph.harvard.edu/xlin/dat/indon.dat.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The research is funded by the National Natural Science Foundation of China (11571025) and the Beijing Natural Science Foundation (1182008). This support is greatly appreciated.
The R code presented in Word format for the real data analysis is included in the supplementary file. (Supplementary Materials)