Abstract

In biomedical research, one major objective is to identify risk factors and study their risk impacts, as this identification can help clinicians to both properly make a decision and increase efficiency of treatments and resource allocation. A two-step penalized-based procedure is proposed to select linear regression coefficients for linear components and to identify significant nonparametric varying-coefficient functions for semiparametric varying-coefficient partially linear Cox models. It is shown that the penalized-based resulting estimators of the linear regression coefficients are asymptotically normal and have oracle properties, and the resulting estimators of the varying-coefficient functions have optimal convergence rates. A simulation study and an empirical example are presented for illustration.

1. Introduction

To balance model flexibility and specificity, a variety of semiparametric models have been proposed in survival analysis. For example, Huang [1] studied partially linear Cox models and Cai et al. [2] and Fan et al. [3] studied the Cox proportional hazard model with varying coefficients. Tian et al. [4] proposed an estimation procedure for the Cox model with time-varying coefficients. Perhaps the most appropriate model is the semiparametric varying-coefficient model; its merit includes easy interpretation, flexible structure, potential interaction between covariates, and dimension reduction of nonparametric components models. Recently, a lot of effort has been made in this direction. Cai et al. [2] considered the Cox proportional hazard model with a semiparametric varying-coefficient structure. Yin et al. [5] proposed the semiparametric additive hazard model with varying coefficients.

In biomedical research and clinical trials, one major objective is to identify risk factors and study their risk impacts, as this identification can help clinicians to properly make a decision and increase efficiency of treatment and resource allocation. This is essentially a variable selection procedure in spirit. When the outcomes are censored, selection of significant risk factors becomes more complicated and challenging than with the case where all outcomes are complete. However, important progress has been made in recent literature. For instance, Fan and Li [6] extended the nonconcave penalized likelihood approach proposed by Fan and Li [7] to the Cox proportional hazards model and the Cox proportional hazards frailty model. Johnson et al. [8, 9] proposed procedures for selecting variables in semiparametric linear regression models for censored data. However, these papers did not consider variable selection with functional coefficients; while Wang et al. [10] extended the application of penalized likelihood to the varying-coefficient setting and studied the asymptotic distributions of the estimators, they only focused on the SCAD penalty for completely observed data. Du et al. [11] proposed a penalized variable selection procedure for Cox models with a semiparametric relative risk.

In this paper, we study variable selection for both regression parameters and varying coefficients in the semiparametric varying-coefficient additive hazards model: where is a vector of covariates with a linear effect on the logarithm of the hazard function , is the main exposure variable of interest whose effect on the logarithm of the hazard might be nonlinear, and is a vector of covariates that may interact with the exposure covariate . is the baseline hazard ratio function and is an unspecified smooth function. is a -vector of constant parameters. Let to assure that the model is identifiable. This model covers generally used basic semiparametric model settings. For instance, if and , model (1) reduces to Lin and Ying’s additive hazard model [12] and Aalen’s additive hazard model when the baseline hazard function also equals to zero [13, 14]. Furthermore, when the exposure covariate is time , then model (1) reduces to the partly parametric additive hazard model if [15, 16] and reduces to the time-varying coefficient additive hazard model [17].

To select significant elements of and coefficient functions , we require an estimation procedure for the regression parameters and . However, this poses a challenge because it is impossible to simultaneously get the root- consistent penalized estimators for the nonzero components of by penalizing and . To achieve this goal, we propose the following two-step estimation procedure for selecting the relevant variables. In Step 1, we estimate by using the profile penalized least square function after locally approximating the nonparametric functions and . In Step 2, given the penalized estimator of in Step 1, we develop a penalized least square function for by using the basis expansion. We will demonstrate that the proposed estimation procedures have oracle properties, and all penalized estimators of nonzero components can achieve their optimal convergence rate.

This paper is organized as follows. Section 2 introduces the SCAD-based variable selection procedure for the parametric components and establishes the asymptotic normality for the resulting penalized estimators. Section 3 discusses a variable selection procedure for the coefficient functions by approximating these functions by using the B-spline approach. The simulation studies and an application of the proposed methods in a real data example are included in Sections 4 and 5, respectively. The technical lemmas and proofs of the main results are given in the appendix.

2. Penalized Least Squares Based Variable Selection for Parametric Components

Let denote the potential failure time, let denote the potential censoring time, and let denote the observed time for the th individual, . Let be an indicator which equals 1 if is a failure time and 0 otherwise. Let represent the failure, censoring, and covariate information up to time for the th individual. The observed data structure is . Assume that and are conditionally independent given covariates and that the observation period is , where is a constant denoting the time for the end of the study.

Let denote the counting process corresponding to and let . Let the filtration be the history up to time ; that is, ,  . Write . Then is a martingale with respect to . For ease of presentation, we drop the dependence of covariates on time. The methods and proofs in this paper are applicable to external time-dependent covariates [18].

Here we use the techniques of local linear fitting [19]. Suppose that and are smooth enough to allow Taylor’s expansion as follows: for each which is the support of and for in a neighborhood of , Then we can approximate the hazard ratio function given in (1) by where , , and .

Let be a diagonal matrix with the first diagonal elements 1 and the rest . If is given, then by using the counting-process notation, similarly to [5], we can obtain an estimator of at each by solving the estimating equations whose objective function is given as follows: where with being a kernel function and being a bandwidth, and . To avoid the technicality of tail problems, only data up to a finite time point are frequently used.

Denote the solution of by , which can be expressed as follows: Here and below, for any vector and for .

As a consequence, we can estimate and presuming that is known:

2.1. Variable Selection

Notice that if is given, then from (7) we can estimate by Given the observed data and based on (7) and (8), , , and the cumulative function can be approximately expressed as functions of . These expressions inspire us to estimate by minimizing the following objective function subject to : Accordingly, our penalized objective function is defined as follows: Thus, minimizing with respect to results in a penalized least squares estimator of .

Let and . Now we establish the oracle properties for the resulting estimators. Let be the true value of , the first subvector of length , containing all nonzero elements of and .

We introduce an additional notation as follows. Let . For , let ,    ,   . Denote where is the density of . Denote   and , where For , , let , and ,   for any vector .

Theorem 1. If Assumptions (A.i)–(A.vi) are satisfied and as , then there exist local minimizers of such that .

Remark 2. Theorem 1 indicates that by choosing a proper , which leads to , there exists a root- consistent penalized least squares estimator of .

Theorem 3. Assume that Under Assumptions (A.i)–(A.vi), if and as , then the local root-n consistent local minimizer in Theorem 1 with probability tending to 1 satisfies(1) ; and(2)asymptotic normality where and are the first submatrix of n and , respectively, and

3. Variable Selection for the Varying Coefficients

When some variables of are not relevant in the additive hazard regression model, the corresponding functional coefficients are zero. In this section, by using basis function expansion techniques, we try to estimate those zero functional coefficients as identically zero via nonconcave penalty functions.

For a simplified expression, we denote as and . Hence, we can rewrite the additive hazard regression model (1) as where and . Assume is a given root- consistent estimator of . From the arguments in Section 2, such an is available.

3.1. Penalized Function

We approximate each element of by the basis expansion; that is, for . This approximation indicates that the hazard function given in (16) can be approximated as follows: Let and . Write and . Then, by using similar arguments as in Section 2, we estimate the basic hazard function by and estimate by minimizing the following least square function with respect to : Suppose that ,   , as mentioned before. We are trying to correctly identify these zero functional coefficients via nonconcave penalty functions, that is, via minimizing the following penalized least square function with respect to : where and with .

Let be the minimizer of penalized least square function (21) and define .

Remark 4. Various basis systems, including Fourier basis, polynomial basis, and B-spline basis, can be used in the basis expansion. We focus on the B-spline basis and examine asymptotic properties of .

3.2. Asymptotic Properties

Define and let be the minimizer of in (20). Hence, we can obtain by resolving is the derivative of with respect to , the usual score function of without the penalty.

For any square integrable functions and on , let denote the norm of on , and define as the distance between and . Let denote all functions that have the form ,   for B-spline base. and . Denote .

Let and .

Theorem 5. Suppose that (13) and Assumptions (A.i) and (A.vii)–(A.iv) hold. If and as , then one has(a) ,   , with probability approaching ; and(b) ,   .

Remark 6. Let be a space of splines with a degree no less than 1 and with equally spaced interior knots, where ,   . Note that [20, Theorem 6.21]. Hence, if , Theorem 5 implies that ,   .

Remark 7. Suppose that ,   . Because , which implies that for the hard thresholding and SCAD penalty function, the convergent rate of penalized least square estimator is . This is the optimal rate for nonparametric regression [21] if . However, for the penalty, ; hence is required to lead to the optimal rate. On the other hand, the oracle property in part (b) of Theorem 5 requires , which contradicts . As a consequence, for the penalty, the oracle property does not hold.

Next, we demonstrate the asymptotic normality of . First, we analyze , that is, the penalized estimator of . Let , , and denote the selected columns of , , and , respectively, corresponding to . Similarly, let denote selected diagonal blocks of ,   . Define and . Then by part (a) of Theorem 5 and (21), is the local solution of Recall that is root- consistent. It follows from (23) that

Thus we can obtain the following theorem.

Theorem 8. Suppose Assumptions (A.i) and (A.vii)–(A.iv) hold, , if and as . For any , let and be the conditional means of and given . We then have where , , and .

4. Simulation Studies

We carried out two sets of simulation studies to examine the finite-sample properties of the proposed methods. was generated from a uniform distribution over and was selected via cross-validation [7]. We set and reported sampling properties based on 200 replications. We consider two scenarios: (I) the performance of the regression coefficient estimates. Without loss of generality, we set . We generated failure times from the partially linear additive hazards model (1) with , , , and . Covariate was generated from a uniform distribution over , was a Bernoulli random variable taking a value of 0 or 1 with a probability of , were generated from uniform distributions over and , respectively, and was generated from a normal distribution with mean zero, variance 0.25, and covariance 0.25. We generated the censoring time from a uniform distribution over . We took to yield an approximate censoring rate of and to yield an approximate censoring rate of . We used sample size . We selected the optimal bandwidth by using the method of [22] and found that the value is approximately . We present the average number of zero coefficients in Table 1, in which the column labeled “correct” presents the average restricted only to the true zero coefficients, while the column labeled “incorrect” depicts the average of coefficients erroneously set to 0. We also report the standard deviations (SD), the average standard errors (SE), and the coverage probability (CP) of the 95% confidence interval for the nonzero parameters in Table 2.

As shown in Tables 1 and 2, SCAD, HARD, and perform well and select about the same correct number of significant variables. The coverage probability (CP) based on SCAD and HARD, however, is better than that based on .

In scenario (II), we focused on functional coefficients . We set , , and . We generated failure times from the partially linear additive hazards model (1) with , , and . Both covariates and were generated from a uniform distribution over , respectively, was generated from a normal distribution with a mean of , variance and covariance 0, was a Bernoulli random variable taking a value of 0 or 1 with a probability of 0.5, and was uniform over . In simulation II, we took to yield an approximate censoring rate of and to yield an approximate censoring rate of . We used sample size . The average number of zero functional coefficients based on SCAD and HARD is reported in Table 3, and fitted curves of nonzero functional coefficients are presented in Figures 1 and 2 based on SCAD and HARD, respectively.

Notice that penalized estimates of functional coefficients are based on the results in Step 1. Because the oracle property does not hold for the penalty according to Remark 7, we did not take penalty into account.

We can see from Table 3 and Figures 1 and 2 that the penalized spline estimation of the functional coefficient that we proposed performed well. Furthermore, for the varying coefficient part, the “correct” numbers of significant variables selected based on SCAD and HARD penalties are close. In comparing the results between different censoring rates, it is also shown that our method is extremely robust against the censoring rate.

5. Real Data Example

In this section, we apply the proposed method to a SUPPORT (Study to Understand Prognoses Preferences Outcomes and Risks of Treatment) dataset [23]. This study was a multicenter study designed to examine outcomes and clinical decision making for seriously ill-hospitalized patients. One thousand patients suffering from one of eight different diseases, including acute respiratory failure (ARF), chronic obstructive pulmonary disease (COPD), congestive heart failure (CHF), cirrhosis, coma, colon cancer, lung cancer, and multiple organ system failure (MOSF) were followed up to 5.56 years. 672 observations were collected, including age, sex, race, coma score (scoma), number of comorbidities (num.co), charges, average of therapeutic intervention scoring system (avtiss), white blood cell count (wbc), heart rate (hrt), respiratory rate (resp), temperature (temp), serum bilirubin level (bili), creatinine (crea), serum sodium (sod), and imputed ADL calibrated to surrogate (adlsc). For more details about this study, refer to [23].

We suppose that binary and multinary risk factors have constant coefficients, which yields a 12-dimensional parameter after transforming multinary variables into binary. Conversely, other risk factors are supposed to have varying coefficients, that is, a 13-dimension functional coefficient. Here the main exposure variable corresponds to age. Censoring rate of 672 patients is 32.7%. The model we considered is where ,   , denote binary variables of the th patient and , denote the others of the th patient. Here, baseline hazard function is the failure risk of patients who are white females suffering from MOSF with other variables all equal to zero.

By fitting model (26) with a SCAD penalty, the parameter of “other race” is identified as zero, and functional coefficients of num.co, charges, wbc, hrt, temp, crea, sod, adlsc, and are identified as zero functions. Identified significant risk factors and results of their parametric or functional coefficients are shown in Figure 3 and Table 4.

The identified zero parameter of “other race” shows that, besides Asian, black, and Hispanic, “other race” has no significant different impact on risk in contrast with white people. Table 4 demonstrates that Asians have the lowest failure probability, as do people who are suffering from MOSF. Conversely, coma is the most dangerous among these eight different diseases. Figure 3 shows the fitted coefficients and their 95% pointwise confidence intervals of the five risk factors which were identified as significant.

Appendices

In this appendix, we list assumptions and outline the proofs of the main results. The following assumptions are imposed.

A. Assumptions

(A.i)The density of is continuous, has compact support , and satisfies .(A.ii) and have continuous bounded second derivatives on .(A.iii)The density function is bounded and symmetric and has a compact bounded support. .(A.iv) , and as .(A.v)The conditional probability is equicontinuous in the arguments on the product space .(A.vi) and are bounded away from zero on the product space , and are continuous on , is nonsingular for all , and are positive-definite.(A.vii)The eigenvalues of the matrix are uniformly bounded away from 0.(A.viii) .(A.ix) .

B. Technical Lemmas

Write . Before we prove the theorems, we present several lemmas for the proofs of the main results. Lemma B.1 will be used for the proofs of Theorems 1 and 3. Lemma B.5 establishes the convergence rate for the spline approximation of nonparametric functions. Its proof can be finished in a similar way as in [24].

Write for functions and .

Lemma B.1. Assume that is equicontinuous in its arguments and , is equicontinuous in its argument , and that is equicontinuous in the argument . Under Assumptions (A.i)–(A.v), one has for each , where for .

Its proof can be finished using the similar arguments in Lemma 1 of [25].

Lemma B.2. Let . Assume that (13) and Assumptions (A.i)–(A.vi) hold. We then have where .

Proof. Because , we have with . By a similar discussion in [5], we can prove that the biases of and are the order of at each point , respectively.
Furthermore, recall that with and , where with for and .
It follows from Lemma B.1 that for each . This implies that uniformly hold on . Hence it follows from Assumption (A.iv), the arguments similar to the proof of Theorem  3 in [5] and the martingale central limit that A further simplification indicates that can be expressed as Thus (B.3) holds.

Lemma B.3. Assume that (13) and Assumptions (A.i)–(A.vi) hold. If , as , then with probability tending to 1, for any given satisfying , and any given constant ,

Proof. It suffices to prove that, for any given satisfying , any given constant , and each ,   , hold with probability tending to 1. Notice that . Then for each in a neighborhood of from Lemma B.2 we have By similar algebra with the proof of Lemma B.2, we know that is positive-definite with probability tending to 1 from Assumption (A.vi) and Lemma B.1. Hence for , by the fact of , we have and then As a result, it follows from (13) and , as that for each , which indicates that (B.11) holds with probability tending to 1.

Lemma B.4. Suppose Assumptions (A.i) and (A.vii)–(A.ix) hold, then there are positive constants and such that, except in an event whose probability tends to zero, all the eigenvalues of between and , and consequently , are invertible.

Proof. According to Lemmas and of [24], except in an event whose probability tends to zero, . Note that by Assumption (A.ii). Hence, Lemma B.4 holds.

In the rest of this paper, we denote by . Let and . Then and are the mean of and conditioning on , respectively.

Lemma B.5. Suppose Assumptions (A.ii), (A.viii), and (A.ix) hold. We then have , as , , , and .

Proof. By Assumptions (A.ii) and (A.viii), [20, Theorem 6.27]. Hence, it follows from Assumption (A.iv) that and as . By the triangle inequality, . Note that Thus, according to , which can be similarly proved by Lemma in [24], we have . On the other hand, by the properties of B-spline basis functions, we can prove that by the similar argument of Lemma in [24]. Thus, Lemma B.5 holds.

Lemma B.6. Suppose (13) and Assumptions (A.i) and (A.vii)–(A.iv) hold. If and , as , we then have .

Proof. Using the properties of B-spline basis functions (Section of [24]), we have which we sum over to obtain Let . It suffices to show that, for any given , there exists a large enough such that which implies can reach a local minimum in the ball with a probability of at least . Thus, there exists a local minimizer satisfying . Let ; then by the Taylor expansion and the definition of we have It follows from Lemmas B.4 and B.5 that We first examine the second term on the right-hand side of (B.23). A direct calculation with the Taylor expansion and using yields that Here, . Thus, combining that with (B.23) yields
Notice that and as . Then by choosing a sufficiently large , the first term dominates the second and the third terms on the right-hand side of (B.25). On the other hand, according to (13), we have . Hence, by choosing a sufficiently large , (B.21) holds. This completes the proof of Lemma B.6.

C. Proof of Theorem 1

Let . It is sufficient to prove that for any given , there exists a large enough constant such that which means can reach a local minimum in the ball with a probability of at least . Thus, there exists a local minimizer satisfying .

Now we are going to prove (C.1). Note that Because , it similarly follows from (A.iv) that Hence, we can obtain that Notice that , . Hence, by choosing a large enough constant , uniformly in . Using the discussion similar to that in [7], the last two terms on the right-hand side of (C.4) are bounded by . It then follows from that when is sufficiently large, uniformly holds in . Hence, (C.1) holds.

D. Proof of Theorem 3

It follows from Lemma B.3 that part (1) holds. Now we are going to prove part (2). From part (1) we see that the local minimizer of has the form of and satisfies . It follows from Taylor’s expansion that Then by some regular calculations we can obtain It follows from the proofs of Lemmas B.2 and B.3 that where for ,   . Lemma B.1 and Assumptions (A.v) and (A.vi) yield Next we consider . Using the martingale central limits theorem, we can prove that is asymptotically normally distributed with a mean of zero and covariance . Thus we obtain that Hence, by using Slutsky’s Theorem, it follows from (D.2)–(D.5) that

E. Proof of Theorem 5

To prove part (a) of Theorem 5, we use proof by contradiction. Suppose that for an sufficiently large there exists a constant , such that with a probability of at least there exists a such that . Then . Let be a vector constructed by replacing with 0 in . Then by the definition of , It then follows from Lemmas B.4B.6 that the first term on the right-hand side of (E.1) is an order of . Notice that . Hence, Because and as , (13) implies that with probability tending to 1. This contradicts the fact that . Thus, part (a) holds.

Lemmas B.5 and B.6 yield the proof of part (b).

F. Proof of Theorem 8

Note that is the conditional mean of (24) given all observed covariates and ,   . It follows from (24) that the expression given in (F.1) equals zero. Hence, together with (24), we have By using similar quadratic approximation in [10], we have Finally, by using the arguments similar to the proof of Theorem 4.1 of [26] and the proof of [24], (F.2) and the martingale central theory, we know that holds for any vector with dimension and components not all 0. Then for any -dimension vector whose components are not all 0 and for any given in , choosing yields which implies (25) holds.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

Ma’s research was supported by the National Natural Science Foundation of China (NSFC) Grant no. 11301424 and the Fundamental Research Funds for the Central Universities Grant no. JBK120405.