Additive Hazard Regression for the Analysis of Clustered Survival Data from Case-Cohort Studies
The case-cohort design is an effective and economical method in large cohort studies, especially when the disease rate is low. Case-cohort design in most of the existing literature is mainly used to analyze the univariate failure time data. But in practice, multivariate failure time data are commonly encountered in biomedical research. In this paper, we will propose methods based on estimating equation method for case-cohort designs for clustered survival data. By introducing the event failure rate, three different weight functions are constructed. Then, three estimating equations and parameter estimators are presented. Furthermore, consistency and asymptotic normality of the proposed estimators are established. Finally, the simulation results show that the proposed estimation procedure has reasonable finite sample behaviors.
In many failure time studies, there is a natural clustering of study subjects such that failure times within the same cluster may be correlated. One example of clinical trial is that the failure times of patients within a famlily may be correlated because they have common genetic characteristics and environmental factors. Another example is the time to onset of blindness in the left and right eyes of the patients with diabetic retinopathy. So, in this case, the focus is on the potential clustering in the data. Moreover, the data structure is that the observations from different clusters are independent, while observations within a cluster may be correlated. Interested readers can refer to the papers [1–4] and the references therein.
In epidemiological research, the collection of detailed follow-up information is costly and time consuming, especially when the incidence of disease is low. One may consider the case-cohort design, proposed by Prentice , which is widely used for the large cohort studies. It entails collecting covariate data for all subjects who experienced the event of interest in the full cohort and for a random sample from the entire cohort. It has the same goals in studying risk factor as collecting the full data. Therefore, many studies about case-cohort data have been yielded. For example, it has been applied to the Cox proportional risk model , additive risk regression model , semiparametric transfer model , accelerated failure time model , and additive multiple risk regression model [10, 11].
However, there is little research on group failure time data under case-cohort design. For example, Lu and Shih  applied the proportional risk model to clustered failure time data under case cohort and proposed three design methods of case cohort to extract subcolumns. Under different sampling mechanisms, the estimation equations are established, and then the large sample nature of the obtained estimation is proved. Unfortunately, the risk set considered by Lu and Shih  only contains the individuals in the subcolumns. Furthermore, Zhang et al.  added the failure information out of the subcolumns to the proportional risk model and proposed three kinds of different estimation equations and parameter estimates. The numerical simulation results show that the proposed model is effective. Note that the above literature studied the clustered data from case cohort based on the marginal proportional hazard models. In some practical cases, the effect of covariates may be additive. Then, the proportional risk model is no longer applicable. Therefore, in this paper, an additive risk model will be applied to clustered data in case-cohort design. Moreover, we will construct the weighted estimation equation, propose the estimation of regression parameters, and prove the asymptotic properties of the estimators.
The remainder of the paper is organized as follows. In Section 2, we describe the proposed estimation procedures. Then, we establish the consistency and asymptotic normality of the resulting estimator in Section 3. Numerical evidence in support of the theory is presented in Section 4. Finally in Section 5, some conclusions are drawn.
2. Model and Estimation Procedures
Suppose the full cohort consists of independent clusters, and the -th cluster has correlated subjects. We assume that subjects within the same cluster are exchangeable. In advance of follow-up, a random sample of the entire cohort, called the subcohort, is selected. Covariate data are then collected from individuals in the subcohort as well as those observed to fail in the entire cohort. Now, we consider the following three designs to obtain the subcohort : Design A: randomly sample individuals from each cluster with Bernoulli sampling. In other words, each individual in each cluster has an independent fixed probability of being selected to the subcohort. Design B: randomly sample clusters from the full cohort with Bernoulli sampling. Design C: randomly sample clusters from the full cohort with Bernoulli sampling and then randomly sample subjects with Bernoulli sampling from the selected clusters.
Note that Design A and Design B are special cases of Design C.
Let and denote the potential failure time and the potent censoring time for the -th subject in the -th cluster, respectively. Let represent a possibly time-dependent vector of covariates, restricted to be external. We assume that and are independent conditional on the given . The observed time is . Let be the indicator of failure, , and .
Let denote the indicator for the -th subject being selected into the subcohort and denote the indicator for the th subject in the -th cluster being selected into the sample. Both and are the independent Bernoulli variables, where . Under Design A, for . Under Design B, , for .
Suppose that the marginal hazard function is associated with as follows:where is an unspecified baseline hazard function and is a vector of regression parameters.
Then, the estimator of in (1) can be estimated by , which is the solution to the estimating equations . That is,where , .
For clustered failure time data from case-cohort studies, we consider the observed-event probability, , which we denote by . We propose three procedures to estimate , which are the same designs proposed by Zhang et al.  except that we consider the additive hazard regression model. We then develop a weighted estimating equation as follows:where , with , , and .
By solving the estimation equation , we could obtain to estimate :
However, is usually not known in most situations. In case where the study cohort is well defined, we can use , the full cohort case proportion, to estimate . When the study cohort is less well defined, , the subcohort case proportion, is a suitable alternative. The corresponding estimator of is written as or .
The cumulative baseline hazard function, , can be consistently estimated by
3. Asymptotic Properties
To derive the asymptotic properties of the proposed estimates, we impose the following assumptions: (C1) are independently and identically distributed, where , and . (C2) and , for . (C3) The matrix is positive definite where (C4) There exists a constant such that holds almost everywhere.
Theorem 1. Under the regularity conditions (C1)–(C4), converges to a mean zero normal distribution with covariance , with
Theorem 2. Under conditions (C1)–(C4), converges in probability to , and each of converges in distribution to a mean zero normal distribution with covariance matrix.
Theorem 3. Under conditions (C1)–(C4), both and converge in probability to , and each of and converges in distribution to a zero mean normal with covariance matrices and , respectively, where for or , , , , , , , and .
Theorem 4. Under the regularity conditions (C1)–(C4), for each , converges in probability to uniformly in . In addition, converges weakly to a Gaussian process with zero mean and covariance function at given by .
4. Numerical Studies
Simulation studies are conducted to examine the finite sample properties of the proposed estimators. In the study, we generate the clustered failure time data from clusters. are simulated from a binomial (50, 0.8) distribution for , with . The total number in the whole queue is about 4000. The covariate takes values 1 and 0, with probabilities 0.5 and 0.5, respectively. The failure time in the -th cluster is simulated by the joint survival function (for details, see ):where is a parameter representing the degree of dependence among variables. Smaller value of represents stronger correlation.
In our experiments, we choose = 0.5 or 2 and or 0. The censoring times are constant and equal to 1. The observed-event probabilities of are equal to 0.45 or 0.63.
For each data generation, for Design A, individuals within each cluster are selected into the subcohort by Bernoulli sampling with equal probability 0.2 or 0.15. For Design B, we select clusters by Bernoulli sampling with probability 0.2 or 0.15. For Design C, we first sample clusters by Bernoulli sampling with probability 0.4 or 0.3 and then sample individuals from those selected clusters by Bernoulli sampling with probability 0.5. Therefore, for each design, we would expect approximately 800 or 600 individuals in the subcohort.
To assess the performance of the proposed estimator, we calculate the estimated standard error based on the bootstrap method. Because the failure clusters inside and outside the subcohort have different structures, we use the following method to conduct bootstrap sampling. Suppose the failure clusters in the subcohort contain M failures and Q nonfailures, whereas the failure clusters outside the subcohort contain D failures only. For each bootstrap sampling, we propose to obtain a bootstrap sample by sampling D failures from the original D failures outside the subcohort with replacement and then sampling M failures and Q nonfailures from the original M failures and Q nonfailures in the subcohort, respectively. The simulation results are based on 1000 replications.
Tables 1 and 2 report the simulation results of our proposed estimators. For simplicity, the notation Bias denotes the difference of the average of the estimates and the true values, SE denotes the average of standard errors, SEE denotes the average asymptotic standard error, BSE denotes the estimated standard error based on the bootstrap method, and CP denotes the coverage probability of the nominal 95% confidence interval.
It can be seen from Tables 1 and 2 that the estimates of both regression parameters and are unbiased, and both the estimated standard deviation and the empirical standard deviation are also close. The coverage probability of the nominal 95% confidence intervals seems to be very reasonable. Furthermore, when for Design B in Tables 1 and 2, we can obtain that the values of SEE and CP are slightly underestimated. It is due to the small number of clusters in the subcohort. In addition, with the increase of the capacity of the subcolumns, i.e., when is increased to 800, the effect of the estimation is obviously improved. Among the above three designs, Design A and Design C are more effective than Design B. The conclusion is consistent with that in . By comparing SE, SEE, and BSE, SE and SEE agree with the BSE. Furthermore, the bootstrap standard errors and the simulated standard errors are similar. The proposed semiparametric estimator works well. Therefore, one may use the bootstrap variance estimate for statistical inference in practice.
This paper proposed methods of fitting additive hazard regression models for clustered survival data from case-cohort studies. Risk difference can provide information value to medical research. Specifically, risk differences can provide information regarding the reduction in the number of cases developing a certain disease due to a decrease in a particular exposure. An advantage of the additive hazard models is that risk difference between different exposure groups can be readily derived from the coefficients in the additive hazard models. With respect to differing observed-event probability, we propose three procedures for parameter estimator. Consistency and asymptotic normality of the proposed estimators are established. The simulation results show that Design A results are more efficient estimators than Design C, and Design C has greater efficiency than Design B when subcohort size is approximately equal. It can be attributed to differences in the number of sampled clusters in the subcohort.
Proof of Theorems
Proof of Theorem 1. Evaluated at the true values, one can writewhere .
Then, we haveThe second term on the right-hand side of (A.2) can be shown to converge to zero. Specially, for fixed t, . The third term on the right-hand side of (A.2) can be written asBy functional Taylor expansion, we can obtain that the first term of (A.3) is written asSimilarly, the second term of (A.3) is written asCombining (A.4) and (A.5), it follows thatwhereSince is a zero mean process, , , , , and , we havefor .
Under the general regular condition, is an independent and identically distributed random variable with a mean value of zero and . It follows from the multidimensional central limit theorem that .
Proof of Theorem 2. To prove the consistency of , we only use the inverse function theorem  by verifying the following conditions:(i) exists and is continuous in an open neighborhood of .(ii) is positive definite with probability going to one as .(iii) converges in probability to , uniformly in an open neighborhood of .(iv)Asymptotic unbiasedness of the estimating functions: .One can writeBy the continuity of each component and (A.9), (i) is clearly satisfied. Now, (A.9) can be decomposed as follows:Note that uniformly converges to in . Then, it follows that the first term on the right-hand side of (A.10) converges to in probability as . Each of the remaining terms on the right-hand side of (A.10) can be shown to converge to zero as . Hence, we have
Proof of Theorem 3. Here we prove results of since results of can be proved similarly. By two Taylor expansions, the score function can be decomposition as follows:where is on the line segment between and , is on the line segment between and , andwhere , , and .
Since and , we can obtain that . Using the fact that , we haveSince and , we obtain that . Then, by continuous mapping, it follows thatUsing the fact thatwe haveNote that ; then, . Therefore, using (A.13) and (A.15)–(A.19), we haveIt follows from and the multidimensional central limit theorem thatwhere .
Since , using (A.12) and (A.21), we haveNote that . By Slutsky’s theorem, we have
Proof of Theorem 4. One can make decompositionBy the Taylor expansion, the first term of (A.24) can be written aswhereSince converges to in probability to , we haveBy computing, the remaining terms of (A.24) can be written as uniformly converges to in . The second term of (A.28) follows immediately thatFor the last term of (A.28), using Theorem 1, we can obtain thatCombining (A.28)–(A.30) and , we have thatwhereIt follows immediately from the multidimensional central limit theorem that converges to a multivariate normal with zero mean and covariance function at given by . It also proves that is tight.
Using the functional central limit theorem, we can obtain that converges weakly to a zero mean Gaussian process with covariance function at given by .
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Liu’s work presented in this paper was supported by the Natural Science Research Project of Universities of Anhui Province (no. KJ2018A0390), and Zhang’s work was supported by the Xulun Scholar Program of Shanghai Lixin University of Accounting and Finance.
D. Zheng and D. Lin, “Efficient estimation for the accelerated failure time model,” Journal of the American Statistical Association, vol. 102, no. 480, pp. 1387–1396, 2007.View at: Google Scholar
D. Zeng, D. Y. Lin, and X. Lin, “Semiparametric transformation models with random effects for clustered failure time data,” Statistica Sinica, vol. 18, no. 1, pp. 355–377, 2008.View at: Google Scholar