Abstract

This paper provides a new insight into an economical and effective sampling design method relying on the outcome-dependent sampling (ODS) design in large-scale cohort research. Firstly, the importance and originality of this paper is that it explores how to fit the covariate-adjusted additive Hazard model under the ODS design; secondly, this paper focused on estimating the distortion function through nonparametric regression and required observation of the covariate on the confounding factors of distortion; moreover, this paper further calibrated the contaminated covariates and proposed the estimators of the parameters by analyzing the calibrated covariates; finally, this paper established the large sample property and asymptotic normality of the proposed estimators and conducted many more simulations to evaluate the finite sample performance of the proposed method. Empirical research demonstrates that the results from both artificial and real data verified good performance and practicality of the proposed ODS method in this paper.

1. Introduction

Generally, the major cost of studying large cohort is tied up in collecting the expensive exposure variables, which casts a poor shadow over the researchers with the burden of limited budgets. Therefore, the simple random sampling design with the characteristics of cost expensive and time consuming has been risking losing ground in the long run. To achieve some certain goal, it is no wonder that many cost-effective strategies have been invoked. In the early 1980s, Prentice [1] proposed the notation of case-cohort design, and the exposure variables were measured on a simple random sample, which is called a subcohort, as well as all the cases that experienced events that we were interested in. Since then, the application of case-cohort sampling in the survival analysis has been reported by Self and Prentice [2], Tsai [3], and Kim et al. [4]. The case-cohort design is expected to be economical and effective sampling techniques for rare events. When the censoring rate is relatively low or medium, the method of generalized case-cohort design has been developed in responding to lowering the research cost. In addition to randomly selecting a subcohort from the entire cohort, the information containing the relevant covariables is collected for only a subset of the failure individuals (e.g., Chen [5], Cai and Zeng [6], and Kang and Cai [7]).

As a matter of fact, the failure time outcome-dependent sampling (ODS) design is considered to be another economical and effective alternative to the simple random sampling design. Exposure variables are measured against samples from two components, the subcohort and additional supplementary samples (see Chatterjee et al. [8], Zhou et al. [9], and Weaver and Zhou [10]). Regarding the ODS data, massive research has been carried out and formed a wealth of literatures. For completely observed data, the studies by Zhou et al. [11, 12] and Qin and Zhou [13] offer the comprehensive analysis of the inference methods based on the partially regression linear models for data from the ODS design. The key systematic study of how to fit the generalized linear model with the data obtained from two-stage ODS design was reported by Yan et al. [14]. For censored data, detail examination of the ODS design by Ding et al. [15] showed the estimated impact of environmental pollutants on women’s subfertility. A significant discussion on estimating the equation method with an ODS sampling scheme under Cox’s proportional hazards model was presented by Yu et al. [16].

In clinical trials and biomedical, instead of direct observation, the covariates are observed by multiplication of unknown functions of an observable confounder. As a result, the regression with contaminated covariates was originally derived by Şentürk and Müller [17, 18], in which the contamination information for model covariates cannot be ignored; otherwise, it will result in a biased estimator and the statistical inference may be misled. Since then, numerous extensions in various aspects have been developed by Şentürk and Müller [19], Cui et al. [20], and Li et al. [21]. For completely observed data, one study by Cui [22] proposed to use the nonparametric kernel estimation method to calibrate the contaminated variables and then conducted parameter estimation under the covariate-adjusted linear model, and further research conducted by Zhang et al. [23] extended the method to the nonlinear models with contaminated variables. A key study by Delaigle et al. [24] derived the process of several nonparametric covariate-adjusted estimators of conditional mean function. A preliminary nonparametric test for covariate-adjusted models was undertaken by Zhao and Xie [25], who found that the proposed test statistic has the same limit distribution as the response and predictors are supposed to be obtained directly. For survival data with censoring, very few studies have investigated survival models with contaminated covariates and even fewer people are ready to tackle the challenge in the ODS design we are discussing here.

We study the following covariate-adjusted additive hazards model under the ODS design in this paper:where is the unknown baseline hazard function, is the unknown parameter of -dimension, and are -dimensional and 1-dimensional parameter, respectively, is the -dimensional covariate, is the observed -dimensional covariate, is the unobservable 1-dimensional covariate, is the actual observed 1-dimensional covariable, and is the unknown distorting function of observable confounding variable . We focus on the method of nonparametric kernel estimation to obtain the estimator of the distortion function and calibrate the covariate . Meanwhile, we attempt to weigh the contributions of the subcohort and the supplemental sample differently, which resulted in a weighted estimation equation with the help of calibrated covariates. Owing to the ODS design and the covariate-adjusted process, it happens to be challengeable in the work of theoretical developments. To overcome the challenge, it will be followed by an approximation to the weighted estimation equation, which is taken as the main basis for obtaining the theory properties of our proposed estimator.

The structure of the rest of the paper is as follows. Section 2 analyzes the process that ODS sampling data is fitted to additive hazard model with covariate adjustment. Then, in Section 3, we describe the large sample properties of our proposed estimator in progress to verify the finite sample performance of the proposed method by numerical approach in Section 4. In Section 5, the empirical research shows that the method we proposed has good practicality in the practical example of datasets from a pulmonary exacerbations analysis. Finally, the conclusion and prospect are summarized in Section 6.

2. Estimation Setup

Suppose that a cohort contains independent subjects. For the th () subject, is the failure time and is the censoring time. is the observed time and is the indicator variable of the right censoring. Denote , and to be the at-risk process, the counting process, and the time-independent -dimensional exposure variable, respectively. Denote to be the study end time.

The additive hazards model proposed by Lin and Ying [26] is as follows:where is the unknown baseline hazard function and is the parameter of -dimension. If we have access to gather information about everyone’s exposure, the following estimation function is commonly used for the inference of :where .

Under the ODS design, the scope of the failure time of the cases is divided into disjoint strata by positive constants satisfying . We first sample SRS individuals from the cohort, and let be the indicator, by value 1 if the th subject being into the SRS and 0 otherwise. Denote . We sample subset stratum from set and then additional samples are drawn from the members who experience failure and not in the SRS, but in stratum . Denote to be the indicator whether the th individual from is sampled into the additional samples. Denote , where and is the number of the cohort failure individuals and the SRS failure individuals dropped into . The samples of SRS sample and additional sample make up the ODS sample.

Denote to be the set of SRS individual and to be the supplemental sample from . Denote to be the set of individuals outside ODS sample. Then, we can summarize the observed datasets obtained by the design of ODS as follows:(i)The ODS sample (ii)The nonvalidation sample:

For the ODS design, we only observe the variable for the selected subjects. The regression parameters can be derived by , whereand , with , , and for a vector ; and the weight is defined bywhere and .

Note that the weight of nonvalidation samples are 0, whereas the subcohort censored individuals are . The weight of the subcohort cases are 1 if their failure time belongs to and are , otherwise. The selected cases, not in the subcohort, are weighted by , when their time of failure belongs to . The estimator defined by (4) takes the explicit form as follows:where

In practice, some covariates may be contaminated by some distorting factors. In this paper, we assume , where is the observed -dimensional covariate and is the unobservable 1-dimensional covariate and satisfieswhere is the actual observable 1-dimensional variable, is the known confounder covariate, and is the unknown distorting function of observable variable . At this point, for the ODS design, the available data have the form:(i)The ODS sample (ii)The nonvalidation sample:

Combining model (2) and equation (8), we assume is generalized from the covariate-adjusted additive hazards model in this paper:where and are -dimensional and 1-dimensional regression parameters of primary interest, respectively. According to Şentürk and Müller [18], two conditions on model (9) can be listed as follows:(C1)(C2) are mutually independent

Note that condition (C1) ensured that the mean distorting effect vanishes. Based on conditions (C1) and (C2), we obtain . Owing to the presence of distortion, the covariate is unobservable, and the estimating function (4) can be no longer used for the inference of . If we use directly instead of , it might lead to inaccurate statistical inference. Therefore, we should calibrate the covariate based on the known covariate and confounder covariate . From (8) and condition (C2), it can obtain that

Define , and we adopt kernel method to estimate :where is a kernel function and is a bandwidth. It is easy to show that converges almost to . By , equations (10) and (11), the distorting function can be estimated byand the covariate can be calibrated by

Denote , and the proposed estimator for model (9) can be defined as the solution of the function:where and . The explicit form of the proposed estimator can be described as follows by some simple calculation:where

3. Main Results

In this section, we would like to establish the asymptotic properties of in (15).

Firstly, we give the following definition.

Definition 1. Definewhere is a locally square integrable martingale (Lin and Ying [26]) and is the true parameters value.
Secondly, the following additional regular conditions are concluded to illustrate the process:(C3).(C4).(C5).(C6)The matrix appeared in (18) is finite positive definite.(C7)The matrix is nonsingular.(C8)As , and for .(C9), , and are differential, and the 3-order derivatives of and meet the following condition, and there exist and a neighborhood of origin, such that, if fails to the neighborhood, we have and , where is a density function of .(C10)The function appeared in (11) meets(i)The support of is the interval .(ii) is symmetric about zero.(iii).(C11)As , the bandwidth falls from to .(C12) is bounded away from 0 and .The above conditions are mild and suitable in many circumstances. Conditions (C3)–(C8) are regular conditions of the regression parameters which are similar to Yu et al. [27]. However, the likes of conditions (C9)–(C12) can be traced to Cui et al. [20].

Theorem 1. If conditions (C1)–(C12) hold, as , we have .

Theorem 2. If conditions (C1)–(C12) hold, as , we have

where . In other words,where and is the -element of matrix .

To prove Theorem 1, the following definition and deformation lemmas are needed.

Definition 2. DefineIn order to prove conveniently the results, we define the partitioned matrices:where

Lemma 1. If conditions (C3)–(C8) hold, then

Proof. Applying Glivenko–Cantelli theorem, we obtain thatwhere . By corollary (III).2 from Andersen and Gill [28], the uniform convergence of and can be similarly shown.

Lemma 2 (see from Cui et al. [20]). If conditions (C1), (C2), and (C9)–(C12) hold, thenwhere is a function of , , and satisfying , and

Lemma 3. If conditions (C1)–(C12) hold, then

Proof. By Lemma 2 with and , we obtainwhere . By Lemma 2 with , we havewhere . Applying Lemma 2 with , we obtainwhere . Similar to the proof of in Proposition 1 from Cui et al. [20], we haveTherefore,By equation (33) and the uniform convergence of to , we obtainFrom the partitioned matrices in the above notation, we obtain thatThen, by equation (33) and Lemma 2 with , we obtain thatwhere . By equations (39) and (40), we haveBy equations (33) and (34), we havewhere appeared in (21). By equations (37) and (38), we havewhere appeared in (22). Then, by equations (39), (42), and (43), we have

Proof of Theorem 1. Applying Lemma 3 and the definition of the notation of and , we obtainBy equation (45), condition (C7) and Slutsky Lemma, we show that is nonsingular in probability. Thus, it obtains thatSimilar to Yu et al. [27], we get ; then, by equations (45) and (46), Lemma 3 and Slutsky Lemma, it holds thatTherefore, Theorem 1 has been proved.

To prove Theorem 2, the following deformation lemma is needed.

Lemma 4. If conditions (C1)–(C12) hold, thenwhere , and appeared in (21) and (22), respectively.

Proof. By equations (4) and (14), we obtainBy equation (41), we haveBy equations (42) and (43), we haveSimilar to Yu et al. [27], we havewhere appeared in (17), which is a square integrable martingale. Then, performing a simple calculation, we obtain thatBy and , we haveThen, we have , that is,By equations (50) and (55), we haveTherefore, by equations (51), (53), and (56), we haveTheorem 2 can be proven by using the properties of Lemma 4.

Proof of Theorem 2. By Lemma 4, we haveThrough the calculation of double expectation, we haveBy Yu et al. [27], we havewhere and appeared in (20) and (21), respectively. By performing a simple calculation, we obtainBy conditions (C1) and (C2), we haveCombining equations (61) and (62), we obtain thatThus, by (58)–(60) and (63), we haveThen, we obtain thatwhere is a point of the line between and , andwhere is a -dimensional matrix with all elements being zero except for the -element being 1. Therefore, we conclude thatwhere is the -element of matrix .
Here, Theorem 2 has been proven regarding the asymptotic convergence properties of normal distribution. It was from a different viewpoint compared to previous research.

4. Numerical Approach

Strictly speaking, we carry out some simulations in the section. The underlying additive hazards model considered is as follows:where the baseline function is set to be and , respectively. The true parameters and . The covariate and . The censoring time , where constant is chosen to approximately produce three censoring rate , , and . The confounding variable and the distortion function , where the constant is chosen such that the distorting function satisfies the identifiability constraint . The observed covariate . We choose a high-order kernel function and use leave-one-out crossvalidation to select the bandwidth.

Under the design of ODS, we sample subcohort individuals without replacement from . Then we partition the observed failure time into three strata by quantiles of observed failure time. In order to study the influence of different cutpoints, we consider 0.2 and 0.8 quantile and 0.3 and 0.7 quantile, respectively. We sample the additional individuals of size and from the first stratum and the third stratum. In addition, we compare our proposed covariate-adjusted estimator (Proposed) with two estimators, for example, oracle estimator (Oracle) which is calculated based on the true covariate and naive estimator (Naive) which is computed based on the contaminated covariate . Note that the oracle estimator is computed from the observations of , which is not available in the real data. Meanwhile, the naive method is sure to exploit to regard directly the contaminated covariate as the true covariate . Under each configuration, the results presented in Tables 1 and 2 are obtained from 1000 independently generated datasets, including the biases of the estimates (Bias), the sample standard deviation (SD), the estimated standard error (SE), and the normal confidence interval (CP).

By comparison and analysis, the oracle estimator is considering to be the best of all three estimators. To be specific, the proposed estimators for both and are all unbiased, and the statistical performance can compete with that of the oracle estimator. Foremost, the normal confidence intervals are reasonable. When it comes to the naive estimator, the result for is biased. However, through covariate-adjusted process, the main results in Section 3. In addition, it is a fact that the efficiency gains are higher when cutpoint is (0.2, 0.8) than when cutpoint is (0.3, 0.7).

Additionally, we conduct simulation studies to evaluate the behavior of the proposed method when the censoring time depends on the covariate. The setup is the same as in Tables 1 and 2, except that the censoring time is taken as and , respectively. The results are reported in Table 3 when and Table 4 when , which show that the proposed method performs satisfactorily in the cases.

5. Empirical Analysis

Studies have been completed to conclude the real-world analysis. Our study data contains 641 patients. The accumulation of extracellular DNA in the lung during bacterial infection can bring out progressive deterioration of lung function and aggravation of respiratory symptoms in patients with cystic fibrosis. Therefore, the dependent variable that we are interested in is time to relapse, and the censoring rate of the dependent variable is approximately . Under the ODS design, we sample 200 individuals as subcohort sample. We partition the dependent variable that are not in the subcohort into three strata. We choose two kinds of cutpoints similar to the simulations. The supplemental samples of size and are selected from the first stratum and the third stratum.

Two variables relevant to potential confounders have been found, such as vital capacity and patient’s type of treatment (Type), divided into placebo and rhDNase. In this study, we measured forced expiratory volume twice and abbreviated and separately. Then, we regarded as a disorder index of vital capacity. It has become apparent that the confounder factor follows a uniform distribution over [0, 1] on the basis of average. A comprehensive study of the additive hazards model has been undertaken to see the effect of Type and FEV on the failure time as follows:

The Kaplan–Meier survival curves have been drawn with related theory taking the kinds of treatment types and the amount of FEV (adjusted FEV) of the patients into account. In the process of drawing, we view FEV (adjusted FEV) to be 1 when FEV (adjusted FEV) ≥ the median of FEV (the median of adjusted FEV), otherwise, to be 0. As shown in Figure 1, it can be seen that disturbance did affect relations between FEV and survival probability, and the patients with placebo or lower FEV (lower adjusted FEV) tend to have lower survival probabilities.

An analysis has been presented that it is available to derive the coefficients in model (69) with the proposed covariate-adjusted approach and summarize estimated coefficients to be the column Est. After 1000 times artificially estimating the process, it shows the main characteristic of SE and the Bias which are calculated by the average of parameter estimates minus the corresponding Est. We also apply the contaminated covariate to calculate the estimator. The results based on our method are listed in the columns under Proposed and the results based on the contaminated covariate are put in the columns under Naive in Table 5. From the result above, we can see that, with the increase of the mount of FEV, the risk of relapse with pulmonary exacerbations will decrease. The treatment type rhDNase of pulmonary exacerbations can decrease the risk of death, which is consistent with Figure 1. Moreover, due to the covariate adjustment process, the bias of the proposed method is less than that of the Naive method. The sample standard error of the cutpoint (0.2, 0.8) is less than that of the cutpoint (0.3, 0.7), which is in accord with the simulation results.

6. Conclusion

From the point of view, the fact is that the ODS design is a benefit to lower the cost of expensive exposure variable and improve computing efficiency in large-scale cohort studies. All the results are performed for avoiding miscalculations and perfecting interpretation with adjustment on contaminated covariates. Firstly, this paper has illustrated the method of fitting covariate-adjusted additive hazard model to the data in the ODS design. Secondly, to solve the problems caused by contaminated covariates and biased-sampling schemes, this paper uses nonparametric kernel estimation method to calibrate the contaminated covariates and uses the inverse probability weighting method based on the calibration covariate to construct the weighted estimation function. In fact, this paper has carried out an extension of the theory properties of the estimator in the analysis of the proposed weighted estimation function. The numerical simulation studies have been applied to show that the estimator proposed in this paper performs well in finite sample case, and the actual data is used to show the possibility of the implement of the method.

From the perspective of research prospects, firstly, based on the methods and conclusions of this paper on the covariate adjustment problem of the additive hazard model, there has been an increasing awareness of the potential of discussing the problem of covariate adjustment of ODS studies based on some other models. For example, the accelerated failure time model discussed by Lin et al. [29], the accelerated hazard model studied by Chen and Wang [30], the nonautonomous SIRS model discussed by Lv and Meng [31], and the dynamic model of the constantor studied in [32]. Secondly, to better promote the design and keep the proposed method practical, a promising research published by Yu et al. [16] who proposed that the determination of sample sizes and the optimal sample allocation method may also be an interesting topic in the future. Furthermore, it is hoped that future research will contribute to a further development of survival models with multiple disease outcomes mentioned by Kang and Cai [7].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC) (nos. 11901175 and 71901222), the Humanities and Social Science Foundation of the Ministry of Education of China (no. 17YJC630236), and Fundamental Research Funds for the Hubei Key Laboratory of Applied Mathematics, Hubei University (no. HBAM201907).