Efficient Algorithms and Implementation of a Semiparametric Joint Model for Longitudinal and Competing Risk Data: With Applications to Massive Biobank Data

Li, Shanpeng; Li, Ning; Wang, Hong; Zhou, Jin; Zhou, Hua; Li, Gang

doi:https://doi.org/10.1155/2022/1362913

Computational and Mathematical Methods in Medicine

On this page

Abstract Introduction Discussion Data Availability Disclosure Conflicts of Interest Acknowledgments Supplementary Materials References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 1362913 | https://doi.org/10.1155/2022/1362913

Efficient Algorithms and Implementation of a Semiparametric Joint Model for Longitudinal and Competing Risk Data: With Applications to Massive Biobank Data

Shanpeng Li,¹Ning Li,^2,3Hong Wang,⁴Jin Zhou,^1,2Hua Zhou,^1,3and Gang Li^1,3

Academic Editor: Reinoud Maex

Received25 Oct 2021

Accepted17 Jan 2022

Published08 Feb 2022

Abstract

Semiparametric joint models of longitudinal and competing risk data are computationally costly, and their current implementations do not scale well to massive biobank data. This paper identifies and addresses some key computational barriers in a semiparametric joint model for longitudinal and competing risk survival data. By developing and implementing customized linear scan algorithms, we reduce the computational complexities from or to in various steps including numerical integration, risk set calculation, and standard error estimation, where is the number of subjects. Using both simulated and real-world biobank data, we demonstrate that these linear scan algorithms can speed up the existing methods by a factor of up to hundreds of thousands when , often reducing the runtime from days to minutes. We have developed an R package, FastJM, based on the proposed algorithms for joint modeling of longitudinal and competing risk time-to-event data and made it publicly available on the Comprehensive R Archive Network (CRAN).

1. Introduction

In clinical research and other longitudinal studies, it is common to collect both longitudinal and time-to-event data on each participant, and these two endpoints are often correlated. Joint models of longitudinal and survival data have been widely used to mitigate incorrect estimation and statistical inferences associated with separate analysis of each endpoint [1, 2]. For instance, in longitudinal data analysis, joint models are often used to handle nonignorable missing data due to a terminal event, which cannot be properly accounted for by a standard mixed effects model or generalized estimating equation (GEE) method that relies on the ignorable missing-at-random or missing-completely-at-random assumption [3–5]. Joint models are also popularly employed in survival analysis to study the effects of a time-dependent covariate that is measured intermittently or subject to measurement error [6–9] and for dynamic prediction of an event outcome from the past history of a biomarker [10–14]. Comprehensive reviews of joint models for longitudinal and time-to-event data and their applications can be found in Elashoff et al. [1], Hickey et al. [15], Rizopoulos [16], Sudell et al. [17] and the references therein.

Despite the explosive growth of literature on joint models for longitudinal and time-to-event data during the past three decades, efficient implementation of joint models has lagged behind, which limits the application of joint models to only small to moderate studies. Recently, massive sample size data collected from electronic health records (EHRs) and insurance claim databases warrants great opportunities to conduct clinical studies in a real-world setting. For example, the UK Biobank is a prospective cohort study with approximately 500,000 individuals, aged 37-73 years, from the general population between 2006 and 2010 in the United Kingdom [18, 19]. Aggregated and quality controlled EHR data purchasable from Optum (https://www.optum.com/EHR) includes 80+ millions patients with longitudinal lab measures. The Million Veteran Project [20] and IBM MarketScan Database [21] are some of many other big biobank databases that contain rich yet complex longitudinal and time-to-event data on 10⁵~10⁸ patients. However, current implementations of many joint models are inefficient and not scalable to the size of biobank data as demonstrated later in Sections 3 and 4. There is a pressing need to develop efficient implementations of joint models to enable the analysis of these rich data sources.

The purpose of this paper is to develop and implement efficient algorithms for a semiparametric joint model of longitudinal and competing risk time-to-event data. As specified in Section 2.1, the joint model consists of a linear mixed effects submodel for a longitudinal outcome and a semiparametric proportional cause-specific hazard submodel for a competing risk survival outcome. These two submodels are linked together by shared random effects or features of an individual’s longitudinal biomarker trajectory. In Section 2, we identify key computational bottlenecks in the semiparametric maximum likelihood inference procedure for the joint model. Specifically, we point out that in a standard implementation, the computational complexities for numerical integration, risk set calculation, and standard error estimation are of the order , , and , respectively, where is the number of subjects. Consequently, current implementation grinds to a halt as becomes large (for example, ). We further develop tailored linear scan algorithms to reduce the computational complexity to in each of the aforementioned components. We illustrate by simulation and real-world data that the linearization algorithms can result in a drastic speed-up by a factor of many thousands when , reducing the runtime from days to minutes for big data. Finally, we have developed a user-friendly R package FastJM to fit the shared parameter semiparametric joint model using the proposed efficient algorithms and made it publicly available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=FastJM.

The rest of the paper is organized as follows. Section 2.1 outlines the semiparametric shared random effect joint model framework and reviews a customized expectation-maximization (EM) algorithm for semiparametric maximum likelihood estimation as well as a standard error estimation method. Section 2.2 develops various linear scan algorithms to address the key computational bottlenecks in the EM algorithm and standard error estimation for large data. Section 3 presents simulation studies to illustrate the computational efficiency of the proposed linear scan algorithms. Section 4 demonstrates the improved computational performance of our FastJM R package over some established joint model R packages on two moderate to large real-world data sets. Concluding remarks are provided in Section 5.

2. Efficient Algorithms for a Semiparametric Joint Model of Longitudinal and Competing Risk Data

2.1. Notations and Preliminaries

Let be the longitudinal outcome at time for subject , , and is the total number of subjects. Suppose the longitudinal outcome is observed at time points , , and denote . Let be the competing risk data on subject , where is the observed time to either failure or censoring and takes value in , with indicating a censored event and implying the th type of failure is observed on subject , . The censoring mechanism is assumed to be independent of the failure time.

2.1.1. Model

Consider the following joint model in which the longitudinal outcome is characterized by a linear mixed effects model: and the competing risk survival outcome is modeled by a proportional cause-specific hazard model:

In the longitudinal submodel (1), is the mean of the longitudinal outcome at time , and are column vectors of possibly time-varying covariates associated with the longitudinal outcome , represents a vector of fixed effects of , denotes a vector of random effects for , is the measurement error independent of , and is independent of for any . In the competing risk survival submodel (2), is the conditional cause-specific hazard rate for type failure at time , given time-invariant covariates and the shared random effects , and is a completely unspecified baseline cause-specific hazard function for type failure. The two submodels are linked together by the shared random effects , and the strength of association is quantified by the association parameters .

2.1.2. Semiparametric Maximum Likelihood Estimation via an EM Algorithm

Let denote all unknown parameters for the joint models (1) and (2), where and . Let be the competing event indicator for type failure, which takes the value 1 if the condition is satisfied and 0 otherwise. The observed-data likelihood for is then given by which follows from the assumption that and are independent conditional on the covariates and the random effects.

Because involves unknown hazard functions, finding its maximum likelihood estimate by maximizing the above observed-data likelihood is nontrivial. However, a customized EM algorithm can be derived to compute the maximum likelihood estimate of by regarding the latent random effects as missing data [3]. The complete-data likelihood based on is where is the cumulative baseline hazard function for type failure and .

The EM algorithm iterates between an expectation step (E-step): and a maximization step (M-step): until the algorithm converges, where is the estimate of from the th iteration. The E-step (5) involves calculating the following expected values across all subjects: , , , , and , where for any function . Furthermore, it can be shown that the M-step (6) has closed-form solutions for the parameters , , , and , and that the other parameters and can be updated using the one-step Newton-Raphson method. Details are provided in equations (7.1)–(7.8) of the supplementary materials (available here).

2.1.3. Standard Error Estimation

As discussed in Elashoff et al. [22] (Section 4.1, p.72), standard errors of the parametric components of the semiparametric maximum likelihood estimate can be estimated by profiled likelihood, observed information matrix, or bootstrap. All three methods can be computationally intensive when is large. Here, we focus on the profiled likelihood-based method and show that its computation can be linearized with respect to .

Let denote the parametric component of and its maximum likelihood estimate. The variance-covariance matrix of can be estimated by inverting the following approximate empirical Fisher information [23–25]: where is the observed score vector from the profiled likelihood of on the th subject by profiling out the baseline hazards. Details of the observed score vector for each parametric component are provided in equations (7.9)–(7.13) of the supplementary materials.

2.2. Efficient Algorithms and Implementation of the EM Algorithm and Standard Error Estimation

With naive implementation, multiple quantities in the above E-step, M-step, and standard error estimation will involve or operations, which become computationally prohibitive at large sample size . Below we identify these bottlenecks and discuss appropriate linear scan algorithms to reduce their computational complexities to .

2.2.1. Efficient Implementation of the E-Step

At each EM iteration, the E-step (5) requires calculating integral (7) across all subjects. Below we discuss how to accelerate two commonly used numerical integration methods for evaluating these integrals.

(1) Standard Gauss-Hermite Quadrature Rule for Numerical Integration. A commonly used method for numerical evaluation of integral (7) is based on the standard Gauss-Hermite quadrature rule [26]: where with the right-continuous and nondecreasing cumulative baseline hazard function for type failure at the th EM iteration as defined in Section 7.1 (equation (7.4)) of the supplementary material and the jump size of at , the dimension of the random effects vector, the shorthand for , the number of quadrature points, the abscissas with corresponding weights , the rescaled alternative abscissas, and the square root of [3]. However, this method is computationally intensive due to multiple factors. First, it usually requires many quadrature points to approximate an integral with sufficient accuracy because the mode of the integrand is often located in a region different from zero. Second, the computational cost increases exponentially with because the Cartesian product of the abscissas is used to evaluate the integrand with respect to each random effect. Lastly, the alternative abscissas need to be recalculated at every EM iteration.

(2) Pseudo-Adaptive Gauss-Hermite Quadrature Rule for Numerical Integration. When the number of longitudinal measurements per subject is relatively large, Rizopoulos [27] introduced a pseudo-adaptive Gauss-Hermite quadrature rule for numerical approximation of integral (7), which achieves good approximation accuracy with only a small number () of quadrature points and is thus computationally more efficient. The pseudo-adaptive Gauss-Hermite quadrature rule proceeds as follows. First, fit the linear mixed effects model (1) to extract the empirical Bayes estimates of the random effects and its covariance matrix : where , , and . Then, define the alternative abscissas and approximate by where with the notations defined similarly to Equation (10).

A derivation of (13) is provided in Section 7.2 of the supplementary materials. The pseudo-adaptive Gauss-Hermite quadrature rule is computationally appealing because the alternate rescaled quadrature points are computed only once before the EM algorithm and do not need to be updated in the EM algorithm. Additionally, the pseudo-adaptive Gauss-Hermite quadrature rule requires fewer quadrature points than the standard Gauss-Hermite quadrature rule to achieve the same numerical approximation accuracy [27]. For example, our simulation results in the supplementary materials (Section A.4, Table A.1) illustrate that the pseudo-adaptive Gauss-Hermite quadrature rule with quadrature points produces almost identical results to the standard Gauss-Hermite quadrature rule with quadrature points.

Remark 1 (Linear calculation of ’s across all subjects). At the first sight, calculating (12) across all subjects would involve operations since each involves a summation over subjects. However, because the same quantity appears in every , one can precompute this quantity and then use the cached value to calculate across all subjects. This way, one can compute the ’s across all subjects in operations. Our simulation study in the supplementary material (Section A.5, Figure A.1) shows that applying this simple linearization algorithm can yield a speed-up by a factor of 10 to 10,000 when grows from 10 to 10⁵. Our implementation is also significantly faster than a popular R package lme4 [28] with over a speed-up by a factor of 10 to 500 as grows from 10 to 10⁵.

(3) Linear Scan Algorithm for Calculating across All subjects. Both the standard Gauss-Hermite quadrature rule (9) and the pseudo-adaptive Gauss-Hermite quadrature rule (13) require evaluating at their prespecified abscissas across all subjects (see Equations (10) and (14)). Hence, calculating requires evaluation of across all subjects for each . We observe from equation (7.4) that for each , have already been calculated from the th EM iteration, where are distinct observed type event times. For each , calculating would involve operations if a global search is performed to find the interval of two adjacent type event times containing . Consequently, calculating would require operations. However, by taking advantage of the fact that is a right-continuous and nondecreasing step function, one can obtain from in operations using the following linear scan algorithm. First, sort the observation times , , in descending order. Denote by the ranked index of a subject. Then, define a mapping where are scanned forward from the largest to the smallest, and for each , only a subset of the ranked observation times are scanned forward to calculate as follows:

Specifically, start with and scan through the first set of observation times where , and the corresponding ’s take the value . Next, move forward to and scan through the second set , where , and the corresponding ’s take the value evaluated at . Repeat the same process until is scanned. Because the scanned ’s for different ’s do not overlap, the entire algorithm costs only operations.

2.2.2. Linear Risk Set Scan for the M-Step

In the M-step, multiple quantities in equations (A.4)–(A.8), such as the cumulative baseline hazard functions and the Hessian matrix and score vector for and (), involve aggregating information over the risk set at each uncensored event time . These quantities are further aggregated across all ’s. If all subjects are scanned to determine the risk set at each , then aggregating information over the risk set for all uncensored event times would obviously require operations. Below we explain how to linearize the number of operations for risk set calculations across all uncensored event times by taking advantage of the fact that the risk set is decreasing over time for right censored data.

First, to calculate , , one needs to compute , . Because the distinct uncensored event times are arranged in decreasing order, the risk set can be decomposed into two disjoint sets: , and consequently, for any sequence of real numbers . It follows from the recursive formula (17) and the fact that the subjects in do not need to be scanned to calculate the second term of (17), one can calculate , , in operations when ’s are scanned backward in time.

Next, to calculate the Hessian matrix for in (A.5), we first rewrite it as which allows one to linearize its calculation based on (17) with similar to the linear scan algorithm for ’s.

Finally, the above linear risk set scan algorithm can be adapted to calculate the Hessian matrix and score vector for and in equations (A.6)-(A.8) in operations in a similar fashion.

2.2.3. Linear Scan Algorithm for Standard Error Estimation

The standard error estimation formula in (8) relies on the observed score vector from the profiled likelihood where the baseline hazard is profiled out. However, for each subject , two components of the score vector, and as given in equations (A.12) and (A.13), involve aggregating information either over or over both and . If implemented naively, the aggregation can take either or operations, respectively. As a result, the observed information matrix can take operation as it requires summing up the information across all subjects. Below we describe a sequential linear scan algorithm to reduce the computational complexity from to .

Our algorithm can be easily explained by considering the calculation of the following expression in the second term of in (7.12):

In other words, we need to compute for , where and

Before going further, we recall that the distinct uncensored event times are in descending order and that the subjects are sorted so that the observation times ’s are in descending order.

First of all, because the risk set is decreasing over time for right censored data, it follows from Equation (17) that can be computed in operations as one scans through backward in time. Second, analogous to (15), the following linear scan algorithm can be used to calculate from : where are scanned forward from the largest to the smallest, and for each , only a subset of the ranked observation times ’s are scanned forward to calculate ’s as follows:

The details are essentially the same as those discussed following Equation (15) and thus omitted here.

3. Simulation Studies

We present a simulation study to illustrate the computational speed-up rendered by the proposed linear algorithms as the sample size grows from 100 to 1,000,000. All simulations were run on a MacBook Pro with 6-Core Intel Core i7 processor (2.6 GHz) and 16 GB RAM running MacOS.

We generated longitudinal measurements from which corresponds to model (1) with and , and competing risk event times from a proportional cause-specific hazard model where the two submodels (23)–(25) are linked together through the shared random effects . In the above joint model, represent scheduled visit times, follows , is a binary covariate, the random effects follows a distribution with , , , the measurement errors are iid and independent of , and the baseline hazards and are constants 0.05 and 0.1, respectively. We simulated noninformative censoring time following exp (20) and let be the observed time (possibly censored) for subject . The longitudinal measurements for subject at are assumed missing after .

We first compared the runtime between three different implementations of the EM algorithm for fitting the joint models (1) and (2) as described in Section 2.2. (1)Method 1. This method is a standard implementation of the EM algorithm using the standard Gauss-Hermite quadrature rule in the E-step (Equation (9) with ) without any linear computation.(2)Method 2. This is a standard implementation of the EM algorithm using the pseudo-adaptive quadrature rule in the E-step (Equation (13) with ) with the linear calculation of ’s described in Remark 1 and without any other linear computation.(3)Method 3. This is a method 2+linear scan for calculating ’s+linear risk set scan for M-step as described in Section 2.2.The number of quadrature points for methods 1 and 2 was determined by first trying different values, {10, 20, 30} for method 1 and {6, 9, 12, 15} for method 2, and then choosing the smallest value for which the estimation results are stabilized and similar between the two implementation methods. For comparison purposes, we have also included the runtime of an established joint model R package joineR, which uses a similar EM algorithm for parameter estimation to fit a semiparametric joint model with a slightly different latent association structure in the competing risk submodel [29]. The results are depicted in Figure 1.

(a) Runtime for four methods

(b) Runtime comparison: method 1 versus method 2

(c) Runtime comparison: method 2 versus method 3

(d) Runtime comparison: joineR versus method 3

It is seen from Figure 1(a) that the runtime of method 3 increases linearly with the sample size, while the runtime of the other three methods grows exponentially. For moderate sample size, method 2 is computationally more efficient than method 1 because it requires fewer quadrature points for numerical integration. However, its computational advantage diminishes as the sample size increases due to the exponentially increasing computational cost of ’s and risk set calculation in the M-step. By further linearizing the computation of these key components, method 3 has yielded more than 100-fold speed-up over method 2 when , and the speed-up is expected to increase exponentially as increases (Figure 1(b)). Furthermore, method 3 has demonstrated more than 30-fold speed-up over joineR when . We also note that joineR failed to run when due to the overload of memory.

We also compared the runtime of two implementations of the standard error estimation, with and without linear scan as described in Section 2.2.3, and the bootstrap method employed by the joineR package [29]. The results are shown in Figure 2.

(a) Runtime for three methods

(b) Runtime comparison: no linear scan versus linear scan

(c) Runtime comparison: joineR versus linear scan

It is seen from Figure 2(a) that the implementation with linear scan easily scales to a million subjects, taking only minutes to finish, while the naive implementation without linear scan grinds to a halt when the sample size is 10,000 or larger. Figure 2(b) shows that linear scan can generate a speed-up by a factor of greater than 100,000 when . Similarly, in comparison with joineR that used 100 bootstrap samples for standard error estimation, our standard error estimation method with linear scan generated a speed-up by a factor of greater than 100,000 when .

Finally, Figures 1 and 2 in Section 3 have focused on contrasting the computational efficiency of different implementations for parameter estimation and standard error estimation in terms of the runtime. We have also compared their parameter estimates and standard error in Section A.6 of the supplementary materials. As one would expect, our three different implementations (methods 1-3) yielded almost identical estimation results, whereas joineR produced similar estimation results for the longitudinal model, but slightly different results for the competing risk model due to its different latent association structure.

4. Real Data Examples

We have developed an R package FastJM to implement the efficient algorithms described in Section 2. Below we illustrate the improved computational performance of FastJM in comparison to existing joint model R packages on a lung health study (LHS) data with subjects and a UK Biobank data with participants.

4.1. Lung Health Study

The lung health study (LHS) data were collected from a ten-center randomized clinical trial on 5,887 middle-aged smokers with mild to moderate chronic obstructive pulmonary disease (COPD) [30]. Patients were randomized into three arms: usual care, smoking intervention and placebo inhaler (SIP), and smoking intervention and active bronchodilator inhaler (SIA). An important objective of the study was to determine if the intervention program with the combination of intensive smoking cessation counseling and an inhaled anticholinergic bronchodilator can slow down the decline in forced expired volume in 1 s () during a 5-year follow-up period. Patients’ values were collected annually upon recruitment into the study. was chosen as the primary outcome since its trajectory is an indicator of a patient’s natural loss of lung function during the progression of COPD. Since not all patients completed the whole study period, about 9.47% of longitudinal measurements were missing. One of the possible reasons for dropout is that treatment was not effective, and hence, missing longitudinal measurements after dropout are nonignorable.

Joint modeling of together with the possible informative dropout time provides an attractive approach to deal with nonignorable missing longitudinal data due to dropout. Based on previous findings, we considered the following covariates when characterizing the trajectory of : time (year), sex, age, body mass index (BMI), baseline number of cigarettes smoked per day, and the logarithm of two-point methacholine concentration- O’Connor slope (logslope) [31]. We also included two interaction terms between treatment indicators SIP and SIA and time, so that the difference in the slope of between SIP (or SIA) and usual care can be evaluated by testing if the interactions are zero or not. Specifically, we considered the following linear mixed effects model: which corresponds to model (1) with and . The random error term and the random effects are assumed normally distributed with zero mean and a covariance matrix . For the dropout time (possibly censored at the end of the study), we assume the Cox proportional hazard submodel. where denotes the baseline hazard function and is a latent association structure that links the two submodels.

Table 1 compares the runtime of FastJM and some existing joint model packages including joineR [29], different versions of JM [2], JMbayes [32], and JSM [33] with various specifications of and .

Among all the semiparametric models (FastJM, joineR, , and ), FastJM finished in 0.3 minutes while other methods took 20.4 minutes to 111 minutes. As a matter of fact, the runtime of FastJM was even shorter than those of some parametric joint models ( and ). We also observed that JMbayes based on a Bayesian MCMC framework is considerably slower than its frequentist counterpart JM. Finally, the parameter estimates and inference results for the longitudinal outcome were almost identical between all packages, but slightly different for the survival submodel because of their slightly different latent structure . Detailed analysis results are summarized in Section A.7 of the supplementary materials.

4.2. UK Biobank Primary Care (UKB-PC) Study

The UK Biobank (UKB) is a prospective cohort study with deep genetic and phenotypic data collected on approximately 500,000 individuals, aged 37-73 years, from the general population between 2006 and 2010 in the United Kingdom [18, 19]. Participants attended assessment at their closest clinic center where they completed questionnaires, took physical measurements, and provided biological samples (blood, urine, and saliva) as a baseline assessment visit. Hospital admission records were available until February, 2018, for the full UKB cohort, whereas linkage to primary care records was available for 45% of the UKB cohort (approximately 230,000 participants) until May, 2017, for Scotland, September, 2017, for Wales, and August, 2017, for England. The detailed linkage procedures relating to primary care records are available online (https://biobank.ndph.ox.ac.uk/showcase/showcase/docs/primary_care_data.pdf).

In this example, we consider a joint model of longitudinal systolic blood pressure (SBP) measurements and a competing risk event time defined as age-at-onset of type 2 diabetes (T2D) as the first risk and age-at-onset of stroke, myocardial infarction (MI), or all-cause death as the second risk, whichever occurred first. Age-at-onset of outcomes were based on participants’ primary care or hospital records, whichever occurred first. Follow-up was censored at the primary care data end date for the relevant country or the date of outcomes, if this occurred earlier. SBP measures were extracted from either baseline assessment visit or primary care data. Covariates include sex, ethnicity, and BMI measured during baseline visit. However, considering the imbalanced racial distribution in this case study, we only considered white vs. non-white ethnicity groups. Specifically, the joint model consists of a linear mixed effects model for the longitudinal outcome (SBP), which corresponds to model (1) with and and a proportional cause-specific hazard model for the competing risk event outcome for . In Model-L, the random error term and the random effects are assumed normally distributed with zero mean and covariance matrix . In Model-PCH, denotes type 2 diabetes and stroke, denotes the baseline cause-specific hazard function for cause , and denotes the latent association structure of SBP with cause risk.

To our knowledge, besides our FastJM package, joineR and JM are two other current joint model R packages that are capable of handling competing risk event outcomes. However, because JM encountered convergence issues, we will focus on FastJM and joineR in this case study. Table 2 compares the runtime of FastJM and joineR on a subset of 5,000 and 20,000 participants randomly selected from the UKB-PC data and the full UKB-PC data with 193,287 participants.

Table 2 shows that for the UKB-PC subset of 5,000 participants, FastJM finished within 1 minute, while joineR took 3.3 hours to finish. For the UKB-PC subset of 20,000 participants, FastJM finished within 5 minutes, while joineR took 33 hours to run. For the UKB-PC full data with 193,287 participants, FastJM finished within 1 hour, whereas joineR encountered a computational failure.

Finally, the analysis results produced by FastJM and joineR are similar for the longitudinal submodel for the UKB-PC subset of 5,000 and 20,000 participants and for UKB-PC full data. For the survival submodel, the analysis results are also similar for most parameters except for the association parameters due to the different latent structure between two packages. Detailed analysis results are provided in Section A.8 of the supplementary materials.

5. Discussion

We have developed customized linear scan algorithms to reduce the computational complexity from or to within each iteration of the EM algorithm and in the standard error estimation step for a semiparametric joint model of a longitudinal biomarker and a competing risk event time outcome. Through simulation and case studies, we have demonstrated that the efficient implementation can generate a speed-up by a factor of up to hundreds of thousands and oftentimes reduce the runtime from days to minutes when the sample size is large (), making it feasible to fit a joint model on a large data in real time.

The ideas and techniques of this paper can potentially be adapted to improve computational efficiency for other joint models. For instance, the linear computational algorithm in Remark 1 for computing the variance-covariance matrices of empirical Bayes estimates of the random effects is not specific to the joint model considered in this paper and can be used in any procedure that uses the pseudo-adaptive quadrature rule. Also, although we have focused on joint modeling of a single biomarker with a time-to-event outcome, our methodology can be easily extended to handle multiple biomarkers in a similar fashion. It is also important to note that the linear risk set scan algorithm is limited to the share random effect joint model in which the Cox submodel (2) only involves time-independent covariates. If the Cox submodel contains time-dependent covariates such as the present value of the longitudinal marker, then one may have to impose more restrictive assumptions such as assuming a parametric baseline hazard in order to linearize the computation costs with respect to the sample size.

This paper has focused on linearizing the computation with respect to the sample size within the framework of classical EM algorithm that is coupled with the pseudo-adaptive quadrature rule for numerical integration in the E-step. It would be interesting to investigate if coupling our algorithms with other numerical integration methods such as quasi-Monte Carlo method [34] in the E-step or with other variations of EM algorithms such as the stochastic EM algorithm (stEM) [35] or Turbo-EM [36] could further enhance the computational efficiency, especially when there are 3 or more random effects in the model. Finally, current joint model implementations are generally not scalable as the number of longitudinal measurement grows large, rendering it infeasible to fit dense longitudinal data such as those generated from modern wearable devices for dynamic prediction of a health outcome. Future research is warranted to develop novel joint modeling procedures that are scalable to large number of subjects, random effects, and longitudinal measurements.

6. Software

A user-friendly R package FastJM [37] has been developed to fit the shared parameter joint model using the efficient algorithms developed in this paper and is publicly available on the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=FastJM.

Data Availability

The lung health study (LHS) data used to support the findings in Section 4.1 have not been made available because the authors did not have permission for data sharing from the data provider. The data that support the findings in Section 4.2 are available from UK Biobank repositories. The UK Biobank data are retrieved under Project ID: 48152. Data are available at https://www.ukbiobank.ac.uk with the permission of UK Biobank.

Disclosure

This paper has been submitted as preprint as per the URL: https://arxiv.org/abs/2110.14822 [38].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was partially supported by the National Institutes of Health (P30 CA-16042, UL1TR000124-02, and P01AT003960, GL), the National Institute of General Medical Sciences (R35GM141798, HZ), the National Human Genome Research Institute (R01HG006139, HZ and JJZ), the National Science Foundation (DMS-2054253, HZ and JJZ), the National Institute of Diabetes and Digestive and Kidney Disease (K01DK106116, JJZ), and the National Heart, Lung, and Blood Institute (R21HL150374, JJZ).

Supplementary Materials

Derivations of formulas, additional simulation results, and analysis results of the two real data are provided in the supplementary materials. (Supplementary Materials)

References

R. M. Elashoff, G. Li, and N. Li, Joint Modeling of Longitudinal and Time-to-Event Data, volume 151 of Monographs on Statistics and Applied Probability, CRC Press, Boca Raton, FL, USA, 2017.
D. Rizopoulos, “JM: an R package for the joint modelling of longitudinal and time-to-event data,” Journal of Statistical Software, vol. 35, no. 9, pp. 1–33, 2010.
View at: Google Scholar
R. M. Elashoff, G. Li, and N. Li, “A joint model for longitudinal measurements and survival data in the presence of multiple failure types,” Biometrics, vol. 64, no. 3, pp. 762–771, 2008.
View at: Google Scholar
R. Henderson, P. Diggle, and A. Dobson, “Joint modelling of longitudinal measurements and event time data,” Biostatistics, vol. 1, no. 4, pp. 465–480, 2000.
View at: Google Scholar
A. Sattar and S. K. Sinha, “Joint modeling of longitudinal and survival data with a covariate subject to a limit of detection,” Statistical Methods in Medical Research, vol. 28, no. 2, pp. 486–502, 2019.
View at: Google Scholar
M. J. Crowther, K. R. Abrams, and P. C. Lambert, “Joint modeling of longitudinal and survival data,” The Stata Journal, vol. 13, no. 1, pp. 165–184, 2013.
View at: Google Scholar
X. Song, M. Davidian, and A. A. Tsiatis, “A semiparametric likelihood approach to joint modeling of longitudinal and time-to-event data,” Biometrics, vol. 58, no. 4, pp. 742–753, 2002.
View at: Google Scholar
A. Tsiatis and M. Davidian, “Joint modeling of longitudinal and time-to-event data: an overview,” Statistica Sinica, vol. 14, no. 3, pp. 809–834, 2004.
View at: Google Scholar
C. Y. Wang, “Corrected score estimator for joint modeling of longitudinal and failure time data,” Statistica Sinica, vol. 16, no. 1, pp. 235–253, 2006.
View at: Google Scholar
J. Barrett and L. Su, “Dynamic predictions using flexible joint models of longitudinal and time-to-event data,” Statistics in Medicine, vol. 36, no. 9, pp. 1447–1460, 2017.
View at: Publisher Site | Google Scholar
F. G. Garre, A. H. Zwinderman, R. B. Geskus, and Y. W. Sijpkens, “A joint latent class changepoint model to improve the prediction of time to graft failure,” Journal of the Royal Statistical Society: Series A, vol. 171, no. 1, pp. 299–308, 2008.
View at: Google Scholar
C. Proust-Lima, V. Philipps, and B. Liquet, “Estimation of extended mixed models using latent classes and latent processes: the r package lcmm,” 2015, https://arxiv.org/abs/1503.00890.
View at: Google Scholar
D. Rizopoulos, “Dynamic predictions and prospective accuracy in joint models for longitudinal and time-to-event data,” Biometrics, vol. 67, no. 3, pp. 819–829, 2011.
View at: Google Scholar
M. Yu, J. M. G. Taylor, and H. M. Sandler, “Individual prediction in prostate cancer studies using a joint longitudinal survival–cure model,” Journal of the American Statistical Association, vol. 103, no. 481, pp. 178–187, 2008.
View at: Google Scholar
G. Hickey, P. Philipson, A. Jorgensen, and R. Kolamunnage-Dona, “A comparison of joint models for longitudinal and competing risks data, with application to an epilepsy drug randomized controlled trial,” Journal of the Royal Statistical Society: Series A (Statistics in Society), vol. 181, no. 4, pp. 1105–1123, 2018.
View at: Publisher Site | Google Scholar
D. Rizopoulos, Joint Models for Longitudinal and Time-to-Event Data: With Applications in R, CRC Press, 2012.
M. Sudell, R. Kolamunnage-Dona, and C. Tudur-Smith, “Joint models for longitudinal and time-to-event data: a review of reporting quality with a view to meta-analysis,” BMC Medical Research Methodology, vol. 16, no. 1, pp. 1–11, 2016.
View at: Google Scholar
R. Collins, “What makes UK Biobank special?” The Lancet, vol. 379, no. 9822, pp. 1173-1174, 2012.
View at: Publisher Site | Google Scholar
C. Sudlow, J. Gallacher, N. Allen et al., “UK Biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age,” PLoS Medicine, vol. 12, no. 3, article e1001779, 2015.
View at: Google Scholar
J. M. Gaziano, J. Concato, M. Brophy et al., “Million Veteran Program: a mega-biobank to study genetic influences on health and disease,” Journal of Clinical Epidemiology, vol. 70, pp. 214–223, 2016.
View at: Publisher Site | Google Scholar
A. M. Butler, K. B. Nickel, R. A. Overman, and M. A. Brookhart, “IBM Marketscan research databases,” in Databases for Pharmacoepidemiological Research, pp. 243–251, Springer, 2021.
View at: Google Scholar
R. Elashoff and N. Li, Joint Modeling of Longitudinal and Time-to-Event Data, CRC Press, 2016.
H. Lin, C. E. McCulloch, and R. A. Rosenheck, “Latent pattern mixture models for informative intermittent missing data in longitudinal studies,” Biometrics, vol. 60, no. 2, pp. 295–305, 2004.
View at: Google Scholar
D. Zeng and J. Cai, “Simultaneous modelling of survival and longitudinal data with an application to repeated quality of life measures,” Lifetime Data Analysis, vol. 11, pp. 151–174, 2005.
View at: Google Scholar
D. Zeng and J. Cai, “Asymptotic results for maximum likelihood estimators in joint analysis of repeated measurements and survival time,” The Annals of Statistics, vol. 33, no. 5, pp. 2132–2163, 2005.
View at: Google Scholar
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery, Numerical Recipes, Cambridge University Press, 3rd edition, 2007.
D. Rizopoulos, “Fast fitting of joint models for longitudinal and event time data using a pseudo-adaptive Gaussian quadrature rule,” Computational Statistics & Data Analysis, vol. 56, no. 3, pp. 491–501, 2012.
View at: Google Scholar
D. Bates, D. Sarkar, M. D. Bates, and L. Matrix, “The lme 4 package,” R Package Version, vol. 2, no. 1, p. 74, 2007.
View at: Google Scholar
P. Philipson, I. Sousa, P. J. Diggle et al., “joineR: joint modelling of repeated measurements and time-to-event data,” R Package Version, vol. 1, no. 2, p. 5, 2018.
View at: Google Scholar
D. P. Tashkin, H.-J. Wang, D. Halpin et al., “Comparison of the variability of the annual rates of change in FEV1 determined from serial measurements of the pre-versus post-bronchodilator FEV1 over 5 years in mild to moderate COPD: results of the lung health study,” Respiratory Research, vol. 13, no. 1, pp. 1–10, 2012.
View at: Google Scholar
G. T. O'Connor, D. Sparrow, and S. T. Weiss, “A prospective longitudinal study of methacholine airway responsiveness as a predictor of pulmonary-function decline: the normative aging study,” American Journal of Respiratory and Critical Care Medicine, vol. 152, no. 1, pp. 87–92, 1995.
View at: Google Scholar
D. Rizopoulos, “The R package JMbayes for fitting joint models for longitudinal and time-to-event data using MCMC,” 2014, https://arxiv.org/abs/1404.7625.
View at: Google Scholar
C. Xu, P. Z. Hadjipantelis, and J.-L. Wang, “Semi-parametric joint modeling of survival and longitudinal data: the r package JSM,” Journal of Statistical Software, vol. 93, no. 2, 2020.
View at: Google Scholar
A. B. Owen, Monte Carlo Theory, Methods and Examples, 2013.
S. F. Nielsen and S. F. Nielsen, “The stochastic EM algorithm: estimation and asymptotic results,” Bernoulli, vol. 6, no. 3, pp. 457–489, 2000.
View at: Publisher Site | Google Scholar
J. F. Bobb and R. Varadhan, “turboEM: a suite of convergence acceleration schemes for EM, MM and other fixed-point algorithms,” 2021, R package version 2021.1, https://CRAN.R-project.org/package=turboEM.
View at: Google Scholar
S. Li, N. Li, H. Wang, J. Zhou, H. Zhou, and G. Li, “FastJM: semi-parametric joint modeling of longitudinal and survival data,” R Package Version, vol. 1, p. 1, 2022.
View at: Google Scholar
S. Li, N. Li, H. Wang, J. Zhou, H. Zhou, and G. Li, “Efficient algorithms and implementation of a semiparametric joint model for longitudinal and competing risks data: with applications to massive biobank data,” 2021 https://arxiv.org/abs/2110.14822.
View at: Google Scholar

Copyright

Copyright © 2022 Shanpeng Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

639

Downloads

796

Citations