Abstract
Factor discovery of public health surveillance data is a crucial problem and extremely challenging from a scientific viewpoint with enormous applications in research studies. In this study, the main focus is to introduce the improved survival regression technique in the presence of multicollinearity, and hence, the partial least squares spline modeling approach is proposed. The proposed method is compared with the benchmark partial least squares Cox regression model in terms of accuracy based on the Akaike information criterion. Further, the optimal model is practiced on a real data set of infant mortality obtained from the Pakistan Demographic and Health Survey. This model is implemented to assess the significant risk factors of infant mortality. The recommended features contain key information about infant survival and could be useful in public health surveillancerelated research.
1. Introduction
Survival approach is a common regression modeling method used for prognostic analysis as it examines the relationship between the covariates, the response, and the time until the occurrence of an event. The framework for survival analysis is based on the Cox proportional hazard (PH) model due to its ease of computing the hazard ratio (HR) without needing to estimate the baseline hazard function. The Cox PH model maximizes the partial likelihood function which estimates the regression parameters but not the baseline hazard function. Consequently, the survival probability and the hazard rates can be estimated only at the event times and not for the longterm evaluations [1].
Parametric survival models specify the probability distribution to estimate the absolute measure of effect in time to event response. A common specification is the Weibull distribution in these models to estimate the baseline hazard . A parametric survival model with a scale parameter (), a shape parameter (), and time () is defined as . For the absolute measure of effect, the Weibull distribution can generally facilitate accurate predictions for a constant, monotonically decreasing or monotonically increasing hazards. However, for more complex hazard functions, the parametric survival model specifying a Weibull function will lead to inaccurate predictions [2].
The Royston and Parmer model is an advanced type of flexible parametric survival model featuring a restricted cubic spline to model more complex hazard shapes and to estimate a continuous function [3]. This model considers the baseline log cumulative hazard function on the log timescale. For Weibull distribution, this function is where and represent the baseline hazard with respect to log time and denotes the vector of predictors. This function can be generalized as where describes a general baseline log cumulative hazard function. Royston and Parmar used a restricted cubic spline to model the baseline hazard function on the log timescale. A restricted or natural cubic spline has an additional restriction featuring the first and last subfunctions beyond the boundary knots as linear instead of cubic. A restricted cubic spline can be mathematically expressed as [15] , where denotes the number of knots, represents derived variables, and describes the coefficients for these variables. This spline has the ability to fit complex shapes of baseline log cumulative hazard functions improving the stability of the function [4].
Multivariate survival regression models assume that there is no multicollinearity among covariates. Most of the survival methods are not appropriate to model large data with correlated covariates. The partial least squares (PLS) regression is considered as a good alternate of traditional regression methods in the presence of multicollinearity [5, 6].
Therefore, the partial least squaresCox (PLSCox) regression model was developed to analyze survival systems in the presence of multicollinearity [7]. Due to several limitations of the PLSCox regression model, the PLS flexible parametric (PLSFP) survival regression model is proposed to estimate smooth hazard ratios of predictors and corresponding cumulative hazard functions and to extrapolate the survival model [2]. However, the major limitation of the PLSFP model is that it is not appropriate for all complex shapes of hazard function. The motivation of this research was to develop a survival model that has the ability to model complex shapes in the presence of multicollinearity. The proposed method is developed by integrating partial least squares with the Royston and Parmer restricted cubic spline model, hence the named as the partial least squares spline (PLSspline) model. This model has the ability to fit more complex shapes of baseline log cumulative hazard functions. The efficiency of the partial least squares spline (PLSspline) model is tested using simulated data by examining its performance on different scales with various spline knots. The proposed model is applied to a real data set of infant mortality to estimate the hazard function and regression coefficients. The analyses based on different scales using simulated and real data set reveal the efficiency of these models to estimate baseline log cumulative hazard functions in the presence of multicollinearity.
2. Materials and Methods
2.1. The Cox Proportional Hazard Model
For the occurrence of an event at time , the Cox model assumes the hazard function in the presence of censoring
where is the baseline hazard function, is the vector of coefficients, and is a matrix of covariates. In this model, the baseline hazard function is unspecified.
2.2. The Partial Least SquaresCox (PLSCox) Regression Model
Partial least squaresCox (PLSCox) regression model is used as a benchmark model in this study. Let represent the survival time and . The partial least squares model computes latent components for correlated covariates; then, the Cox model assumes the baseline hazard function as
where is the unspecified baseline hazard function, is the vector of coefficients, and is a matrix of components. The hyperparameters are found by maximum likelihood estimation method.
2.3. The RoystonParmar Spline Model
In the context of the PH model, the RoystonParmar (RP) model can be expressed as where describes a restricted cubic spline that is a function of the derived variables and the number of knots . Generally, three different scales, hazard, odds, or normal, are used to model the RP spline model. When no knots are specified, the restricted cubic spline reduces to the Weibull distribution if the scale is hazard. For odds and normal scales, no knots give loglogistic and lognormal models, respectively.
2.4. Partial Least Squares Spline (PLSSpline) Survival Regression Algorithm
Let denote the matrix of correlated covariates for a sample of size . The algorithm executes the FP model based on the components (as ) of PLSR computed with time as a response variable and as a matrix of covariates for . The pseudocode for the proposed PLSspline model is expressed as follows.

2.5. Data Simulation
Simulated data is generated using the simsurv Rpackage to evaluate the efficiency of existing and proposed survival models. The simulated data set is generated from Weibull distribution for the scale parameter () and shape parameter () over 5 years of censoring. The correlation structure between 200 covariates ranged from 0 to 0.9 over 100 samples.
2.6. Real Data Set
This study used publically available secondary data, borrowed from the Demographic and Health Survey (DHS), collected during 201213 from Pakistan with the support of the United States Agency for International Development and ICF International. Therefore, there are no ethical concerns involved in this work, and no ethics review is required for this study [8]. The secondary data of infants from birth to aged 12 months born to ever married women aged 1549 years in Pakistan is used in this study. The outcome of interest was infant survival within 12 months after first month of birth. The sample consists of 80 infants belonging to Pakistan, and 86 covariates are included.
3. Results
3.1. SimulationBased Results
Using Weibull distribution, the high dimensional simulated data set having multicollinearity is generated. The constructed data is then split into test and training sets with 70 : 30 to train and evaluate the performance of benchmark and proposed methods. The hazard, odds, or normal scales are modeled each with zero and one knot.
The PLSspline model with different knots measured on different scales is fitted over the simulated data set generated from Weibull distribution to access the performance of models based on the Akaike information criterion (AIC) and Bayesian information criterion (BIC). Figure 1 shows the comparison between the standard, PLSCox regression model, and six PLSspline models with different knots based on various scales. The proposed PLSspline models based on the hazard scale with zero knot and one knot are symbolized as and , respectively. Similarly, and stand for odds and normal scales accordingly. Figure 1 shows that the PLSspline model based on all three scales with one knot has the highest performance compared to the PLSCox and PLSspline models with zero knot. But it is also clear from Figure 1 that the PLSspline model having zero knot showed even higher efficiency than the benchmark PLSCox method. Figure 2 shows the efficiency comparison based on the BIC defending performance based on AIC.
3.2. Application
3.2.1. Infant Survival Time Data Set
A cluster heat map presented in Figure 3 is used to show the magnitudes of correlation among covariates. Negative correlations are shown in blue color, and positive correlations are presented in red. High intensity of colors shows higher correlation among corresponding variables. Only 36 covariates are selected for examining multicollinearity for comprehendible visualization. Figure 3 clearly portrays the correlation between covariates showing intense colors.
The presence of multicollinearity is evident in the heat map. Hence, the existence of multicollinearity among covariates in high dimensional survival data is detected visually.
The high dimensional infant survival data set having multicollinearity is used for comparison of models and identification of risk factors of infant mortality. The sample data is split into test and training sets with 70 : 30 to evaluate the efficiency of PLS survival methods.
The PLSspline models with zero and one knot are fitted over the real data set to access the performance of models based on different scales using AIC and BIC. Figure 4 shows the comparison presenting the higher efficiency of all proposed methods compared to PLSCox based on AIC. Also, the highest performance of is observed in Figure 4 compared to other methods. This result showed that the proposed PLSspline model based on the odds scale with one knot is the optimal model for the observed data.
Figure 5 shows the comparison of models based on BIC. The visual representation showed that the PLSspline model based on the odds scale with zero and one knot has nearly the same efficiency. On the basis of both model assessment criteria, we may conclude that the PLSspline model based on the odds scale is the best fitted model for the observed data. For identification of significant risk factors, the PLSspline model based on the hazard scale with one knot is executed as being best fitted.
Table 1 presents the selected influential risk factors of infant mortality by using the as being the optimal model. After analysis, 27 influential factors are found significantly associated with infant mortality in Pakistan. The positive association of mother’ age, type of place of region, de facto place of residence, relationship of mother to household head, type of cooking fuel, number of births in last five years, distance, transport and accompany to health facility, mother’s occupation, person who usually decides on respondent’s health care, person who usually decides on visits to family or relatives, person who usually decides what to do with money husband earns, succeeding birth interval, and blood relation with husband is found for infant mortality. Furthermore, negative association of region, selection for domestic violence, household has motorcycle/scooter, reading newspaper or magazine, watching television, wealth index, awareness of tuberculosis and hepatitis, beating justified if wife neglects the children or argues with husband or if wife burns the food, and preceding birth interval is observed.
Figure 6 shows the estimates of the baseline cumulative hazards from the PLSspline model measured on hazard, normal, and odds scales with zero and one knot for the data set of infant survival. All six PLSspline models produce smooth estimates of the baseline cumulative hazards extrapolated to time of 12 months showing consistent estimates. The PLSspline model based on the odds scale with one knot is represented by the red line in Figure 6 showing the lowest cumulative hazard for the first 4 months after birth, moderate increase in the fifth month, and maximum at the sixth month.
4. Discussion
Alongside advances in statistical techniques, several modifications are suggested for survival analysis to improve efficiency of the model. Yang et al. [9] introduced DeepCoxPH, an estimation strategy based on deep learning and the Cox model which is proposed to improve the risk stratification for overall survival analysis. Rueda et al. [10] used discretetime Markov chain theory and the Cox regression to predict survival function. The authors also employed a parametric analysis for comparison and variable selection. Another study developed an algorithm as a conjugate of the parametric model and partial least squares in the presence of extreme observations to enhance model performance [2]. In this study, the PLSspline model is proposed to treat survival response with collinear predictors using the spline strategy based on different scales with various knots regarding better model performance and superior interpretation potential. To examine hazard function with higher accuracy, the PLSspline model is proposed by integrating PLS and the Royston and Parmer spline model in the presence of multicollinearity. The proposed model is compared with the PLSCox model using simulated and real data sets for efficiency comparison. The PLSspline model with one knot over hazard, odds, and normal scales turns out to be the best model to estimate cumulative hazards based on AIC and BIC over simulated data generated from Weibull distribution. More importantly, for known simulated data, the PLSspline model showed better performance than the PLSCox model. For the real data set of infant mortality, the PLSspline model with one knot over the odds scale is observed to be optimal model. The finally selected model is used to identify the influential risk factors of infant mortality in Pakistan. Maternal age, occupation, and place of residence are found to be significant predictors of infant mortality in the present study. Previous studies observed that younger and older maternal ages are significantly associated with infant mortality [11]. Another study reported that the region of residence and working status of mother are statistically significant risk factors for stunted, underweight, and wasted children [12]. Consistent with literature, domestic violence is found to be significantly associated with infant mortality [13]. The present study observed that an increase in media awareness (watching television and reading newspaper) and wealth level could decrease the ratio of infant mortality. Literature described that media exposure and income level are associated with maternal outcomes [14, 12]. Availability and utilization of health facility is determined an important risk factor of mortality rate among infants. Several former studies verified that health expenditure potentially reduces maternal and infant mortalities across different countries [15, 16]. Closely similar to previous literature, birth interval and consanguineous marriage showed a significant association with infant mortality [17, 18]. The overall accuracy of the proposed algorithm enhances the model performance to a higher extent, considering collinear covariates. This efficiency suggests that survival function, hazard function, cumulative hazard function, and parameters of distribution for the survival time data with unknown distribution can be estimated more efficiently in terms of smooth lines. The PLSspline model is viewed as a useful addition to the toolbox of estimation and prediction of survival time response for the widely used PLSCox model in the survival settings.
5. Conclusion
The proposed PLSspline model based on different scales with various knots is shown to be a better choice regarding model performance and superior interpretation potential. Using the PLSspline model based on the odds scale with one knot, the influential factors identified as the important predictors of infant mortality are in agreement with other studies. So, the PLSspline model has the potential as a multivariate survival technique in scientific research to treat highdimensional correlated survival times data more efficiently.
Data Availability
Data are freely available at http://www.dhs.org.
Conflicts of Interest
The authors declare that they have no conflicts of interest.