Abstract

Both frequentist and Bayesian statistics schools have improved statistical tools and model choices for the collected data or measurements. Model selection approaches have advanced due to the difficulty of comparing complicated hierarchical models in which linear predictors vary by grouping variables, and the number of model parameters is not distinct. Many regression model selection criteria are considered, including the maximum likelihood (ML) point estimation of the parameter and the logarithm of the likelihood of the dataset. This paper demonstrates the information complexity (ICOMP), Bayesian deviance information, or the widely applicable information criterion (WAIC) of the BRMS to hierarchical linear models fitted with repeated measures with a simulation and two real data examples. The Fisher information matrix for the Bayesian hierarchical model considering fixed and random parameters under maximizing a posterior estimation is derived. Using Gibbs sampling and Hybrid Hamiltonian Monte Carlo approaches, six different models were fitted for three distinct application datasets. The best-fitted candidate models were identified under each application dataset with the two MCMC approaches. In this case, the Bayesian hierarchical (mixed effect) linear model with random intercepts and random slopes estimated using the Hamiltonian Monte Carlo method best fits the two application datasets. Information complexity (ICOMP) is a better indicator of the best-fitted models than DIC and WAIC. In addition, the information complexity criterion showed that hierarchical models with gradient-based Hamiltonian Monte Carlo estimation are the best fit and have supper convergence relative to the gradient-free Gibbs sampling methods.

1. Introduction

We think through the tricky of comparing complex hierarchical models in which linear predictors vary by grouping variables, and the number of model parameters is not noticeably distinct [1]. In natural structure, experimental data in cognitive science, education, public health, and social follow-up contain “clusters.” These nested structures comprise experimental measurements that are much more correlated within the group than between them. Such clusters in clinical trials and experimental designs are subjects and experimental units (e.g., words, pictures, and measurements presented to the issues). These clusters arise because we have multiple (repeated) observations for each subject and item [2]. Incorporating this grouping structure in data analysis leads to the necessary use of a hierarchical model (also called a multilevel or mixed-effects model). This grouping structure and hierarchical modeling type are closely connected to the concept of exchangeability [3]. The exchangeability concept of hierarchical models is the Bayesian equivalent of the assumption “independent and identically distributed” one often encounters in classical statistics thinking.

The rapid advancement of complex statistical modeling and computing to fit real-world data structures need the best model selection criteria [4]. In statistical modeling, having attention to model convergence and interpretability, there is no specific reason to select a single best model according to some criterion [5]. Instead, it makes more sense to “deselect” poor models which might have overfitting and underfitting, keeping a subset model for further inference. Often, this subset might consider a single model, but sometimes possibly not.

Model selection is a procedure for finding the best-fitted model from a subset of models. The predictor variables and covariates are related to the outcome/dependent variable [6]. Model selection approaches find the “best” trade-off between goodness-of-fit with data and model complexity [7, 8]. Based on the correct complexity interpretation of the models, these techniques can be categorized into different goals to realize high predictive density, low predictive error, high model probability, or minor looseness of information. These goals can be accomplished by ensuring unreliable or reliable model selection and integrating a piece of Bayesian prior information or not [9].

Hierarchical (i.e., mixed-effects) linear models include random intercepts, random slopes for all within-subjects, experimental units, and correlations between the random effects components [10]. The simulation results showed that complex models with more parameters, including Bayesian parameter priors, faced a more negligible convergence effect [11]. Although various model selection tools indicated the best-fitting model, proper model selection depends on the random effect structure of the Bayesian hierarchical model. That is why model selection should be built on goodness-of-fit and consider model complexity [12]. Fitted models with high complexity measures attempt to capture each deviation in the data points. Such models are supposed to have high variance structures, leading to overfitting the data [13].

Model selection, based merely on the fit to the trained dataset, leads to choosing an unnecessarily complex model that overfits the data and thus infers poorly. Model selection techniques must appropriately balance the consequence of overfitting [9, 11, 12, 14].

Many more information criteria are implemented in different modeling types and found in different software packages. These include Akaike Information Criteria (AIC), Adjusted Akaike Information Criteria (AICc), Schwartz Bayes Information Criteria (SBC, SBIC), and Bayesian Information Criteria (BIC) [15]. All these information criteria are based on the maximum likelihood (ML) point estimation, which can be expressed as the sum of the deviation and penalty terms [16]. These criteria select the “best-fitting” model with sufficient goodness of fit and few parameters, penalizing the over determined parameter for lack of appropriate measure [17]. The Bayesian model selection method used the posterior proposal to decide whether the fitted model maximizes the expected utility of the posterior distribution for the data and parameters [5, 18].

The model averaging approach used flat priors over the range of plausible values of the model parameters [19]. Another popular model selection method is the Bayes factor, which requires one unique priory nominated model by the decision of the statisticians or the researcher out of all available fitted models [20]. Nevertheless, the researcher’s (or scientist’s) decision might be wrong, making the Bayes factor irrelevant in limited situations. In addition, overlooking the information criteria computed based on the deviation terms in the maximum likelihood (ML) point estimation, the penalty terms rely on only the sample size and the number of parameters. None-point estimation of the full posterior estimation (the expected posterior estimator) considers the variance-covariance matrices [13, 17, 21].

In the Bayesian approach model compassion, the deviance information criterion (DIC) available in the MCMCglmm and the widely applicable information criterion (WAIC) of the BRMS in Stan are attempts to find the model with the best predictive models [22]. Like DIC, WAIC estimates adequate parameters to adjust for overfitting in the model. WAIC is a fully Bayesian pointwise version of the AIC, asymptotically equivalent to the Deviance Information Criterion (DIC) [20].

On the other hand, a novel model selection criteria known as information complexity (ICOMP) is calculated by the set of random vectors obtained from the information-based covariance complexity index for a general multivariate linear or nonlinear model estimated with the C-valued equation or by the structural complexity of an element. ICOMP represents and is a real-valued measure of complexity. ICOMP is the estimation of the covariance matrix of the parameter vectors in the model. In the general case of model selection criteria, the best model minimizes the criterion [2325]. The main objective of this study is to evaluate the popular model selection criteria that consider the fixed parameterizations with ICOMP criteria that consider the covariance matrix of the parameter vectors for the different setups of hierarchical modeling with prior distributions. Therefore, this paper compares the fitted Bayesian hierarchical models using covariance complexity-based ICOMP and the number of estimated parameters based on DIC or WAIC under two popular MCMC approaches in three application datasets.

2. Methodology

Classical statistical linear models were often fitted by maximum likelihood estimation considering parameter point estimates. This paper demonstrates the theoretical and practical drive of the information complexity criterion for Bayesian hierarchical linear models, which fit by estimation methods of maximizing a posterior (MAP) distribution. The Bayesian hierarchical model considers a full posterior distribution supported by parameter prior information, and it has additional random terms besides the usual residual time in the standard linear model. Model complexity frequently refers to the number of features or variables incorporated in a specified predictive model in the machine learning concept. Based on the researcher’s needs and the model structure, there are several freely available R-based Bayesian hierarchical model applications of bearing to the data in any field [26]. Here, we applied two freely available R packages for Bayesian hierarchical linear modeling: gradient-free Bayesian Monte Carlo and one gradient-based (Hamiltonian Markov Chains). The hybrid Hamiltonian in Stan is a pioneer and more effective Monte Carlo method than other Markov Chain Monte Carlo approaches [27]. Six different Bayesian hierarchical linear models, of which 3 used gradient-free Gibbs sampling approach and the other 3 used gradient-based Hamiltonian Monte Carlo estimation, were fitted for each of the three application datasets.

2.1. The General Form of the Fisher Information Matrix for a Linear Model

For a given vector of measurements , a general linear model can be written as follows:

The likelihood is a function of all model parameters defined as . For a general linear regression model, the log-likelihood can be written as follows:

The partial derivative of the log-likelihood for the parameter is called the score. Under the general regularity condition, the expected value of the score is 0 [28]. Indeed, it is easy to show thatwhere is the “true” unknown value of such that the observations were generated with model . The variance of the score is called the Fisher information matrix (FIM).

Besides, when the log-likelihood is double differentiable concerning the parameter , the FIM is given by the following equation:

The variance of is estimated by as . And standard errors (SE) of becomes .

In this case, the estimated inverse Fisher information matrix (IFIM) (i.e., Cramér–Rao lower bound matrix) is as follows:

2.2. Bayesian Hierarchical Linear Modeling for Repeated Measures Data

Suppose the target variable is the repeated observation taken of the individual (subject) that is considered as a total number of response measurements. Assuming the parameters of the model with the set of explanatory variables (fixed effects) and random terms constant compliments of as , the hierarchical linear model which considers a group (subject) is an extension of equation (1) and can be written as follows:where

Since the random effects vary by group , and . Therefore, the variances and covariance among the random (group) effects are and , respectively. The residual error variance and covariance between errors in the group are and [2931]. In general form,where denotes the vector of outcome variable, denotes a vector of fixed effects parameters, denotes a vector of associated random effects (), is a matrix of covariates (explanatory variables), denotes a block diagonal matrix of covariates for the random effects as a complement of embraced of m blocks that each block has dimension matrix, and denotes a column vector of residuals. We assumed that the random effects and the residuals , where and are independently distributed. Based on the unknown vector of and , the unknown random effects in and can be written as [32].

As the population parameters of the hierarchical linear model, the model parameters are the vector of fixed effects , the variance-covariance matrix for the random effects, and the variance of the residual errors. Here, let be the set of fixed effect and random effect model parameters and .

For a given set of parameter estimates , the covariance matrix of can be computed as and the estimate of the covariance matrix is a positive definite matrix [33].

The Bayesian hierarchical linear model with fixed and random effect parameters can be noted as . This parameterization implies, for in which , the marginal likelihood can be written as follows:

The Best Linear Unbiased Estimator (BLUE) of fixed effects and the Best Linear Unbiased Predictor (BLUP) of random effects can be expressed as follows:

2.3. Maximum a Posterior Estimator (MAP) of Model Parameters

The maximum likelihood (ML) estimators of maximize the log-likelihood function defined as follows:

Due to the lack of a simple analytical solution to this optimization problem, various numerical methods such as the Newton–Raphson and the Expectation-Maximization (EM) algorithms can be used for maximizing . Among the likelihood methods, the unbiased estimates of variance and covariance parameters can be computed using the restricted maximum likelihood (REML) approach in contrast to the maximum likelihood (ML) method. The maximum likelihood estimates of the fixed effect parameters are given by the following equation:where (X’X) is nonsingular. Bayesian inference introduces prior distributions over all model parameters and is built entirely on posterior distributions of model parameters or conditional distributions for parameter , given data, and other known quantities in the model.

In the Bayesian generalized hierarchical linear model, the joint full parameters, , and posterior distribution () can be articulated as follows:where the regularizing constant is independent of the parameter components in . Estimation of the parameter can be derived from the full parameters posterior distribution, , only through specified locational measures of the posterior, such as posterior mode, mean, or median. The prior for , , is a constant, then the posterior in equation (14) is effectively proportional to the likelihood function in equation (10), and hence, the posterior mode is numerically identical to the maximum likelihood estimate. Thus, flat priors or noninformative priors for fixed effects, , and the random effect, , are usually preferred. In this case, the distribution of random effects assumed to be normal, specifically are . dimensional [34].

It is computationally true that considering uniform distribution priors indicates that the posterior distribution in equation (14) is efficiently compared to the likelihood function. It provides identical results of the maximum likelihood estimate and the Bayesian maximum a posteriori (MAP) estimate of the posterior mode [35]. For this reason, flat priors or noninformative priors for fixed effect parameters and random effect parameters are typically chosen. Moreover, for the reason that the Bayesian hierarchical modeling dimensions complexity, the usual numerical approximation techniques cannot provide the maximization solution. Thus, a recent influential method for handling complex statistical integration is the Markov chain Monte Carlo (MCMC) algorithm. Then, with the noninformative prior specification, the joint posterior distribution can be written as follows:

The theoretical derivation of the Bayesian maximum a posteriori (MAP) estimator considers a Bayesian hierarchical linear model where all parameters are random at the individual level. Based on the marginal likelihood function of equation (10), one can derive separate scores for each component of as and [35]. Therefore, the score for from the gradient for the pointwise entry of is as follows:

To obtain the scores from every observed gradient , we need to use the trace operator with a operator by component-wise multiplication known as Hadamard product. The score function for each observation for the parameter vector is as follows:

Equation (17) gives score vector from the gradient of parameter (a scalar).

Similarly, the score vector for the fixed effect parameter can be obtained from the gradient.

Again replacing the matrix multiplication by the Hadamard product,

After the derivation of gradients for fixed and random effects, the complete set of scores can then be the combined matrix whose columns consist of the results of the separate score vectors from equations (17) and (19) [28].

2.4. Derivation of the Fisher Information Matrix for Hierarchical Linear Modeling

To derive the Fisher information matrix for the Bayesian hierarchical linear models (BHLM) in equation (7) or (9), we need to ponder the concept of Fisher information matrix for general regression and consider the design matrix can be partitioned into submatrices. Suppose denote the number of columns in , and then, is an identity matrix with dimension . This implies and with and for . Therefore, the covariance matrix of the random effects is a block diagonal matrix with blocks . As standard assumptions, and . We can rewrite , where . Thus, in equation (9), the covariance of the random effects can be rewritten as follows:

Here, the unknown model parameters vector comprises the fixed effect parameters and the random effect scalars . Thus, as an alternative to estimating the matrix , we need to estimate scalars . Note that, besides the variance terms, the covariance terms, such as , are needed to be estimated.

The fixed portion of the model in equation (9) is analogous to the general linear model regression coefficients to be estimated by maximizing the log-likelihood. For the random part of the model (6), we assume that had variance-covariance matrix and that is orthogonal to so that

Here, considering which encompasses the variance-covariances , and for fixed effect , random effect , and the residual term , respectively, the full variance-covariance matrix for can be written as follows:

Suppose denotes the Fisher information matrix for the hierarchical model in equation (9); without considering the constant , the log-likelihood function for model (9) can be rewritten as follows:

Now, our primary target is finding the matrix .

Thus, the Fisher information matrix for the Bayesian hierarchical linear model (BHLM) is a block diagonal matrix that combines the covariance terms of the fixed and random parts:

Then, from equations (23) and (24), we need to invert the information matrix to get the full variance-covariance matrix concerning all parameters in the model [28].

2.5. Bayesian Deviance Information and Widely Applicable Information Criterion

Many researchers have scrutinized model selection from frequentist and Bayesian perspectives, and many tools for selecting the “best-fitted model” have been suggested [5]. The Bayesian deviance information criterion (DIC) developed by [1] is a Bayesian version of the Akaike Information Criterion that substitutes a maximized log-likelihood with the log-likelihood evaluated by the Bayes estimate. Nevertheless, it is not fully Bayesian logically because of its reducing behavior of the probability distribution down to point estimates. Bayesian deviance information criterion (DIC) can be computed as follows:where is the maximizing a posterior (MAP) estimate that replaces the ML-point estimate in AIC. The new measure of the Bayesian predictive accuracy expressed as the expected log-point prediction intensity is the difference between the posterior mean of the deviance minus the deviance of the posterior means: , and is the posterior mean of deviance and is a point estimate of the deviance obtained by substituting in the posterior means , thus . This gives an adequate number of parameters, , given by .

Widely applicable information criterion (WAIC), also known as Watanabe–Akaike [36], could be achieved as an improvement over the deviance information criterion (DIC) for Bayesian models [25]. Widely applicable Bayesian information criterions penalty term is purely Bayesian and is computed pointwise as follows:

Here, , the penalty term is the variance of the log-predictive-density terms aggregated over N data points. Thus, the Watanabe–widely applicable Bayesian information criterion can be calculated as follows:where is the joint posterior distribution of fitted model parameters and an estimate of the number of effective parameters computed as in equation (26). In the general case, WAIC uses the logarithm of the pointwise predictive density, and its logarithm sums the overall observations. This estimation leads to the asymptotical equivalence of WAIC and DIC in Bayesian modeling. However, the penalization term for overfitted estimates the number of effective parameters [20].

2.6. The Information Complexity (ICOMP) Criterion for the Hierarchical Linear Model

Deviance can be defined as the likelihood difference between the fitted and perfect models. It is used to measure the deviance of the fitted binary logistic model for the saturated model for . It is statistically valid that the deviance is always larger than or equal to zero only if the fit is perfect [35]. The deviance of the worst (null) model, the one fitted without any predictor, to the ideal model can be written as .

As a background of model selection criteria, Akaike Information Criterion (AIC) [21, 37] is extensively used for different statistical models selection criteria and it is given by the following equation:where is the log-likelihood function of parameter in a probability distribution, and is the number of parameters in a model . ICOMP (I, information-COMP, complexity) is a criterion developed by [38] for selecting multivariate linear and nonlinear models, whereas the log-likelihood AIC is merely meant to strike a balance between the lack of fit and the penalty terms; however, ICOMP seeks to achieve this balance by taking into account a measure of complexity that assesses how the model’s parameters interact with one another. As a result, it penalizes the covariance complexity of the model rather than simply punishing the number of parameters [39]. The information complexity (ICOMP) criterion is given by the following equation:

The second part of equation (29) is called the measure of complexity of the model given in equation (9), which can be computed as follows:where represents the determinant of and is the dimension of . As seen, includes the two most straightforward scales of multivariate scattering called determinant and trace in a single function [40]. Depending on the model structure, the ICOMP criterion can be computed by substituting the inverse Fisher information matrix (IFIM) to measure the covariance complexity. Information complexity (ICOMP) has various forms [4143].

AIC, DIC, BIC, and ICOMP, among other model fit criteria, combine the goodness-of-fit term of the model, 2LL (Y), and the model’s complexity, commonly referred to as the penalty term. All likelihood-based ML points and maximizing posterior estimation measures of model fit criteria also do this. However, many criteria used the number of parameters as a penalty for complexity; the penalty term of AIC is , two times the number of estimated parameters. In contrast, the penalty term of the ICOMP criterion is the measure of the covariance complexity for the fitted models [41]. Because the structured covariance matrix of the estimated parameters is unknown in closed form and nonidentifiability, we then apply the estimated inverse Fisher information matrix to assess the complexity of the Bayesian hierarchical linear model [44]. The estimated inverse Fisher information matrix can be computed with instead of in a matrix . Thus,

Combining equation (23) having constant with equation (29), the inverse Fisher information matrix (IFIM)-based information complexity (ICOMP) [42] is computed as follows:

For the ICOMP criterion implemented on the fitted models, the model with a minimum ICOMP measure value is known as the best model.

3. Application of Model Fitting to Categorical Outcome Repeated Measures Data

This article demonstrates and realizes the fitting of a categorical outcome Bayesian generalized hierarchical linear model for a trait with repeated observations using two different (MCMCglmm and BRMS) R packages in two real datasets and simulation data.

3.1. Bayesian Hierarchical Linear Model Specifications

Suppose the general Bayesian hierarchical model in equation (1) for the subject (group) and measurement repeated at . It is assumed that .

Every part of the hierarchical model can be seen in the graph as a node, as depicted in Figure 1. The solid arrows reflect stochastic (random) connections, such as those from to , whereas the dotted arrows indicate deterministic (fixed) relationships across the parameters, such as those from to .

3.1.1. Model 1: The Null or Unconditional Random Intercepts Model

In hierarchical modeling, the null model is the model with only grouping (clustering level) variables as a determinant of the intercept of the dependent variable [45]. This is an “unconditional” and random intercept model since it predicts the outcome variable’s level one intercept without any predictor variable at any level. Furthermore, this model provides information on intraclass correlations, which helps to decide if hierarchical models are necessary for the data in the first place.where is the overall intercept; and are the intercepts with and variance (standard deviation) components for each individual considering all measurements and for the repeated measures, respectively. In a Bayesian hierarchical model, the BRMS-Stan documentation package in R recommended using half-Cauchy as the prior, which is automatically constrained at zero, to lessen the likelihood of unreasonably large standard deviation (SD) values [45]. For the intercept, the default prior is a normal distribution.

3.1.2. Model 2: The Conditional Random Intercepts Model

A conditional random intercept model incorporates a cluster as a random effect, and only the intercept of the outcome variable is adjusted for the random effects. It is also conditional because predictor variables are added to the clustering variables [46]. A random intercept model is one in which intercepts are allowed to vary. As a result, the intercept that fluctuates across groups predicts the scores on the dependent variable for each unique observation [47].

However, the slopes in this model are assumed to be fixed (the same across different contexts). Given . A random intercept logistic regression model is defined as follows:where is the overall intercept; and are the intercepts with and variance (standard deviation) components for each individual considering all measurements and for the repeated measures at , respectively. In a Bayesian hierarchical model, the BRMS-Stan documentation package in R recommended using half-Cauchy as the prior, which is automatically constrained at zero, to lessen the likelihood of unreasonably large standard deviation (SD) values at all levels [45]. The default prior is normal for the intercepts and slopes, with a mean of 0 and a standard deviation of 10 considered.

The mathematical notation of the variance-covariance matrix, , considering the overall intercept and slopes which are fixed effects, can be designed as follows:where , , , and are the variance and standard deviation components of the intercept and coefficients and is the relationship between intercepts and coefficients.

Similarly, the covariance matrix for the grouping (random) effects, , can be expressed as follows:where and are the variance components at the individual subject and repeated measurements, respectively; and is the relationship between measurements within subject . Hierarchical (multilevel) models adjust parameter estimates of the intercepts (mean) of one or more dependent variables at level 1 based on grouping variables defining the higher levels. And it also adjusts the slopes (, coefficients) of one or more predictors (regressors) at any level, and is the association between intercepts and coefficients at the subject level.

3.1.3. Model 3: The Conditional Random Coefficients Model

The random coefficients model also called the conditional random coefficient model is the model with all varying effects, adjusting the intercept (mean) of the outcome variable as well as the coefficients (slopes) of predictors for the random effects at any clustering (level) of the hierarchical data. The coefficient (slope) term in the random coefficients model should not obscure the fact that a random slope model estimates the intercept (mean) as well as the slopes (regression coefficients) at the appropriate hierarchy [48].

A random slope model is one in which the slopes are permitted to change, resulting in slopes that differ between groups. The most realistic model contains random intercepts and random slopes but can also be the most complex. Both intercepts and slopes can change among groups in this paradigm, implying distinct in different situations [49]. Thus, in Bayesian hierarchical modeling, the prior distribution parameters as hyperparameters and the distributions of hyperparameters as hyperprior distribution noticeably occurred.

It can be assumed that both the random intercepts and coefficient/slope model hyperparameters have uniform hyperprior distributions with assumptions suitable for parameters. A more common way to write the model by addressing the correlation between parameters is as follows: , where the regression coefficient by grouping variables.where is the overall intercept and and are the intercepts with and variance (standard deviation) components for each individual considering all measurements and for the repeated measures at , respectively. Half-Cauchy distribution is used as the prior for standard deviation (SD) at the individual subject and measurement levels. The default prior is normal for the intercepts and slopes, with a mean of 0 and a standard deviation of 10 considered.

Considering the intercepts (means) as for groups, the covariance matrix for the random effect, and fixed effect, of the model can be expressed as follows:where and are the variance components of the random and fixed effects, respectively, and is the relationship between random and fixed effects.

3.2. Real Data Application One: Arterial Occlusive Disease Data

In this application, we consider a dataset related to surgical planning, specifically for arterial occlusive disease data. The two popular invasive methods, ultrasound imaging and reduced cuff pressure measures, classify each leg as either healthy (0) or diseased (1). The variables are indicated as which is the patient’s health status from the measurement on the side of the leg. is the patient’s ultrasound measurement on the side of the leg. is the patient’s reduced cuff pressure (RCP) measurement on the side of the leg. Here, individual patients and leg side , which indicates measurements at for right leg upper-side, for right leg lower-side, for left leg upper-side, and for left leg lower-side. In the body, the branches of the arteries that come out of the aorta and carry clean blood to the arms, legs, head, and organs, and the branches of the veins that connect to the main vein bring the dirty blood coming from them to the heart are called peripheral vessels [50].

Arterial occlusive diseases are occlusion or narrowing of the arteries in the legs (or rarely in the arms), usually caused by atherosclerosis resulting from reduced blood movement. Data were collected at Broadgreen Hospital in Liverpool, England, in 1988/89 [51]. Healthy peripheral arteries have a flatlining that prevents coagulation and promotes constant blood movement. Peripheral artery disease can affect all arteries but is most commonly seen in the legs. The data included 16 patients whose features were measured at 4 points on the lower and upper sides of their right and left legs. Of all patients’ characteristics measured during data collection, we only considered the patient’s health status, ultrasound measurements, and reduction in cuff pressure measurements, excluding variables measured at one point and missing values. The patient’s health status was considered the outcome variable, while the ultrasound imaging score and cuff pressure measurement reduction were independent variables. The data structure for the variables in each measurement is shown in Table 1.

3.3. Real Data Application Two: The National Longitudinal Survey of Youth 1979

National Longitudinal Survey of Youth (1979–2012) is a longitudinal project that follows a sample of Americans on various life aspects collected from 1979 to 2012. The dataset has multiple characteristics, mainly socioeconomic status, employment, education, and marriage. For this modeling application, we considered 895 black Americans by year level, and the variables include poverty status, family size, and overall income extracted based on removing missing measurements on selected variables for two years. Bayesian hierarchical linear models were demonstrated, modeling the poverty status of young people as the response variable and family size and overall income varying over their identification (YID) as the predictor variables. The data can be accessed from https://dasil.sites.grinnell.edu/downloadable-data/.

3.4. Application Three: Simulation of Repeated COVID-19 PCR Test Results Scenario

This section considers the variation in three consecutive testing for people with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) PCR test results related to age and testing site (environment). The scenario can be easily simulated for a sample of 50 patients using the R package, as shown in the R-code of the data analysis.

4. Results and Discussion

4.1. Results of Fitted Model Estimates and Model Selection in Each Application Case

After fitting a set of models for each dataset, we need to know which model is more accurate and should be used to make inferences and draw conclusions. Choosing (using for example) the model with a better absolute fit to the real dataset can be a challenge, as this model does not necessarily perform well on new data. Instead, we may want to choose the model with the best predictive capabilities, that is, the model that performs best in predicting data that has not yet been observed. We call this ability the out-of-sample prediction performance of the model [3].

Model convergence was demonstrated in a complex parameter configuration using R-hat statistics sometimes referred to as the potential scale reduction factor (PSRF) and effective sample sizes (ESS). The consistency of an ensemble of Markov chains was demonstrated by models with an effective sample size greater than 100 and an R-hat closest to 1.00 but not greater than 1.10 [52]. As model convergence diagnosis, we reported effective sample size cutoffs for hierarchical models as the bulk effective sample size (Bulk_ESS) and the tail effective sample size (Tail_ESS).

In the first case, based on the arterial occlusive disease real data application, among the fitted six models, results in Table 1 showed that the random coefficient model () in the Hamiltonian Monte Carlo algorithm fits best relative to other models. Besides, the Bayesian information complexity criterion has smaller values than Bayesian deviance or widely applicable information criteria for each fitted model. Thus, the random coefficient model fitted by Hamiltonian Monte Carlo (HMC) algorithm has the least ICOMP value identified as the absolute best model or more accurate in applying dataset one.

In the second case, based on the National Longitudinal Survey of Youth 1979 real data application, among the fitted six models, results in Table 1 showed that the random intercept model () in the Hamiltonian Monte Carlo algorithm fits best relative to other models. Besides, the Bayesian information complexity criterion has smaller values than Bayesian deviance or widely applicable information criteria for each model. Thus, the random intercept (fixed slopes) model fitted by Hamiltonian Monte Carlo (HMC) algorithm has the least ICOMP value, identified as the absolute best model or more accurate in the application two dataset. The results also showed that the varying effects of coefficients (fixed effects) do not provide additional information for the data.

In the third case, based on the simulation application dataset of the repeated COVID-19 PCR test results scenarios, among the fitted six models, the results in Table 2 showed that the random coefficient model () in the Hamiltonian Monte Carlo algorithm fits best relative to other models. Thus, the random coefficient model fitted by Hamiltonian Monte Carlo (HMC) algorithm has the least ICOMP value identified as the absolute best model or more accurate in the application three simulation dataset. The three cases showed that Hamiltonian Monte Carlo (HMC) estimation is better than the Gibbs sampler for complex Bayesian hierarchical models. ICOMP criterion has smaller values and is a better model assessment tool than Bayesian deviance or widely applicable information criteria for each fitted model of two-level repeated measures data [53].

In Tables 35, the results of application datasets showed the estimates of the posterior mean (intercept), estimates of coefficients, and standard error as the standard deviation (SD) for each parameter. Model convergence was achieved well enough since in all cases, and both the bulk effective sample size (Bulk_ESS) and the tail effective sample size (Tail_ESS) for the 95% credential intervals were adequate [25]. Generally, each parameter is summarized by the posterior mean (“Estimate”) and standard deviation of the population parameter (“Std. Err.”). Moreover, a 95% credible interval can be used as lower and upper bounds based on posterior quintiles.

4.2. Assessment of Convergence and Conditional Effects in Selected Models

The information complexity (ICOMP) criterion assessed the best-fitted model based on the figures. The hierarchical model with gradient-based Hamiltonian Monte Carlo estimation provided the best fit and superconvergence models relative to the gradient-free Gibbs sampling approach. Although there is no specific best statistical package for a particular statistical model and dataset, complex hierarchical models can be estimated well using Bayesian Regression Models using Stan (BRMS) compared to MCMC generalized linear mixed/hierarchical models (MCMCglmm) in R software [27, 49, 52, 54].

Based on Application 1: Arterial occlusive diseases data, the model convergence diagnosis, paired plots, and marginal effects are visualized in Figures 24 for the models fitted by MCMCglmm and BRMS in Stan. Based on the convergence plots, the convergence assessment of the models selected above showed that the random coefficient model under the HMC approach best fitted and Bayesian hierarchical models converged well under the Hamiltonian Monte Carlo approach. The best-fitted model’s marginal effects plot (Figure 2) showed that leg side and ultrasound imaging had a positive effect, while the reduced cuff pressure (RCP) measures had no effect in both the fixed and random parts. The probability of events (Figure 3) of a patient’s illness positively affected the leg sides. There are more disease occurrence variations between the legs within a patient than between patients. From Figures 3 and 4, ultrasound measurements showed a significant positive effect on patients’ arterial occlusive disease status varying between four leg sides. At the same time, reduced cuff pressure (RCP) measures had no variation.

Application 2: National Longitudinal Survey of Youth 1979 data showed that the model convergence diagnosis paired plots and marginal effects are visualized in Figures 5 and 6 for the models fitted by MCMCglmm and BRMS in Stan. Based on convergence plots, the convergence assessment of the selected random intercept model under the HMC approach showed the best fit and Bayesian hierarchical models converged well under the Hamiltonian Monte Carlo approach. The marginal effects of year, family size, and income, as visualized in plot Figures 6 and 7 on the poverty status of youths in the best-fitted model, showed that year and family size had a positive effect. In contrast, income had almost no impact on poverty status.

Application 3: simulation data, the model convergence diagnosis paired plots, and marginal effects are visualized in Figures 810 for the models fitted by MCMCglmm and BRMS in Stan. Based on the convergence plots, the convergence assessment of the models selected above showed that the random coefficient model under the HMC approach was best fitted, and Bayesian hierarchical models converged well under the Hamiltonian Monte Carlo approach. The best-fitted model’s marginal effects plot (Figure 8) showed that test time points and age had a positive effect, while the test site had (more negative) no impact in both the fixed and random parts. The probability of events (i.e., a positive SARS-CoV-2 PCR test) shown in Figures 3 and 10 of a patient’s status in the three measurement time points increases with age but decreases with test site variation.

5. Conclusion

Complex hierarchical models require a Bayesian computation (MCMC) that integrates the three components of the posterior: a prior, likelihood, and evidence (marginal likelihood). This study demonstrates two Bayesian computations (MCMC) methodologies to fit three distinct application datasets to three Bayesian hierarchical models: null, random intercept, and random coefficient models. All fitted models estimated by Gibbs sampler and Hamiltonian Monte Carlo (HMC) approaches were compared to select the best-fitted model in each application case. In all cases, models fitted with the Hamiltonian Monte Carlo (HMC) method were the best-fitted models than models fitted with the Gibbs sampler. Hamiltonian Monte Carlo (HMC) in BRMS and Gibbs sampler in MCMCglmm R packages were used in this context.

That model convergence diagnosis demonstrated using effective sample size cutoffs for hierarchical models as the bulk effective sample size (Bulk_ESS) and the tail effective sample size (Tail_ESS) for the 95% credential intervals in each parameter estimation was adequate.

Moreover, model comparisons and sections were made using Bozdogan’s information complexity measure, Bayesian deviance information: DIC (under MCMCglmm), and widely applicable information criterion (WAIC) of the BRMS in Stan [55]. Among the fitted candidate models, the information complexity (ICOMP) criterion showed the lowest measure values in all cases. A better-fit model has a smaller selection criterion measurement value. Hairy caterpillars were depicted in the model convergence assessment graphs, showing the model is well-fitted for the applied data. Performance assessment of ICOMP-type criteria and DIC or/and WAIC was validated using Bayesian hierarchical linear models. We are impressed that instead of DIC or/and WAIC, ICOMP deserves more exploration as a tool for model assessment and comparison with various thematic areas of repeat measures data of two hierarchical levels.

Data Availability

The labeled (Application 1 and Application 2) real datasets used to support the findings of this study and the R-code for the data analysis that had simulation scenarios (Application 3) and all fitted model applications are obtainable upon inquiry from the principal investigator.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Ebrahim EA engaged at every research stage: developing, writing the manuscript, managing data, and analyzing. EA Ebrahim wrote the initial format of the manuscript. EA Ebrahim and MA Cengiz performed material preparation, managing data, and analysis. MA Cengiz participated in editing and manuscript commenting. Erol Terzi participated in proofreading and revising the manuscript to improve the manuscript. The authors had substantial direct and intellectual involvement in the manuscript and approved it for publication.