Abstract

Continuous advancements in biotechnology are generating new knowledge and data sources that might be of interest for the insurance industry. A paradigmatic example of these advancements is genetic information which can reliably notify about future appearance of certain diseases making it an element of great interest for insurers. However, this information is considered by regulators in the highest confidentiality level and protected from disclosure. Recent investigations have shown that the microbiome can be correlated with several health conditions. In this paper, we examine the potential use of microbiome information as a potential tool for cardiovascular diagnosis. By using a recent dataset, we analyze the relation of some variables associated to coronary illnesses and several components of the microbiome in the organism by using a new copula-based multivariate regression model for compositional data in the predictor. Our findings show that the coabundance group associated to Ruminococcaceae-Bifidobacteriaceae has a negative impact on the age for nonsedentary individuals. However, one should be cautious with this conclusion since environmental conditions also influence the baseline microbiome.

1. Introduction

In recent years, the advances in biomedical sciences and biotechnology have enabled an unprecedented leap forward in the amounts and variety of data and information available for research and other purposes. This has translated in new applications and the development of new approaches such as a personalised or precision medicine where the aim is to use these new data and information sources (mostly related with genetic/genomic information) for diagnostic and therapeutic purposes and tailoring them to individuals or groups of patients (see Ginsburg and Phillips [1]). However, genomic information is considered in the highest level of confidentiality and protected from disclosure. For these reasons, the use of genomic data sparked a debate around the ethics and limits associated with the use of this knowledge and information in different sectors how it could be eventually regulated. In an attempt to overcome this potential limitation, in this paper, we propose to explore new avenues and information sources and the use of microbiome as a potential tool for disease diagnosis.

The microbiome is defined as the set of microorganisms that live inside or on the organism and its analysis has attracted a great interest in the biomedical domain, particularly since the development of new technologies that have facilitated and reduced the costs of accessing this information. Microbiome is an extremely dynamic element, and it changes with time and environmental conditions and other external factors, such as diet, geographical location, or physical activity and even with interaction between microbes and microbes and the host. The advances in biotechnology allow researchers to measure dynamic behaviors of the microbiota at a large scale (see [2]). Recent studies have shown that differences in the microbiome composition are correlated with an increasing number of conditions ranging from cardiovascular diseases, autoimmune diseases, metabolic diseases, or neurological disorders and mental health aspects [37]. Another interesting reference linking a well-established cohort in the biomedical domain (the Framingham Study) with changes in the microbiome for multiple relevant health parameters such as cardiovascular risk, metabolic syndrome, and diabetes is Walker et al. [8]. BMI and physical activity have been also studied in the context of the microbiome finding relationships with different microbiome compositions [912]. Quite often, these relationships have been studied under the umbrella of either different ages or other health conditions. For a recent review highlighting that physical activity has an impact in gut microbiota and that physical exercise could be used to control obesity and health (see [13]). Both BMI and physical activity are important factors considered in insurance underwriting. For example, a higher BMI was predominantly related to blood pressure and lipids, which is consistent with results found in the literature (see [14] or [15]). There also exists an association between obesity and higher BMI with all-cause mortality (see [16]). Besides, BMI is connected to increased cancer risk as was recently described by Bhaskaran et al. [17] in a recent paper. On the other hand, microbiome data present a singular challenge due to its inherently high-dimensional and sparse structure. To handle the high dimensionality and compositional nature of the data, Wang et al. [18] proposed a sparse microbial causal mediation model specifically; also, Zhang et al. [19] used an isometric log-ratio transformation of the relative abundances as the mediator variables between treatment and outcome. A statistical approach that enables the inclusion of all daily activity behaviors, based on the principles of compositional data analysis was described by Dumuid et al. [20].

In our cross-sectional analysis, using a sample of eligible individuals with unique microorganisms across the Indian microbiome population, due to the large number of operational taxonomic units available in the gut microbiome across the sample, a clustering analysis to reduce the dimensionality of the dataset was initially carried out. Then, the resulting proportions of each of the groups of bacterial coabundance are combined in compositional data predictors that will be used to jointly explain the relationship of age, BMI, and level of physical activity by using a new bivariate regression model for compositional data in the predictor. In this paper, the margins are a beta regression and mixture of logistic regression models. As the age in years of the individuals is restricted to the interval 18–65, a beta regression model indexed by mean and dispersion parameters is considered. This regression family is useful in situations where the dependent variable is continuous and restricted to a bounded interval. On the other hand, regardless of the gender and physical activity level, the empirical distribution of BMI in humans is bimodal. Therefore, choosing suitable parametric models that can capture this feature is crucial; for that reason, a mixture of logistic regression model has been chosen due to its flexibility and simplicity. These margins are linked via a -copula. This is an elliptical copula that is particularly well suited for this purpose as they not only allow for separate modeling of the univariate marginal distributions from the dependency structure but also for covariate adjustment in the margins and uncertainty quantification of their dependence estimates. In addition, they can specify different levels of correlation between the marginals. The compositional data included in the linear predictor are rewritten as logarithms of ratios. Then, we perform estimation via inference for margins method to explain the age, BMI, and level of physical activity using as margins a beta regression and mixture of regression models. Although copula models have been widely applied to model the joint distributions with mixed margins, copula models with the margins proposed in this work with compositional data in the predictor have not been extensively studied in the literature.

The rest of the paper is structured as follows: in Section 2, an examination of a human microbiome dataset is carried out. Here, a cluster analysis to classify the proportion of the most significant bacterial coabundance groups in the sample is completed. Furthermore, an approach to deal with the implementation of microbiome data as compositional data in the predictor is presented. The relationship of age, body mass index, and physical activity level with microbiome is analyzed in Section 3. Here, we firstly consider the marginal relation of age given a level of physical activity with microbiome through a beta regression model. Next, we examine the connection of BMI with the microbiome given the level of physical activeness by using a mixture of logistics regression model. Later, the joint relationship of these variables is examined by using a -copula. Finally, discussion and extensions conclude the paper.

2. Analysis of a Human Microbiome Dataset

In our analyses we use a dataset available in Dubey et al.’s [21] LogMPIE study. This dataset is freely accessible, and it may be downloaded from the European Nucleotide Archive (ENA) portal of the European Bioinformatics Institute (https://www.ebi.ac.uk/ena/data/view/PRJEB25642). In this study, as it was portrayed in the original description of the dataset, they identify and map the Indian gut microbiome. It was carried out in fourteen geographical locations. Individuals were uniformly selected across geographical regions and some variables associated with changes in the structure of microbiome such as BMI, age in years of the individual, restricted to the interval 18–65 and level of physical activity (sedentary-nonsedentary) and gender (male-female) were also considered in their study design. In addition, a subject is classified as an obese if his/her BMI is greater than 30. This study recorded data from 1004 eligible individuals and reported 993 unique microorganisms across the Indian microbiome population. Unfortunately, in this dataset neither a longitudinal analysis across time of individuals nor changes in the composition of microbiome in old subjects are available.

2.1. Cluster Analysis

In general, microbiome empirical distribution includes a high proportion of zero observation and a truncation point mass to account for high values that are too sparse to model; for that reason, models that gives an accurate estimates of the true proportion of zeros have been considered in the literature (see [22] and [23]). In addition, given the dynamic character of the microbiome other techniques such as functional response regression on correlated longitudinal microbiome sequencing data has been recently considered in the literature [24]. In this work, in order to facilitate further analyses and reduce the dimensionality of our data set, we started carrying out a cluster analysis. A main task of exploratory data mining, to group a set of bacterial coabundance collections in such a way that objects in the same group or cluster are more similar to each other than to those in other clusters. We performed our clustering based on the coabundance of genus-like groups at a taxonomic level of species within a sample of 1004 subjects. A total of 993 bacterial genera were identified. The core microbiota analysis was completed by using the Hierarchical Ordered Partitioning and Collapsing Hybrid (HOPACH) package in R that can be downloaded from the Bioconductor website http://www.bioconductor.org/. This package includes the HOPACH clustering algorithm that assembles a hierarchical tree of clusters by recursively portioning the whole dataset while ordering and collapsing clusters at each level. In our analysis, we have discarded redgenus that contain at least a minimum relative abundance of 30%, i.e., 70% of zeros in the sample of 1004 individuals. The algorithm uses the MSS (Mean/Median Split Silhouette) criteria to identify the level of the tree with maximally homogeneous clusters. The correlation distance (cor) was the metric selected for clustering the microbiome species by calculating dissimilarities between variables. We have also used a nonparametric bootstrap to estimate the probability that each species belongs to each cluster and to better understand the variability of each cluster. For that reason, we employed the “boothopach” function by taking 1000 bootstrap resample datasets to obtain a suitable balance between precision and speed. As a result of this, we were able to group the microbiome in five groups containing different numbers of genera (see supplementary tables in Table 1). The five different group of bacteria (classes) identified from the cluster analyses could be associated with different taxonomic groups according to the most abundant or representative genus for each of the identified clusters. Groups 1 and 4 are the two largest groups in terms of number of taxonomic elements. Also, as in Group 1, a majority of members comes from the Bacteroidales-Bacteroidaceae group that represents almost 2/3 of the species contained in this cluster (17 out of 27 members), it could be related to Bacteroidales-Bacteroidaceae cluster. Group 4 is associated with Lachnospiraceae which represent almost 1/3 of the total in this group (7 out of 23 members). The other three groups (2, 3, and 5) were assigned to the Ruminococcaceae-Bifidobacteriaceae group (5 out of 19 members), Negativicutes group (4 out of 19 members), and Pasteurellaceae group (3 out of 15 members), respectively. The results and relationships between the different elements on each of the clusters are presented in Figure 1. Here, species close to each other in the tree are shown in a similar way. The ordered distance matrix shows the clustering structure. Similar clusters appear as blocks on the diagonal of this heatmap. Darker colours represent small distances whereas the lighter colours represent large distances. The identified clusters have different sizes and compositions, with two large coabundance clusters, grouping the majority of the genus analyzed. It is important to note that we have combined under the name Group 0 all the discarded operational taxonomic units, that is, all the species with more that 70% of zeroes in the sample.

Table 2 displays the mean, median, and standard deviation for each one of the coabundance groups. It is noticeable that the proportion of bacteria that belongs to Group 1 is higher in average than the proportion in the other groups. The variability is also larger for the first coabundance group.

In Figure 2, some ternary plots for different combinations of the bacterial groups are displayed. In particular we have compared the coabundance Group 1, with Group 2 (top left), Group 3 (top right), Group 4 (bottom left), and Group 5 (bottom right). In order to ensure that the total sum is one, we have combined the coabundance proportion for the rest of the groups in each graph as Others. Group 1 is always located at the top of each triangle. The proportion of coabundance of Group 1 is measured in terms of the horizontal lines, i.e., 0% of coabundance is measured in terms of base of the triangle (farthest from the vertex Group 1). In the lower left apex of each triangle is represented the groups compared to Group 1. The right side of the triangle now becomes the baseline for the percentage of the groups located in this vertex. Finally, the combined groups are located at the lower right apex of the triangle.

The rate of coabundance for the combined groups is calculated from the left side of the triangle (0% abundance) to the lower right corner (100% abundance). It is observable that the data lie from a high amount of coabundance of Group 1 and Group 2 with a low coabundance of third, fourth, and fifth groups (top left graph). From the rest of the graphs, it can be inferred that when Group 1 is compared to the other groups, the coabundance of these groups is lower than in the former graph. Also, as Group 2 has been included in the lot Others, the corresponding coabundance of the combined group is higher than in the top left graph.

2.1.1. Compositional Data Predictor

Compositional data can be defined as arrays of strictly positive numbers for which ratios between them are important without any further requirement [25]. Microbiome data are compositional, that is, the distance between component values is only meaningful proportionally (see [26]). The elements of the composition are non-negative and sum to unity. An important issue in microbiome data is the large presence of zeros; however, the issue of zero values in some components is not addressed in most papers and especially in the task of regression. In general, in compositional research problems, most of the basic statistical analysis tools are incorrect unless the variables are rewritten in terms of logarithms of ratios as proposed in the log-ratio methodology for compositional data. After computing these log-ratios, standard regression methods can be used since the relative character of the information is considered when analyzing the results, as one group or variable can only increase in relative terms if some other group or groups reduce. In this work we focus on the case of compositional data being included in the predictor variables. The effect of increasing one of the variables in relative terms in the predictor therefore depends on which other variables are decreased when this occur. In log-ratio parlance, the effect of increasing one log-ratio is interpreted while keeping all other log-ratios constant as the same log-ratio may have different meaning depending on the way that the other log-ratios in the model are assembled. Thus, the interpretation of log-ratios as explanatory variables is usually different from other approaches. Several different approaches of building and interpreting the log-ratios have been considered in the literature, often leading to the same predictions and residuals [27]. Among the different parametrizations, in this work, we have chosen centred log-ratios [28]. In our analysis, we consider a vector of 6-dimensional real space that carries information on the relative importance of its components,where . Note that the explanatory variables represent the proportion of the bacterial coabundance proportion of Group in individual . Centred log-ratios are calculated by using a quotient between each variable and the geometric mean of all components (see [28]),

The fact that we are using logarithms to base 2 means that a unit increase in this logarithm leads to a double increase in the original magnitude. In order to avoid perfect collinearity one centred log-ratio must be deleted from the regression equation. Since all six centred log-ratios add-up to zero, by increasing a fixed centred log-ratio while keeping the other four remaining log-ratios in the regression equation (with regressors with ) constant implies increasing the given centred log-ratio whilst reducing the omitted centred log-ratio by the same amount. In this regard, a positive statistically significant regression coefficient indicates an increasing value of the covariate at the expense of decreasing the amount of the omitted component has a significant positive effect on expected value of the response variable. This is equivalent to say that in terms of logarithm of base 2, is interpreted as the expected change in the response variable when the ratio between and the omitted explanatory variable is multiplied by six. Finally, in order to obtain the estimates and their corresponding values for all possible pair combinations, the model needs to be repeated six different times by ignoring each time a different centred log-ratio.

3. Relation of Age, BMI, and Level of Physical Activity with Microbiome

In this section, we firstly consider the marginal relation of age given a level of physical activity with microbiome through a beta regression model. Next, we examine the connection of BMI with the microbiome given the level of physical activeness by using a mixture of logistics regression model. Finally, the joint relationship of these variables is examined via a -copula.

3.1. Relation of Age and Level of Physical Activity with Microbiome

It is our interest to model the relationship between the age of the subject and the proportion of each coabundance genus-like groups at taxonomic level of species via a beta regression model (see Ferrari and Cribari-Neto [29]). This model assumes that the response variable is beta distributed using a parametrization of the beta law that is indexed by mean and dispersion parameters. This regression family is useful for modeling rates and proportions, that is, in situations where the dependent variable of interest is continuous and restricted to a bounded interval where and are known scalars with . This model is related to other variables through a regression structure. Our goal is to explain a continuous response variable with . The density of is defined as follows:with and with . This parametrization allows us to obtain a regression structure for the mean of the response along with a dispersion parameter . Here, is the sample size and . The variance of the response variable can be easily explained in terms of its mean by the following expression . The variance decreases with the value of the dispersion parameter.

Let us now consider that a random variable denoting age of the individual in the sample is related to a compositional data predictor related to each one the coabundance groups, where are chosen among all combinations without repetition from the vector taking 5 components at a time. Then, by using the logit link (i.e., ), we have that , where is a vector of regressors. Other choices for the link function link functions for the response model are feasible.

We have fitted this beta regression model to this dataset to explain the response variable Age by considering two levels of physical activity: sedentary and nonsedentary by assuming and . Below in Table 3, the estimates and values associated with the six predictors for individuals classified as sedentary for each microbiome coabundance group’s proportion obtained under the regression model (1). Similarly, in Table 4 estimates and values results for each predictor are shown for nonsedentary subjects. From these tables, it is discernible that for the first predictor the regressor associated to Group 5 for sedentary individuals is statistically significant at the 5% level whereas it is not for nonsedentary subjects. Its value, −0.1260, is interpreted as the decrease in the covariate (i.e., Pasteurellaceae) at the expense of increasing the amount of has a significant negative effect on while keeping the remaining four log-ratios in the equation constant. In a similar fashion for the third predictor for the nonsedentary subjects the explanatory variables, , , , and are significant at the same level while they are not for the sedentary individuals. For all these covariates, the sign of their regressors is positive; therefore, an increase in these regressors at the expense of decreasing the value of has a significant effect on the transformation of the expected value of the mean of the model. For the fourth predictor, the regression coefficient associated to the second group is only significant for the nonsedentary party. Similarly for the fifth predictor, is significant for the sedentary individuals whereas the regressor associated to Group 2, i.e., Ruminococcaceae-Bifidobacteriaceae is significant for the nonsedentary individuals. Similar situation is also verified for the sixth predictor. On the other hand, for the sedentary subjects, the variable is a positive significant variable.

In Figure 3, we have plotted the histograms of the empirical distribution of the response variable Age for the nonsedentary (top left panel) and sedentary (bottom left panel) subjects. For both histograms, we have superimposed the probability density function of the beta distribution. It is observable that this distribution provides a better fit to empirical data for the nonsedentary group than for the sedentary party. Furthermore, we have performed a diagnostic analysis to check the goodness-of-fit of the estimated model by providing a global measure of explained variation and graphical tools based on QQ-plots, to detect departures from the given model and influential observations. Residuals are used to check the appropriateness of a chosen model and to identify outliers. For that reason, randomized quantile residuals (Dunn and Smyth [30]) are used since other type of residuals, i.e., Pearson’s and deviance residuals are far from normality when the parameters of the model are known and they fail to provide useful information of the inadequacy of the model. The th randomized quantile residuals for a discrete response variable is defined as where is the quantile function of the standard normal distribution and is the cumulative distribution function associated to the beta regression model evaluated at the estimated parameters for . In the right panels of Figure 3, the QQ-plots of the randomized quantile residuals of the beta regression models when the predictor 1 is considered for the nonsedentary (top right) and sedentary (bottom right) subjects. Each dot on the plots represents an empirical residual. A perfect alignment with the line implies that the residuals are normally distributed. In general, it is observable that the residuals for the nonsedentary group adhere closer to the line in the whole distribution.

3.2. Relation of BMI and Level of Physical Activity with Microbiome

Regardless of the gender, the empirical distribution of BMI in humans is bimodal. Then, finding appropriate statistical models that have the capacity to explain bimodal datasets is an issue of vital importance. In this work, we use a mixture of two logistic distributions with different locations and scale parameters. We have chosen this family for its flexibility and simplicity. It is now our interest to explain the BMI in the population in terms of a random variable . The probability density function of this random variable iswhere , are location parameters and , are scale parameters. In this model, it will also be assumed that the weight parameter for each individual in the sample is again expressed as a function of the same group of covariates, , where is a vector of regressors. Other choices for the link function link functions for the response model are also possible. This parametrization enable us to obtain a regression structure for the mean of the response in the following way, . The variance of the response variable is written in terms of the following linear combination of the scale parameters:

We have now fitted the mixture of logistics regression model given by (5) to this dataset to explain the dependent variable BMI by again considering two levels of physical activity: sedentary and nonsedentary. In Table 5, the estimates and values associated with the six predictors for individuals classified as sedentary for each microbiome coabundance groups proportion obtained under this mixture of logistics regression model (5). In a similar way, in Table 6, estimates and values results for each predictor are shown for nonsedentary subjects. From these tables, it is apparent that for the first predictor the regressor associated to Group 1 for sedentary individuals is statistically significant at the 5% level whereas it is not for non-sedentary subjects. The estimated value is −0.3631, that it is interpreted as the decrease in the covariate (i.e., Bacteroidales-Bacteroidaceae) at the expense of increasing the amount of has a significant negative effect on while keeping the remaining four log-ratios in the equation of the predictor constant. Similarly, for the second predictor and the sedentary subjects, the explanatory variables, is statistically significant at the 10% significance level while it is not for the nonsedentary individuals. The sign of this regression coefficient is positive; therefore, an increase in this regressor at the expense of decreasing the value of , while keeping the other four log-ratios in the equation constant has a significant effect on the transformation of the expected value of the mean of the model. Finally, for predictor 3, the regression coefficient associated to the first group is only significant at the 10% significance level for the sedentary individuals. The sign of this regressor is negative.

In Figure 4, we have plotted the histograms of the empirical distribution of the response variable BMI for the nonsedentary (top left panel) and sedentary (bottom left panel) individuals. For both histograms, we have superimposed the density function of the mixture of logistics distributions. It can be seen that this distribution is able to reproduce the two modes of the empirical distribution for both cohorts. Note that for the group of sedentary individuals, the second modal value located around the BMI value of 31 is clearly more predominant. Once again, we have plotted the QQ-plots of the randomized quantile residuals of this mixture of logistics regression when the first predictor 1 is considered for the nonsedentary (top right) and sedentary (bottom right) individuals. In general, it is observable that the residuals for the nonsedentary group adhere closer to the line in the whole distribution but it underestimates the top part of the distribution of residuals.

3.3. Joint Relation of Age, BMI, and Level of Physical Activity with Microbiome

The degree of association between the two variables age and BMI in the sample for the different levels of physical activity can be summarized in terms of some measures of correlation for bivariate data. In Table 7, Pearson’s, Spearman’s, and Kendall’s measures of correlation for these continuous random variables are displayed. It is noticeable that there exists weak positive correlation between these two variables. The degree of association is less intense for the nonsedentary individuals.

We model the joint dependence of age and BMI for different level of physical activity and their relationship with the proportion of each coabundance genus-like groups at taxonomic level of species via a -copula with degrees of freedom (df) parameter with marginal distributions given by the beta regression model given in (1) and the mixture of logistics distributions provided in (5). The density of this of this multivariate distribution is defined aswhere , , , and . Here, is a symmetric and positive definite scatter matrix with dimension with unit diagonal entries and , denotes the determinant of a matrix, and is the complete gamma function. Also, with with , where is the quantile function of univariate -distribution with df and is the cdf associated to the regression models presented above with .

The corresponding log-likelihood function, given a sample is provided bywhere and are the log-likelihood functions of the copula and marginal model, respectively. Maximum likelihood estimation can be used to estimate the parameters of expression (8) via an adaptive maximization by parts (MBP) algorithm as described in [31], by using initial estimates generated by inference for margins algorithm. In the step of this algorithm for , we find,

The algorithm stops when a terminating condition between two consecutive iterations is reached, i.e.,

Finally, we have fitted the bivariate distribution given in (5) to the bivariate data set. Once again, the two levels of physical activeness have been considered. Results are shown in Table 8 for the sedentary case and Table 9 for the nonsedentary situation. When using the first predictor, i.e., the omitted covariate is , the variable and are statistically significant at the 5% level of significance for the variables age and BMI, respectively, for sedentary individuals while they are not for the nonsedentary group. Also, the regressor associated with the covariates and for the BMI are significant at the same level of significance for the sedentary individuals whereas they are not for the nonsedentary group. The sign of these regression coefficients is negative. Conversely, for the third predictor and the variable Age, the regressors for the variables and are positive significant only for the nonsedentary subjects. In addition, the variable is positive significant for the variable BMI for the same level of physical activeness. Regression coefficients associated to in the fourth and fifth predictors (omitted variables and , respectively) are negative significant for the response variable age. Finally, when the explanatory variable is deleted, the regressor linked to is positive significant at the 5% level for the sedentary subjects and response variable Age. Similarly, for this sixth predictor, the regressor associated to the variable for the same response variable is negative significant for individuals classified as nonsedentary.

4. Discussion and Extensions

Although genetic information can reliably inform about the future appearance of certain diseases and it is an element of great interest for different stakeholders, this genomic information is considered in the highest level of confidentiality and protected from disclosure. In this sense, in the insurance industry, a particularity of genomic information in this context is that it does not only provide information and knowledge about the individual taking the insurance but also in respect to their ancestors and descendants. As a consequence of these limitations, international regulators, e.g., the Council of Europe encourages insurers to update their actuarial bases according to relevant and new scientific knowledge and this may open the gates to explore new avenues and data types and information sources. As part of this new vision, we have examined the potential use of microbiome information in some variables associated with the insurance underwriting. Recent investigations have shown that changes in the gut microbiome are associated to certain risk of pathologies could be a potential proximal predictor of disease onset.

Recently, in an unpublished work by using text mining techniques in life insurance literature and microbiome research, a significant overlap between certain diseases and health conditions and other elements that are considered in insurance underwriting. One of these elements is the body mass index (BMI). This is one of the variables considered in the standard health declaration. Traditionally, this declaration is the first step in the risk assessment in health insurance underwriting practice. Certainly, depending on the level of insured capital and age of the policyholder, extra medical examination will be the obligatory required guarantee regardless of the outcome of the standard health declaration. However, medical examinations are expensive, disturbing for the applicant and time-consuming in the underwriting process (see [32]). The importance of BMI is linked to obesity that will lead to large number of chronic diseases, and consequently increase health expenditures and claims costs. Therefore, an early detection of obesity is crucial to safeguard the financial structure of the health insurance provider. Similar conclusions can be drawn about the early detection of cardiovascular, mental metabolic or immune diseases. Then, it is extremely important for the private health insurers to monitor their policyholders’ health status in order to reduce future claims costs.

The main findings of our analysis show that the second bacterial coabundance group associated to Ruminococcaceae-Bifidobacteriaceae has a significant negative effect on the expected value of the response variable Age for the nonsedentary individuals. This issue is verified not only for the marginal model associated with the response variable age but also for the joint regression model. Concerning the fifth coabundance group related to Pasteurellaceae, it was observed a positive impact on the expected value of age for the sedentary individuals in the marginal model; in contrast, for some predictors, a negative impact in the mean of the response variable age is noticed under the bivariate regression model. This fact is somehow consistent with the recent work of Jollet et al. [33], where it is described a randomized clinical trial that shows in two of the bacterial groups of the Pasteurellaceae group are related to the level of physical activity. However, the degree that this conclusion is valid is arguable since, in general it appears that the majority of people have a unique baseline microbiome that is influenced by environmental conditions. The standard microbiome composition is affected not only by the level of physical activity but also for other factors such as age, diet, and medication. In addition, we have only analyzed a single observation of the gut microbiome per individual limited to the age interval 18–65. It should be highly recommended to perform a longitudinal analysis to relate microbiome information to risk mortality factors and probability of developing certain pathologies. For example, to deal with the high proportion of zeroes in the operational taxonomic units observed, a two part zero-inflated regression model with random effects could be used [26]. In this regard, the American Gut Project (a citizen science project containing more than 10,000 samples not only from the USA but also from several other countries around the world including Australia, UK, or Spain) represents an opportunity to access microbiome data for a variety of age group. This source of information together with the Human Mortality Database available in https://www.mortality.org could be used to analyze how changes in the gut microbiome are related to human longevity in different countries.

Data Availability

In our analyses, we use a dataset available in Dubey et al.’s [21] LogMPIE study. This dataset is freely accessible, and it may be downloaded from the European Nucleotide Archive (ENA) portal of the European Bioinformatics Institute (https://www.ebi.ac.uk/ena/data/view/PRJEB25642). Datasets are also included in the submission (Abundance.txt and Metadata.txt).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

E.C.O and G.L.C acknowledge Fundación MAPFRE for “Ignacio H. de Larramendi 2017” research grant that partially funded this work.

Supplementary Materials

Metadata: demographic information of the individuals. Table S1: microbiome abundance for each individual. (Supplementary Materials)