Abstract

Gut microbiome plays a significant role in defining the health status of subjects, and recent studies highlight the importance of using time series strategies to analyse microbiome dynamics. In this paper, we develop a Bayesian model for microbiota longitudinal data, based on Dirichlet distribution with time-varying parameters, that take into account the compositional paradigm and consider principal balances. The proposed model can be effective for predicting the future dynamics of a microbial community in the short term and for analysing the microbial interactions using the value of the estimated parameters. The usefulness of the proposed model is illustrated with six different datasets, and a comparison study with four alternative models is provided.

1. Introduction

The microbiome is defined in [1] as “the ecological community of commensal, symbiotic, and pathogenic microorganisms that literally share our body space.” Recent studies suggest that gut microbiome plays a significant role in defining the health status of subjects. In fact, human gut microbiome influences brain function and behavior [2] and is related to several diseases such as cancer [3], HIV [4], Alzheimer [5], type 2 diabetes [6], cardiovascular disease [7], irritable bowel syndrome (IBS) [8], Crohn’s disease [9], chronic fatigue syndrome (CFS) [10], obesity [11], and malnutrition [12], among others. What is more, in [13], the paramount interest of gut microbiome, pointing out the importance of microbes in regulating most if not all human bodily functions, has been underscored. As a result, monitoring human microbiome will probably lead us to improve human health, as has been mentioned in [14].

Given the importance of the microbiome, a concerted effort is being made to design models that enable us to extract information on bacterial dynamics. In [15], Martí et al. explain that temporal microbiota stability is linked to health status. In [16], Dinleyici et al. observed that a disruption in the stable state of the microbiome is related to a disease, but the stable state is reached again when the subject is cured. These facts highlight the importance of using time series strategies to study the microbiome. In that respect, longitudinal studies are an appropriate approach to tackle microbiome modeling, as they can help to gain a better understanding of microbiota regularities, bacterial interactions, and microbiome response to perturbations. Furthermore, this longitudinal approach allows us to predict future microbiome dynamics and analyse not only microbiome stability but also the time it needs to recover and reach a new stable state, if possible.

At present, there are studies where the microbiome is manipulated in humans in order to treat certain illnesses [17]. This manipulation is carried out by fecal microbiota transplantation, probiotic supplementation, or a combination of both strategies [17]. Nowadays, there are also nutrition-based approaches to treatment [18]. For instance, the work detailed in [19] points out that the administration of Lactobacillus delbrueckii and antibiotics to treat bacterial vaginosis can provide a cure, [20, 21] indicate that fecal microbiota transplantation is a promising therapy for C. difficile infection, and [22] reports that fecal microbiota transplantation in HIV patients attenuates HIV-associated dysbiosis. Taking this research approach into account, modeling bacterial dynamics should be a research priority because these models will help us to design better interventions, as has been mentioned in [17].

One line of action in longitudinal analysis to model the abundance of each microbial entity at each time point is to use generalized Lotka–Volterra equations [2326]. We must mention that these models have usually been used in a deterministic way; however, [27] takes stochastic processes into account, using multiplicative noise with normal distribution. These approaches consider pair-wise interactions, and consequently, they do not take into account the effect that a third microbe may have on an interacting pair of microbes. In [28], more limitations of using the Lotka–Volterra structure have been pointed out. In [29, 30], a nonparametric approach with an additive structure has been proposed. The advantage of these models is the absence of microbial relationship specifications, while the drawback is the fact that the additivity of relationships, allowed in this approach, may not be realistic for complex microbial communities. Other proposals consist of using state space models [31] or linear mixed models [32, 33]. In [34], and references therein, more proposal to analyse microbiota longitudinal data is detailed.

Recent investigations consider compositional data analysis [35] is also a valid approach to analyse microbiome high-throughput sequencing data, and therefore, microbiome datasets can be converted to relative abundance values, or normalized counts, prior to analysis (see, for instance, [36] or Chapter 3 in [34]). Fundamentally, this approach uses the log-ratio transformation techniques to convert microbiome data, thus removing the compositional constraints to make the standard techniques suitable for analysis. This transformation focuses on removing the compositional constraint: all microbial relative abundances sum to one. Microbiome data can be considered as a compositional dataset due to limitations of the techniques used to collect the data [36]. These limitations are based on the fact that datasets collected by 16S rRNA gene amplimers have an arbitrary total imposed by the instrument. This total is known as sequencing depth or library size. This implies that an increase in the abundance of one taxon results in a decrease of some of the other taxa because the sum of all of the raw abundances cannot exceed the sequencing depth. As the increase of a taxon produces a decrease in some of the other taxa, the correlation between the taxa is bogus. The work presented in [37] specified that the correlations between compositional components are spurious. This fact was described under various headings: the constant-sum problem, the closure problem, the negative bias problem, or the null correlation difficulty [38]. These characteristics of compositional data require that the analysis of this type of data must be conducted using appropriate methodology.

There are approaches that analyse microbiome taxa taking its compositional nature into account. In [39], Kynčlová et al. model the isometric log-ratio transformation of the data using Gaussian distribution and a VAR structure, while in [40] phylogenetic isometric log-ratio transformation, to allow off-the-shelf statistical tools to be optimally applied to microbiota datasets, is introduced. This approach takes into account phylogenies to construct the ILR transform and redefine data considering coordinates termed balances. Rivera-Pinto et al. in [41], take also into account balances. In this case, to identify microbial signatures and consider them as predictive covariates of a phenotype of interest, Joseph et al. describe microbial dynamics using generalized Lotka–Volterra (gLV) equations and compositional data analysis (CODA) [42]. In [43, 44], Bayesian CODA methods for microbiome time series which model the counting process and perform inference and apply log-ratio transformations to the model parameters are detailed.

Our proposal is based on modeling a normalized transformation of the observed counts. We assume that normalized transformation of the observed counts can be explained by an autoregressive structure considering Dirichlet distribution with time-varying parameters that take into account principal balances. Principal balances are a compositional tool that enables us to reduce the number of parameters of the model by selecting the balances that maximize the variance [45]. As balances with more variability will better explain the variability of the data, this formulation allows us to encapsulate information and decrease the number of parameters to be estimated. Our proposal is in line with [46], though this one is not used with microbiome datasets. We must mention that the Dirichlet distribution has been considered inappropriate for modeling compositional data due to its strong implied independence structure, which can be deduced from its definition by a set of independent, gamma-distributed, random variables, with equal scale parameter. This fact makes it inappropriate to consider this probability distribution for modeling compositional data [35]. However, it has also been found useful when used as a conditional distribution (see, for instance, [47, 48]). Our approach is presented as an alternative for predicting the future dynamics of a microbial community in the short term and for analysing the microbial interactions using the value of the estimated parameters.

2. Some Basic Principles of CODa

The first steps towards developing a suitable methodology to analyse compositional data focused on projecting the sample space of compositional data (the simplex) to real space using log-ratio transformations [35]. The additive log-ratio transformation (alr) and centered log-ratio transformation (clr) were defined in [35] and are defined by the following expressions:where denotes a D-composition and denotes the geometric mean. Note that when projecting the simplex to the real space, the clr transformation preserves the distances [49]; see equation (2), where denotes the Aitchison distance between two vectors and and denotes the Euclidean distance.

2.1. Balances

Balances are defined in [50]. Using the notation applied in [45], they are expressed as follows:where and are two nonoverlapping groups of parts of a D-composition, and are the number of parts in and , respectively, and denotes the geometric mean.

Balances are a compositional tool that makes it possible to study the relationship between the components of the numerator and the components of the denominator of the balance. Balances that are near zero indicate similarity between the two groups; however, the greater the absolute value of the balance, the lager is the dissimilarity between them.

Depending on which components we put in and which we put in we will have different balances. In this study, we want to choose the components of and the components of that maximize the variance of the balance because balances with more variability will better explain the variability of the microbiome data. Principal balances are the strategy selected to tackle this issue.

2.2. Principal Balances

Principal balances act as the Principal Component Analysis (PCA) for compositional data. In other words, they allow us to reduce the dimension with a minimum loss of information. In addition, the numerator and the denominator of the principal balances are selected so as to maximize the variance of the balance [45].

In order to obtain the principal balances, Ward’s clustering method [51] is used (see details in [52]). We apply this method because it enhances the proportionality between groups of parts and reduces the computation time [52].

3. Model

3.1. Selecting Principal Balances

Let be the relative abundances at time , so that and . Note that denotes the relative abundance of taxon at time point . Let be the number of time points available in the data. In order to implement Ward’s clustering method, we denote as the vector that contains the clr transformation of the relative abundance of species at all time points available in the data. As a result,

First of all, we calculate the matrix . This matrix is proportional to the variation matrix due to the following equation defined by [52].

In this equation, denotes the vector that contains the relative abundance of species at all time points available in the data. Before going into further detail, it is important to point out that Ward’s method distance between two clusters B and C is defined by equation (6), where and denote the centroids (i.e., barycenters) of and , respectively.

Second, we apply Ward’s method, because according to [52], the variance of a balance is proportional to Ward’s method distance between the numerator and the denominator of the balance. Ward’s method is hierarchical, and it detects the smallest entry in the matrix (which functions as the variation matrix), and the corresponding parts are merged to form a group. The matrix is recalculated at each iteration using equation (6) and, the final stage of the process, the last two remaining groups are fused into one, which gives the balance with the largest variance. We have implemented this process using the hclust function included in the fastcluster package in R [53].

During the hierarchical process, we generate a dendrogram that plots the groups created in each iteration of the method. As a result, the groups merged in the dendrogram give us the numerator and the denominator of the principal balances. In conclusion, through this process, we have applied the new approach to Ward’s clustering method in order to obtain the principal balances, which are denoted as .

Finally, we select the principal balances for which the sum of the percentage of variance is higher than . We choose this percentage because we consider that taking account of the balances that describe at least of the variability is enough for our model to capture the variability of the data [54]. We denote the selected principal balances as , where denotes the total amount of SPBal. We must mention that a detailed example of the process described in this section to obtain the selected principal balances is provided in Section 5.

3.2. Proposed Model (PM)

We assume that follows a Dirichlet distribution with nonnegative parameters . Reference [46] presents the ability of the Dirichlet distribution with an autoregressive structure that uses the alr or the clr transformation as a link function and imposes a parameter that is equal to the sum of the Dirichlet parameters. Our proposal is in line with this because it uses the Dirichlet distribution and an autoregressive structure; however, we use a Bayesian framework with the logarithm as a link function, we introduce the principal balances in the autoregressive structure, and we do not need to estimate an extra parameter that is equal to the sum of the Dirichlet parameters.

We model each transformed parameter as a linear function of the selected principal balances:

Note that are calculated using the values of , and the parameters of the model are . The Gaussian distribution with zero mean and standard deviation is considered as a prior distribution of . As [55] suggests, is assigned the uniform distribution in the interval . Note that and then can be estimated by

As we have mentioned before, the numerator and the denominator of the principal balances are selected so as to maximize the variance of the balance. As balances with more variability will better explain the variability of the data, principal balances enable us to encapsulate information and decrease the number of parameters to be estimated. Balances provide information about the similarity between the components of the numerator and the elements of the denominator: balances that are approximately zero indicate that the two groups are similar; however, the larger the absolute value of the balances, the greater the dissimilarity between them.

4. Datasets

We demonstrated the ability of our model by using both publicly available and simulated datasets. We studied the gut microbiome time series data of two healthy subjects, one female and one male. In both subjects, the samples were collected daily. The data of the female subject were extracted from [56] and were analysed on different taxonomic levels. We used Subject C in [57] as the male subject, taking into account that this subject is obese class I according to his body mass index.

We simulated three databases with fifteen, twenty, and sixty microbial taxa, respectively. Following the scheme given by [58], we generated the interaction matrix using the algorithm proposed by [59], and we generated the initial abundances using the Poisson distribution. With these two tools, we simulated the data using the generalized Lotka–Volterra structure. We carried out the simulation using the R package seqtime [58]. Focusing on technical details, to generate the interaction matrix we set the clique size at 10, the diagonal values at , the interaction connectance at 0.01, the positive edge percentage at , and the maximal absolute off-diagonal interaction strength at 1.

Microbial datasets are zero-inflated [60]; nevertheless, zeros in a compositional data set are below detection limit values [61]. For this reason, in order to analyse the publicly available datasets, we selected the bacterial taxa that did not have zeros at any time point, and we gathered the information of the nonselected taxa in the Other collection.

5. Results

First of all, we will explain in detail how we applied Ward’s clustering method to obtain the principal balances and present a comparative study on selecting the principal balances. Second, we analyse the information that estimated parameters provide on bacterial dynamics. Third, we offer a study on the presence of zeros at credible intervals. After this, we examine the graphic results and test the results when modeling datasets with a large number of microbial taxa. Finally, we compare the proposed model with other approaches.

5.1. Applying Ward’s Clustering Method and Selecting the Principal Balances

In order to introduce the results in an accessible way, we first present the outcomes obtained with the publicly available datasets. Moreover, we start by analysing the female dataset at family level, because it has aspects in common with the other two datasets and, consequently, it is easily comparable.

By applying Ward’s clustering method as it is described in Section 3.1, we obtain the dendrogram shown in Figure 1. As it is presented by [62], the horizontal bars of the dendrogram describe the groups of parts defined at each iteration of the process; for this reason, the groups merged in the dendrogram give us the numerator and the denominator of the principal balances. Moreover, the sum of all vertical bars represents the total variance of the sample; therefore, a long vertical bar implies a balance that explains a good deal of the cc variance and a short vertical bar means that balance has a little variability in the sample. In Figure 1, we can also see the principal balances described by the dendogram and their percentage of variance (Table1).

As explained in Section 3.1, we select the principal balances that accounted for at least of the variability. Consequently, we pick , , , , and because they explain a cumulative variance of (see Table 2). Note that we have written the principal balances in descending order by percentage of variance and we denote them as , respectively. For comparative purposes, we have estimated the proposed model using different selected principal balances to compare the results when extracting different percentages of variance of the data. In Table 2, we can observe that the formulation with is the one with smallest value of 2002 DIC, 2004 DIC, and RMSD when fitting; consequently, this formulation is the most suitable option. In fact, it is underscored in Section 5.2 that is particularly interesting for explaining the dynamics of the microbial community.

5.2. Estimated Parameters

Once we have selected the principal balances, we can estimate the proposed model. We are going to do so using Bayesian analysis because it allows us to simulate the posterior distribution of the model parameters using Markov Chain Monte Carlo (MCMC), as implemented in the free statistical software JAGS [65]. We fixed a burn-in period of 10000 iterations and an adaptation period of 1000 iterations. We kept one posterior sample in 100 iterations until a set of 20000 iterations was obtained, except for the first and second simulation datasets where just 2000 iterations were needed. We implemented this process using the runjags package in R [66]. Note that in Table S1 (see supplementary material), we can find the potential scale reduction factor defined by [67] and obtained using the gelman.diag function of the Coda package [68]. Table S1 shows that the potential scale reduction factor belongs to the interval and is smaller than its upper confidence limit.

In Table 3 we can find the estimated parameters, denoted in Section 3.2 as . Note that in each equation in the model (see equation (7)), we need to estimate six parameters, the intercept, and the five parameters associated with the selected principal balances. The estimated value of represents the abundance of the family in the community; in fact, tiny intercepts are related to bacterial families of low abundance, whereas large intercepts indicate taxa of high abundance. In addition, the absolute value of the estimated represents the weight that the balance has in describing the relative abundance of the family . As a result, we can say that the absolute value of represents the influence of the balance on the relative abundance of the family .

Analysing the estimated parameters allows us to extract some information about the bacterial dynamics. Focusing on intercepts described in Table 3, we observe that the families o_Clostridiales; f_, f_Desulfovibrionaceae, and f_Erysipelotrichaceae have a low intercept, while the intercepts of f_Bacteroidaceae and Other are meaningful. A negative intercept indicates families of low abundance, whereas large positive intercepts indicate abundant ones. We define the percentage of sway that each family has on the bacterial community as . This percentage of sway provides information about the abundance of a family taking into account the other families; consequently, it enables us to compare the abundance of the different families. When the intercept is positive, a high percentage indicates high abundance and vice versa. However, when the intercept is negative, high percentage is related to low abundance and vice versa. In Table 3, the percentage of sway is written in bold, and we can observe that f_Bacteroidaceae and Other, which have positive intercepts and a percentage of sway of 33.91% and 17.50%, respectively, are the two most abundant families. In addition, o_Clostridiales; f_ has a negative intercept and a percentage of 13.75%; it is therefore the least abundant family. The percentage of sway can help to gain a better understanding of the different dynamics of microbes because low percentage indicates small families, and high percentage is related to the extreme families: it indicates the most abundant if the intercept is positive or the least abundant if the percentage is negative.

As we have mentioned before, the absolute value of the estimated parameters represents the weight that the balance has in describing the relative abundance of the family, and they therefore provide information about the influence that each balance has on the family. Centering our attention on the absolute value of the estimated , we can observe that is the one that most influences f_Ruminococcaceae, f_Lachnospiraceae, f_Veillonellaceae, and f_Bacteroidaceae. Consequently, the families that belong to the are particularly interesting to explain the dynamics of these families. Note that ; therefore, this balance explains the relationship between the components and (which are f_Erysipelotrichaceae and f_Ruminococcaceae respectively) at time point . A balance value close to zero can be interpreted as a high similarity between the numerator elements and the denominator components; however, the larger the absolute value of the balance, the greater the dissimilarity between them. Furthermore, in Table 3, we can also observe that the coefficient of has a significant absolute value in all families. As a result, , which is the selected principal balance with the highest percentage of variance, is also the one that has a significant influence on most of the bacterial families.

Families with low abundances will have low parameter values; as a result, in order to compare the influence of one in different families, it is interesting to study the percentage of influence . Observing Table 3, it seems that does not have an special influence on f_Erysipelotrichaceae because , while in the rest of the families , however, if we observe the value of all at f_Erysipelotrichaceae, we can observe that all the values are small and, in comparison, has a significant influence. This is the reason why it is interesting to study the percentage of influence present in Table 3. In f_Erysipelotrichaceae, the has a percentage of influence although the selected principal balance that most affects this family is with an influence of . It would be a mistake to think that has more influence on f_Ruminococcaceae than on f_Erysipelotrichaceae because the absolute value of the estimated parameter in f_Ruminococcaceae is greater than the value in f_Erysipelotrichaceae; in fact, the percentage of influence of on f_Ruminococcaceae is , which is smaller than the percentage of influence of the selected principal balance on f_Erysipelotrichaceae.

In Table S2 (see supplementary material), we present the estimated parameters obtained with the female dataset at genus level as it is the publicly available dataset with most parameters. Focusing on intercepts, we observe that g_Bacteroides and Other are the two genera that most influence the bacterial community, and g_Dorea is the genus with the smallest intercept. Studying the percentage of influence that each selected principal balance has on each genus, we can observe in Table S2 that is the one that most influences eight genera, and is the one that most influences six genera. This reflects the importance of and , which are the balances with the least percentage of variance. Consequently, taking into account the principal balances that accumulate at least of the variability is paramount in order to collect enough information. In addition, in Table S3, we can observe the estimated parameters obtained by analysing the male dataset, where both the selected principal balances with the lowest percentage of variance are also essential for describing the microbiome dynamics of the male subject.

5.3. Credible Intervals

The credible intervals present in Tables 3, S2, and S3 (see supplementary material) are noteworthy because most of them contain zero. In order to study this fact, we analyse the percentage of zeros included at the credible intervals and the number of parameters present in each equation of the model for all databases. We scrutinise these two aspects with the proposed model and the four approaches used for comparison; see details in Section 5.5.

Table 4 collects this analysis and points out two overall trends: (a) as the number of microbial entities present in the data grows, the number of parameters that contain zero in his credible interval rises; (b) the model with the fewest parameters in each equation of the model is the one with the fewest parameters containing the zero in its credible interval.

In all models, the expected value of each microbial entity depends on its corresponding autoregressive expression. In Table 4, we have denoted the number of parameters in each autoregressive expression as Param/equation. As the autoregressive expressions have the same number of parameters for all bacteria, the estimate parameters of the less abundant microbial entities must necessarily be small. In conclusion, small families will inevitably have small parameters, and as a result of working with small parameters, the credible intervals contain zero. As we are working with compositional data, the more microbial entities we have, the smaller these microbial entities are. As a result, the number of small parameters increases in datasets with a lot of microbial entities. To sum up, a zero present in a credible interval does not mean that the parameter is irrelevant; the zeros appear due to working with compositional data.

5.4. Graphic Results and Modeling Datasets with a Large Number of Microbial Entities

Graphic results show the ability of our model to both explain and predict the dynamics of microbiome communities. In order to predict, we estimate the model, and we denote as the value of the parameter at the iteration of the Markov Chain, with . In order to obtain predictions using the model, we apply the following reasoning:

Finally, denotes the predictions obtained with the model. In Figures 2 and 3 we can observe how many time points we have used to estimate the model and to make predictions for each dataset.

First, we analyse the graphic results obtained with the female dataset at family taxonomic level because it is straightforwardly comparable with the male dataset and the genus study, as mentioned in Section 5.1. Figure 2(a) indicates that the two families with the highest intercept values (f_Bacteroidaceae and Other, mentioned in Section 5.2) are also the families with the most relative abundance and variability. What is more, o_Clostridiales, f_Desulfovibrionaceae, and f_Erysipelotrichaceae, which are the families with the lowest intercepts, are the families with the lowest abundance. In order to measure the goodness of fit of the estimating and predicting part, we can observe the RMSD presented in Table 5.

In Figure 2 and Table 5, we can compare the results of the female and male datasets. Analysing Figure 2, we conclude that the families of the male have less variability than the families of the female. This leads to a decrease in the RMSD value obtained with the male dataset not only in the fitting part but also in the predicting part. Note that this drop is illustrated at Table 5.

Hitherto, we have analysed the datasets at family taxonomic level. In Figure 2, we can see that both female and male datasets have families. When we increase the number , we increase the number of equation (7) that our model has (see Section 3.2). In order to test our model with a higher value of , we analyse the female dataset at genus taxonomic level. In this case, and the dimension of the parameter estimation matrix is , as we can observe at Table S2. We must mention that the dimensions of with the female and male dataset at family taxonomic level are and , respectively (see Tables 5 and S3).

In Figure 3 and Table 5, we can observe the results of the estimation and prediction when analysing the data at the genus taxonomic level. Figure 3 shows that g_Bacteroides, g_Akkermansia, and Other are the genera with the most variability. Indeed, we can observe in Table S2 that g_Bacteroides and Other are the two genera with the highest intercept values, and g_Akkermansia is especially influenced by and .

Using simulated data, we have also managed to test our model when and . In Table 4, we can see the results obtained, and we can conclude that the model achieves modeling and predicting in short-term microbiome dynamics with higher values of . In fact, Table 5 shows that the RMSD of the fitting part of the data from the third simulation (with ) is smaller than the RMSD of the data from the second simulation (with ).

5.5. Comparison

For comparative purposes, we also show the results obtained with alternative models that differ from the proposed model due to the autoregressive component and the reparameterization of the Dirichlet parameters.

Reference [46] highlights the ability of the Dirichlet distribution with an autoregressive structure that uses the alr or the clr transformation as a link function and imposes a parameter that is equal to the sum of the Dirichlet parameters. On the one hand, we proposed the AR(1)-alr and AR(1)-clr models that take this construction and use it in a Bayesian framework while writing in the autoregressive structured an AR expression with the alr or clr transformation, respectively. On the other hand, [41] emphasizes the importance of balances to identify microbial signatures; for this reason, we present alr-Bal and alr-alr-Bal models, which are two alternative that maintain the general construction of [46] but introduce balances in the autoregressive structure and use it in a Bayesian framework. The following alternatives have been implemented: a modification of the alr-alr-Bal model using a balance whose numerator does not include the microbial entity , the alr-alr-Bal model but using the clr transformation to accompany the parameter and a balance that compares all microbial entities except with all the microbial entities, and an alteration of this last model applying the clr to transform the Dirichlet parameters. However, these alternatives did not lead us to obtain relevant fitting results, and consequently, they are not included in this article.

5.5.1. Comparison with AR(1)-alr Model

In this case, in order to link with , we propose:with

Note that is the concentration parameter of the Dirichlet distribution and must also be estimated.

5.5.2. Comparison with AR(1)-clr Model

This consideration is similar to the AR(1)-alr model. In this case, the centered log-ratio transformation is used. Note that this transformation scales each component vector by the geometric mean, .

In this alternative formulation, must also be estimated.

5.5.3. Comparison with alr-Bal Model

This model has the same structure as the AR(1)-alr model, but in this case:

Note that is a balance expression and must also be estimated.

5.5.4. Comparison with alr-alr-Bal Model

Following the structure of AR(1)-alr, this model presents an alternative formulation of the autoregressive component, combining the alr transformation and the balance used in the alr-Bal model.

Note that is a balance expression and must also be estimated.

We compare the proposed model with these four alternatives to assess the ability of principal balances compared to AR models or other balance options. In fact, analysing Table 5, we can thoroughly compare the five methods. The proposed model is the one that achieves the lowest DIC values not only at the 2004 DIC but also at the 2002 DIC. If we focus on the 2004 DIC, we observe that the proposed model has by far the smallest DIC value; in addition, alr-Bal and alr-alr-Bal models obtain similar results, but they enhance the values obtained with the AR methods (except in female datasets). In spite of the fact that AR methods describe similar behaviours, in general, we can observe an improvement in the AR(1)-alr model. This dynamic is also matched at the 2002 DIC values; however, the values of the 2002 DIC are smaller than the values of the 2004 DIC in all models.

We can find the smallest 2002 DIC and 2004 DIC in the female dataset analysed at genus taxonomic level. The proposed model minimizes the 2002 and 2004 DICs with the values of and , respectively. On the other hand, we observe the highest values of both DICs in the simulation with . The AR(1)-clr model maximizes the 2002 DIC with a value of , and AR(1)-alr maximizes the 2004 DIC with the value of . Note that in the female dataset analysed at genus taxonomic level, we can find the biggest difference between the DIC values obtained with the proposed model and the DICs achieved with other datasets. We must mention that the AR(1)-alr model did not achieve convergence with the dataset at genus taxonomic level. In conclusion, the difference between the proposed model and the other models is reflected far more clearly in this dataset.

In Table 5, we also can study the RMSD values of the fitting and the predicting parts. Notice that the values of the fitting part are smaller than the values of the predicting part in all models. Focusing on the fitting part, we can observe that alr-Bal and alr-alr-Bal have larger RMSE values than the proposed model in all datasets; however, AR(1)-alr and AR(1)-clr have in general smaller values than the proposed model. Despite this, the proposed model has advantages over these models: it reduces the number of parameters and this enhances the estimation of large datasets. This is reflected in the simulated dataset with where the proposed model has the smallest RMSE value, improving on the AR(1)-alr and AR(1)-clr RMSE values. Shifting the focus from fitting to predicting, we can observe that the AR(1)-alr or the AR(1)-clr has a larger RMSE value in all the datasets with except for the simulated dataset with , where the difference between the RMSE values of the proposed model and the AR(1)-alr or AR(1)-clr is small. As a result, in this dataset, the models with AR structure and the proposed model have a similar prediction capacity; however, the proposed model describes the bacterial dynamics better, as we have seen before with the RMSE values of the fitting part. The same is true of the alr-Bal and alr-alr-Bal models; they have a smaller RMSE value in the predicting part than the proposed model, but they have the worst RSME value in the fitting part, so these models are the ones that worst describe the bacterial dynamics.

In conclusion, the proposed model is the one with the smallest DIC values, and this leads to us conclude that the proposed model (PM) is the most suitable approach to model microbiome dynamics. We must take into account that the proposed model does not have the smallest RMSE value in general; in these cases, however, the proposed model still has an advantage over the other models: it better describes the bacterial dynamics of large datasets.

6. Conclusion

In this paper, we develop a compositional autoregressive model in Bayesian framework that is able to describe microbiome time series dynamics. In order to model the relative abundance of microbiome longitudinal data, we use the Dirichlet distribution with time-varying parameters. We propose that the logarithm of the Dirichlet parameters can be explained by an autoregressive structure, which contains the principal balances that account for at least of the variability. A comparison study selecting principal balances that account for a lower percentage of variance is provided in this article. Even though we have developed the model for the analysis of microbiome longitudinal data, it can be adapted for other scenarios defined by longitudinal data with compositional nature.

Principal balances are a compositional tool that enables us to extract the balances that maximize the variance. As balances with more variability will better explain the variability of the data, this formulation allows us to reduce the number of parameters to be estimated without removing or grouping microbial taxa. A direct consequence of reducing the number of parameters is being able to model datasets with a higher number of microbial taxa. In fact, we demonstrated the ability of our model with six datasets, three of which are publicly available and present the data of a healthy female and a healthy male, while the rest are simulated. Two of the publicly available datasets have 9 families to model and the other has 18 genera. The simulated datasets have 15, 20, and 60 microbial taxa to model, respectively.

In conclusion, we have tested the ability of our model with datasets that have different numbers of microbial taxa, different taxonomic levels, healthy people of different sexes, and datasets that are publicly available and simulated. This paper presents the results obtained by analysing these datasets using the proposed model and four alternative models with which we compare the proposed model. These alternative models differ from the proposed model because they change the reparameterization of the Dirichlet parameters using the alr and the clr transformations, and they modify the autoregressive component using an AR model structure or other balance options. After comparing them, we conclude that models with balances are the ones that worst describe the bacterial dynamics but obtain the best prediction results. Models with AR structure describe the data better than the proposed model, except for big datasets, where they also present the worst prediction results. Finally, the proposed model better describes the bacterial dynamics of big datasets due to reducing the number of parameters and obtains the lowest DIC values. In addition, it can be effective for short-term prediction of the future dynamics of a microbial community. This approach can be an interesting alternative to predict the future dynamics of a microbial community.

Additionally, the proposed model allows us to analyse the microbial interactions using the value of the estimated parameters. Indeed, the estimated value of the intercept represents the abundance of the microbial taxa in the community; as a matter of fact, small intercepts are related to bacterial families of low abundance, whereas large intercepts indicate taxa of high abundance. In addition, the absolute value of the rest of the estimated parameters represents the weight that the selected principal balance has in describing the relative abundance of the family; as a result, we can say that the absolute value of represents the influence of the balance at the relative abundance of the family . We must mention that balances provide information about the similarity between their numerator and denominator components; in fact, balances that are near zero indicate that the two groups are similar, and the higher the absolute value of the balance, the larger is the dissimilarity between them.

Nevertheless, our proposal also has some limitations. Even if we choose the method for obtaining the principal balances that requires least computation time, estimating the models in big datasets still requires a significant amount of time. As future development, in order to improve the usefulness of the model, we propose to extend it to consider external covariates. The incorporation of covariates affecting microbiota dynamics, such as antibiotics or diet variations, may enable us to study the effectiveness of some medical treatments. Moreover, including random effects will allow us to study different subjects at the same time, such as healthy and unhealthy individuals, for example.

Data Availability

The data can be found at https://drive.google.com/drive/folders/1xIbHcjT0P-TJ6MQ5P91FwoEbu3gRkM2n?usp=sharing.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by grants from the Spanish Ministry of Economy and Competitiveness (projects MTM2017-83850-P, SAF2012-31187, SAF2013-49788-EXP, and SAF2015-65878-R), Carlos III Institute of Health (projects PIE14/00045 and AC15/00022), Generalitat Valenciana (projects PrometeoII/2014/065 and Prometeo/2018/A/133), and Asociación Española Contra el Cancer (project AECC 2017-1485) and was co-financed by the European Regional Development Fund (ERDF).

Supplementary Materials

We present the Tables S1–S3 as supplementary material. Table S1: potential scale reduction factor (labelled psrf) and their upper 95% confidence limits (labelled upper C.I.) of all the estimated parameter aij. Table S2: estimated parameters (aij) and credible interval obtained applying the Proposed Model to the female dataset at genus taxonomic level. The table also contains the percentage of influence that each SPBalj has on each genus and, in bold, it is represented the percentage of sway that each genus has on the bacterial community. Table S3: estimated parameters (aij) and credible interval obtained applying the Proposed Model to the Male dataset. The table also contains the percentage of influence that each SPBalj has on each family and, in bold, it is represented the percentage of influence that each family has on the bacterial community. (Supplementary Materials)