A Mixture of Regular Vines for Multiple Dependencies
To uncover complex hidden dependency structures among variables, researchers have used a mixture of vine copula constructions. To date, these have been limited to a subclass of regular vine models, the so-called drawable vine, fitting only one type of bivariate copula for all variable pairs. However, the variation of complex hidden correlations from one pair of variables to another is more likely to be present in many real datasets. Single-type bivariate copulas are unable to deal with such a problem. In addition, the regular vine copula model is much more capable and flexible than its subclasses. Hence, to fully uncover and describe complex hidden dependency structures among variables and provide even further flexibility to the mixture of regular vine models, a mixture of regular vine models, with a mixed choice of bivariate copulas, is proposed in this paper. The model was applied to simulated and real data to illustrate its performance. The proposed model shows significant performance over the mixture of R-vine densities with a single copula family fitted to all pairs.
Real data often exhibit complex multivariate mixture dependency structures among variables. These dependencies may vary from one pair of variables to another. The variation in the dependency structures adds extra complexity for modelling and capturing these types of relationships. The Gaussian mixture model is commonly used to model data with complex dependency structures due to its ease of implementation. However, the main limitation of this model is that mixture Gaussian dependencies are assumed between all variables. Furthermore, all univariate marginal distributions are assumed to be Gaussian, which may not be the case for many real datasets. Hence, this type of model may not provide an adequate fit to the data, resulting in inaccurate estimates of quantities derived from the fitted model. Copula theory, however, provides two main strengths that overcome these limitations. First, in copula models, due to Sklar’s theorem (Sklar ), margins are modelled separately from the dependency structures. Hence, margins do not need to be from the same type of distribution. Second, there is a wide range of copula families to describe various types of dependencies, including heavy tails. These two main advantages make copula a popular model in many areas. In a mixture context, for example, Zhang and Shi  introduced a Bayesian network with a copula model to study cancer data. Gunawan et al.  studied Bayesian estimation of copulas in high-dimensional cases of discrete and mixed margins. In finance, for example, a mixture of three copulas, Gaussian, Gumbel, and Gumbel survival, was studied by Hu . Khaled and Kohn  investigated the properties of the mixture copula using different copula classes. Marbac et al.  introduced a mixture of Gaussian copulas.
Regardless of the flexibility of the copula model, it imposes the same dependency structures among variables. Furthermore, multivariate copula types are mostly limited to elliptical copula families. Also, extending copula functions to high-dimensional cases is known as a difficult problem .
To overcome the limitations of copula models, Aas et al.  introduced the so-called pair-copula construction (PCC) model. PCC decomposes a multivariate density into a cascade of bivariate copula functions. Therefore, only two variables are modelled at a time using bivariate copula families, possibly belonging to different types. These flexibilities have led PCC to receive great interest in the literature of different fields, for example, in geostatistics (Gräler and Pebesma ; Erhardt et al. ) and in finance (Min and Czado [10, 11]).
In the mixture context, several researchers have incorporated pair-copula models with a finite mixture model. For example, Kim et al.  formed a mixture of drawable vine (D-vine) density models using multiple D-vine constructions that provide a high-flexibility mixture model and facilitate the comprehensive study of complex multivariate models. The authors illustrated the improvement of the model fit to capture hidden dependency structures. Similarly, a mixture of D-vine copulas was introduced by Zheng et al.  for chemical process monitoring. In their models, only one type of copula family was fitted to all pairs of variables.
A mixture canonical vine (C-vine) model was introduced by Sun et al. . They showed that the performance of the C-vine copula mixture model was superior to other methods, including K-means and Gaussian mixture models (GMMs). Similarly, Qiu et al.  introduced C-vine and D-vine mixture copula models to analyze the dependency structure of the multiwind power output. Evkaya et al.  incorporated C-vine into the D-vine copula model (CD-vine) using fixed pair-copula for each density to model the dependency pattern among variables based on their temporal order.
Despite the strengths of the mixture PCC models, only C- and D-vine mixture-based models have been introduced in the literature due to their ease of computation. The regular vine (R-vine) model, however, is a general class that is formed as a mix of C- and D-vines. R-vine gives more flexibility for tree structures than C- and D-vines since it does not have a specific role in ordering the variables. Dißmann et al.  stated that R-vine copula-based models provide much more modelling capabilities. Besides, the authors provided a comparison example with several scenarios, including mixed R-vine (different copula types for each pair of variables). The authors concluded that overall, their example demonstrated the usefulness of the R-vine model with different choices of copula for each copula term.
Nowadays, R-vine models have been receiving increasing interest in the literature. For example, Dißmann et al.  provided a novel method for selecting and estimating R-vine copula models sequentially (tree by tree) using array representation. The array representation goes back to Morales Napoles et al. , who aimed to count the number of R-vine tree structures. The spatial R-vine model was introduced by Erhardt et al. . In finance, two Bayesian R-vine models were developed by Gruber and Czado [20, 21]. In chemical process monitoring, Zhou and Li  applied the R-vine copula model to describe high-dimensional interdependence among complex variables.
In a mixture context, however, R-vine has not been investigated in the literature. Introducing mixture vine copulas into conventional vine pair-copula construction is a challenging task that opens up areas for future research regarding the selection of such models . From the previous studies of the mixture PCC models and the R-vine models, a mixture of R-vine models may provide further flexibility for modelling highly complex hidden dependencies among the variables. Furthermore, specifying different bivariate copula types for each copula term improves the model fit for capturing multivariate dependencies among the variables and hence reduces the misspecification of the dependency structures. Therefore, a mixture of R-vine density models with mixed copula types will introduce further improvement and flexibility to the model. Although Kim et al.  mentioned, as a valuable extension of their work, fitting different bivariate copula types for all copula terms of a mixture of D-vines, this has not been investigated in the literature.
Therefore, inspired by and building on the works of Dißmann et al.  and Kim et al. , this paper develops the first mixture of the R-vine density model, with a mixed choice of bivariate copulas (copula family is specified individually for each pair in each density), in the literature. The new proposed model aims to introduce higher flexibility to mixture vine copula models, providing a way to fully captured mixed hidden complex correlations among variables.
The rest of the paper is organised as follows: Section 2 reviews a brief theoretical background of the copula and the regular vine copula. Section 3 discusses the selection strategy of the mixture of R-vines; the model describes the expectation-maximisation algorithm (a method used to estimate the model parameters) and provides an example of a two-component mixture R-vine density model stored in an array representation. Section 4 illustrates the performance of the proposed model in simulated and real datasets. Moreover, the new mixture model is compared with a mixture R-vine model where a single copula family is fitted to all pairs of the variables.
2. Theoretical Background
Generally, a copula function can be defined as a function that “links or joins” multivariate distribution functions to their univariate uniform marginal distribution functions .
Theorem 1. (see Sklar  in Joe ). If is an -variate distribution function with univariate margins , then there exists an -copula function such thatIf the margins are continuous, then the copulais unique, where is the inverse function of the marginal functions and .
The joint density function can be obtained by taking the partial derivatives of (2):where is a copula density function.
From (4), the joint density function can be factorised into its dependence structures and univariate margins. Hence, from Sklar’s theorem (Sklar ), margins can be modelled separately from the dependence structures. This forms the main advantage of copula models. Another advantage of copulas is that they are invariant under strictly increasing transformations of the random variables, while the margins may be changed . In addition to these advantages, there are various copula families, such as elliptical copula (Gaussian and -student) and Archimedean copula (e.g., Frank, Clayton, and Gumbel), which can model a wide range of non-Gaussian dependence structures including tail dependencies. For more families, see Joe . These advantages make copula-based models commonly used models in different areas such as finance (e.g., Embrechts et al.  and Cherubini et al. ), spatial statistics (e.g., Bárdossy  and Kazianka and Pilz [30, 31]), hydrogen production (e.g., Qiu et al. ), and climate study (e.g., Khan et al. ).
Despite the advantages of the copula-based model, it imposes the same dependency structures between all variables in high-dimensional cases. Also, constructing a higher-dimensional copula is known as a difficult problem (Aas et al. ). To overcome these limitations, a novel multivariate model using only bivariate copula families has been developed in the literature. This new model is known as a vine copula or a pair-copula-based model.
The following section introduces a brief theoretical background of the vine copula model. For more details, the interested readers are referred to Aas et al.  and Dißmann .
A vine copula model decomposes multivariate copulas into bivariate copulas (pair-copula) to build high-dimension models using only bivariate copulas. In doing so, the pair-copula models provide a flexible way to model multiple complex high-order dependence structures using only bivariate copulas, possibly belonging to different copula families. The pair-copula model originated with Joe  and was later investigated and named as a regular vine model by Bedford and Cooke [35, 36]. Kurowicka and Cooke  further developed this model. Later, the full inference of the C-vine and D-vine models was introduced by Aas et al. .
In C-vine, all variables are modelled with respect to a particular variable, while in D-vine, variables are ordered sequentially.
Definition 2. (tree; see Bedford and Cooke ). = is a tree (an acyclic graph) with nodes and edges (connect each pair of ).
The degree of the node is the total number of edges connected to this node.
Definition 3. (vine, regular vine; see Ch. 4 of Kurowicka and Cooke ). is a vine on elements if(i), where indicates the first tree of the vine, and so on.(ii) is a connected tree with nodes and edges .(iii)For , is a connected tree with nodes . In addition, becomes a regular vine on elements if(iv)For , if and in are two nodes connected by an edge in , then exactly one of is equal to , . This condition is known as the proximity condition.Under the proximity condition, two nodes in tree are only connected by an edge if they were sharing a common node in the previous tree (). Kurowicka and Cooke  defined the D- and C-vine models as follows:(i)If every node at the first tree of a regular vine is connected at maximum with two nodes, then the regular vine is called -vine.(ii)If at each tree of a regular vine, there is one particular node that is connected to all other nodes, then the regular vine is called a -vine. At the first tree, this node is called a root node.
Definition 4. (regular vine (R-vine) specification; see Bedford and Cooke ). (, , ) is a regular vine copula (R-vine copula) specification if(i) is a vector of continuous invertible distribution functions(ii) is an -dimensional regular vine (R-vine)(iii) is a set of bivariate copulasLet be a vector of random variables, an edge, and and a conditioning set of edge . Bedford and Cooke  defined regular vine dependence as follows.
Definition 5. (regular vine (R-vine) dependence). A joint distribution function on is said to realise a regular vine copula specification (, , ) or exhibit regular vine dependence if for each , the bivariate copula of and given is a member of the bivariate copula . The marginal distribution of is , for .
The bivariate copula of and given is a conditional bivariate copula which is assumed to be independent of conditioning variables (see Aas et al.  and Haff et al. ).
Theorem 2. (see ). Let (, , ) be an -dimensional regular vine specification. Then, there is a unique distribution function that realizes (, , ). Its density iswhere , , denotes a conditioning variable in a conditioning set , i.e., , and is the density of , . Moreover, stands for the density function of bivariate copulas between edge .
Continuing with the last theorem, let , , , , be the edge that joined and . Joe  showed that the conditional marginal distribution, and , can be obtained as follows: and are then called transformed variables (see Aas et al.  and Dißmann et al. ).
Regular vine is a general case of vine copula that includes both C- and D-vines. The number of R-vine tree structures is considerable in comparison with C- and D-vines. Morales Napoles et al.  showed that, for variables, there are possible -vine tree structures. The following presents one possible example of a 5-dimensional R-vine copula model (following the details given in Aas et al. ).
2.2.1. Example of a 5-Dimensional R-Vine Copula
In this example, we have 5 variables, 4 trees, and 10 edges. At the first tree, all pairs of variables are modelled via unconditional bivariate copula families. For the upper levels, the dependencies are captured via conditional bivariate copulas. In this case, there are two sets of variables: the conditioning and conditioned variables. For example, is the density of the conditional bivariate copula family that models the dependency structure between the first and third variables given the second variable. In this case, the first and third variables are called conditioned variables, while the second variable is called a conditioning variable. In general, all the variables that are present before are called conditioning variables, while the variables after are called conditioned variables. The joint density of this 5-dimensional R-vine is as follows:
3. A Mixture of R-Vine Densities
Mixture models provide a flexible way to capture multivariate hidden dependency structures among variables. A finite mixture model is a model applied to data that are assumed to be generated from a finite number of unknown distributions. The density in the finite mixture model is a weighted sum of a finite number of densities.
Kim et al.  formed a mixture of D-vine models with a single copula family fitted to all copula pairs. This section extends the work of Kim et al.  from a mixture of D-vines to a mixture of R-vine model with an individual choice of copula types for each pair in each density of the model.
3.1. Finite Mixture Model
Suppose that an -dimensional random vector is generated from a -component mixture of R-vine density model. Hence, the density of is given bywhere is an unknown parameter (known as a mixture coefficient or weights) of the component that satisfies the following:
is the set of all model parameters, while is the set of all the parameters of the component. In mixture models, expectation-maximisation algorithm (EM algorithm) is a commonly used method to estimate the model parameters. Further details of this method will be introduced in the next section.
3.1.1. EM Algorithm
Expectation-maximisation algorithm (EM) (Dempster et al. ) is an estimation method with two steps: the so-called expectation step (E-step) and the maximisation step (M-step).
Suppose that a dataset of size is drawn independently from the -component mixture of R-vine density model given in equation (8). Suppose further that the data are converted into uniform distribution using an empirical cumulative distribution function. Then, the log pseudo-likelihood function of is given as follows:where is the set of all model parameters and is the set of all the parameters of the component. EM algorithm introduces a latent variable , where if is drawn from the component (the component of the mixture model) and , otherwise. In other words, indicates from which mixture component each observation was drawn. These latent variables are assumed to be independent and identically distributed from the multinomial distribution such that
Consequently, we now have the complete data:
Then, the complete-data log likelihood function, , is given as follows:
EM algorithm starts with initial values of the unknown parameters , and the two steps (E and M) are repeated until the convergence () is smaller than a prespecified tolerance.
E-step: calculate the conditional expectation of the complete-data log likelihood, in equation (14), given the observed data and using the current estimate of the parameters .
Suppose that we are at iteration . Then, the conditional expectation of is calculated as follows:
M-step: maximize the complete-data log likelihood, (from E-step), with respect to () in order to produce a new estimate of the model parameters (). In this step, the estimation of each component parameter is computed independently, i.e., and .
The new estimate of can be obtained as follows:while the updated can be obtained by maximising the following equation using the numerical maximisation method:
3.2. Model Selection of the Mixture of R-Vine Densities
Statistical inference algorithms for computing the log-likelihood functions and simulating strategy from an R-vine distribution using the lower triangular arrays of R-vine copula models were introduced by Dißmann et al. . Algorithm 2.1 from Dißmann et al.  computes a given R-vine specification density. In addition, the authors showed how to calculate the log likelihood of the evaluated density to be used for estimating the pair-copula parameters using, for example, maximum likelihood. Their method introduced high flexibility for R-vine copula models.
Inspired by and based on Algorithm 2.1 from Dißmann et al.  and the method of Kim et al. , the proposed mixture of R-vine density model with mixed copula families is presented in this section. I termed it a Mixture of R-vine Density Model with Mixed Families (MRDMMF). An example of constructing a two-component mixture of R-vine model, in an array representation, is given in Section 3.3.
3.2.1. Full Inferences and the Selection Strategy of the R-Vine Mixture Model
The full inference of pair-copula models (see Aas et al. ), in general, requires the following four main steps:(i)Obtain the normalised ranks of the original data(ii)Order the variables (tree structure)(iii)Select the appropriate bivariate copula type for each copula term (possibly belonging to different copula types)(iv)Estimate the parameters of the bivariate copula families
For the first step and to avoid misspecifying the margins, it is useful to transform the margins nonparametrically using the empirical cumulative distribution function.
A comparison study between semiparametric and parametric methods for estimating copulas was investigated by Kim et al. . The authors illustrated, by simulation studies, that the performance of the pseudo-maximum likelihood (PML) function (see Genest et al.  and Ch. 5 of Cherubini et al. ) for estimating copula parameters is significant in comparison to the full maximum likelihood (ML) and the inference function for margins (IFM) (Joe [42, 43]) methods when the marginal distributions are unknown, which is almost always the case in practice.
For the second step, it is well known that there are different possible ways to order the variables, and one needs to select the most appropriate way. The only way that guarantees, with no doubt, that the chosen order is the best is by testing all these possible structures, which is infeasible, especially in high-dimensional cases. One existing method of selecting the tree structure is based on the largest values of Kendall’s tau (see, for example, Aas and Berg  and Kim et al. ).
Dißmann et al.  introduced a tree-by-tree selection strategy, or a sequential estimation, using the maximum spanning tree. In their model, at the first tree, the empirical Kendall’s tau is computed for all possible pairs of variables. Then, the tree that maximises the sum of absolute empirical Kendall’s tau is selected. Finally, the pair-copula families are chosen for each pair, and corresponding parameters are estimated. Aas et al.  ordered the variables based on the strongest tail dependencies.
Having selected the appropriate order of the variables, identifying the bivariate copula type that best fits the data is the most crucial part of the pair-copula models. In the literature, several methods have been introduced to solve this challenging issue. For example, Aas et al.  used a scatter plot to determine the shape of each copula family. Other methods involve selection criteria, such as the Akaike information criteron (AIC) from Akaike  (see, for example, Dißmann et al. ). After that, the final step of the pair-copula models is estimating the copula parameters.
The finite mixture of the R-vine density model, however, requires estimating not only the pair-copula type but also the number of mixture components. For k components’ mixture of R-vine model, there are bivariate copula families to be specified and estimated. These families do not have to be of the same type. Thus, constructing mixture R-vine density models with mixed copula types is challenging. The difficulty lies in specifying the appropriate bivariate copula type for each pair in each density (component). Testing all possible bivariate copula types in each mixture component is highly infeasible in practice. However, even though the scatter plot, in the mixture context, is almost not straightforward enough to provide exact information on the involved pair-copula types, due to the mixture dependencies, it can still give some information on the possible types of the dependence structures. In addition to the scatter plot point, Dißmann et al.  said that the bivariate copula types determined at the first tree have a significant influence on the model fit. From these two points, a method for specifying bivariate copula types in the mixture context is introduced based on the scatter plot method of Aas et al. . Following the structure of Aas et al. , the first step is plotting the original data. Then, based on the possible information extracted from the scatter plot of each pair, several possible mixture models are constructed (first tree only). After that, the selection criteria are computed to determine the most appropriate model. For this step, three commonly used selection criteria are employed, namely, Akaike information criterion (AIC) of Akaike ; Bayesian information criterion (BIC) of Schwarz et al. ; and the consistent Akaike information criteria (CAIC) of Bozdogan . The formulas of these criteria are given as follows:where is the estimation value of the parameters, P is the number of estimated parameter in the model, and N is the number of the observations.
After that, the remaining trees of the selected model are determined following the same step as Aas et al.’s  method. However, instead of estimating the model parameters sequentially, the model parameters are estimated jointly using the EM algorithm.
To end this section, the main idea of specifying the bivariate copula type in the mixture R-vine density model can be summarised as follows:(1)Plot the original data. From the plot, construct different possible mixtures of R-vine density models. In this step, only the first tree of each density of each model is constructed.(2)Fit all the constructed mixture of R-vine density models to the data.(3)Compute the values of AIC, BIC, and CAIC. Then, the model with the smallest values is selected as the most appropriate model.(4)Then, for the second tree and based on the selection families of the first tree of each density of the selected model, a method from Aas et al.  is used to determine the bivariate copula types for each pair.(5)After that, the model parameters are estimated jointly using the EM algorithm.(6)For the remaining trees , Steps 4 and 5 are repeated until the last tree is reached.
The following steps summarise the full inference of the MRDMMF:(i)Obtain the normalised ranks of the observed data(ii)Construct different mixture of R-vine density models with mixed copula families(iii)Estimate copula parameters using the EM algorithm(iv)Fit all models to the data, and compute the AIC, BIC, and CAIC(v)The model with the smallest values of the selection criteria is selected as the best-fit model to the data
Since the primary focus of this paper is to introduce a new mixture model, the types of bivariate copula families used in the simulation studies are prespecified as the most commonly used copula families. In addition, the number of mixture components in both the simulation and the real datasets is fixed at two.
3.3. A Mixture of R-Vine Densities in an Array Representation
This section explains by example the representation array of the mixture of R-vine density model. In this model, each array of the pair-copula families is treated as a mixture component.
Example 1. (example of a two-component mixture of R-vine densities in an array representation).
To store a two-component mixture of R-vine density model of one-parameter copula families in an array representation, one need to have five arrays: one array for the tree structure, two arrays for storing the involved bivariate copula types (each array stands for a single mixture component), and another two arrays for their corresponding parameters. In this example, stands for the R-vine tree structure (numbers refer to the variables), while and are the mixture components (numbers refer to the bivariate copula) and the corresponding parameters, respectively.Figures 2 and 3 show the construction and the contour plots of the given mixture of R-vine model, respectively (these plots are generated via contour.RVineMatrix and plot (RVM) functions from the VineCopula package  in R ).
4. Numerical Application
This section aims to illustrate the efficiency of the model fit of the MRDMMF. Therefore, the simulation studies were designed with two main scenarios. The first scenario tests the performance of the proposed model in order to evaluate its ability to estimate the true complex multivariate dependency structures. The second scenario contains two main parts. The first part compares the performance of MRDMMF with one of the mixtures of R-vine model where a single copula family is fitted to all pairs of variables. I termed the latter model the Single-Family Mixture of R-vine Density Model (SFMRDM). This comparison aims to express the flexibility and the usefulness of MRDMMF over the SFMRDM. The second part aims to study the effect of misidentifying the number of mixture components on the model fit when MRDMMF is fitted to a nonmixture dataset (a similar scenario was applied by Kim et al. ).
For both scenarios, two samples of size 300 and 1500 were simulated using the RVineSim function from the VineCopula package in R (Schepsmeier et al. ). These datasets were repeated 100 times each. The simulated data for the first scenario are based on the mixture of R-vine density model given in Example 1 (the mixture weight is ), while the data for the second scenario are a nonmixture dataset based on the first density of the mixture model.
4.1. Simulated Data Application
4.1.1. First Scenario
In this scenario, the two-component MRDMMF model, given in Example 1, is fitted to the simulated datasets. To evaluate its performance, the difference between the true and the mean of the estimated parameter values of the simulated data is computed. Figure 4 shows the scatter and contour pair plots of the simulated dataset (size 300). Table 1 gives a full description (or information) of the simulated mixture of the R-vine density model. This description includes information of the type of the pair-copula families fitted for each pair, its parameters, and the corresponding Kendall’s tau () values. For convenience, the short names of the bivariate copula types are used. Hence, the involved bivariate copula families, with their short names, are Frank (F), Clayton (C), Gumbel (G), and Joe (J). The estimation result is summarised in Table 2.
Comparing the true simulation model (provided in Table 1) and the estimated model (provided in Table 2) shows that the estimated correlation parameters and the corresponding Kendall’s tau values are very close to the true values. Hence, the performance of the EM algorithm is satisfied, and the underlying dependencies are well captured.
4.1.2. Second Scenario (First Part)
This section aims to express the significant performance and the improvement of the model fit of the MRDMMF over the SFMRDM. For the single mixture model, Frank copula is specified as a pair-copula type for all pairs in the mixture of R-vine densities. After that, all models are fitted to the data, and the values of different selection criteria (AIC, BIC, and CAIC) are computed for each model. Their values are shown in Table 3.
From Table 3, the values of all selection criteria of both datasets support the MRDMMF over SFMRDM. The results also show a very poor fit of the SFMRDM. Hence, MRDMMF improves the model fit for modelling high multivariate complex dependence structures.
4.1.3. Second Scenario (Second Part)
The previous section demonstrated the usefulness of the mixture of R-vine model with a mix choice of pair-copula type in each density over the single-family model. In this part, the effect of misspecifying the number of mixture components is investigated. For this case, the mixture of R-vine model is fitted to the nonmixture simulated datasets. The performance of the model is shown in Tables 4 and 5.
From the result, it is evident that the mixture weight of the first component of the model is very high (it almost equals one), while the weight of the second component is minimal (almost zero). This is very acceptable since the data are nonmixture. For the estimation of the parameters, the results show that the estimated parameters are very close to the true values. Hence, the results support and demonstrate the performance of the proposed model.
4.2. Real Data Application
4.2.1. Real Datasets
This section aims to illustrate the performance of the proposed model with real datasets. For (only) an evaluation test, I applied the newly proposed mixture of R-vine density model (MRDMMF) to two different datasets, namely, Newthyroid and Glass datasets, from the  repository. These datasets consist of 215 and 214 observations, respectively. Figure 5 shows the scatter pair plots of both original data.
Following the steps of the mixture of R-vine model, the observations of both datasets are converted into pseudo-observations in order to obtain the copula data. For this step, the pobs function from the R package copula (Hofert et al. ; Jun Yan ; Ivan Kojadinovic and Jun Yan ; and Marius Hofert and Martin Mächler ) is used. After that, the most appropriate order of the variables is selected based on the highest values of the absolute Kendall’s tau. The tree structures of both datasets are shown in Figures 6 and 7.
After defining the R-vine tree structure and following the R-vine representation array (see Dißmann ), one can define the R-vine matrix corresponding to the selected R-vine structure. Please note that the R-vine matrix is not unique as the R-vine structure can be represented by different R-vine matrices (Dißmann et al. ). RVineStructureSelect function from the (Schepsmeier et al. ) package was used to find the structure of the Glass dataset (to save effort and time) using Kendall’s tau as the edge weight, while the structure of the Newthyriod dataset was built manually based on the highest values of Kendall’s tau.
Having determined the order of the variables, the next step is constructing a different possible mixture of R-vine density models (first tree only) based on the possible information from the scatter plot of the original data.
Figure 8 shows contour and scatter pair plots of the real datasets. The first dataset clearly shows that some of the bivariate dependencies, not perfectly (due to hidden mixture dependencies) but almost, reflect one type of copula family. For example, the dependency structure between and is an upper tail dependence. For the second dataset, the situation is less clear. However, one can still get some information. For example, from the plot, some pairs of variables exhibit tail dependencies. Again, this does not show the exact type of the appropriate bivariate copula family. However, extending the number of components of this model may improve the model fit since increasing the number of components gives extra flexibility to fit several bivariate copula types.
Following the selection strategy given in Section 3, the mixture of R-vine model is constructed with mixed copula types. For comparison, the four most commonly used copula types, namely, Frank, Gaussian, Gumbel, and Joe copulas, are fitted to the real datasets as single mixture of R-vine density models. Then, for each dataset, the best-fit model is selected based on the smallest values of the selection criteria. The results are given in Table 6.
Table 6 shows the values of selection criteria and the estimated values of mixture weights. Please note that several starting values were used (the first step) for the EM algorithm. The starting value that returns the largest log likelihood is chosen. After that, the estimation values of the chosen model are again used as starting values for the same model (the second step) in order to double-check for any improvement. For this case, two main improvements were noticed. First, the number of iterations for the second step is less than half as many iterations as the first. Second, the log-likelihood values barely increased.
From Table 6 and for both datasets, all the selection criteria selected the MRDMMF (the values are shown in bold text). For the SFMRDM, they all showed a poor model fit. However, SFMRDM (Gumbel) and SFMRDM (Joe) were the worst. In these two models, almost half of the pairs of variables are assumed to be independent. This is because the estimates of the copulas parameters (at these pairs) were either 1 (this means independent for these two families) or 0 (out of boundary). This provides evidence that the mixture of one type of copula family is unable to control multiple-mixture highly complex dependency structures that may vary from one pair of variables to another. This comparison demonstrates the significant performance of the mixture of R-vine density model with mixed bivariate copula-type single-family mixture of R-vine density model.
In this paper, I have introduced a new mixture of R-vine density model with a mixed choice for pair-copula families. This new method provides much more flexibility for modelling complex mixture dependency structures among variables in high-dimensional cases. As with general pair-copula models, the main challenging part of the proposed model is specifying the bivariate copula types. For this, the tree-by-tree scatter plot method was used to construct the mixture components of the model. To illustrate the model performance, the proposed model was fitted to simulated and real datasets. The simulation studies showed well estimation of the true model parameters. In addition, in the simulation and the real application studies, both models showed significant performance of the proposed model over the single-family mixture of R-vine density model. In this study, the number of mixture components is fixed. Fixing the mixture component limits the number and the types of copula families. Hence, estimating the number of mixture components will improve the model fit. This will be done in future work.
All the real data are publicly available through an online database.
This work was conducted in School of Mathematical Sciences, Queensland University of Technology, Brisbane, Australia.
Conflicts of Interest
The author declares that there are no conflicts of interest.
The author acknowledges Dr. Christopher Drovandi and D.Prof. Kerrie Mengersen for their supervision of this research.
A. Sklar, “Fonctions de répartition á n dimensions et leurs marges,” IEEE Transactions on Power Systems, vol. 8, pp. 229–231, 1959.View at: Google Scholar
D. Gunawan, M.-N. Tran, K. Suzuki, J. Dick, and R. Kohn, “Computationally efficient bayesian estimation of high dimensional copulas with discrete and mixed margins,” 2016.View at: Google Scholar
M. A. Khaled and R. Kohn, “The approximation properties of copulas by mixtures,” 2017.View at: Google Scholar
M. Sun, I. Konstantelos, and G. Strbac, “C-vine copula mixture model for clustering of residential electrical load pattern data,” IEEE Transactions on Power Systems, vol. 23, 2016.View at: Google Scholar
O. O. Evkaya, C. Yozgatlıgil, and A. S. Selcuk-Kestel, “Cd-vine model for capturing complex dependence,” Journal of Applied Statistics, vol. 10, pp. 1–15, 2020.View at: Google Scholar
O. Morales Napoles, R. M. Cooke, and D. Kurowicka, “About the Number of Vines and Regular Vines on N Nodes,” 2010.View at: Google Scholar
L. Gruber and C. Czado, “Bayesian model selection of regular vine copulas,” 2015.View at: Google Scholar
R. B. Nelsen, An Introduction to Copulas, Springer, New York, NY, USA, 2nd edition, 2006.
B. Schweizer and A. Sklar, Probabilistic Metric Spaces, Courier Corporation, London, UK, 2011.
H. Joe, Dependence Modeling with Copulas, CRC Press, London, UK, 2014.
B. Schweizer and E. F. Wolff, “On nonparametric measures of dependence for random variables,” 1981.View at: Google Scholar
U. Cherubini, E. Luciano, and W. Vecchiato, Copula Methods in Finance, John Wiley & Sons, London, UK, 2004.
J. Dißmann, Statistical Inference for Regular Vines and Application, Technische Universitat Miinchen, London, UK, 2010.
D. Kurowicka and R. M. Cooke, Uncertainty Analysis with High Dimensional Dependence Modelling, John Wiley & Sons, London, UK, 2006.
I. H. Haff, K. Aas, and A. Frigessi, “On the simplified pair-copula construction-simply useful or too simplistic?” Journal of Multivariate Analysis, vol. 101, no. 5, pp. 1296–1310, 2010.View at: Google Scholar
H. Joe, Multivariate Models and Dependence Concepts, Chapman & Hall, New York, NY, USA, 1997.
H. Akaike, “Information theory and an extension of the likelihood ratio principle,” 1973.View at: Google Scholar
U. Schepsmeier, J. Stoeber, E. C. Brechmann, B. Graeler, T. Nagler, and T. Erhardt, “VineCopula: Statistical Inference of Vine Copulas,” 2017.View at: Google Scholar
R Development Core Team, “R: A Language and Environment for Statistical Computing,” 2008.View at: Google Scholar
J. Alcalá-Fdez, A. Fernández, J. Luengo et al., “Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework,” Journal of Multiple-Valued Logic & Soft Computing, vol. 17, pp. 255–287, 2011.View at: Google Scholar
M. Hofert, I. Kojadinovic, M. Maechler, and J. Yan, “Copula: multivariate dependence with copulas,” 2017.View at: Google Scholar