Abstract

This paper proposed a panel data clustering model based on D-vine and C-vine and supported a semiparametric estimation for parameters. These models include a two-step inference function for margins, two-step semiparameter estimation, and stepwise semiparametric estimation. In similarity measurement, similarity coefficients are constructed by a multivariate Hierarchical Nested Archimedean Copula (HNAC) model and compound PCC models, which are HNAC and D-vine compound model and HNAC and C-vine compound model. Estimation solutions and models evaluation are given for these models. In the case study, the clustering results of HNAC and D-vine compound model and HNAC and C-vine compound model are given, and the effect of different copula families on clustering results is also discussed. The result shows the models are effective and useful.

1. Introduction

When we use panel data to cluster, it is difficult to use the spatial correlation and cross-correlation of temporal dimension simultaneously. There is very little literature in the field. This paper would discuss how to build a multivariate copula model for panel data analysis, which would reflect the hierarchical structure relationship between time and index for realizing comprehensive evaluation and dynamic clustering.

1.1. Literature Review

In the traditional literature of data mining, the clustering method is divided into five methods: the method based on classification, on the hierarchy, on density, on the grid, and on the model. Frey et al. [1] also proposed a neighbor propagation algorithm based on clustering center.

Currently, hierarchy clustering methods are used for panel data analysis: Ren et al. [2], Juarez [3], Nie [4], Zhu et al. [5], Zheng et al. [6], LI et al. [7], Yang et al. [8] Zheng et al. [9], and Xie et al. [10].

There is also some authors’ research based on models method. For example, Bartolucci [11] analyzed the binary panel data model; Zheng et al. [9] applied the traditional clustering method to panel data analysis by constructing a panel data matrix. Ren et al. [2] proposed a clustering method based on multi-index panel data by reestablishing the ward function based on the extended Frobenius criterion. Juárez [12] proposed a non-Gaussian panel data clustering method model based on the skew T distribution, which divided the cluster for the object by the dynamic behavior of the potential, equilibrium level and covariance effect of the regression and non-Gaussian model. The Nie [4] proposed a new clustering measurement, which can be used to calculate the weight flexibly. If the user paid more attention to the latest data, the more weight would be allocated.

A special case of vertical data would be proposed by panel data analysis, which also has been discussed by some authors. For example, Chiou et al. [13] proposed a clustering analysis of vertical data by establishing a nonparametric stochastic effect model based on Karhunen-Loeve expansion and nonparametric iterative mean-variance model.

De la Cruz-Mesia [14] proposed a model clustering method based on the measurement of cluster individuals over a period. With reference to the changes of data as the main feature in considering nonlinear layered model to the mixed layer model, the MCMC sampling method to explore the posterior distribution of the target which was formulated by the EM algorithm discussed the maximum likelihood estimation of the model.

Nielsen et al. [15] assumed that the count data follows the nonhomogeneous Poisson process whose intensity is a time-homogeneous and adaptive spline function. The spline functioned with smoothness and covariance of the change by time, which is also used to control the shape of the curve. And then an adaptive semiparametric longitudinal counting data clustering method is proposed. Jackknife, bootstrap, and pseudo-quasi methods were used as the parameter estimation method.

Shaikh et al. [16] applied the clustering method based on the model to cluster the missing vertical data. They used the mixed normal distribution and improved the Cholesky method for decomposing the covariance structure and then used the EM algorithm to estimate the parameters of the method model.

Yang et al. [17] proposed a panel data clustering analysis based on density. Xie et al. [18] proposed panel data clustering with affinity propagation and gave an agriculture risk regionalization analysis case. Guan et al. [19] gave MRI data analysis of affinity propagation clustering based on similarity matrix reduction.

The above vertical data clustering literature was based on the model clustering method in which assumption data and mixed models were applied to the vertical data clustering analysis. The merits of this method are its frequent user of all information and features of vertical data and the method with rigorous assumptions and statistical inferences. And its demerit is that, because of the data types of the vertical data with specific assumptions, such clustering methods are not widely applicable.

The above literature uses some extracting features of panel data to calculate similarity coefficient or distance, but not using overall characteristics of the panel data to consider how to calculate the similarity measure of panel data. Therefore, clustering of panel data could be studied on how to comprehensively measure its characteristics, proposed effective, and feasible method. In this paper, a classification method would be proposed to consider the multiple indexes of panel data and present a similarity coefficient, a clustering method based on density and near-neighbor propagation which were not used in the above literature. Based on the model clustering method, this paper would also present a composite PCC clustering method for the flexibility of the copula method (Genest et al. [20], Joe [21], and Bedford et al. [22, 23]).

1.2. The Innovation and Structure of the Study

According to the literature of panel data clustering, the purpose of panel data clustering mainly includes the following: to classify individual by clusters or classes; to classify indicator by clusters or classes; to find outliers or noise; and to classify individual by their shape characteristics, numerical characteristics, surface features, and another clustering purpose. Therefore, different clustering purposes are needed to propose different clustering methods. For example, some clustering purpose is to embody the overall characteristics of the data, or to reflect the hierarchical structure of the indicators, or to reflect the dynamic development characteristics of the data. Related state-of-the-art methods use some indicators to model the correlations in the data, of which its metrics are too single to extract complex dependency structures hidden in panel data, Therefore, it is necessary to propose the clustering method that can use comprehensive information and adapt to different clustering purposes. Copula method is a good approach.

In this paper, Pair-Copula Construction (PCC) would be discussed, which includes three types of models: HNAC model, the composite model of D-vine and HNAC, and the composite model of C-vine and HNAC. The third part of this paper discusses the statistical inference of composite PCC panel data clustering.

A general expression of composite PCC method would be presented first in this paper, and the parameters of the composite PCC method estimates would also be discussed, including the maximum likelihood method, marginal inference function of the two-stage estimate, semiparametric two stages, and semiparametric step-by-step; then the degree of fit and test would be discussed. The fourth part of this paper would also discuss the application of compound PCC method in panel data clustering and analyze panel data by clustering. And the final part would be the summary of the paper.

2. Model Building

2.1. The Construction of Multiple Copula

The construction of multiple copula mainly used EAC (the Exchangeable Archimedean copula) and NAC (the Nest Archimedean copula) (Joe [21]).

Joe [21] first proposed the structure of PCC, which is a new method and manifests as a waterfall structure similar to EAC and NAC. The difference of PCC, EAC, and NAC is that the PCC will deconstruct complex multivariate joint probability density into relatively simple multiple two-dimensional copula and marginal probability density; among them, the two-dimensional copula is not limited to Archimedes copula, also using any copula class, even mixed with a variety of copula classes. Bedford and Cooke [22, 23] provide a likelihood estimation method of PCC. Kurowicka et al. [24] simulated linear dependence by using partial correlation coefficient and determining the correlation coefficient matrix. Aas et al. [25] proposed a maximum pseudo-likelihood estimation method.

Berg and Aas [25] compared the EAC, HNAC, and PCC from max number of the copula, parameter constraint, and the choice of copula class. Conclusions are as follows: the EAC and NAC can only choose Archimedes copula class; parameters satisfy the constraint conditions; the scope of its application is restricted. The main advantage of PCC is that PCC is more flexible than NAC; there is no more copula in PCC; the degrees of freedom of HNAC will decrease by the Mosaic layer increasing; therefore, the fitting effect of real data by the PCC method is better than by HNAC. Berg et al. [26] also compared the computational efficiency of HNAC and PCC. The results showed that the efficiency of PCC was higher than that of HNAC. See Table 1.

Aas et al. [25] pointed out the obvious deficiency of the PCC method of not having a unique structure. So Bedford and Cooke [22, 23] proposed a solution, the graph model called Regular Vine (R-vine). But the structure of R-vine is still complex. Kurowicka et al. [24] presented two special forms of R-vine: Canonical Vines (C-vine) and Drawable Vines (D-vine). Aas et al. [25] First presented the application of vine copula in financial data, and the development of vine copula theory in the literature of Kurowicka et al. [24], Haff et al. [27], and Czado [28].

The R-vine method is used to solve the problem of multiple structures of the PCC; however, R-vine also has a variety of structures. C-vine and D-vine are two special forms of R-vine. Aas et al. [25] pointed out that, in the d dimension, C-vine has kinds of structures and D-vine also has structures, and finally made a conclusion of four variables, C-vine and D-vine structure and the joint probability density function. Bedford et al. [23] proposed an R-vine structure of five variables. Yang et al. [8] proposed the general formula of probability density function in the case of dimension.

Brechmann et al. [29] analyzed the method of pruning at the top of the tree and discussed how to choose the independent copula. Smith et al. [30] selected independent copula and a given copula function in MCMC and model indexes. Yang et al. [8] proposed the sequence tree wise method to select the copula function.

2.2. The Dependency Structure of the Composite PCC Metric Panel Data

Pair-copula is suitable for the cross section data, Joe [21], Smith et al. [30], and Sun et al. [31]. Pair-copula was used in time series but rarely used in panel data and vertical data.

In order to reflect the dependency structure between indexes of panel data, this paper would build a composite method of PCC; the basic idea is that the upper (outer) uses HNAC structure, and PCC structure is nested in the lower (inner), including D-vine and HNAC composite model, and the composite model of C-vine and HNAC.

2.2.1. Notation

In this paper, the CDF of copula can be noted as , while the pdf of copula can be noted as . is the CDF for united distributions or marginal distributions, while is the pdf. Panel data is represented by , for the unit (for example, Beijing for ), the total number of unit for (for example, for the country ) contains the index number for , each unit for the (i.e., 1 = health index, 2 = education index, 3 = life index, and 4 = social index), so is a set of common indicators (dimensions) of the dependency structure. The first variable contains an observation . For example, Beijing’s education index contains a total of observations from 2004 to 2012, specifically, (Beijing education 2004) to (Beijing education 2012).

2.2.2. Using D-Vine and HNAC to Measure Dependency Structure

The D-vine structure is used in the D-vine form with HNAC nesting. The D-vine structure is adopted in the subblock structure, as shown in Figure 1.

The HNAC structure is used in the superstructure of D-vine as shown in Figure 2.

2.2.3. Using C-Vine and HNAC to Measure Dependency Structure

The C-vine form with HNAC is used in the C-vine structure, and the subblock structure is adopted in C-vine structure, as shown in Figure 3.

The HNAC structure is used in the superstructure of C-vine, which is structured by and as shown in Figure 4.

3. Statistical Inference

3.1. Semiparametric Gradual Estimation

When the number of parameters of PCC is too much, the previously discussed estimation methods are less efficient ways. The more efficient estimation method would be discussed in the next.

Assuming that represents the parameter of the marginal distribution; represents the parameter of the PCC copula; are the dependent parameters of HNAC. The joint distribution can be expressed aswhere is the pdf for united distributions as mentioned before. The probability density function of is satisfied:

Defining , , . is the cardinality. And is the data set of all parameters from to i

For D-vine, the joint probability density function can be expressed as

In the next analysis, the paper assumes that D-vine has similar formulas for C-vine and another vine. In the semiparametric two-stage estimation, the marginal distribution parameters of the logarithmic likelihood function are replaced with nonparametric. The semiparametric gradual estimation is similar to this idea, estimating the PCC parameters at a level by level. The log likelihood for a given sample can be expressed aswhere .

The pseudo-likelihood function of the dependent part can be written as

By gradually substituting parameters, normal functions can be constructed and solved . The normal equations arewhere .

is plugged into the likelihood function, estimating . To be specific, the steps of gradual estimation are as follows.

Step 1. Estimate , the first level of dependency parameters.

Step 2. Plug into (5). Then obtain , the dependent parameters of the second level, by estimating the maximum likelihood method.

Step 3. Take parameters and in (5), and estimate the dependent parameters .

Step 4. Repeat the above steps, estimating by .

Step 5. Plug into the likelihood function, and estimate and all dependent parameters.

Haff [29] proved that the semiparametric gradually estimation has consistent, progressive, and normal and robustness characters. In this paper, the good properties of the estimations are as follows:

is the consistent estimation of;

has asymptotic normality:where

are robustness estimates.

In practical analysis, this paper considers firstly implementing semi-ginseng gradual regression, then plugging the estimate of the parameter as the initial value into the maximum likelihood estimation method, and finally getting the parameter estimator.

3.2. Model Evaluation

Front, parameter estimation method has been proposed for the same set of data; the difference of the results of using different marginal distribution or using a different copula is very large. The problem with this paper is how to compare fitting effects between different marginal distribution and different copula. Further discussions in the paper are model evaluation comparison and fitting optimization test. A Hit test would be discussed in this session, which does not only compare with the figure and numerical value but also gives a test; the traditional information criterion evaluation criteria would be used to intuitively compare the fitting effect of different combinations, and the fitting optimization test would be discussed later.

3.2.1. Hit Test

When using multiple marginal distributions or different copula, the fitting test will be different from the traditional fitting degree test. The original Hit test method is adopted in this paper. First, divide the interval into known regions, as shown in Figure 5. In contrast to the Hit inspection of Pattern [32], the paper adopts the overall Hit test for the shaded part of Figure 5; different axes represent different marginal distribution function. For two-dimensional case, the chessboard form is shown above. For the three-dimensional case, a cubic (cube) is obtained. The high dimension is corresponding to the hypercube (super cube).

The specific steps are as follows.

Step 1. Divide experience data by the above critical value. Each data point is only in a subtable (a subcube or a hypercube). Calculate , the number of data points on each subtable (a subcube or a hypercube).

Step 2. With reference to the previous section method to estimate the marginal distribution parameters, and according to these parameters, calculate , the number of fitting data points falling in each subtable (a subcube or a hypercube).

Step 3. Construct the Hit test according to the number of empirical data points and the number of fitting data points.where N is the number of the total sample observation points and k is the number of estimated parameters.

3.2.2. Model Evaluation

Traditional statistical analysis indicators include information statistics and entropy criterion. Only information criterion is discussed in this paper, and other criteria can be calculated by information criterion.

The correction of AIC criterion (AICC, abbreviated as AICC) can be used to balance the optimal degree of fitting and the number of parameters. The paper proposes the statistics of HQC (Hannan-Quinn information criterion, abbreviated as HQC). The maximum likelihood estimation is calculated as follows:

For semiparametric gradual estimation, HQC is calculated as

The constructed information criterion in the front is only a numerical value, which cannot be tested in its own way. So there are many disadvantages. In practice, a test of goodness of fit is more effective.

4. The Empirical Analysis

4.1. Introduction to the Panel Data Clustering

The China Development Index (RCDI) is compiled by the China Survey and data center of Renmin University of China. The index is composed of four indices of health, education, economy, and social environment, and 15 subindexes. The development of the states and the 31 regions since 2004 has been comprehensively measured. According to the similarity matrix of composite pair-copula, the proper clustering algorithm is selected and the rationality of the clustering algorithm is tested and compared.

According to the similarity matrix, the similarity coefficient is transformed into , and we select the corresponding clustering algorithm, such as the ward method commonly used in the multivariate statistics, hierarchical clustering method.

4.2. The Empirical Analysis

The following is an empirical analysis of China’s development index panel data.

This paper gives the HNAC clustering results and composite method of PCC clustering results to contrast HNAC and composite method of PCC for the effect of panel data clustering; among them, the composite PCC method discussed C-vine and HNAC composite model, as well as D-vine and HNAC composite model.

In order to make the results of this paper comparable, the upper HNAC Archimedes copula function chose Gumbel copula. The PCC section also uses Gumbel copula. It is a way to unify and ensure the comparability of analysis.

4.2.1. Selection of Different Copula Functions

As mentioned, the Gaussian copula is applicable to data without tail dependencies; the Clayton copula is applicable to the data of the bottom tail; Gumbel copula applies to the data on the tail; Student copula is applied to data of the bottom tail and on the tail.

Since each copula uses a different range, the coefficient of fitting will also be different, and its structure will change, which may influence the clustering result. Therefore, we need to combine the Hit test and the fitting method to select the most suitable copula.

If different copula is selected, the fitting effect of the three methods will be different. The above section has discussed the model of evaluation and test of goodness of fit; this paper mainly chooses HQC index and S inspection; considering that the value of S depends on , simulate 2000 times to give an approximate estimate of the adjoint probability in this paper by using Monte Carlo simulation.

As can be seen from Table 2, the Gumbel copula model works well, so Gumbel copula is selected for the following analysis.

4.2.2. The Clustering Results of HNAC

Taking Beijing and Shanghai as an example, the copula dependency relationship after data processing is correct in Figures 6 and 7.

The distribution is shown in Figure 7.

The HNAC and composite PCC methods depend on parameter estimation.

The dependent parameters are processed by the unit, and then the clustering analysis; the results of the clustering are drawn in Figure 8.

As can be seen from Figure 8,s Beijing and Shanghai are classified together, which are very different from any other provinces. The eastern coastal areas (except Fujian province and Hebei province), including the northeast border area, are obviously a category; Hainan and Xinjiang are the third provinces, and some of the provinces in the west are the fourth provinces. The correlation between regions is reflected.

4.2.3. C-Vine and HNAC Composite PCC Clustering Results

The dependent parameters are processed by the unit, and then the clustering analysis; the results of the clustering are drawn as shown in Figure 9.

As can be seen from Figure 9. Beijing and Shanghai are one class, which is different from any other province. The eastern coastal areas (except Fujian province and Hebei province), including the northeast border area, are obviously a category; Hainan provinces and Xinjiang (excluding Guizhou) are the third provinces, and some provinces in the west are the fourth provinces. The correlation between regions is reflected.

Through graphics contrast, it can be seen that in front of the C-vine and HNAC composite model relative to the D-vine and HNAC composite model is more similar, the results of the two models can reflect the various provinces and regions section, and the different time sequence on inertia of economics, and the characteristic of the composite panel data dependency structure can be clearly reflected; for panel data clustering, the effect is very stable.

5. Conclusions

Among these models, the composite model of D-vine and HNAC and the compound model of C-vine and HNAC are collectively referred to as the composite PCC method. The composite PCC method is the new structure proposed in this paper, which can reflect the hierarchical structure of panel data indexes.

In PCC, the conditional probability functions used by D-vine, C-vine, and R-vine are different, so there is a huge difference in independence structure. D-vine is appropriate for variable equivalence (similar to the exchangeable order, see Barbe [33]), such as education, economic index, social index, and index of life, the four partial comparative equivalences (intuitively, the number of copula links to each variable is equal); but for the C-vine, the number of links of copulas connects different variables, four indexes were put in position which need to be well designed and argued, the number of copulas linked to each variable in R-vine is different, and there is a certain sequence. Refer to Rehman [34] for research on nonparametric estimating abundance.

In this paper, a panel data clustering method based on composite PCC is proposed to summarize this method according to the evaluation of the clustering algorithm.(1)Based on the method of composite PCC, it has better scalability. When the estimated parameters are large and the data volume is too large, the operation speed drops.(2)Some clustering algorithms are sensitive to parameters, and the parameters are more difficult to be determined for some data sets with a large number of the unit. Such clustering algorithm is not practical.Based on the method of compound PCC, there are too many parameters in all marginal distribution, conditional distribution, and dependent structure, so it is difficult to bring parameter estimation and hypothesis testing.(3)Some clustering algorithms are sensitive to noise data, and such clustering methods are not useful, but we sometimes need to be able to recognize the clustering algorithm of noise. The sensitivity to noise data for the method of compound PCC depends on the copulas connected and marginal distribution. The result can be sensitive to noise data and also cannot be sensitive, depending on the construction.(4)From the input data order: some clustering algorithms are sensitive to the order of input data; such clustering algorithm is not practical. The above clustering method is not sensitive to the input data order. But the setting of the dependent structure itself is sequential, depending on the analyst’s understanding and grasp of the problem.(5)From processing high-dimensional data: in the field of genetics and biology, or in the data set of e-commerce, the number of observations is often far less than the number of indicators (variables or attributes). Based on the method of compound PCC, it is easy to set the complex structure with high computational complexity when dealing with high dimension. It is sensitive to an initial value and it is convergent to the local optimal solution or even does not converge.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors hereby declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The paper was financially supported by the National Social Science Fund of China “Research on the Redistribution Function of Social Security System: Research on the National Social Security Fund’s Intervention in Pension Insurance Payment” (18BJY212) and “the Fundamental Research Funds for the Central Universities” in UIBE (CXTD9-04).