Gaussian Mixture Models Based on Principal Components and Applications
Data scientists use various machine learning algorithms to discover patterns in large data that can lead to actionable insights. In general, high-dimensional data are reduced by obtaining a set of principal components so as to highlight similarities and differences. In this work, we deal with the reduced data using a bivariate mixture model and learning with a bivariate Gaussian mixture model. We discuss a heuristic for detecting important components by choosing the initial values of location parameters using two different techniques: cluster means, k-means and hierarchical clustering, and default values in the “mixtools” R package. The parameters of the model are obtained via an expectation maximization algorithm. The criteria from Bayesian point are evaluated for both techniques, demonstrating that both techniques are efficient with respect to computation capacity. The effectiveness of the discussed techniques is demonstrated through a simulation study and using real data sets from different fields.
In real data such as engineering data, efficient dimension reduction is required to reveal underlying patterns of information. Dimension reduction can be used to convert data sets containing millions of functions into manageable spaces for efficient processing and analysis. Unsupervised learning is the main approach to reducing dimensionality. Conventional dimensional reduction approaches can be combined with statistical analysis to improve the performance of big data systems . Many dimension reduction techniques have been developed by statistical and artificial intelligence researchers. Principal component analysis (PCA), introduced in 1901 by Pearson , is one of the most popular of these methods. The main purpose of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables, while retaining as much as possible of the variation present in the data set. Among the many PCA methods, singular value decomposition is used in numerical analysis and Karhunen–Loève expansion in electrical engineering. Eigenvector analysis and characteristic vector analysis are often used in the physical sciences. In image analysis, the Hotelling transformation is often used for principal component projection.
In recent years, there has been increasing interest in PCA mixture models. Mixture models provide a useful framework for the modelling of complex data with a weighted component distribution. Owing to their high flexibility and efficiency, they are used widely in many fields, including machine learning, image processing, and data mining. However, because the component distributions in a mixture model are commonly formalized as probability density functions, implementations in high-dimensional spaces are constrained by practical considerations.
PCA mixture models are based on a mixture-of-experts technique, which models a nonlinear distribution through a combination of local linear submodels, each with a fairly simple distribution . For the selection of the model, a PCA mixture model was proposed by Kim, Kim, and Bang , which has a more straightforward expectation maximization (EM) calculation, does not require a Gaussian error term for each mixture component, and uses an efficient technique for model order selection. The researchers applied the proposed model to the classification of synthetic data and eye detection .
For multimode processes, the Gaussian mixture model (GMM) was developed to estimate the probability density function of the process data under normal operating conditions. However, in the case of high and collinear process variables, learning from process data with GMM can be difficult or impossible. A novel multimode monitoring approach based on the PCA mixture model was proposed by Xu, Xie, and Wang  to address this issue. In this method, first, the PCA technique is applied directly to each Gaussian component’s covariance matrix to reduce the dimension of process variables and to obtain nonsingular covariance matrices. Then, an EM algorithm is used to automatically optimize the number of mixture components. A novel process monitoring scheme for the detection of multimode processes was developed using the resulting PCA mixture model. The monitoring performance of the proposed approach has been evaluated through case studies .
In recent years, hyperspectral imaging has become an important research subject in the field of remote sensing. An important application of hyperspectral imaging is the identification of land cover areas. The rich content of hyperspectral data enables forests, urban areas, crop species, and water supplies to be recognized and classified. In 2016, Kutluk, Kayabol, and Akan  proposed a supervised classification and dimensionality reduction method for hyperspectral images, using a mixture of probability PCA (PPCA) models. The proposed mixture model simultaneously allows the reduction of dimensionality and spectral classification of the hyperspectral image. Experimental findings obtained using real hyperspectral data indicate that the proposed approach results in better classification than the state-of-the-art methods .
In the field of face recognition, Ahmadkhani and Adibi  proposed a supervised version of the PPCA mixture model. This model provides a number of local linear underlying data samples. The underlying manifolds are used for face recognition applications to achieve dimensionality reduction without loss of information.
In this work, we reduce the dimensions of data by applying a PCA technique and then deal with the reduced data or principal components scores using one GMM. Then, we obtain estimates of the parameters using an EM algorithm. Finally, we compare the selection of initial values for location parameters in the mixture model using three different techniques: k-means, hierarchical clustering, and default values in the “mixtools” package.
The rest of this paper is organized as follows. Section 2 briefly defines the concept of PCA. In Section 3, the mixture densities are discussed. Section 4 provides the probability density function of the Gaussian mixture distribution and the EM algorithm that is used to estimate the mixture’s parameters. Section 5 shows a PCA mixture model based on the proposed scenario. In Section 6, the experimental results are presented in three subsections, and the main conclusions are listed Section 7.
2. Principal Component Analysis
Suppose we have -dimensional vectors and need to reduce them to a -dimensional subspace. The reduction can be achieved by projecting the original vectors on to dimensions, the principal components, which span the subspace. Suppose that X is a vector of random variables. To find the principal components, we compute several linear functions of with maximum variance; most of the variation in will be accounted for by principal components, where . The PCA method determines the correlation between PCA components and data variables; a high correlation indicates important variables. Let be the known covariance matrix for the random variable . For , the PC is given by , where is an eigenvector corresponding to the largest eigenvalue . If is chosen to have unit length, and . Then, the technique of Lagrange multipliers can be used, by maximizingand choosing to be as large as possible. Here, is the eigenvector corresponding to the largest eigenvalue of , that is, the first principal component of . In general, the principal component of is , and . The second principal component, , maximizes , subject to being uncorrelated with . The uncorrelation constraint can be expressed using any of the following equations:
If we choose equation (3), we can write a Lagrange to maximize as follows:
Differentiation of this quantity with respect to gives us
Next, left multiplying into this expression, we havewhere, as mentioned above, the first two terms equal zero and , resulting in 0. Therefore, , or , is another eigenvalue equation, and we use the same strategy of choosing to be the eigenvector associated with the second largest eigenvalue that yields the second principal component of , namely, [8, 9].
3. Mixture Densities
A mixture density is defined as a weighted sum of component densities [9, 10]. Denote the component density by , where indicates the component parameters. We use to denote the weighting factor or “mixing proportions” of the component in the combination, with the constraints that and , and represents the probability that a data sample belongs to the mixture component. A component mixture density is then defined as
The mixture model has a vector of parameters, .
We consider a mixture density to model a process by selecting a “source” according to the multinomial distribution and then drawing a sample from the corresponding component density Therefore, the probability of selecting source and datum is . Equation (7) gives the marginal probability of selecting datum . We can think of the source that generated a data vector as “missing information”; that is, given a data point , we want to infer which source it is likely to belong to. Section 4 presents the EM algorithm, which is used to iteratively estimate this missing information [11, 12].
In mixture models, we deal with hidden variables as a latent variable, denoted by . It takes values as a discrete set satisfying and . We define the joint distribution in terms of a marginal distribution and a conditional distribution . Generally, in the mixture model, we first choose a sample from a multinomial distribution and then draw observations for sample from a distribution that depends on , i.e.,
The marginal distribution over is specified in terms of the mixing coefficients , such that,
4. Mixtures of Gaussians
The probability density function of is defined aswhere is a vector of means, and is an covariance matrix.
The Gaussian mixture distribution can be written as a linear superposition of Gaussians in the form
Now, the conditional distribution of given a particular value of is a Gaussian:
The marginal distribution of is obtained by summing the joint distribution over all possible states of to give
An important derived quantity is the “posterior probability” on a mixture component for a given data vector :
In the example shown in Figure 1, the resulting distribution is bimodal, suggesting that the data come from two different sources. In the figure, the red and green lines indicate two components of the Gaussian mixture distribution.
4.1. EM for Gaussian Mixtures
The EM algorithm is an estimation method used to find the estimators of maximum likelihood when a data set has missing values or latent variables. In this work, we assume a GMM with a fixed number of and that are known a priori.
The EM is obtained as follows:(1)Initialize the means , covariances , and mixing coefficients and evaluate the initial value of the log-likelihood.(2)E step: evaluate the responsibilities using the current parameter values:(3)M step: re-estimate the parameters using the current responsibilities:where
Evaluate the log-likelihood:and check for convergence of either the parameters or the log-likelihood. If the convergence criterion is not satisfied, return to step 2 .
5. PCA of Gaussian Mixture Model
In this section, we present the steps of the proposed method. These steps are also illustrated in Figure 2.(1)Use the PCA technique to reduce the dimensionality of a -dimensional data set. To find the principal components, we first obtain the eigenvalues and eigenvectors of the covariance matrix. The eigenvalues are the principal components. The total number of principal components corresponds to the total number of variables in the data set.(2)Choose the eigenvectors that correspond to the largest eigenvalues, where is the number of dimensions of the new feature subspace .(3)Transform the original data set to the lower-dimensional version.(4)Use the k-means clustering method to partition the data set into clusters. When using k-means, it is important to determine the correct number of clusters for the data.(5)The new data are modelled by a mixture of Gaussians with parameters set to assumed initial values.(6)Use the EM algorithm to estimate the unknown parameters representing the mixing proportions between the Gaussians and the means and covariances of each.(7)Use the Bayesian information criterion (BIC) test to assess the fit of the model; BIC is a model selection among a finite set of models, where the model with the lowest BIC is preferred .
6. Experimental Results
To study the effectiveness of the proposed method, we consider two scenarios. In the first scenario, the mixture model is fitted to the reduced data produced by the PCA method. In the second scenario, the clustering method is applied to the reduced data, and then the mixture model is fitted to the new data by taking the cluster means as initial values for the means in the mixture model. We use different types of data sets. The method is implemented on a scaled data set, and the results are illustrated in the following sections. We use the “stats”, “mclust”, and “mixtools” R packages to implement this method [15–17].
6.1. Simulation Case
We implement the proposed method on simulation data with different sample sizes: 50, 100, and 500. Consider a data set of four variables defined as follows:where , and are scaled variables of , and , respectively. The simulated data consist of four variables, . The mean values for each variable in the simulated data are , , , and .
The implementation includes a graphical plot of the data set as displayed in Figure 3, which shows a pairs plot of the simulation data and its three-dimensional surface. As a first step, we applied the PCA method; the results are summarized in Table 1, which presents the total variance of components. Practically, PCA describes the data in few variables without loss of information. As shown in Figure 4(a) and Table 1, two components explained 93% of the total variance. We considered those two components to comprise a new data set denoted RD data, which contained 93% of the information of the original data. The empirical distribution of the RD data is presented in Figure 4(b).
Hence, we fitted a two-component bivariate mixture model to the RD data. Figure 5(a) displays the fitting of the two components GMM on the new data (with 500 data points); the plot specifies each component’s mean and sigma values. In order to estimate the density parameters for each component, we used an EM algorithm for mixtures of bivariate data. Table 2 presents the estimates of model parameters for each component; it also displays the estimates for different sample sizes. We computed the BIC of the model for the three cases presented in Table 2 and observed that the BIC value became large with increasing sample size. Figure 5(b) shows a plot of log-likelihood versus number of iterations; it is clear that the log-likelihood remained low when the number of iterations increased and that the EM method reached convergence.
The second scenario involves the selection of the initial values for the means in the mixture model; this was done by applying the k-means method to the reduced data. Then, the centres (or the means) of the clusters were taken as initial values. Thus, the RD data were partitioned into two clusters using the k-means method, denoted as PC1 and PC2, as shown in Figure 6(a). The cluster means for PC1 were 1.343257 and −1.210463, whereas those of PC2 were −0.02877005 and 0.02592586. In the next step, the bivariate GMM was fitted to the RD data using the cluster means as initial values. A visualization of the fitted bivariate GMM is presented in Figure 6(b). A summary of the results is given in Table 3. The resulting BIC value was 3389.031, which was the same as the BIC for GMM, as displayed in Table 2.
Parameter estimates were compared for the mixture and clustering methods. The k-means method computes the conventional Euclidean distance of given data, whereas GMM computes the weighted distance by considering the variance in its measurement calculations. One can think of mixture models as generalizing k-means clustering to incorporate information about the covariance structure of the data as well as the centres of the latent Gaussians.
6.2. Forensic Glass Fragment Data
In this section, the proposed method is implemented using forensic glass fragment (FG) data (available in an R package). The FG data include 214 observations and 10 variables. Consider the four selected variables, manganese (Mg), aluminium (Al), silicon (Si), and potassium (K), and the types of fragments (WinF, WinNF, Veh, Con, Tabl, and Head). Note that the variable type is used only to classify the data. Figure 7(a) shows a pairwise plot for four measurements of the scaled data set.
As a first step, we applied the PCA method to the data to obtain a linear estimate of dimensionality. The results are summarized in Table 4. As shown in the scree plot in Figure 7(b), three components explained 89% of the total variance and two components explained 70% of it. We therefore consider the first two components in the following.
Hence, the two-component bivariate mixture model was fitted to the FG data, as shown in Figure 8(a). The mean and sigma values for each component are also shown in Figure 8(a). In order to estimate the density parameters for each component, we used the EM algorithm for mixtures of bivariate data. Note that the initial values were chosen using the “mixtools” package. Table 5 presents estimates of the model parameters for each component, as well as the BIC. Figure 8(b) shows a plot of log-likelihood versus number of iterations; it is clear that the log-likelihood remained low when the number of iterations increased and that the EM method reached convergence.
Next, we studied the selection of initial values for the location parameters in the mixture model based on the centres of clusters obtained with different clustering methods: k-means and hierarchical clustering. First, the k-means method was applied to the reduced FG data, and then the centres (or the means) of clusters were taken as initial values. A visualization of the resulting data for PC1 and PC2 is given in Figure 9(a). The cluster centres (means) for PC1 were −0.36697 and 1.45934 and those for PC2 are 0.35596 and −1.41558. Then, the bivariate GMM was fitted to the reduced FG data using the cluster means as initial values. The results are summarized in Table 6, and a plot of the fitted mixture model is shown in Figure 9(b). The resulting BIC was 1174.485.
Second, the initial values for the location parameters were selected using hierarchical clustering with the “mclust” package. As a first step, the hierarchical clustering was applied to the reduced FG data. A visualization of the resulting data for PC1 and PC2 is given in Figure 10(a). The resulting centres were 1.49372 and −0.51297 for PC1, while those for PC2 were −0.58542 and 0.20104. In the next step, the bivariate GMM was fitted to the reduced FG data using the cluster means as initial values, as shown in Figure 10(b). Table 7 shows the results for the mixture model, and the resulting BIC was 1111.483.
We observe that the selection of initial values for location parameters using clustering methods provided good results, similar to those obtained by selection of initial values with the “mixtools” package.
6.3. Applications to Real Data
In this section, the proposed method is implemented on a real data set obtained from the “Knoema” website . Knoema is one of the most comprehensive sources of global decision-making data. The data set for cancer incidence in 100 countries in a specific year (2016) includes 3168 observations and 32 variables and covers different cancer types. In the first step, we used the PCA approach to approximate the dimensionality of the data in a linear manner. The results of the first five main components are summarized in Table 8. As shown in the scree plot in Figure 11, 93% of the overall variance was described by the first two components. We thus consider the first two components in the following.
The data were thus fitted by the two-component bivariate mixture model, as shown in Figure 12(a), which gives the mean and sigma values for each component. We used the EM algorithm for bivariate data mixtures to determine density parameters for each component. The initial values were as suggested by the “mixtools” package. The estimates of model parameters are provided in Table 9 for each component, as well as the BIC. Figure 12(b) shows a plot of log-likelihood versus number of iterations; the log-likelihood remained low as the number of iterations increased, and the EM method reached convergence.
Then, we used the centres of clusters with different clustering techniques (k-means and hierarchical clustering) to analyse the choice of initial parameters for the location parameters of the mixture model. First, the k-means method was applied to the reduced data, and then the centres (or the means) of clusters were taken as initial values. A visualization of the resulting data for PC1 and PC2 is given in Figure 13(a). The cluster centres (means) for PC1 were 14.19303 and −1.0799, while those of PC2 were −3.22915 and 0.24569. Then, the bivariate GMM was fitted to the reduced data using the cluster means as initial values. The results are summarized in Table 10, and a plot of the fitted mixture model is shown in Figure 13(b). The resulting BIC was 556.9627.
Second, a hierarchical clustering procedure was used to specify the initial values of the location parameters; this was achieved using the “mclust” package. The hierarchical clustering of the reduced data was implemented as a first step. Figure 14(a) provides a visualization of the data obtained for PC1 and PC2: the resulting centres were 9.39997 and −0.58877 for PC1, while those for PC2 were 2.25038 and −0.14095. In the next step, the bivariate GMM was fitted on the reduced data using the cluster means as initial values; the fitted data are presented in Figure 14(b). Table 11 shows the results of the mixture model and the BIC of 556.9629.
We observe that the selection of initial values for location parameters using clustering methods provided good results, similar to those obtained when the initial values were selected using the “mixtools” package.
This work aimed to study the applications of PCA in mixture models. First, we discussed the use of the well-known PCA technique for dimension reduction and applied it to high-dimensional data sets. Then, in the reduced data (which contained only two variables), we dealt with the two variables together and fitted a two-component bivariate GMM to the data. We used an EM algorithm to estimate the model parameters. This approach is suitable for large data sets with high dimension and can solve the problem of overfitting. We compared three different techniques for the selection of initial values of location parameters in the mixture model: two clustering methods, k-means and hierarchical clustering, and default values from the “mixtools” package. With all three techniques, EM convergence was reached and similar BIC values were obtained.
The data were taken from Knoema website, which is one of the most comprehensive sources of global decision-making data in the world (world and regional statistics, national data, maps, rankings) (retrieved from https://knoema.com/Atlas).
This paper was a component of a Masters’ thesis by the first author under the supervision of the second author.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
The authors would like to mention their special thanks to Dr Jochen Einbeck of the University of Durham for constructive comments that greatly improved the paper. This work was funded by the Deanship of Scientific Research (DSR), King Abdulaziz University, Jeddah, under grant no. DG 010-247-1440. The authors thank DSR for technical and financial support.
Z. Jin, F. Davoine, and Z. Lou, “An effective EM algorithm for PCA mixture model,” in Proceedings of the Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pp. 626–634, Springer, Lisbon, Portugal, August 2004.View at: Publisher Site | Google Scholar
G. J. McLachlan and D. Peel, Finite Mixture Models, John Wiley & Sons, Inc, Hoboken, NJ, USA, 2000.
J. Verbeek, “Mixture models for clustering and dimension reduction,” Universiteit Van Amsterdam, Amsterdam, Netherlands, 2004, Doctoral dissertation.View at: Google Scholar
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, Berlin, Germany, 2006.
R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2018.
L. Scrucca, M. Fop, T. B. Murphy, and A. E. Raftery, “Mclust 5: clustering, classification and density estimation using Gaussian finite mixture models,” The R Journal, vol. 8, no. 1, pp. 205–233, 2017.View at: Google Scholar