Abstract

Clustering analysis is an important and difficult task in data mining and big data analysis. Although being a widely used clustering analysis technique, variable clustering did not get enough attention in previous studies. Inspired by the metaheuristic optimization techniques developed for clustering data items, we try to overcome the main shortcoming of k-means-based variable clustering algorithm, which is being sensitive to initial centroids by introducing the metaheuristic optimization. A novel memetic algorithm named MCLPSO (Memetic Comprehensive Learning Particle Swarm Optimization) based on CLPSO (Comprehensive Learning Particle Swarm Optimization) has been studied under the framework of memetic computing in our previous work. In this work, MCLPSO is used as a metaheuristic approach to improve the k-means-based variable clustering algorithm by adjusting the initial centroids iteratively to maximize the homogeneity of the clustering results. In MCLPSO, a chaotic local search operator is used and a simulated annealing- (SA-) based local search strategy is developed by combining the cognition-only PSO model with SA. The adaptive memetic strategy can enable the stagnant particles which cannot be improved by the comprehensive learning strategy to escape from the local optima and enable some elite particles to give fine-grained local search around the promising regions. The experimental result demonstrates a good performance of MCLPSO in optimizing the variable clustering criterion on several datasets compared with the original variable clustering method. Finally, for practical use, we also developed a web-based interactive software platform for the proposed approach and give a practical case study—analyzing the performance of semiconductor manufacturing system to demonstrate the usage.

1. Introduction

Clustering analysis or clustering is the task of grouping a set of objects in such a way that, according to certain similarity, objects in the same group (called a cluster) are more similar than objects falling in different groups (clusters). Clustering analysis is used widely in the data preprocessing step and data mining step of KDD (Knowledge Discovery in Databases, KDD) [1] (Figure 1), and especially it is the main task of exploratory data mining and unsupervised machine learning. Recently, clustering analysis is pointed out as a powerful metalearning to accurately analyze the big data [2]. Clustering analysis also plays an important role in many other fields, including pattern recognition, image analysis, information retrieval, and bioinformatics. Besides its importance, clustering analysis is also a challenging task because the unsupervised nature of clustering analysis implies that the structural characteristics of the dataset are not known, except if there is some domain knowledge about the dataset available in advance [3].

Because of the importance and difficulty of clustering analysis, a lot of clustering algorithms are proposed in the literature. Some popular clustering algorithms, for example, k-means clustering, suffer from the shortcoming that is being sensitive to outliers; therefore, metaheuristic methods such as evolutionary algorithms and swarm intelligence algorithms are used widely to improve the clustering algorithms from the optimization perspective. Almost all the metaheuristic based improvements for clustering algorithms in the literature are devoted to cluster the data items, but clustering analysis for variables is also a common technique for statistical data analysis for dimension reduction or (unsupervised) feature selection especially in practical statistical data analysis activities. The most famous one is the VARCLUS procedure in SAS, and there are also some other versions of variable clustering methods implemented in R and SPSS. In contrast to its wide application, the contributions of research to the variable clustering techniques are not sufficient. Also, the k-means-based variable clustering algorithms suffer from the same shortcoming that is being sensitive to initial centroids.

We studied the metaheuristic approach for variable clustering algorithm based on our previous work. In our previous research, MCLPSO [4] is studied to improve CLPSO [5] from two aspects: one is the chaotic local search and the other is the SA-based local search. Firstly, we integrate the chaotic local search operator to CLPSO to enable the stagnant particles to escape from the local optima. An SA-based local search operator combined with the “cognition-only” model is developed to enhance the local search ability of the elite members. The experimental results demonstrate that MCLPSO is competitive in optimizing the multimodal functions. In this work, MCLPSO is reorganized under a novel metaheuristic paradigm-memetic computing. Furthermore, MCLPSO is used to optimize k-means-based variable clustering algorithm as a metaheuristic approach. The experimental results demonstrate that MCLPSO can improve the k-means based variable clustering algorithm effectively. We also developed a web-based interactive software platform to implement this approach and give a practical case study—analyzing the performance of a semiconductor manufacturing system by MCLPSO-based variable clustering.

The main contribution of this work includes the following:(i)A novel memetic algorithm MCLPSO proposed in our previous research is described under a more sound and general theoretical framework-memetic computing.(ii)To our best knowledge, it is the first time to use metaheuristic method to improve the results for variable clustering. Also, the improved variable clustering is used to deal with some complex tasks.(iii)To facilitate the practical use of the MCLPSO-based variable clustering algorithm, we developed an interactive software system for this approach and give a real-world case study.

The rest of the paper is organized as follows: In Section 2, we review the related work. In Section 3, we describe our previous work MCLPSO in detail under the memetic computing framework. In Section 4, MCLPSO is used to optimize the k-means-based variable clustering problem. In Section 4, some experimental results on several datasets are presented and discussed. In Section 5, a web-based interactive software system developed for clustering variables is introduced. Finally, we give a final conclusion in Section 6.

As mentioned in Section 1, clustering analysis is an important and difficult task. In the literature, dozens of clustering algorithms are proposed for multiple clustering analysis applications. These clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods [6]. Despite the classification of methods, the main objective of a clustering algorithm is maximizing both the homogeneity within each cluster and the heterogeneity among different clusters [6]. From optimization perspective, if the homogeneity within each cluster and the heterogeneity among different clusters can be measured by a certain clustering criterion, the metaheuristic algorithms including EA (Evolutionary Algorithms, EA) such as GA (Genetic Algorithms, GA) and swarm intelligence algorithms such as PSO (Particle Swarm Optimization, PSO) can be applied to improve the clustering results by adjusting the hyperparameters for those clustering algorithms which are sensitive to their hyperparameters.

As a best-known and most commonly used clustering technique, k-means clustering suffers from the deficiency that is being sensitive to its k value and initial centroids. In the literature, several metaheuristic-based clustering methods are proposed to overcome this deficiency. Maulik and Bandyopadhyay proposed a GA-based clustering technique to exploit the searching capability of genetic algorithms so that the clustering metric can be optimized by searching for appropriate cluster centroids [7]. Van der Merwe and Engelbrecht introduced two PSO-based clustering algorithms [8]. In PSO-based clustering algorithm, PSO is used to find the optimum centroids directly. In hybrid PSO and k-means clustering algorithm, the result of k-means is used to initialize PSO-based clustering for quick convergence. The results of algorithm were compared to k-means algorithm and the conclusion is that the proposed approaches gave better convergence and low quantization error in comparison to k-means algorithm. Esmin improved PSO-based clustering algorithm by modifying the evaluation function and the modification brought good improvements to the clustering results [9]. Ahmadyfard proposed a two-stage clustering algorithm [10] in which PSO is used at the first stage to find optimum centroids directly; then these optimized centroids are used to initialize k-means at the second stage. The combined method has the advantage of both PSO and k-means methods if the algorithm switch to k-means when the PSO are closed to the global optimum.

Recently, memetic algorithms are used as a novel metaheuristic paradigm to improve clustering algorithms. A memetic algorithm (MA) is an EA that includes one or more local search operators to improve the individuals within its evolution cycles [11]. In MAs, “memes” refer to the local search operators which are used to enhance local search ability of EAs [12]. Moscato introduced the concept of meme to EA firstly by combining the SA with the crossover operator in the genetic algorithm to solve the TSP (Travelling Salesman Problem, TSP) problem [13]. MA is inspired by the concept of a meme, which represents a unit of cultural evolution that can exhibit local refinement. The population evolution is cooperated with the individual learning, and the memetic model is a more detailed explanation for the adaption in the natural system than the genetic model [12]. Most EAs can find the regions around the local optima, but some versions of EA including PSO exhibit the deficiency of lacking local search abilities. MA is proposed to overcome this deficiency. The promising regions throughout the search space can be found by global search operators and the local search operators can give fine-grained search around these search regions [12]. The global search cooperates with the local search to find the global optima. Ong extends the notion of MA and defines memetic computation (MC) [14]. The concept of memes used in MC is more general than the concept of memes used in MA. In MC, a meme can denote a learning strategy, an operator or a local search procedure.

Sheng proposed an approach for simultaneous clustering and feature selection using a niching memetic algorithm, NMA_CF [15]. In NMA_CF, encode both feature selection and cluster centers with different numbers of clusters; local search operations are introduced to refine feature selection and cluster centers encoded in the chromosomes; and niching method is integrated to preserve the population diversity and prevent premature convergence. The experimental results demonstrated that simultaneous global clustering and feature subset optimization mechanism is effective in approaching the problem. Recently Sheng improved NMA_CF by introducing multiple local search operation and adaptive niching strategy [16]. In our previous research [17], we proposed a novel memetic algorithm GS-MPSO and use GS-MPSO to optimize the initial centroids for k-means clustering. In GS-MPSO, k-means clustering algorithm is integrated into function evaluation so that the improvement of clustering results is significant.

Although most clustering algorithms are devoted to cluster data items, variable clustering is also a widely used technique in practical statistical analysis activities. The function of variable clustering is provided in almost all the statistical tools such as R, SPSS, and SAS. The most famous one is the VARCLUS procedure implemented in SAS. In VARCLUS, the similarity of variables is measured by Pearson correlation, and the centroid is computed by the first principal component of the variables in the cluster. The variables are clustered to hierarchical clusters by hierarchical clustering. Almost in all the variable clustering algorithms, PCA (Principal Component Analysis, PCA) is used to compute the representative of variables in a cluster. Vigneau proposed a variable clustering algorithm named Clustering around Latent Variables to segment quantitative variables [18, 19]. Chavent proposed ClustOfVar to cluster variables with mixed type [20]. In [20], PCAMIX is used to calculate centroids and a k-means-based variable clustering and a hierarchical variable clustering are studied to optimize the homogeneity criterion.

3. Memetic Comprehensive Learning PSO

In many applications, the MAs are more competitive both in effectiveness and efficiency than the traditional EAs. But the method of designing an MA with a good performance is intricate. To design a competitive MA, the local search components should be kept in balance with the global search component to achieve a balance between exploration and exploitation. In some MAs, the excessive use of the local search can lead to a loss of diversity in the population. If the local search is applied to the candidate which is a local optimum or the local search depth is too high, the computing time may be wasted because of the unnecessary local search. The local search operators should cooperate with the evolutionary operators to find a balance between global search and local search. Therefore, the following design and parameterization issues of MA are considered [21]:(i)How often should local search be applied?(ii)On which solutions should local search be used?(iii)How long should local search be run?

The memetic strategy used in MCLPSO will give answers to the design issues of MA. We propose an adaptive memetic strategy based on the status and quality of particles.

Although some MAs have been proved to be effective, the framework of MA was found too specific to describe some complex hybrid algorithms. Some researchers try to develop a more general and more formal definition for MA. For example, Nguyen presents a probabilistic memetic framework to model the process of MA [22]. Ong defines memetic computation as a paradigm that uses the notion of meme(s) as units of information encoded in computational representations for problem-solving [14]. A MC is composed of several interactive memes and MC uses these memes to solve the complex problems. In MC, a meme can denote an operator, a learning strategy, or a local search procedure, so the concept of memes used in MC is extended. Icca gave a thorough analysis for MC and introduced “Ockham’s Razor” theorem which is stated as “Entities should not be multiplied unnecessarily” [23]. Icca pointed out that simplicity will help to design an efficient and compact memetic computational algorithm from the perspective of Ockham’s Razor theorem and summarized that four kinds of memes perform different exploration in MC:(i)Stochastic long-distance exploration(ii)Stochastic moderate-distance exploration(iii)Deterministic short-distance exploration(iv)Random long-distance exploration

In our previous work, we have developed some novel memetic algorithms under the framework of MC and theses memetic algorithms are applied to data clustering [17] and missing data estimation [24]. In this work, we develop MCLPSO by following the analysis of MC in [14] and design the following “memes” as in [24].(i)Stochastic long-distance exploration: comprehensive learning strategy(ii)Stochastic moderate-distance exploration: chaotic local search(iii)Deterministic short-distance exploration: SA local search

The diversity can benefit from random long-distance exploration. But random long-distance exploration may lower the quality of swarm when the comprehensive learning strategy is used. So random long-distance exploration is disabled in MCLPSO to keep the swarm stable.

Based on the above discussion, we will discuss the memes used in MCLPSO in detail and propose the memetic strategy for MCLPSO.

3.1. Classification of the Particles

In MCLPSO, the CLPSO is responsible for the global search. The chaotic local search operator is applied to the stagnant particle to improve the stagnant particle and the SA-based local search operator performs fine-grained search around the promising regions.

At each iteration of CLPSO, the ith particle’s solution xi will be updated by adding a velocity which is calculated by learning from at each dimension d. pbestj is the best solution found by the jth particle so far. defines the ith particle’s corresponding learning exemplars at each dimension. Some variables are introduced to classify the particles for the purpose of designing an adaptive memetic strategy. The classification depends on the searching status of the particle.(i)For the ith particle, flagi is used to record the number of generations the ith particle has not improved its pbesti. If flagi ≥ m, fi is reassigned and flagi is reset to 0. m is the refreshing gap and set at 7 [5].(ii)For the ith particle, stagnanti is used to record the number of reassignments of fi and the pbesti has not been improved during this period, i.e., pbesti has not been improved for generations. If stagnanti ≥ stagnantmax, the particle i is stagnant.(iii)For the ith particle, improvei is used to record the number of generations that the pbesti has been improved continuously, i.e., pbesti has been changed continuously for improvei generations.(iv)A particle i with the best pbesti in the population is a promising particle if improvei ≥ improvemax.

3.2. Stochastic Long-Distance Exploration-Comprehensive Learning Strategy

The CLPSO is adapted from the original PSO by using a novel velocity updating equation (1) which is called comprehensive learning strategy:where is the inertia weight, c is the weight of comprehensive learning, will generate a random number in [0, 1] according to the uniform distribution, and defines the ith particle’s corresponding learning exemplars at each dimension. At dth dimension, the ith particle should follow which denotes the dth value of the particle’s best solution found so far. Pci is the probability that the ith particle will learn from other particles’ pbest which is empirically defined aswhere ps means the population size.

The selection of learning exemplars of the ith particle can be implemented by the following steps. For each dimension of the ith particle, a random number is generated between 0 and 1 according to the uniform distribution. If this random number is larger than Pci, the corresponding dimension will learn from its own pbesti, otherwise it will learn from another particle’s pbest and then two particles in the swarm which excludes the ith particle are chosen randomly and the one with a better pbest will be selected as the exemplar for particle i to learn at that dimension. This process is summarized in Figure 2. For efficiency, the ith particle is allowed to refresh its learning exemplars fi until the particle ceases improving for m generations and m is called the refreshing gap.

In the CLPSO, each particle will learn from pbestfi which is derived from different particles’ historical best position. The updating strategy (1) is proved to yield a larger potential search space than that of the original PSO by the analysis of search behavior [5]. The swarm’s diversity can be kept by the comprehensive learning strategy. Therefore, the performance is improved when solving complex multimodal problems. But this improvement is obtained at the cost of the convergence speed because the effect of the current global best position is weakened. If all the particles share the similar pbest with the current global best position, the comprehensive learning is not able to enable the swarm to escape from the local optimum. As other EAs, CLPSO also lacks of the ability of local search. In this study, the CLPSO is investigated under the framework of the MC. Two local search operators are introduced to overcome these deficiencies.

3.3. Stochastic Moderate-Distance Exploration—Chaotic Local Search

We study the chaotic local search operator to improve the stagnant particle i which cannot improve its pbesti by comprehensive learning strategy. The Chaotic_local_search is adapted from the chaotic local search operator in [25]. The logistic equation (3) is used to generate the chaotic sequence. In (3), μ is the control factor and x is the chaotic variable. Although (3) is deterministic, it exhibits chaotic dynamics when μ = 4 and . So, (4) is used to generate the chaotic sequence for the dth dimension of particle i. The sequence generated by (4) is sensitive to the initial value. A minute difference in the initial value of the chaotic variable would result in a considerable difference in its long behavior. Equation (5) is used to normalize the initial value of chaotic variable in (4). The stagnant particle i is perturbed with probability by the denormalized value of a chaotic variable. The denormalized value is derived from (6):

In Chaotic_local_search, xi is reset by pbesti and then xi is normalized between 0 and 1 by (4) to initialize the chaotic vector cxi. [xmin,d, xmax,d] is the range of the dth dimension of the search space. A chaotic sequence is generated for each dimension by (5) and cxid is the chaotic variable for the dth value of particle i. k is the iteration number. cxid evolves by (5) iteratively, and the track of cxid during the evolution can travel ergodically over the whole search space. During the evolution process of the chaotic variables, the position xi is perturbed with probability PChaotic by to escape from the local optimum. is denormalized from . The details of Chaotic_local_search are described in Figure 3. Chaotic_ls_length represents the number of iterations.

3.4. Deterministic Short-Distance Exploration—SA Based Local Search

CLPSO is used as the global search component for MCLPSO because the diversity can be kept by comprehensive learning. But the lack of ability to local refinement in CLPSO can lead to missing the local optima. To solve this problem, a novel local search operator by combining the cognition-only model [26] with SA is developed in our previous work [17] to enhance the local search ability of the CLPSO. The details of this SA-based local search operator are described in Figure 4.

In Figure 4, T is the temperature variable and T0 is the initial temperature. SA_ls_length represents the number of iterations. can be obtained by introducing a Cauchy perturbation to the rth dimension of pbesti according to (7) in which [Ar, Br] is the range of the rth parameter and u is generated randomly subject to the uniform distribution between 0 and 1. pbesti is perturbed with a probability PSA each time for the purpose of “fine-grained” local search around the promising regions. The pbesti is updated in a greedy way, but the new position which is generated by the cognition-only model is accepted subject to the Metropolis rule (Kirkpatrick, 1983). A local search around the promising region can be performed. Thus, the ability of local refinement of PSO can be enhanced by the SA_local_search.

3.5. Adaptive Memetic Strategy for CLPSO

MCLPSO can be presented as a combination of the CLPSO with Chaotic_local_search and SA_local_search. The memetic strategy used in MCLPSO can be described as follows:

Adaptive Memetic Strategy 1: SA_local_search is only applied to the promising particle to give fine-grained local search around the promising regions, and the Chaotic_local_search should be applied to the stagnant particle which cannot improve its own pbest by the comprehensive learning strategy to enable the stagnant particles to escape from the local optima.

Although, in some other MAs, the local search is applied to all particles, we adopted Adaptive Memetic Strategy 1 in the MCLPSO because of the high cost of local search, and the frequent application of local search will result in a disastrous loss in diversity. In Adaptive Memetic Strategy 1, the swarm evolves along with the local refinement around the promising regions and the chaotic local search of the stagnant particles. A pseudocode for MCLPSO is described in Figure 5.

Adaptive Memetic Strategy 1 can give answers to two of the design and parameterization issues mentioned in the last section. The local search operators will be applied adaptively according to current particle’s quality and status. The SA_local_search is always applied to the promising candidate solutions and the Chaotic_local_search is always applied to a stagnant particle which cannot improve its pbest by the comprehensive learning strategy. For the third question, the depth of Chaotic_local_search and SA_local_search, we believe a moderate value of SA_ls_length is sufficient for SA_local_search to find the local optimum because the local search is always applied to the particles with high quality and the value of Chaotic_ls_length is set a same value to balance the exploration and exploitation.

In MCLPSO, the velocity of particle i is restrained by min (, ) within [, ] which is the range of the dth velocity value. And, is evaluated only if xi is inside the search bounds. All the pbest are kept inside the search bounds, and the particle will be attracted back to the search bounds by the learning exemplars.

4. Clustering of Variables Based on MCLPSO

The main objective of this work is to improve k-means-based variable clustering algorithm by MCLPSO. As mentioned in section1, the k-means clustering method is sensitive to the initial centroids and is easy to be trapped into local optima. But k-means is still the most popular clustering algorithm because of its effectiveness and efficiency. Some variable clustering algorithms are implemented by k-means. In this section, we introduce k-means-based variable clustering algorithms in this section at first. Then MCLPSO is used to optimize the initial centroids for the k-means based variable clustering algorithm.

Some notations used in variable clustering are defined as follows:(i) is a N-dimension multivariate random variable, in which Xi is a continuous random variable(ii) is an observation of X(iii) is a dataset composed of M observations of X

We consider hard partitioning clustering in this work, so each variable belongs to only one cluster. Based on the above notations, we can give a formal description for variable clustering furthermore:

Definition 1. Clustering of variables can be defined as K partition on the variable set, PartitionK:PartitionK should satisfy the following constraints:In Sections 5 and 6, we choose some datasets generated by the sensors of complex manufacturing system, so the variables discussed in this section are quantitative. In [20], some variable clustering methods are developed to cluster variables with mixed types.

4.1. Principal Component Analysis—PCA

Centroid update rule is critical to k-means-based variables clustering. In almost all the variable clustering algorithms in the literature, PCA is used to compute the first principal component as the centroid for a group of variables in a cluster. In this section, we give a brief introduction to PCA at first.

PCA is a widely used dimension reduction method. The essentiality of PCA is the coordinate transformation. The projection of data on the new coordinate can maximize the variance. PCA transforms xi to by projecting xi on new coordinate U′ in (13), the dimension of is less than N:U′ is a submatrix of U and U′ is obtained by deleting some columns from U. U is a N×N orthogonal matrix, Uj is the jth column of U, and Uj is defined as the jth eigenvector of the sample covariance matrix C. C is the sample covariance matrix of dataset S defined by (11), :

From (12), we can get that C is a real symmetric matrix. By the properties of real symmetric matrix, we can get that there are N real eigenvalues of C (λ1, λ2,…, λN) and it is possible that λi = λj, 1 ≤ i ≠ j ≤ N. The eigenvectors of C (U1, U2, …, UN) corresponding to (λ1, λ2, …, λN) are real vectors. Eigenvectors corresponding to different eigenvalues are orthogonal to each other.where λj is the eigenvalue of C, Uj is the eigenvector corresponding to λj. Let λ1 ≥ λ2 ≥ … ≥ λN, then U1, U2, …, UN is sorted by their corresponding eigenvalues. The projection of S on U1 direction has the largest variance. The projection of S on U2 has the second largest variance, and so on. These eigenvectors are all orthogonal to each other. We can choose T eigenvectors with T maximum eigenvalues U′ = (U1, U2, …, UT). The S′ is the reduction dataset and S′ is the projection of S on U′.

Based on the discussion above, we can summarize the steps to calculate the FPC (First Principal Component) for S:(1)Calculate the sample covariance matrix C for S(2)Calculate the eigenvalues λ1, λ2, …, λN for C by Jacobi method(3)Choose the largest eigenvalue value λ1 and U1 is the eigenvector corresponding to λ1(4)Compute the projection of S on U1 and get the FPC = SU1

The Pseudocode of FPC is described in Figure 6.

4.2. Variable Clustering Based on KMEANSVAR

We use MCLPSO to optimize the k-means-based variable clustering algorithm KMEANSVAR which is same as CLV_kmeans in R packge ClustVarLV [19]. In KMEANSVAR, the variables are clustered iteratively and the key components of KMEANSVAR are defined as follows:(1)Similarity. In variable clustering, the similarity between variables is usually defined by correlation coefficient. In KMEANSVAR, Pearson correlation (14) is used to measure the similarity between the variables. If the two variables are highly correlated, the variables will be closer to each other, and vice versa. The similarity between variables is defined by (15).(2)Update of Centroid. In KMEANSVAR, the centroid of a cluster of variables is always kept as the FPC of the variables in the cluster.SClusterk is the samples composed by M observations of the random vector (Xk1, Xk2, …, XkP). SClusterk can be obtained by keeping Xk1, Xk2, …, XkP and deleting X-{Xk1, Xk2, …, XkP } from S. FPC (SClusterk) is the centroid of Clusterk.(3)Clustering Criterion. The quality of clustering result is measured by clustering criterion. A high-quality clustering of variables can maximize the clustering criterion. In [20], a clustering criterion is proposed for both quantitative and qualitative variables. In this work, we only take quantitative variables into consideration. In KMEANSVAR, the clustering criterion is defined by the homogeneity of variables in each cluster. H (Clusterk) denotes the homogeneity of the variable cluster Clusterk which is defined by (17). In (17), centroidk is the cenroid of Clusterk obtained by FPC (SClusterk).

H (PartitionK) denotes the homogeneity of a clustering of variables PartitionK, which is defined by

Based on the discussion above, we can give the steps of KMEANSVAR in detail.(1)Initialize K cluster centroids for centroid1, …, centroidK for clusters Cluster1, …, ClusterK(2)Clear the clusters Cluster1, …, ClusterK(3)For each Xi ∈ X, find its nearest cluster Clusternearest, and assign Xi to ClusternearestThe distance between a variable and a cluster is defined by the distance between the variable and the centroid of the cluster as(4)Computer centroidk = FPC (SClusterk) as the new centroid for Clusterk(5)Iteratively do (2) to (4) until the maximum iterations is reached

The Pseudocode of KMEANSVAR is described in Figure 7.

4.3. Variable Clustering Based on MCLPSO

Although KMEANSVAR can cluster the variables efficiently, KMEANSVAR is as sensitive to initial centroids as some other k-means-based methods. The clustering criterion is easy to be trapped to the local optima and the quality of clustering cannot be guaranteed. To overcome this shortcoming, MCLPSO is used to optimize the initial centroids for KMEANSVAR and MCLPSO-KMEANSVAR is proposed. In MCLPSO-KMEANSVAR, the solution is coded as the initial centroids for kmeansvar, and KMEANSVAR is embeded into the objective function of MCLPSO. MCLPSO optimizes the following:(1)Coding of the solution: particle i’s solution is coded as a D-dimension vector (21), D=KM, K is the number of clusters, M the number of observations. The kth component of solutioni-centroidik denotes that the centroid of the kth cluster centroidk is initialized by centroidik, and solutioni determines the initial centroids for the clusters. The posi and pbesti of particle i can be denoted as solutioni:(2)Objective function: in order to improve the quality of the clustering of variables, MCLPSO-KMEANSVAR optimizes H (PartitionK) by optimizing the initial centroids for KMEANSVAR. solutioni can be decomposed to K initial centroids: centroidi1, , centroidiK. The clustering of variables can be obtained by call KMEANSVAR parameterized by centroidi1, …, centroidiK, i.e., PartitionK = KMEANSVAR (centroidi1, …, centroidiK), and the clustering criterion H (PartitionK) can be obtained by (20). 1/H (PartitionK) is defined as the value of the objective function f. KMEANSVAR is embeded into the objective function of MCLPSO; therefore, the clustering result of KMEANSVAR can be optimized by adjusting the initial centroids for KMEANSVAR (Figure 8).

5. Experiment

As the clustering criterion of the clustering results has been defined in Section 4, we give some experiment results in this section. We evaluate the performance of the proposed algorithm MCLPSO-KMEANSVAR and compare it with some other variable clustering methods. MCLPSO-KMEANSVAR is compared with CLPSO-KMEANSVAR (KMEANSVAR initialized by CLPSO) and the original version of KMEANSVAR with random initialization. In [1820], cutting the dendrogram is recommended to initialize the k-means-based variable clustering method, but it is also stated that the hierarchical variable clustering lacks scalability when the number of candidate variables increases because its O (N2) complexity in which N is the number of candidate variables. We choose a more scalable initialization k-means++ initialization [27] in which the first cluster center is chosen uniformly at random from the data points that are being clustered, after which each subsequent cluster center is chosen from the remaining data points with probability proportional to its squared distance from the point's closest existing cluster center. It is easy to apply k-means++ initialization to KMEANSVAR and obtain KMEANSVAR++. The variable clustering methods and some experimental setting are listed in Table 1.

5.1. Datasets

We choose several real-world datasets as the benchmark datasets to test the variable clustering method in Table 1. The detailed information of the datasets is listed in Table 2. D1 are chosen from UCI datasets. D2 and D3 are collected from the MES (Manufacturing Execution System) database of a large-scale semiconductor manufacturing system located at Shanghai. D2 is composed of the values of the manufacturing performance variables and D3 is composed of the values of manufacturing status variables. D4 is the SECOM dataset which is described in [28]. A complex modern semiconductor manufacturing process of house line testing is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. SECOM is collected from the database of the FCS (Floor Control System) of the semiconductor manufacturing process. In D1-D4, only continuous variables are considered. The number of clusters for each dataset is set according to the number of variables.

In order to ensure the validity of the evaluation, D1–D4 are preprocessed before experiment. First, all the values are normalized between [0, 1] by Min-Max Normalization method. Especially, D4 contains some null values because the FCS sometimes influenced by sensor drifting results in data loss. Therefore, we give following rules to clean D4. After cleaning, a complete dataset D4 consists of 1560 instances, and each instance has 440 variables. D4 is a difficult variable clustering problem.(i)Remove the variables with unchangeable data(ii)Remove the variables with more than 50% missing data(iii)Remove the data items with more than 30% missing data

5.2. Parameter Setting

There are many parameters in the MCLPSO. According to the “No Free Lunch” theorem [29], there do not exist a so-called optimal parameterization. We set the parameters by following some empirical rules mentioned in some studies [30, 31].

For the parameters in the global search component of MCLPSO, decreases from 0.9 to 0.4 linearly, c = 1.49445, m = 7, the number of generations is set at 100, the population size is set at 20. For the parameters in the SA_local_search, T0 = 10 to give a fine-grained local search. For the parameters in the “cognition-only” model, decreases from 0.9 to 0.4 linearly along with the evolution cycles and c1 = 1.49445. PSA and PChaotic are both set to 0.1. For the other parameters, we found two heuristic rules by some tentative experiments. Chaotic_ls_length should be positively correlated with stagnantmax as stagnantmax determines the degree of stagnation of the particle. A high value of stagnantmax implies that a high value of Chaotic_ls_length is needed to enable the stagnant particle to escape from the local optimum. SA_ls_length is negatively correlated with improvemax because a high value of improvemax denotes a high quality of a promising particle. A moderate value of SA_ls_length is enough to detect the local optima. These parameters are set empirically: stagnantmax = 10, improvemax = 3, Chaotic_ls_length = 100, and SA_ls_length = 100.

In CLPSO, decreases from 0.9 to 0.4 linearly and c is set at 1.49445 as recommended by [5].

In KMEANSVAR, the number of clusters for each dataset has been specified in Table 2. If KMEANSVAR is evaluated in the fitness function, the maximum number of iterations is set to 10.

5.3. Result and Discussion

The mean value and the standard deviation are recorded in Table 3 with the best result in bold.

First, we assess the effect of introducing the metaheuristic optimization on variable clustering. From Table 3, we can find that the mean values of clustering criterion obtained by KMEANSVAR on D1–D4 are relatively poor because of the intrinsic deficiency of k-means clustering that is sensitive to initial centroids. KMEANSVAR is easy to be trapped into local optima and results in a relatively poor variable clustering. KMEANSVAR also shows a large standard deviation, so the performance of KMEANSVAR is not stable. On the simplest dataset D2 with only 11 variables, the clustering criterion values obtained by KMEANSVAR are not satisfactory enough and the variance values remain large. KMEANSVAR++ can improve KMEANSVAR by choosing centroids with probability proportional to its squared distance from the point’s closest existing cluster center. The improvement is definite but not so significant. CLPSO-KMEANSVAR can improve the clustering results significantly compared with KMEANSVAR. The mean values of the clustering criterion obtained by CLPSO-KMEANSVAR are improved, and the variance values of the clustering criterion are also reduced. Therefore, the clustering result can be improved more significantly by introducing the metaheuristic optimization than using k-means++ seeding.

Second, we analyze the effect of introducing the local search operators and adaptive memetic strategy to the population-based metaheuristic optimization on variable clustering. From Table 3, we can find that on D1-D2, the mean values of clustering criterion obtained by MCLPSO-KMEANSVAR are similar with the mean values of clustering criterion obtained by CLPSO-KMEANSVAR. The number of possible clustering results can be derived by the number of variables and the number of clusters. For example, the possible number of clustering results on D2 is C2 11, C3 11, and C4 11 when the number of clusters is 2, 3, and 4. Therefore, the difference between MCLPSO-KMEANSVAR’s results and CLPSO-KMEANSVAR’s results on D1-D2 is not significant because of the limited number of clusters and variables. When the number of variables and clusters increases, the advantage of MCLPSO-KMEANSVAR is more significant. On D3, MCLPSO-KMEANSVAR has a better performance than CLPSO-KMEANSVAR and the improvement will be more significant when the number of clusters increases. The advantage of MCLPSO-KMEANSVAR is more significant on D4—a complex real-world industry dataset with 440 variables. Therefore, the global search operators and the local search operators will take effect when dealing with dataset with large number of variables.

Furthermore, we analyze the robustness of MCLPSO-KMEANSVAR when dealing with the complex real-word dataset. MCLPSO-KMEANSVAR can generally improve the quality of the clustering of variables by optimizing the clustering criterion; its variance values on D1–D4 are not reduced. To show the robustness of the above approaches, the boxplots of MCLPSO-KMEANSVAR, CLPSO-KMEANSVAR, and KMEANSVAR’s result on D4 are depicted in Figures 911. From Figure 9, we can find that KMEANSVAR lacks robustness because of its intrinsic deficiency. CLPSO-KMEANSVAR’s results’ distributions are flatter. The variable clustering results can be improved significantly by introducing metaheuristic optimization. Compared with CLPSO-KMEANSVAR’s results, MCLPSO-KMEANSVAR’s results’ values of range and interquartile range are relatively higher, so the robustness cannot be improved by introducing the local search operators and adaptive memetic strategy. But in Figure 11, we find that MCLPSO-KMEANSVAR can avoid some extreme bad cases. In Figure 10, we find that the possibility to find satisfactory results is also higher.

To prove the improvement brought by MCLPSO-KMEANSVAR compared with CLPSO-KMEANSVAR is definite, nonparametric Wilcoxon rank sum tests are conducted between the MCLPSO’s results and the CLPSO’s results. The results of tests are presented in the last row of Table 3. If h = 1, the performances of the two algorithms are statistically different with 95% certainty. If h = 0, the performances are not statistically different. From Table 3, we find that MCLPSO-KMEANSVAR and CLPSO-KMEANSVAR are statistically different with the increase of K and the number of candidate variables.

5.4. Implementation and Computational Time

The algorithms discussed above are all implemented in JAVA 8, so we can use the multithread technique to accelerate the particle’s comprehensive learning process. We run the codes on Intel i5-8365U CPU with a parallelism of 8, i.e., 8 particles can do comprehensive learning operation simultaneously.

When we use MCLPSO to optimize KMEANSVAR, we will run the KMEANSVAR with a maximum number of evaluations (calling KMEANSVAR) 2000 as stated in Table 1. In 5.2, we stated that when the KMEANSVAR is called in a fitness function, the number of iterations is restricted to 10. The computational time of MCLPSO-KMEANSVAR is about 40–60 times as much as the computational time of KMEANSVAR.

The computational time of D4 with different K is listed in Table 4. Then computation time of MCLPSO-KMEANSVAR and KMEANSVAR shows a linear increase with respect to the increase of K, but KMEANSVAR++ increases dramatically because the initialization of KMEANSVAR++ is sensitive to K. So MCLPSO-KMEANSVAR is more scalable than KMEANSVAR++ (Table 5).

6. A Web-Based Interactive Software Platform

In Section 5, some datasets are used to evaluate MCLPSO-KMEANSVAR. Except D1, D2-D4 are collected from the information system databases of semiconductor manufacturing factories. The relationship between D2-D4 is explained in Figure 12. The variable clustering analysis of the variables of D2-D4 is an important and practical work. It is helpful to find useful insights of manufacturing systems from different perspectives and improve the operation management by some further analysis such as performance analysis, optimal control, fault diagnosis.

For the purpose of practical usage, we have also developed a web-based interactive software platform based on MCPSO-KMEANSVAR. In this section, we introduce the usage of a software platform by demonstrating each step. The performance analysis of the semiconductor manufacturing system is introduced as a case study.

6.1. Performance of Semiconductor Manufacutring System

The semiconductor manufacturing system is a very complicated system and its performance can be affected by the manufacturing environment, scheduling rules, equipment failure rate, and rush order. The analysis of performance is useful to improve the operation management of semiconductor manufacturing system. In Table 3, we choose 8 performance variables, in which Y1Y3 are long-term global performance, Y4Y6 are short-term global performance, and Y7–Y8 are short-term local performance. The detailed description is presented in Table 3.

6.2. A Web-Based Interactive Variable Clustering System

First, the dataset of the performance history data should be uploaded (Figure 13).

Then some statistics of each variable can be found in Figure 14. The user can also choose some thresholds to smooth the outliers for each variable before variable clustering.

The result of MCLPSO-KMEANSVAR is presented in Figure 15. We can reduce the number of optimization objectives by variable clustering.

7. Conclusion

In this work, MCLPSO, a novel memetic algorithm is presented in our previous research, is introduced as a metaheuristic approach to improve k-means-based variable clustering. The experiment results show that MCLPSO-KMEANSVAR outperforms KMEANSVAR significantly. We also develop a web-based interactive software platform to implement MCLPSO-KMEANSVAR and give a case study of performance analysis for semiconductor manufacturing system. In the future research, we will further study the practical use of MCLPSO-KMEANSVAR in other problems and develop a distributed MCLPSO-KMEANSVAR for analyze the big data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

In this work, the design, implementation, experiment, and part of the case study are finished by JiaCheng Ni and Li Li at Tongji University. The paper revisions and part of case study are finished by JiaCheng Ni at DELL EMC.

Acknowledgments

The authors would like to thank Prof. P. N. Suganthan for providing the codes of his research group. The authors would also like to thank Zhen Jia, Qiang Chen, and Jinpeng Liu for discussing the usage of machine learning in Internet of things (Iot) applications. This work was supported by the Key Research and Development Project of National Ministry of Science and Technology under grant no. 2018YFB1305304 and the National Natural Science Foundation of China under grant no. 61873191.