Weighted k-Prototypes Clustering Algorithm Based on the Hybrid Dissimilarity Coefficient
The k-prototypes algorithm is a hybrid clustering algorithm that can process Categorical Data and Numerical Data. In this study, the method of initial Cluster Center selection was improved and a new Hybrid Dissimilarity Coefficient was proposed. Based on the proposed Hybrid Dissimilarity Coefficient, a weighted k-prototype clustering algorithm based on the hybrid dissimilarity coefficient was proposed (WKPCA). The proposed WKPCA algorithm not only improves the selection of initial Cluster Centers, but also puts a new method to calculate the dissimilarity between data objects and Cluster Centers. The real dataset of UCI was used to test the WKPCA algorithm. Experimental results show that WKPCA algorithm is more efficient and robust than other k-prototypes algorithms.
Cluster analysis belongs to unsupervised learning and is an important research direction in the field of machine learning . Clustering analysis, as an important data analysis tool, can divide data objects into different subclusters by calculating the dissimilarity of data objects without marked samples. This study aimed to achieve the purpose that the data objects in the same cluster have less dissimilarity and the data objects in different clusters have more dissimilarity.
Clustering aims to find out the correlation between subclusters in datasets and to evaluate the dissimilarity among data objects in these subclusters . In the field of Categorical Data clustering, the classical k-modes algorithm  uses the modes vector to represent the Cluster Centers. The modes vector is a combination of the eigenvalue that occurs most frequently of each feature in the subcluster. The dissimilarity between data objects to be clustered and the cluster is calculated by simple Hamming distance, and only the Categorical Data can be processed. In the field of Numerical Data clustering, the classical k-means algorithm  uses the means vector to represent the Cluster Centers, and the means vector is the average value of each eigenvalue in the subcluster. Euclidean distance is used to calculate the dissimilarity between the cluster and the data objects to be clustered, and only Numerical Data can be processed. Both the k-modes algorithm and the k-means algorithm can only handle a single type of data.
In the actual clustering division, people not only need to deal with the Categorical Data and Numerical Data, but also need to deal with a large number of mixed-type datasets composed of the Categorical Data and Numerical Data. Because there is a big difference between Categorical Data and Numerical Data, and mixed-type datasets are usually high dimensional, it is very complicated to deal with mixed-type data in cluster analysis. A simple method to deal with mixed-type data is data preprocessing, which directly converts the Categorical Feature into Numerical Feature. In other words, the mixed-type data are directly converted into the Numerical Data, and then the Numerical Data clustering algorithm is applied to the clustering. For example, the Categorical Feature is converted to binary string, and then the algorithm based on Numerical Data is used to do clustering division. However, there are four disadvantages to using binary encoding for data preprocessing: (1) the original structure of the Categorical Data is destroyed, resulting in the meaningless binary features after conversion; (2) the implicit information of dissimilarity is ignored, which cannot truly reflect the structure of the dataset; (3) if the range of eigenvalues is large, the converted binary eigenvalues will have a larger dimension; and (4) maintenance is difficult, if new eigenvalues are added for the Categorical Feature, then all data objects will change .
To solve these problems, researchers have carried out a series of exploratory studies. k-prototypes algorithm  and its variant algorithm are mixed-type data clustering algorithms that take into account the Dissimilarity Coefficient of Categorical Feature and Numerical Feature at the same time. Such algorithms can process both Categorical Data and Numerical Data at the same time, but clustering parameters need to be set artificially. OCIL algorithm  is a hybrid clustering algorithm based on no parameters, and a uniform Dissimilarity Coefficient is given based on entropy. Like k-prototypes and its variants, the OCIL algorithm uses the k-means example to process mixed-type data. It is an iterative algorithm, sensitive to initialization, and more suitable for spherical distributed data. Ji et al.  improved the k-prototypes algorithm by considering the influence of feature weight on the clustering process and proposed the Dissimilarity Coefficient based on feature importance. Renato et al.  further improved the algorithm of Ji, endowed different features with different weights, and used Lp Distance Function as the new Dissimilarity Coefficient. Yao et al.  proposed an anonymous algorithm for hierarchical clustering based on k-prototypes (short name KLS). KLS algorithm improves the formula of the Dissimilarity Coefficient and unifies the weight setting of the Categorical Feature and Numerical Feature, but the weight needs to be specified in advance by experts. DPC-KNN-PCA algorithm  introduced the density peak algorithm into the k-prototypes algorithm to determine the initial Cluster Centers and improved the local neighborhood density through the nearest neighbor algorithm. However, the selection of the nearest neighbor value is easily affected by the dataset distribution. Dongwei et al.  proposed a k-prototypes algorithm based on the adaptive determination of the initial centroids (short name KP-ADIC). The KP-ADIC algorithm can determine the initial Cluster Centers adaptively, but its Dissimilarity Coefficient cannot fully calculate the dissimilarity between the data. Sangam et al.  proposed an equi-biased k-prototypes algorithm for clustering mixed-type data (short name EKACMD) in 2018. EKACMD algorithm is a variant algorithm of k-prototypes, which improves the Dissimilarity Coefficient by considering the relative frequency and distribution of each Categorical Feature. EKACMD algorithm can fully consider the structural characteristics of data in some cases and improve the clustering accuracy, but it is still not applicable in the case that the occurrence frequency of each eigenvalue of the Categorical Data is equal. Cui et al.  applied rough sets to k-prototypes algorithm and proposed RS-KP algorithm, which used rough sets to calculate the dissimilarity between eigenvalues. Although the RS-KP algorithm can deal with the outliers in the clustering of mixed-type data, it is difficult to cluster the discretized data when the eigenvalue ranges overlap, that is, the clustering results of the RS-AP algorithm are easily affected by the discretization.
In this paper, k-prototypes algorithm and its variants were analyzed and compared, and the automatic determination method of initial Cluster Centers was improved, and then a new Hybrid Dissimilarity Coefficient is proposed. The value of these improvements lies in the following: (1) avoiding the randomness of the selection of the initial Cluster Centers; (2) making the clustering method more suitable for the characteristics of mixed-type data; (3) there is no need to manually set various parameters in the clustering process, such as the number of clustering and the weight parameter γ; (4) there is no limitation on the types of clustering data. We can process Categorical Data, Numerical Data, and mixed-type data at the same time, which not only makes the clustering results more ideal, but also provides a new idea for the analysis and mining of real-world data.
The organizational structure of this paper is as follows: Section 2 introduces the symbols related to this paper. Section 3 introduces the k-prototypes algorithm. In addition, Section 4 details the design of WKPCA algorithm, and Section 5 gives the experimental results and analysis. Finally, Section 6 is a summary of this paper.
Table 1 shows the symbols associated with this article.
3. The k-Prototypes Algorithm
Huang  proposed a k-prototypes algorithm for clustering mixed-type data, which combines the ideas of k-means algorithm  and k-modes algorithm . The k-prototypes algorithm divides the dataset into different subclusters to minimize the value of the Cost Function. The Cost Function is shown in the following formula:
The k-prototypes algorithm combines the “means” of the numerical part and the “modes” of the categorical part to build a new hybrid Cluster Center “prototype”. On the basis of “prototype,” it builds a Dissimilarity Coefficient formula and the Cost Function applicable to the mixed-type data. The parameter is introduced to control the influence of the Categorical Feature and the Numerical Feature on the clustering process. It is assumed that the mixed-type dataset has Numerical Feature and Categorical Feature. For any , the definition of the Dissimilarity Coefficient of k-prototypes is shown in the following formula:
The k-prototypes algorithm divided the Dissimilarity Coefficient of the mixed-type data into two parts for separate calculation. The categorical part adopts the simple Hamming distance, and the numerical part adopts the square of the Euclidean distance . The proportion of the two types of data in the Dissimilarity Coefficient was adjusted by parameter . It is an important adjustable parameter for k-prototypes algorithm. The purpose of introducing parameter is to avoid the clustering result value deviation from the Categorical Feature or the Numerical Feature and control the relative weight of dissimilarity between Categorical Data and Numerical Data. When , is equal to 0; when , is equal to 1; the basic steps of the k-prototypes algorithm are described as follows: Step 1: data objects were randomly selected from dataset as the initial Cluster Centers. Step 2: formula (2) is used to calculate the dissimilarity between and . According to the calculation result, is allocated to the nearest cluster. Step 3: according to the current Cluster Centers, the dissimilarity of the data object is recalculated. Reassign the data objects to the nearest subcluster, the values with the highest frequency are used in the categorical part, and the numerical part uses the method of average value to determine. Update the Cluster Centers. Step 4: repeat Steps 2 and 3 until the Cost Function is no longer changing. If the Cost Function is no longer changing, the algorithm ends. Otherwise, skip to Step 2 to continue.
3.1. Description of Problem
k-prototypes algorithm can cluster mixed-type data, and the principle is simple and easy to operate, but there are still some shortcomings in the clustering process: (1) The random selection of the initial Cluster Centers results in the uncertainty and randomness of the clustering results, and the number of clusters should be manually determined; (2) the simple Hamming distance is used to calculate the dissimilarity between the Categorical Data and the Cluster Centers, resulting in the loss of information and the inability to objectively reflect the real situation between the data objects and the clusters, resulting in inaccurate clustering results; (3) parameter used to adjust the proportion between Categorical Data and Numerical Data needs to be manually determined; and (4) the structural characteristics of Categorical Data and Numerical Data and the overall distribution of datasets have not been fully considered.
3.1.1. Problem with the Dissimilarity Coefficient
With the help of the artificial dataset shown in Table 2, the disadvantages of user directive parameter in the clustering process are discussed. contains 27 data objects, and each data object is described by two Numerical Features and one Categorical Feature. Categorical Feature has three eigenvalues ; the values of Numerical Feature and are in the range of 0–80.
About the feature of , a solid triangle, a solid circle, and a solid square are prescribed to represent eigenvalue data objects . When parameter , clustering results of only depends on two features and . The clustering results are shown in Figure 1. has three clusters. To facilitate observation, the three clusters were separated by dotted lines. When , the data object can be moved to the , because most of the feature of the data objects is the same between the object and cluster . Similarly, based on the above reasons, a data object can be moved to cluster . When the value of the parameter γ changes, the clustering results of the data objects x7 and x10 will change accordingly. The data objects x7 and x10 may be divided into cluster C1 or cluster C2. Data objects and may remain in the original cluster because they are too far away from the other clusters nearby, even if they have the same eigenvalues as the data objects in the other clusters nearby. In summary, it is important to define the parameter on the same scale. For related discussion, see literature .
3.1.2. Problems in Initial Cluster Center Selection
The classical k-prototypes algorithm is very sensitive to the initial Cluster Centers, which are selected by random initialization method or manual setting method, both of which lead to the instability of clustering results to a certain extent. The initial Cluster Centers with different locations and values will produce different clustering results. As shown in Figure 2, the actual cluster number of this dataset is . Figure 2 shows the clustering results generated by different initial Cluster Centers when the initial cluster number is set to , , and (the contents described in Figure 2 from left to right are random selection of initial Cluster Centers, clustering iteration process, and final clustering result). Therefore, it is very important for the clustering algorithm to find a suitable initial Cluster Center.
4. Weighted k-Prototype Clustering Algorithm Based on Hybrid Dissimilarity Coefficient (WKPCA)
The motivation for the proposed algorithm is (1) to provide an effective method for the expression of Dissimilarity Coefficient of mixed-type data clustering and (2) to avoid the uncertainty caused by random selection of initial Cluster Centers.
In order to solve the problem of quantitative measurement of information, in 1948 Shannon cited the concept of thermal entropy in thermodynamics and proposed the concept of “Information Entropy”. The occurrence probability of discrete random events is defined as Information Entropy. The size of the Information Entropy is related to the probability of random events. The smaller the probability of an event, the more information is generated and the smaller the entropy of information. For example, the Information Entropy of the event “Heavy rain in a place where it does not rain frequently” is large; the larger the probability of an event, the less information is generated, and the greater the entropy of information. For example, the event “The sun rises in the east” will definitely happen, so it has very little information. Shannon's Information Entropy formula  is defined as follows:
, represents the probability of a random event . is a divided subset of dataset . When satisfies the condition , takes the maximum value ; when satisfies the condition , takes the minimum value of 0. Information Entropy has the following basic properties: Nonnegative: there is a negative sign in the Information Entropy formula, which represents the reduction or elimination of the disordered state after the system is obtained, that is, the magnitude of the uncertainty is eliminated Symmetry: all variables of a function are interchangeable without affecting the value of the function
4.1. Dissimilarity Coefficient of Categorical Based on Entropy Weight
Information Entropy can be used to calculate the discreteness of data and assign appropriate weight to each feature to improve the clustering effect. In the clustering process, the importance of a Categorical Feature is inversely proportional to its dissimilarity . To some extent, the Information Entropy of each Categorical Feature reflects the weight of each Categorical Feature. Therefore, according to the uncertainty of the value of each Categorical Feature, this paper uses Information Entropy to calculate the importance of each Categorical Feature in the clustering process and assigns weight to the Dissimilarity Coefficient.
Definition 1. (the intracluster relative frequency of eigenvalues). Suppose that most of the data objects in cluster contain the same Categorical Eigenvalue which means that the eigenvalue appears frequently in cluster , so the intracluster dissimilarity of the eigenvalue will be low. The intracluster relative frequency of the eigenvalues is defined as follows:
Definition 2. (the intercluster distribution frequency of eigenvalues). The intercluster distribution frequency of the eigenvalue refers to the occurrence frequency of the eigenvalue relative to the total frequency of the eigenvalue in all clusters. The intercluster distribution frequency of the eigenvalue is defined as follows:
Definition 3. (Dissimilarity Coefficient of categorical). Let represent the dissimilarity of the Categorical Data portion of the mixed-type dataset, and the definition is shown in the following formula:
Definition 4. (the entropy of Categorical Feature). From the perspective of information theory, the importance of feature can be seen as the dissimilarity of the dataset relative to the feature. Basak  mentions that if the information content of a feature is high, the dissimilarity of the feature is also high. Let be a discrete random variable belonging to the finite dataset , and is a probability function of the discrete random variable . Because the eigenvalue domain of Categorical Data is certain, the eigenvalues in the eigenvalue domain can be regarded as discrete and independent. Suppose there are different eigenvalues of a certain Categorical Feature , and the probability of the occurrence of each eigenvalue is , then the importance of Categorical Feature can be calculated by formula (7) . is the intracluster relative frequency of the eigenvalue mentioned in Definition 1:
Theorem 1. (). The larger the value of , the larger the proportion of the eigenvalues in the feature . Then, the intracluster dissimilarity between data objects to be clustered with the eigenvalues and the cluster is smaller.
Observing formula (7), we can find that the more the possible values of eigenvalue are, the smaller the entropy of the Categorical Feature is. In practice, it is not the case that the larger the value domain of eigenvalues, the higher the importance. Considering the actual situation, the more different values a data object has on a feature, the less influence this feature has on clustering. In order to reduce the influence of Categorical Feature with too many different values on clustering , formula (7) is further modified as follows.
Definition 5. (quantified entropy). When defining the entropy of the feature , we divide by the number of possible values of feature . The definition of entropy of Categorical Feature after quantization is shown as follows:
Definition 6. (weight on the -dimensional feature of Categorical Feature). The eigenvalue distribution of each dimension feature is different, and the eigenvalue of different distribution will make the feature of different dimension Categorical Feature occupy different importance. In order to better discover all or part of the “prototypes” hidden in the dataset, the weight of each dimension should be taken into account when defining the Dissimilarity Coefficient. Let the weight of each dimension feature be and , , the weighted data object is , where . The weight is defined as shown in the following formula:The weight of the Categorical Feature per dimension is defined as the ratio of the redundancy of the dimension feature to the sum of the overall redundancy. The calculation method of entropy redundancy is . All quantized entropy values in the dataset were normalized to obtain the entropy weight of each Categorical Feature. The definition of is shown as follows:
Theorem 2. (). is the sum of the weights of all features. The larger the value , the larger the weight of the feature .
Definition 7. (Dissimilarity Coefficient of categorical based on entropy weight). For any , the Dissimilarity Coefficient of categorical based on entropy weight between and is defined as shown in the following formula:The proposed Dissimilarity Coefficient is demonstrated by using the artificial dataset which is shown in Table 3. The dataset is described by three Categorical Features, , , and ; there are three clusters ,, and corresponding to the Cluster Centers , , and . Suppose clustering division is needed for the new data object .
The dissimilarity computing result of the k-prototypes algorithm is , and . The dissimilarity computing result of the EKACMD algorithm is, , and . The dissimilarity computing result of the WKPCA algorithm is and . According to the above calculation, it can be seen that the correct clustering division of can be carried out by using WKPCA algorithm.
4.2. Quantized Numerical Dissimilarity Coefficient
Definition 8. (quantitative numerical Dissimilarity Coefficient). The classical k-prototypes algorithm uses the Euclidean Distance to calculate the dissimilarity of the numerical part. Direct calculation of data of different orders of magnitude will not only increase the difficulty of calculation, but also cause a large error between the calculated results and the real situation. Therefore, Numerical Data should be dimensionless before calculation. The paper adopts Max-Min Standardization, and the processing method is shown in formula (12). The quantified numerical Dissimilarity Coefficient is defined as follows:
4.3. Weighted Hybrid Dissimilarity Coefficient
Definition 9. (weighted Hybrid Dissimilarity Coefficient). Suppose the mixed-type dataset has -dimension features (the front -dimension features are Categorical Features, and the latter -dimension features are Numerical Features). Numerical Features are treated as a whole (a vector), while the Categorical Features are treated as -dimensional vectors. Take the example of a data object which has a Categorical Feature of -dimension and has a Numerical Feature of -dimension. In the process of calculating dissimilarity, and -dimensional categorical vectors are chosen for calculation. That is, there are -dimensional vectors involved in the calculation of Dissimilarity Coefficient. Therefore, for arbitrary data object in the dataset , the Dissimilarity Coefficient between them is defined as follows:
4.4. Determination of Initial Cluster Centers
The classical k-prototypes algorithm is very sensitive to the selection of Cluster Centers. The appropriate initial Cluster Centers and cluster number are particularly important for k-prototypes algorithm.
Definition 10. (the average distance). The average distance between two data objects and is defined as follows:
Definition 11. (local neighborhood density). The local neighborhood density is defined as shown in the following formula:where is a piecewise function, when , ; otherwise, cutoff distance is a critical value that limits the search scope.
Definition 12. (distance threshold). is defined as the distance threshold between arbitrarily data objects and in dataset , which is defined as follows:The Cluster Centers generally satisfies the following two assumptions. Firstly, the local neighborhood density of the central point of the cluster is higher than that of the surrounding noncentral point of the cluster. Second, the relative distance between the center points of each cluster is large. Based on the above assumptions, this section presents the specific process of self-determining the initial Cluster Centers: Step 1: formula (2) is used to calculate the distance matrix of the data. Step 2: formula (16) is used to calculate the local neighborhood density value . Step 3: formula (17) is used to calculate the distance threshold . Step 4: sort the data in the dataset in descending order according to the local neighborhood density and get the sort sequence . is the initial Cluster Center , and is stored into the Cluster Centers set . Step 5: for , determine whether there is a that satisfies . If it is satisfied, is taken as the next Cluster Center and put into the Cluster Centers set . Otherwise, proceed to the next data object . Step 6: determine whether all the data objects in have been accessed. If not, skip to Step 4 to continue execution. Otherwise, the elements in the collection of are the initial Cluster Centers, and is clustering number . Step 7: end.
4.5. Cost Function considering Weights
Definition 13. (Cost Function considering weights). WKPCA algorithm is to find k subclusters where the following Cost Function as shown in formula (18) is minimized:where represents the membership degree of the data object to the cluster . represents the membership degree matrix. indicates that the data object belongs to the cluster , and indicates that the data object does not belong to the cluster . is the cost of dividing , that is, the sum of the dissimilarity of all data objects in cluster to the center of its cluster. When the value of the Cost Function reaches a minimum value when the constraint conditions: , , are satisfied, the clustering process ends. The WKPCA algorithm steps are described as follows: Input: dataset containing data objects Output: subclusters after clustering Step 1: initialization procedure. Formula (2) is used to calculate the dissimilarity between the data objects. Step 2: according to the automatic selection method of initial Cluster Centers in Section 4.4, data objects are selected from dataset as the initial Cluster Centers. Step 3: iterative process. Formula (11) was used to calculate the dissimilarity between the data object and the Cluster Center, and was assigned to the nearest cluster according to the calculation results. Step 4: according to the current cluster center, the dissimilarity of the data object is recalculated. Update the Cluster Centers. Step 5: repeat Steps 2 and 3 until the Cost Function is no longer changing. If the Cost Function is no longer changed, the algorithm ends. Otherwise, skip to Step 2 to continue.The flowchart of WKPCA algorithm is shown in Figure 3.
The time complexity of the algorithm in this paper is higher than that of the classical k-prototypes algorithm, which is mainly consumed in the process of selecting the initial Cluster Centers. However, after the optimal initial Cluster Centers is determined, the number of iterations will be reduced and satisfactory clustering results will be obtained, so as to make up for the high time complexity to some extent.
5. Experimental Results and Analysis
5.1. Experimental Environment
Simulation experiments in this article are implemented in Python, and all experiments are run on the i7-8700K [email protected] in the Intel(R) Core(TM), Windows 10 operating system. For experimental verification, five mixed-type datasets of Bank Marking (short name Bank), Zoo, Heart Disease (short name Heart), Lymphography (short name Lym), and Australian Credit Approval (short name ACA) were selected from UCI (UCI datasets: http://archive.ics.uci.edu/ml/datasets.html.2011) machine learning database. The details of the selected dataset are shown in Table 4.
The dataset used in this article has data missing phenomena, such as the ACA dataset. Deleting missing data directly from the dataset does not affect the clustering results. Therefore, before the experiment, all the data with missing values were deleted to ensure the integrity of the dataset and the accuracy of the clustering results. The complete version of the ACA dataset has 690 pieces of data, and the paper selects 623 pieces of data with complete eigenvalues to form a cleaned dataset. In addition, the numerical features are normalized by using the Maxi-Mini Normalization methods.
5.2. Performance Index
In order to evaluate the quality of clustering, the index AC (clustering accuracy) shown in formula (19) was used as the evaluation criterion. The indicator represents the ratio of the number of data objects correctly divided into the cluster to the total number of data objects. The closer the clustering result is to the real clustering partitioning result of the dataset, the larger the index value is, the better the clustering result is, that is, the better the clustering algorithm is. indicates the number of data objects correctly assigned to :
5.3. Analysis of Experimental Results
In order to verify the universality of WKPCA algorithm and eliminate the accidental results of a single experiment, we experimented with multiple real UCI datasets. The experiment compared the performance of WKPCA algorithm with k-prototypes algorithm proposed by Huang  and the EKACMD algorithm proposed by Sangam  on mixed-type datasets. Each algorithm was executed 30 times on each dataset to take the average value, and the statistics of clustering results are summarized in Tables 5–9. For k-prototypes algorithm and EKACMD algorithm, five different clustering parameters , , , , and were set in the paper for experiments. Since neither EKACMD nor WKPCA needed to set clustering parameter , the paper only sets cluster parameters of k-prototypes separately.
The Bank dataset has 41,188 data objects, 10 Categorical Features, 10 Numerical Features, and 2 clusters. The Bank dataset was sampled at a sampling rate of 4.8%. Table 5 shows that when , the values of WKPCA are 9.43% and 5% higher than those of k-prototypes and EKACMD, respectively.
The Zoo dataset has 101 data objects, 15 Categorical Features, 1 Numerical Feature, and 7 clusters. Table 6 shows that when , the values of WKPCA were 11.88％ and 4.95％ higher than those of k-prototypes and EKACMD, respectively.
The Heart Disease dataset has 303 data objects, 7 Categorical Features, 6 Numerical Features, and 2 clusters. Table 7 shows that when , the values of WKPCA were 10.35% and 2.7% higher than those of k-prototypes and EKACMD, respectively.
The Lym dataset has 148 data objects, 7 Categorical Features, 6 Numerical Features, and 4 clusters. Table 8 shows that when , the values of WKPCA are 9.62％ and 6.45％ higher than those of k-prototypes and EKACMD, respectively.
The ACA dataset has 690 data objects, 9 Categorical Features, 5 Numerical Features, and 2 clusters. Table 9 shows that when k = 2, the values of WKPCA are 5.02% and2.99% higher than those of k-prototypes and EKACMD, respectively.
The results in Tables 5–9 show that, in terms of clustering accuracy, the proposed algorithm achieves better clustering results than other algorithms. The Hybrid Dissimilarity Coefficient considers the importance of each feature in the clustering process and can automatically calculate the weights of different features. The above reasons enable the algorithm in this paper can obtain better clustering results.
Figure 5 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the Zoo dataset with different parameters . It can be seen that the curve of WKPCA is higher than that of k-prototypes and EKACMD.
Figure 6 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the Heart dataset with different parameters . It can be seen that when and , the clustering precision of k-prototypes and EKACMD is relatively close. The true cluster number of the Heart dataset was , and the clustering accuracy of WKPCA at is much better than that of EKACMD.
Figure 7 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the Lym dataset with different parameters . It can be seen that the effect of WKPCA algorithm is better than that of k-prototypes and EKACMD overall.
Figure 8 shows the clustering accuracy of WKPCA, k-prototypes, and EKACMD on the ACA dataset with different parameters . When , the clustering accuracy is WKPCA > EKACMD > k-prototypes.
From Figures 4–8, it can be seen that, in the case of random initialization, the proposed parameter-free WKPCA algorithm is superior to the k-prototypes algorithm and the EKACMD algorithm in clustering accuracy. As can be seen from the detailed information of the dataset shown in Table 4, the ratio of the Categorical Features and the Numerical Features in the selected datasets is mostly different. For example, the Zoo dataset has 1 Numerical Feature and 15 Categorical Features; the Lym dataset has 2 Numerical Features and 16 Categorical Features. Although there is a large gap between the two types of feature distribution of these datasets, the WKPCA algorithm still achieves satisfactory clustering results. This indicates that the proposed Hybrid Dissimilarity Coefficient is applicable for various complex datasets, and it is not necessary to set any parameters manually to adjust the weights of the Categorical Features and the Numerical Features.
The weighted k-prototypes clustering algorithm based on the Hybrid Dissimilarity Coefficient is an extension of the classical k-prototypes clustering algorithm. The method of automatic selection of initial Cluster Centers is improved by means of average distance, local neighborhood density, and relative distance. Considering the spatial distribution information of the data, the Cluster Center is more in line with the actual situation. The uncertainty of clustering caused by different initial Cluster Center selection is avoided. For Categorical Data, the coefficient of type dissimilarity based on entropy weight is used. For Numerical Data, different numerical values are standardized by using quantized numerical Dissimilarity Coefficient. For mixed-type data, the paper used a weighted Hybrid Dissimilarity Coefficient. The proposed Hybrid Dissimilarity Coefficient not only retained the characteristics of different types of data, but also effectively improved the clustering accuracy and clustering effectiveness, and its robustness was better than other k-prototypes clustering algorithms. Finally, WKPCA algorithm is proposed to realize mixed-type data clustering. In Step 1, the WKPCA algorithm automatically determines the initial Cluster Centers by calculating the average distance and local neighborhood density. Compared with other k-prototypes algorithms, it takes more time, but a more accurate Cluster Centers can be selected in the initial stage of clustering. Make sure the Cluster Centers is located in the region with the highest sample density, and the distance between them is the longest, which reduces the number of algorithm iterations. The paper algorithm improves the clustering accuracy, but sacrifices the time performance. Therefore, the next step will focus on improving time complexity. To sum up, although the time performance of the proposed algorithm needs to be improved, its clustering accuracy and clustering effectiveness have been significantly improved.
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
This work was supported by the National Natural Science Foundation of China (grant no. 61762030), the Innovation-Driven Development Special Fund of Guangxi (grant no. AA17204017), and the Scientific Research and Technology Development Program of Guangxi (grant nos. AB19110050 and AB18126094).
A. L. Simone, “MapReduce-based fuzzy c-means clustering algorithm: implementation and scalability,” International Journal of Machine Learning and Cybernetics, vol. 6, no. 6, pp. 923–934, 2015.View at: Google Scholar
J. Macqueen, “Some methods for classification and analysis of multivariate observation,” in Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, California Press, Berkeley, CA, USA, 1967.View at: Google Scholar
Z. Huang, “Clustering large data sets with mixed numeric and categorical values,” in Proceedings of the 1st Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, 1997.View at: Google Scholar
Y. M. Cheung and H. Jia, “A unified metric for categorical and numerical features in data clustering,” in Proceedings of the 17th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer-Verlag Press, Berlin, Germany, 2013.View at: Google Scholar
Y. Yao and Y. Sun, “An anonymous algorithm for hierarchical clustering based on K-prototypes,” in Proceedings of the 2016 4th International Conference on Machinery, Materials and In-Formation Technology Applications, Berlin, Germany, 2016.View at: Google Scholar
D. Guo, Y. Chen, and J. Chen, “A k-prototypes algorithm based on adaptive determination of the initial centroids,” in Proceedings of the 10th International Conference on Machine Learning and Computing, ACM Press, New York, NY, USA, 2018.View at: Google Scholar
G. Cui and C. GaoHongwei, “Rough set processing out-li-ers in cluster analysis,” in Proceedings of the 2019 IEEE 4th International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), New York, NY, USA, 2019.View at: Google Scholar
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, DBLP, 2009.