Abstract

Most of the existing clustering algorithms are often based on Euclidean distance measure. However, only using Euclidean distance measure may not be sufficient enough to partition a dataset with different structures. Thus, it is necessary to combine multiple distance measures into clustering. However, the weights for different distance measures are hard to set. Accordingly, it appears natural to keep multiple distance measures separately and to optimize them simultaneously by applying a multiobjective optimization technique. Recently a new clustering algorithm called ‘multiobjective evolutionary clustering based on combining multiple distance measures’ (MOECDM) was proposed to integrate Euclidean and Path distance measures together for partitioning the dataset with different structures. However, it is time-consuming due to the large-sized genes. This paper proposes a fast multiobjective fuzzy clustering algorithm for partitioning the dataset with different structures. In this algorithm, a real encoding scheme is adopted to represent the individual. Two fuzzy clustering objective functions are designed based on Euclidean and Path distance measures, respectively, to evaluate the goodness of each individual. An improved evolutionary operator is also introduced accordingly to increase the convergence speed and the diversity of the population. In the final generation, a set of nondominated solutions can be obtained. The best solution and the best distance measure are selected by using a semisupervised method. Afterwards, an updated algorithm is also designed to detect the optimal cluster number automatically. The proposed algorithms are applied to many datasets with different structures, and the results of eight artificial and six real-life datasets are shown in experiments. Experimental results have shown that the proposed algorithms can not only successfully partition the dataset with different structures, but also reduce the computational cost.

1. Introduction

With the rapid development of informatization, there are large amounts of data generating in all walks of life every day; meanwhile, the quantity continues to grow rapidly. It is necessary to design many powerful and intelligent data analysis tools to explore the valuable knowledge from these data. Clustering analysis is such a tool and it partitions a dataset into some subgroups such that objects within a group should be as similar as possible while objects in different groups should be as dissimilar as possible. Most of clustering algorithms have been developed in recent years in various fields, such as image segmentation [1], data mining [2], pattern recognition, and computer vision [3]. Generally, the most popular clustering techniques can be simply divided into partitioning clustering [4], hierarchical clustering [5], and density clustering [6]. Among these clustering algorithms, partitioning clustering is the most popular one and has attracted more and more attention in recent years.

Although partitioning clustering algorithms are developing rapidly, they still have a limitation in partitioning the dataset with different structures. The reason is that these clustering algorithms usually employ Euclidean distance to design their clustering criterion [7]. Euclidean distance measure is suitable for detecting the spherical structures, whereas it often fails for detecting the irregular structures [8]. Many algorithms apply other strategies to measure the similarity between clusters. Hierarchical clustering and density clustering are two kinds of traditional methods which are effective to partition the dataset with different structures. Hierarchical clustering [5] utilizes single link, average link, and complete link to measure the similarity between clusters. Density clustering [6] applies density function to measure the similarity between two objects. Apart from them, many algorithms employ non-Euclidean distance measures such as Path distance [9], Manifold distance [10], and Kernel distance [11, 12] to partition the dataset with different structures. However, it is hard to select an appropriate distance measure for a given dataset. A good solution is to assign different weights to different distance measures, but the appropriate weights are also difficult to set. Francisco proposed two algorithms ‘dynamic hard clustering algorithm based on multiple dissimilarity matrices estimated locally and globally’ (MRDCA-RWG and MRDCA-RWL) [13], which are two dynamic hard clustering algorithms with relevant weight for each dissimilarity matrix. The relevant weights change at each algorithm iteration and can either be the same for all clusters or different from one cluster to another. Shortly afterwards, Francisco proposed another two algorithms ‘partitioning fuzzy Kmedoids clustering algorithm based on multiple dissimilarity matrices estimated locally and globally’ (MFCMdd-RWG and MFCMdd-RWL) [14], which are two fuzzy clustering algorithms with relevant weight for each dissimilarity matrix. The relevant weights are obtained by the same strategy with MRDCA-RWG and MRDCA-RWL. Nevertheless, similar with weighting kmeans (Wkmeans) [15], these algorithms will lead to assigning a larger weight to the distance measure that has a smaller sum of the within-cluster distances and a smaller one to the distance measure that has a larger sum of the within-cluster distances.

In addition, many algorithms which can automatically detect the optimal cluster number are also designed based on Euclidean distance measure. Validity index[16] is the most widely used approach that can detect the cluster number automatically. Many validity indices such as XB index () [17], OS index () [18], and LZZ index () [19] are also designed based on Euclidean distance measure. Hence, they usually work well on the dataset with spherical structures and fail on the dataset with irregular structures. Some other algorithms, such as gap algorithm [20], affinity propagation [21], two phase method [22], and L-method [23], can detect the optimal cluster number for the dataset with different structures by using different distance measures. However, the distance measure should be set in advance. Mok proposed ‘robust adaptive clustering with kmeans’ (RAC-Kmeans) [24], which detected the optimal cluster number by using a specific graph-partitioning process and was not sensitive to distance measure. However, this algorithm is computationally expensive by counting the breadth first search algorithm for each subgraph matrix.

Recently, as an effective tool to get a global optimal solution, the evolutionary algorithms (EAs) are more and more used for cluster analysis. Bandyopadhyay proposed ‘genetic clustering for unknown ’ (GCUK) [25], which took DB index () for clustering and could detect the optimal cluster number automatically. Maulik and Saha proposed ‘modified differential evolution based automatic fuzzy clustering’ (MoDEAFC) [26], which employed index as the objective function and was optimized by using a modified differential evolution. In recent years, as a new trend, multiobjective optimization has also been used in cluster analysis [27]. Handl proposed ‘multiobjective clustering with automatic -determination’ (MOCK) [28], which took the compactness and connectedness as two objective functions for clustering and were optimized simultaneously by using the ‘Pareto envelope-based selection algorithm version 2’ (PESA-II) [29]. Prakash proposed ‘two-stage diversity mechanism in multiobjective particle swarm optimization’ (TSMPSO) [30], which employed the same objective functions as MOCK for clustering and were optimized by using an improved multiobjective particle swarm optimization. Saha proposed ‘multiobjective modified differential evolution based fuzzy clustering’ (MOMoDEFC) [31], which used mean squared error () and index as two objective functions for clustering and were optimized by a modified differential evolution. Zhao developed ‘multiobjective spatial fuzzy clustering algorithm’ (MSFCA) [32], which partitioned an image by optimizing the global fuzzy compactness with spatial information and fuzzy separation among the clusters. Hu proposed ‘multiobjective evolutionary approach based soft subspace clustering’ (MOEASSC) [33], which applied within-cluster dispersion as the first objective function and the information of the negative weight entropy and the separation between clusters as the second objective function. This algorithm is suitable for partitioning high-dimensional dataset. Although above algorithms can get many satisfactory clustering results for datasets with spherical structures, they fail on datasets with irregular structures because most of them are based on Euclidean distance measure.

In this paper, integrating multiple distance measures into clustering is studied. More recently, we proposed a novel approach ‘multiobjective evolutionary clustering based on combining multiple distance measures’ (MOECDM) to partition the dataset with different structures and an updated approach ‘multiobjective evolutionary automatic clustering based on combining multiple distance measures’ (MOEACDM) to detect the optimal cluster number [34]. Although both of the two approaches can get promising results, they have two limitations. The individual is designed by label encoding scheme, which is time-consuming due to the large-sized genes and it is a hard clustering algorithm, which may not be able to reflect the structure of the dataset with overlaps. Based on the above analysis, in this study, we proposed a fast multiobjective fuzzy clustering algorithm based on multiple distance measures (F-MFCMDM) to partition the dataset with different structures. Moreover, we also propose an updated algorithm, F-MFCMDM with unknown k (F-MFCMDM-UK), to detect the optimal cluster number automatically. The contributions of the paper can be summarized as follows.(i)We propose F-MFCMDM to partition the dataset with different structures.(ii)An improved evolutionary operator is proposed to increase the convergence speed.(iii)We propose F-MFCMDM-UK to detect the optimal cluster number of the dataset with different structures.(iv)F-MFCMDM and F-MFCMDM-UK provide better accuracy compared with several other approaches not only in partitioning dataset with different structures but also in increasing the convergence speed.

The rest of this paper is arranged as follows. Section 2 introduces some related works concerning fuzzy clustering and multiobjective evolutionary technique. Section 3 presents the proposed algorithms, F-MFCMDM and F-MFCMDM-UK, in detail. The experiment results are shown in Section 4. Finally, Section 5 provides the conclusion.

2.1. Fuzzy Clustering Problem

In this subsection, we make a brief introduction of fuzzy clustering by a widely used clustering algorithm, fuzzy c-means (FCM) [35]. Firstly, most notations used in clustering should be summarized as follows. Let be a set of unlabel data objects and , where is a -dimensional point in the continuous feature space. represents clusters, where . represents cluster centers, where is the center of the cluster . is the fuzzy factor. is the fuzzy matrix, where represents the membership of the to the cluster . It satisfies and . The objective function of the standard FCM algorithm can be written as inBy minimizing the objective function in (1) using the Lagrange multiplier algorithm, the updating equations of the fuzzy memberships and the cluster centers are as shown in (2) and (3), respectively:

Based on the analysis in Section 1, the traditional fuzzy clustering methods have two obvious limitations. The first limitation is that they are usually based on a single distance measure. Although some fuzzy clustering methods such as MFCMdd-RWG and MFCMdd-RWL [14] also combine multiple distance measures, the appropriate weights for multiple distance measures are difficult to set. The second limitation is that they are quite sensitive to the initialization and may sometimes get stuck at a suboptimal solution. In this paper, we are trying to solve the two limitations.

2.2. Multiobjective Optimization Problem

In many real world applications, a certain problem can be solved by optimizing several objective functions simultaneously. It is a multiobjective optimization problem (MOP). Two strategies can be used to solve the MOP. The first is to convert the MOP into a single objective optimization problem by combining all objectives with a weighted formula. However, the objectives in many real world problem are often conflicted with each other and the best weights are difficult to set. The second is to employ Pareto solution approach, in which dominance relation is applied to evaluate the intermediate and final solutions. From a mathematical viewpoint, a general MOP problem can be formulated as in the following:where is a -dimensional decision vector. is an objective vector and it consists of objective functions. All the objective functions will be optimized simultaneously and several solutions may be considered superior to the rest of the solutions in the search space. The winning solutions are called nondominated solutions and the losing solutions are called dominated solutions. All the nondominated solutions can be considered acceptable because none of them are absolutely better than the others. For example, considering a minimization problem, a solution is called nondominated solution if and only if no solution exists such that for at least one and for the remaining cases.

3. Fast Multiobjective Fuzzy Clustering Based on Multiple Distance Measures

In this section, F-MFCMDM and F-MFCMDM-UK are introduced. The two algorithms employ the ‘regularity model based multiobjective estimation of distribution algorithm’ (RMMEDA) [36] as the optimizer, and a series of changes are applied to improve the ability of the algorithm. Next, we will introduce the details as follows.

3.1. Distance Measures Definition

F-MFCMDM and F-MFCMDM-UK are designed based on two distance measures; therefore, we introduce the two distance measures in this subsection. We need to consider which two distance measures should be selected. As we all know, Euclidean distance measure is very suitable for the dataset with spherical structures and Path distance measure [9] is suitable for the dataset with irregular structures. Hence, in our paper, we employ Euclidean and Path distance measures as two distance measures. The two distance measures are also used in [34]. Since the Path distance needs to visit all of the path, it is very time-consuming. Therefore, we can get two distance matrices, and , in one run firstly and they are defined as follows:where and represent the Euclidean distance and Path distance between and , respectively, where .

3.2. Objective Function Computation

The objective functions used in many existing multiobjective evolutionary clustering algorithms are usually based on the Euclidean distance measure and it is not equally appropriate for the dataset with different structures. In here, we define two new objective functions based on Euclidean distance measure and Path distance measure, respectively, as shown in the following: and are two objective functions based on Euclidean distance and Path distance, respectively. and are the same as those in (1). and denote the distance measures between and in two distance measures, respectively. and are two kinds of parameters that can be used as the tradeoff solution between the two objective functions and . However, it is hard to implement due to two problems. The first problem is that the cluster centers become invalid in many non-Euclidean distance spaces; in other words, is meaningless in Path distance space. A good solution to this problem is that we employ the cluster medoids to replace the cluster centers. Unlike the cluster center, a cluster medoid is one of the real objects in the dataset [37]. The two objective functions are modified as where represents cluster medoids and is the medoid of the cluster . The second problem is that the distance measure in is different from that in . To solve this problem, we only use as the tradeoff solution between and . The two objective functions are modified again as follows:It is noted that the is different from . By minimizing and , the two memberships and are obtained by using the following:Next, we will look for the set of tradeoff cluster medoids by some evolutionary operators.

3.3. Encoding and Initialization Schemes

The encoding and initialization schemes are introduced in this subsection. Most notations used in multiobjective evolutionary algorithms are summarized as follows. is the population size. is the population of the th generation and , where is the th individual. An encoding scheme should be selected to map cluster medoids into an individual, where is the cluster number. Many encoding schemes have been proposed in the literature [4]. In F-MFCMDM, we propose a real encoding scheme to represent the individual. Each individual is a real vector of genes and each gene corresponds to a cluster medoid. For example, the th individual in can be written as in where is a real number in the range and is the number of data objects. Next, is mapped into an integer , which represents the index of . It means that we use as the th cluster medoid for the individual . All the cluster medoids can be gotten as where .

In order to make the population have a higher probability to fall into the promising regions of the solution space, the initial population ) is generated by using a random strategy and a preclustering strategy. is a parameter that determines the proportion of individuals generated by using the random strategy to the whole individuals. and individuals are generated by using the random strategy and preclustering strategy, respectively. In our algorithm, each individual is a tradeoff between Euclidean distance and Path distance measures; that is to say, each individual implies a combination of Euclidean distance measure and Path distance measure. Hence, the preclustering strategy is performed by using a preclustering algorithm based on several different distance combinations. By adjusting the weights for Euclidean distance and Path distance measures, we can get many different distance combinations. For simplicity, we set four distance combinations including only Euclidean distance (), only Path distance (), combining the Euclidean and Path distances directly (), and normalizing the Euclidean and Path distances within the range (). Kmedoids algorithm can be used as the preclustering algorithm. Hence, individuals are generated by using four preclustering algorithms, which are Kmedoids algorithm based on (KM(ED)), Kmedoids algorithm based on (KM(PD)), Kmedoids algorithm based on (KM(EPD)), and Kmedoids algorithm based on (KM(NEPD)). We hope that the individuals generated by using different preclustering algorithms have the same number, which is . The procedure of initialization of F-MFCMDM is shown in Algorithm 1.

Input: Dataset , cluster number , population size , proportion parameter ,
   four distance combinations , , and .
Output: Initial population .
for each individual do
  Set = a random real number from .
end
while do
  Set , =KM(ED,).
  Set , =KM(PD,).
  Set , =KM(EPD,).
  Set , =KM(NEPD,).
end
3.4. Evolutionary Operators

In this subsection, the th generation population is generated by applying some evolutionary operators on the th generation population , where . Next, we introduce the details of each evolutionary operator in turn.

3.4.1. RMMEDA Operator

RMMEDA employs the estimation of distribution algorithms (EDA) to generate the offspring population. In RMMEDA, an assumption is that the Pareto set of a multiobjective optimization problem (MOP) is an -dimensional piecewise continuous manifold, which is a regularity property of a continuous MOP [36]. According to this assumption, a probability model will be built by employing local principal component analysis (PCA) algorithm and the new offspring solutions are sampled from this model. We call it RMMEDA operator and it mainly consists of two operators, modeling operator and reproduction operator. The offspring population generated by RMMEDA operator is shown in Algorithm 2.

Input: The th generation population, .
Output: The offspring population, .
 (a) Step1: Modeling operator: Learn the distribution of the solutions in and build the probability
   model.
   (i) Partition into a set of disjoint clusters by the local PCA algorithm.
   (ii) For each cluster, find the first principal subspaces.
   (iii) Extend each of the directions in each cluster to cover the whole Pareto front.
 (b) Step2: Reproduction operator: Sample pop size individuals from the probability model to generate
   the offspring population .
3.4.2. Differential Evolution Operator

RMMEDA operator works very well when the regularity of solutions is obvious. However, in the initial generations, the solutions have no obvious regularity. Thus, it makes the convergence speed of population very slow. In order to overcome this limitation, we apply differential evolution operator (DE operator) to generate another offspring population, . It is noted that DE operator works on . For the th individual in , it is generated by considering two operators. The first operator is used to increase the diversity of population and it is shown in the following:where , , and are three integers that are selected randomly from . is the th gene in the individual . is a scaling factor, which controls the weights of different individuals. is a uniformly distributed random number from , which is generated for each gene . is the crossover control parameter. is a randomly chosen integer from and it ensures that at least one of genes in is different with that in .

The second operator is used to push the population quickly towards the global optima and it is shown in where and are similar with and , respectively. is the best individual in the Pareto front of for .

Generally speaking, a Pareto front contains more than one solution. For the th individual , only one solution in Pareto front can be used as in (13). In this paper, is selected from the Pareto front by using the nearest criterion. Firstly, we calculate the Euclidean distances in objective function between the individual and each nondominated solution in the Pareto front. Secondly, the nondominated solution with the minimum Euclidean distance is used as the . For example, as shown in Figure 1, there are four nondominated solutions, , , , and , in Pareto front. By computing the Euclidean distance from each nondominated solution to the in objective function, is the nearest one to . Hence, we select as for .

The offspring population generated by DE operator is shown in Algorithm 3. Note that only one operator is used for generating the and which one can be selected depends on a mutation parameter . is computed by using (14). The value of decreases in the range between as the number of generations increases. In the initial stage, the first operator may be selected in a higher probability. On the contrary, the second operator may be selected in a higher probability in the later stage. is a uniformly distributed random number from , which is generated for the th individual.

Input: The th generation population, .
Output: The offspring population, .
Compute by using Eq. (14).
for each individual do
 if then
   Generate by using Eq. (12).
 else
   Generate by using Eq. (13).
 end
end
3.4.3. Selection Operator

After RMMEDA and DE operators, we can get three populations, i.e., current population , the offspring population generated by RMMEDA operator , and the offspring population generated by DE operator . To ensure that the population size is equal with the and the favorable solutions are preserved, a selection operator is executed to preserve individuals with higher fitness and to eliminate individuals with lower fitness. In F-MFCMDM, crowded binary tournament selection method proposed in [38] is employed to select the better individuals. In selection operator, all of the individuals in are ranked based on nondominated sorting and assigned with crowding distance. The individuals with lower ranking and higher crowding distance can be selected as the new population .

3.5. Outputting the Best Clustering Result

In the final generation of F-MFCMDM, the Pareto front is obtained. As we all know, all the solutions in the Pareto front are equally important and each solution corresponds to cluster medoids. In most real world problems, a single solution should be selected from these solutions. There are two widely used strategies to select the best solution. The first strategy is to employ an additional validity index such as [32] to select the best solution. And the second strategy is to use a semisupervised method [39] to select it. Since some additional validity indices are also based on Euclidean distance, we use the semisupervised method to achieve our goal in this paper.

In the semisupervised method, some data objects are chosen as the test patterns and their class labels are known. Firstly, each individual in Pareto front is decoded into cluster medoids by using (11). Secondly, the class label of each test pattern is assigned based on the nearest medoid criterion. A data object should be assigned into the th cluster if it satisfies the following:However, the distance measures in F-MFCMDM are not unique. And each individual needs to be decoded by using an appropriate distance measure. As introduced in Section 3.3, each individual implies a combination of Euclidean distance and Path distance. In this subsection, we apply , , , and as four candidate distance combinations that can be selected for each individual. Based on this analysis, each individual in Pareto front can be decoded into four clustering results by using four distance combinations. If the number of individuals in Pareto front is , we can obtain clustering results for the test patterns. Thirdly, Minkowski Score is used to measure the amount of misclassification. The Minkowski score (MS) is then defined as in where is the number of object pairs having same class labels belonging to the same clusters, is the number of object pairs having different class labels belonging to the same clusters, and is the number of object pairs having same class labels belonging to different clusters. A lower score indicates a better solution and a better distance combination. Finally, after the scores of clustering results have been calculated and the solution and the distance combination with the minimum Minkowski score are chosen as the best solution and the best distance combination, respectively, they can be used to partition the remaining objects.

3.6. F-MFCMDM-UK

As mentioned in the previous section, the cluster number should be provided by user in advance. However, in most of real world applications, this parameter is difficult to get. In this subsection, we try to get it automatically by designing two appropriate objective functions. We call it ‘multiobjective fuzzy clustering based on multiple distance measures with unknown ’ (F-MFCMDM-UK). Many operators in F-MFCMDM-UK are similar with those in F-MFCMDM. We will mainly introduce the different operators, which are objective function computation and initialization scheme.

3.6.1. Objective Function Computation

Validity index is a widely used approach to detect the optimal cluster number automatically and it works by looking for a balance between the compactness of each cluster and separation between each pair of clusters. Most of existing validity indices are inapplicable in here. Similar to the analysis above, most of them are designed based on Euclidean distance. Therefore, we need to construct two new objective functions based on two distance spaces, respectively. Unlike F-MFCMDM, each objective function in F-MFCMDM-UK should consider the balance between compactness and separation in order to detect optimal cluster number automatically. Based on this analysis, the two new objective functions are rewritten as where . Let us make the following comments on the two objective functions. and are two ratios of compactness and separation based on Euclidean distance and Path distance respectively. On the one hand, two factors and tend to decrease as the cluster number increases. On the other hand, two factors, and , measuring the minimum separations between each pair of clusters, also tend to decrease as the cluster number increases. Each individual is a tradeoff not only between Euclidean distance and Path distance but also between compactness and separation. By optimizing and simultaneously, we can detect the optimal cluster number automatically.

3.6.2. Encoding and Initialization Schemes

The optimal cluster number in F-MFCMDM-UK is unknown and it is assumed to lie in the range of , where and are the acceptable lower bound and upper bound of the cluster number, respectively. The length of an individual in F-MFCMDM-UK is +1. The first genes represent the cluster medoids and the last gene is an additional label denoting how many medoids are used for clustering. The th individual can be written as in the following:where are the first genes, , and . is the last position, which is a real number and . In F-MFCMDM-UK, the first genes are the valid genes for the objective functions calculation. The valid individual can be expressed as in The medoids can be gotten by using (11).

In F-MFCMDM-UK, the random strategy and the preclustering strategy are also used to generate the initial population , where . Similar to Section 3.3, and individuals are generated by using the random strategy and the preclustering strategy, respectively. Since the cluster number is unknown in F-MFCMDM-UK, individuals are generated by using the preclustering strategy not only with different distance combinations but also with different cluster numbers. There are four distance combinations and () different cluster numbers can be selected; in other words, the individuals are generated by using preclustering algorithms. In addition, the label gene in the th individual corresponds to the cluster number of the individual. If the individual is generated with cluster number, the in is set to . The procedure of initialization of F-MFCMDM-UK is shown in Algorithm 4.

Input: Dataset , population size , proportion parameter , the lower and the
   upper bound of the cluster number and , four distance measures , , and
   .
Output: Initial population .
for each individual do
 Set = a random real number from
 and = a random real number from .
end
.
while do
if then
  .
end
, =KM(ED,), is set to -0.5.
, =KM(PD,), is set to -0.5.
, =KM(EPD,), is set to -0.5.
, =KM(NEPD,), is set to -0.5.
end
3.7. Procedure of F-MFCMDM and F-MFCMDM-UK

The framework of the proposed algorithm is shown in Figure 2.

4. Experimental Results and Analysis

In this section, four groups of experiments are set up to evaluate the performance of F-MFCMDM and F-MFCMDM-UK. The first experiment is to show that F-MFCMDM can partition the dataset with different structures by combining Euclidean distance and Path distance. The second experiment is to test the efficiency of the new evolutionary operator by combining DE operator and RMMEDA operator. The third experiment is to show that F-MFCMDM-UK can not only partition the dataset with different structures but also detect the optimal cluster number automatically. And the fourth experiment is to show the computation cost of F-MFCMDM and F-MFCMDM-UK, which have less time cost than the related works.

Fourteen datasets are used to evaluate the proposed algorithms. These datasets can be divided into three groups and they are spherical-type datasets as shown in Figure 3(a), irregular-type datasets as shown in Figure 3(b), and real-life datasets. The spherical-type datasets contain four artificial datasets and each cluster in these datasets is distributed with a spherical structure. The clusters in Separated1 and Separated2 are separated with each other, and the clusters in Connected1 and Connected2 are overlapped with each other. The four datasets are used to test the validity of F-MFCMDM for Euclidean distance. For the irregular-type datasets, the two datasets Spiral and Flame are widely used datasets in many other literatures [40, 41] and the other two datasets, Circle and Rect, are generated artificially. The four datasets have clusters with irregular structures and can be used to test the validity of F-MFCMDM for Path distance. And the real-life datasets contain six datasets, named Iris, Glass, Soybean, Seeds Wine, and Liver, which are widely used UCI datasets and can be used to test the validity of F-MFCMDM for real world datasets. The details of these datasets in terms of the number of objects, the number of dimensions, and the number of clusters are summarized in Table 1.

4.1. Experiments for F-MFCMDM

This subsection will test the validity of F-MFCMDM. The parameters of F-MFCMDM are as follows. We set to to ensure that the individuals generated at random have a large proportion. Since the remaining individuals are generated by four preclustering algorithms, the should satisfy where is a parameter that can avoid the influence arising from each preclustering algorithm if it runs only once, and it should satisfy . Taking into (20), the should satisfy Meanwhile, the nondominated solutions in the final generation are in fact less than . Therefore, we set to and it is sufficient to store all nondominated solutions for all the test datasets in the following experiments. It is clear that the accuracy of the best clustering result may be improved by increasing the . Nevertheless, the Pareto front had not changed in less iterations because some initial individuals are generated by using four preclustering algorithms in Algorithm 1. Hence, we set to in this experiment. The and are all set to and the and are set to and , respectively. The parameters of F-MFCMDM are listed in Table 2.

Three metrics, Rand index (), F-measure (), and Minkowski score () (as shown in (16)), are used to evaluate the performance of the algorithms quantitatively. and are defined in (22). It is noted that the class label of each object is known beforehand in the three metrics.where and . , , and have been defined in Section 3.5. is the number of object pairs having different class labels belonging to different clusters. Note that the higher the value of is, the better the clustering will be.

Firstly, we analyze the Pareto front obtained by using F-MFCMDM in a single run. Connected1 and Circle are chosen as the test datasets because Connected1 has three spherical clusters and Circle has three irregular clusters. The Pareto front of Connected1 is presented in Figure 4(a). There are nondominated solutions in the Pareto front. From left to right in Figure 4(a), increases as decreases. We select three meaningful nondominated solutions to analyze the Pareto front. All the three nondominated solutions are decoded into four clustering results based on four candidate distance combinations. Figures 4(b)4(d) correspond to the nondominated solutions marked with red, green, and blue color in the Pareto front, respectively. The first subfigure in Figure 4(c) is the best clustering result and it is obtained by the nondominated solution marked with green color based on . We can know that the nondominated solution marked with green color corresponds to the best solution and is the best distance combination for Connected1.

The Pareto front of Circle is presented in Figure 5(a). There are nondominated solutions in the Pareto front. Figures 5(b)–5(d) correspond to the nondominated solutions marked with red, green, and blue color in the Pareto front, respectively. Among the twelve subfigures, three of them are the best clustering results including the second subfigure in Figure 5(c) and the second and the fourth subfigures in Figure 5(d). The second subfigure in Figure 5(c) corresponds to the nondominated solution marked with green color in the Pareto front based on . The second and the fourth subfigures in Figure 5(d) correspond to the nondominated solution marked with blue color in the Pareto front based on and , respectively. For this dataset, Path distance is more dominant than Euclidean distance. The two experiments show that Euclidean distance is suitable for datasets with spherical structures and is suitable for datasets with irregular structures.

Secondly, we compare F-MFCMDM with some other fuzzy clustering algorithms based on different distance combination strategies. The first compared method is SFCMdd, which is the fuzzy Kmedoids clustering algorithm based on a single distance measure. SFCMdd(ED) and SFCMdd(PD) represent the SFCMdd based on and , respectively. The second compared method is MFCMdd, which is the fuzzy Kmedoids clustering algorithm based on multiple distance measures. MFCMdd(EPD) and MFCMdd(NEPD) represent the MFCMdd based on and , respectively. The third compared method is MFCMdd-RWL, which is the fuzzy Kmedoids clustering algorithm with relevant weights for multiple distance matrices estimated locally. The fourth compared method is MFCMdd-RWG, which is the fuzzy Kmedoids clustering algorithm with relevant weights for multiple distance matrices estimated globally. Both MFCMdd-RWL and MFCMdd-RWG are executed based on and . All the above four algorithms are from [14] and they apply the cluster medoid to represent a cluster. The fifth compared method is FCM, which applies the cluster center to represent a cluster. The fuzzy factor is set to for all algorithms and the weight factor is set to for MFCMdd-RWL and MFCMdd-RWG.

Table 3 presents the comparative results of the aforementioned algorithms for the fourteen datasets in independent runs. For each compared algorithm, we can obtain a clustering result in a single run. For F-MFCMDM, we can get a Pareto front in a single run and the best clustering result is selected by using the semisupervised method introduced in Section 3.5. Hence, we can get clustering results for each algorithm on each dataset in runs. The average and the best values of , , and of the clustering results are listed in this table. As can be seen from this table, F-MFCMDM is found to provide the best performance for eleven datasets not only in the average values but also in the best values of the three measures. For Iris, although the best values of the three measures obtained by F-MFCMDM are worse than those obtained by SFCMdd(ED), MFCMdd(NEPD), MFCMdd-RWL, and MFCMdd-RWG, the average values of the three measures obtained by F-MFCMDM are also the best. For Wine, the best values of the and obtained by F-MFCMDM are worse than those obtained by SFCMdd(PD), but the average values of the two measures obtained by F-MFCMDM are better than those obtained by SFCMdd(PD). All the algorithms can get the best values of the three measures for Separated1, since this dataset has three clearly separated clusters. It is observed from the experiments that the algorithms based on two distance measures can get better performance than the ones based on a single distance measure. F-MFCMDM outperforms fuzzy Kmedoids clustering algorithms with the same four distance combinations(SFCMdd(ED), SFCMdd(PD), MFCMdd(EPD), and MFCMdd(NEPD)). The reasons might be that the nondominated solutions in the Pareto front not only imply the four distance combinations but also imply some other distance combinations and F-MFCMDM is based on an evolutionary algorithm, which can get the global solution in a higher probability. Interestingly, both MFCMdd-RWL and MFCMdd-RWG perform well on three measures for the four irregular-type datasets; the reason might be that the compactness of each cluster based on Path distance is smaller than that of Euclidean distance. Hence, both of the two algorithms give greater weight to Path distance. This reason has been confirmed by observing the final weights for all the datasets.

4.2. Experiment for the New Evolutionary Operator

In this paper, we proposed a new evolutionary operator () which combines RMMEDA operator () and Evolution operator () to increase the convergence speed of F-MFCMDM. In this experiment, the new operator () is compared with the traditional RMMEDA operator () by using the same encoding scheme and objective functions in F-MFCMDM. The two operators are executed on all of the fourteen datasets in runs, respectively. Two widely used metrics, coverage () and diversity () [30], are used to assess the Pareto fronts obtained. For each dataset, we can get two Pareto fronts and by and , respectively, in a single run. Coverage metric can compare the relative quality of the two Pareto fronts and based on domination relationship. evaluates the proportion of solutions in , which are dominated by solutions in . It can be defined as shown in the following:where is a truth indicator function. is a solution in Pareto front and is a solution in Pareto front . represents that the solution is dominated by solution . represents the number of nondominated solutions in the Pareto front . The range of falls into . A higher value of shows that the Pareto front is better than the Pareto front . represents that each solution of is dominated by at least one solution of . On the other hand, represents that no solution of is dominated by any solution of .

Distribution metric () can assess the distribution of a Pareto front. It is obtained by a relative distance measure between all of nondominated solutions in the Pareto front. Given a Pareto front , the distribution of can be described as in where represents the number of nondominated solutions in the final Pareto front. is the sum of difference in nondominated objective function values between the th solution and ()th solution in Pareto front . Since F-MFCMDM has two objective functions, the value of is . is the mean of . It is noted that smaller values of indicate better uniformly distributed solutions.

Table 4 presents the comparative results of the average/best coverage metric and distribution metric obtained by using and in runs. Firstly, the values of and are shown in the second and third columns, respectively. The values of are larger than for most of datasets, which can show that the Pareto fronts obtained by are superior to those obtained by . In particular, the best values of are for Separated2, Connected2, Circle, Rect, and Glass in runs. It shows that all solutions by are dominated by at least one solution by . On the other hand, the values of are for most of datasets, which indicates that no solutions by using are dominated by any solution by using . The values of and are for Spiral, which indicates that and can get similar Pareto front in all runs. Secondly, the values of and are shown in the fourth and fifth columns, respectively. The Pareto fronts obtained by have smaller average/best distribution than those obtained by using for almost all test datasets except Iris in runs, which indicates that the Pareto fronts by using are more uniform than those by using .

4.3. Experiments for F-MFCMDM-UK

This subsection will test the validity of the updated algorithm F-MFCMDM-UK, which can detect the optimal cluster number for datasets with different structures. The acceptable lower bound of the cluster number is set to . The acceptable upper bound of the cluster number is difficult to set because it determines the length of an individual. On the one hand, a too large causes a waste of memory and time cost. On the other hand, a too small one will lead to the probability of leaving of the optimal cluster number. For simplicity, we set to . The , , , , and are the same as those in F-MFCMDM. Since individuals are generated by preclustering algorithms, the population size should satisfy Taking the , , and into (25), we can get the condition in In addition, the length of an individual in F-MFCMDM-UK is larger than that in F-MFCMDM. Hence, we set to for F-MFCMDM-UK. Similarly, the of F-MFCMDM-UK also needs to be greater than that of F-MFCMDM. We set to in F-MFCMDM-UK. The parameters settings of F-MFCMDM-UK are listed in Table 5.

In this subsection, we also analyze the Pareto front obtained by using F-MFCMDM-UK in a single run. Separated2 and Flame are chosen as the test datasets. The Pareto front of Separated2 is presented in Figure 6(a). We can get nondominated solutions in the Pareto front. Three nondominated solutions are also selected to analyze the Pareto front. Each selected nondominated solution is decoded into four clustering results based on four distance measure combinations. Apart from , , and , the optimal cluster number () is also recorded. Figures 6(b)6(d) correspond to the nondominated solutions marked with red, green, and blue color in Pareto front, respectively. There are four subfigures which get the best clustering result and the optimal cluster number. They are the first and third subfigures in Figure 6(b) and the first and third figures in Figure 6(c). It is noted that the nondominated solution marked with red color and the nondominated solution marked with green color are very close. Furthermore, with the increment of and the decrement of , the values of , , , and are all getting worse. We can conclude that the nondominated solution marked with red or green color corresponds to the best solution. In addition, or can be as the best distance combination. The Pareto front of Flame is presented in Figure 7(a). We can get nondominated solutions in the Pareto front. Figures 7(b)7(d) correspond to the nondominated solutions marked with red, green, and blue color in Pareto front, respectively. The nondominated solutions marked with red and green color have the same optimal cluster number (=3) and nondominated solution marked with blue color has another view (=2). By looking for the minimum , the second subfigure in Figure 7(c) is the best clustering result in spite of the error optimal cluster number.

F-MFCMDM-UK is compared with some other methods. Many compared methods can be used in here. Firstly, we consider the validity index which is the most used approach to determine the optimal cluster number automatically. and indices are used in our experiment. Secondly, some clustering algorithms based on evolutionary algorithm which can automatically detect the optimal cluster number are also used. The first is GCUK [25], which is a single objective clustering algorithm and can find the optimal cluster number by optimizing index. The second is MoMODEFC [31], which is a multiobjective clustering algorithm and can find the optimal cluster number by optimizing and simultaneously. Four compared algorithms are all based on Euclidean distance and they are not suitable for irregular-type datastes. Hence, RAC-Kmeans [24] is also used as the compared method. It detects the optimal cluster number by using a specific graph-partitioning process and is not sensitive to the choice of distance measure.

The parameters of the compared methods need to be set in advance. For , , GCUK, and MoMODEFC, both and are set to and [42], respectively. For RAC-Kmeans, the two parameters are set to and , respectively, and Kmeans is used as the underlying clustering technique. For GCUK and MODEFC, the and are set to and , respectively.

Table 6 summarizes the clustering results obtained by using F-MFCMDM-UK and the compared algorithms for several test datasets in runs. represents that the optimal cluster is obtained times in runs. , , and represent the average values of , , and , respectively, in runs. F-MFCMDM-UK can get the best , , , and for Separated1, Separated2, Spiral, Circle, and Iris in all runs. It indicates that F-MFCMDM-UK can detect the best optimal cluster numbers for datasets with different structures. For Connected1, all of the algorithms can detect the best . However, the values of , , and obtained by F-MFCMDM-UK are worse than those obtained by , , GCUK, and MODEFC. The reason might be that the separation in (17) influences the clustering results. Although F-MFCMDM-UK does not detect the best for Flame, the values of , , and are the best. For Iris, all the algorithms indicate the best optimal cluster number. However, even among them, the values of , , and obtained by F-MFCMDM-UK are the best. The reason is that Path distance is the best choice for this dataset with . Most of the compared algorithms except RAC-Kmeans can get acceptable clustering results for Separated1, Separated2, Connected1, Iris, and Soybean, but they get worse results for Spiral, Circle, and Flame. The reason is that these algorithms are based on Euclidean distance. RAC-Kmeans shows well for Separated1, Spiral, and Circle; nevertheless, it fails for the remaining datasets. The reason is that RAC-Kmeans is suitable for datasets with separated clusters and unsuitable for datasets with connected clusters. As can be seen from this experiment, F-MFCMDM-UK can detect the optimal cluster number not only for datasets with spherical structures but also for datasets with irregular structures.

4.4. Experiment for the Computational Cost

According to our motivation, F-MFCMDM and F-MFCMDM-UK should have smaller computational cost than the previous algorithms MOECDM and MOEACDM. In this subsection, we illustrate the computational cost on several test datasets for the four algorithms and analyze the reasons of the results. Notice that the cluster numbers of F-MFCMDM and MOECDM are predefined in advance and those of F-MFCMDM-UK and MOEACDM are unknown. Hence, F-MFCMDM and F-MFCMDM-UK are compared with MOECDM and MOEACDM, respectively. Each algorithm is conducted in runs. For the sake of fairness, the proposed methods and compared methods have the same parameters (, ). All algorithms are performed on a personal computer with a 2.5 ghz intel core central processing unit and 4 GB memory and the code programmed in Matlab R2014b. The average running time obtained by using four algorithms is recorded in Table 7. The running time obtained by F-MFCMDM is less than that obtained by MOECDM in all test datasets. And the running time obtained by F-MFCMDM-UK is also less than that obtained by MOEACDM in all test datasets.

Let us discuss the reasons by analyzing the differences between the proposed algorithms and the compared algorithms. There are four main aspects in each algorithm as shown in Table 8. For encoding scheme, MOECDM and MOEACDM apply label encoding to construct an individual. F-MFCMDM and F-MFCMDM-UK apply real encoding to construct an individual. The length of an individual in MOECDM, MOEACDM, F-MFCMDM, and F-MFCMDM-UK is , , , and , respectively. For evolutionary operator, MOECDM and MOEACDM apply crossover, mutation, and selection operators to generate the new population. F-MFCMDM and F-MFCMDM-UK apply and selection operators to generate the new population. For objective function computation, MOECDM applies compactness between objects as objective function. MOEACDM applies the compactness in MOECDM and separation between clusters as objective function. F-MFCMDM applies compactness between objects and medoids as objective function. F-MFCMDM-UK applies compactness in F-MFCMDM and separation between medoids as objective function.

According to these differences, the length of the individual greatly influences the computational cost. In MOECDM and MOEACDM, the individual is an integer vector of genes, where is the number of data objects and the th gene represents the th data object. In F-MFCMDM, the individual is a real vector of genes, where is the number of clusters and the th gene represents the th cluster medoid. Generally speaking, for a dataset, is much smaller than , so the running time obtained by F-MFCMDM is less than that obtained by MOECDM. In F-MFCMDM-UK, the individual is a real vector of genes. In this paper, is also much smaller than , so the running time obtained by F-MFCMDM-UK is also less than that obtained by MOECDM.

5. Conclusion and Future Work

Distance measure plays an important role in clustering because it can measure the relationship between each pair of points. Most of the popular clustering algorithms depend too much on Euclidean distance. This causes incorrect evaluation of the dataset with different structures. In this paper, a multiobjective fuzzy clustering algorithm called F-MFCMDM is proposed to partition the dateset with different structures and to reduce the computational cost. In contrast to the problem that most clustering algorithms are usually based on the Euclidean distance, F-MFCMDM is based on two distance measures, i.e., Euclidean distance and Path distance. Therefore, it can partition the datasets not only with spherical structures but also with irregular structures. Meanwhile, the two objective functions are optimized simultaneously by using a multiobjective evolutionary algorithm with some modified operators. A real encoding scheme is implemented to set up the individual, which consists of cluster medoids. An improved evolutionary operator has been proposed to increase the diversity of population and convergence speed of F-MFCMDM. In the final generation, we can get a set of nondominated solutions and each solution can be regarded as a tradeoff between Euclidean distance and Path distance. Afterwards, each solution is mapped into four clustering results by using four distance measure combinations. A semisupervised method is used to select the best nondominated solution and the best candidate distance combination. Moreover, we also update this algorithm to detect the optimal cluster number automatically by looking for the balance between compactness and separation. The work reported in this paper is very preliminary and there are some possible ways to improve the performance of our approach. Furthermore, more than two distance measures can be added into the model to partition the dataset with more structures. Moreover, the best nondominated solution should be selected by using an unsupervised method.

Data Availability

In this manuscript, we use 14 datasets to support our experiments. The six datasets Separate1, Separate2, Connected1, Connected2, Circle, and Rect used to support the findings of this study are available upon request by contact with the corresponding author email: [email protected]. The two datasets Spiral and Flame can be obtained in “Adaptive k-means algorithm for overlapped graph clustering. International Journal of Neural Systems, 22(05), 133-297,” or by contact with the corresponding author email: [email protected] . The six datasets Iris, Glass, Soybean, Seeds, Wine, and Liver used to support the findings of this study are available upon request in UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/index.php).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by National Natural Science Foundation of China (No. 61703278, 61772342) and the National Project Foster Fund of University of Shanghai for Science and Technology.