Abstract
This paper investigates the threeway clustering involving fuzzy covering, thresholds acquisition, and boundary region processing. First of all, a valid fuzzy covering of the universe is constructed on the basis of an appropriate fuzzy similarity relation, which helps capture the structural information and the internal connections of the dataset from the global perspective. Due to the advantages of valid fuzzy covering, we explore the valid fuzzy covering instead of the raw dataset for RFCM algorithmbased threeway clustering. Subsequently, from the perspective of semantic interpretation of balancing the uncertainty changes in fuzzy sets, a method of partition thresholds acquisition combining linear and nonlinear fuzzy entropy theory is proposed. Furthermore, boundary regions in threeway clustering correspond to the abstaining decisions and generate uncertain rules. In order to improve the classification accuracy, the knearest neighbor (kNN) algorithm is utilized to reduce the objects in the boundary regions. The experimental results show that the performance of the proposed threeway clustering based on fuzzy covering and kNNFRFCM algorithm is better than the compared algorithms in most cases.
1. Introduction
Threeway decisions (3WD) proposed by Yao [1, 2] is a hot topic in various fields in recent years. Since it was put forward, the idea of tripartition has attracted many scholars to do research. Especially recently, great progress has been made in the theoretical research and model building of threeway decisions based on rough sets. For example, Liang and Liu et al. [3–6] proposed fuzzy threeway decision models and stochastic threeway decision models to deal with realvalued or linguisticvalued decisionmaking problems. Qian et al. [7] established multigranulation decisiontheoretic rough set model based on granular computing theory. Hu [8, 9] introduced the concept of threeway decision space and established a threeway decision model based on partially ordered sets. Qi et al. [10] investigated the 3WD model in the framework of lattice theory. Li et al. [11] have constructed a costsensitive sequential threeway decision model to simulate the decisionmaking process from coarse granularity (high cost) to fine granularity (low cost) and please refer [12–14] for further generalizations and applications of this model. Yao et al. [15] construct an optimizationbased framework for threeway approximations of fuzzy sets. In the meanwhile, for dynamic objects and attributes, some algorithms and incremental 3WD models are designed for classification of dynamic data [16, 17]. From the viewpoint of application, threeway decisions have been widely used in research fields such as pattern recognition [18, 19], artificial intelligence [20–22], engineering, managements [23], and social communities [24].
Based on the above backgrounds and work in threeway decisions, a novel method for threeway clustering based on fuzzy covering is discussed. First, the fuzzy covering of the dataset according to the reasonable fuzzy similarity relation is constructed. The fuzzy covering of the universe requires that the more similar the objects in the universe are, the more similar the corresponding fuzzy classes are. The fuzzy covering established in this way can better reflect the intrinsic relationship between objects in the universe. Therefore, clustering results will have more accuracy with valid fuzzy covering. One of the inevitable problems of clustering is threshold calculation. As is well known, for most of the threeway decision models mentioned above, we first need to obtain the pair of partition thresholds and . Different thresholds lead to different decision results. The appropriate partition thresholds make the decision more accurate, whereas the inappropriate thresholds distort the decision. Traditionally, the partition thresholds are usually selected according to the experts experience in advance [25–27]. According to the loss function, Yao et al. [1] proposed a method to determine the thresholds by Bayesian risk decision theory. By using Shannon entropy as a measure of uncertainty, Deng et al. [28] present an informationtheoretic approach to explain and calculate the thresholds. Zhou et al. [29] explore the shadowed set to automatically obtain the partition thresholds of the threeway decisions but cannot theoretically give a reasonable semantic explanation. To address this issue, inspired by the idea of balancing the uncertainty change of fuzzy sets, a threshold calculation method combining linear fuzzy entropy with nonlinear fuzzy entropy is proposed. This method provides a new scientific explanation for the generation of thresholds. And then, the boundary regions of threeway clustering are processed by the kNN algorithm to reduce uncertainty and improve decision accuracy.
The structure of the rest of this paper is as follows: Section 2 briefly introduces the necessary notions of threeway decisions. Section 3 focuses on constructing the fuzzy covering of the raw dataset according to the fuzzy similarity relation and some necessary conditions and discusses its related properties. In Section 4, a novel rough fuzzy Cmeans (FRFCM) algorithm based on valid fuzzy covering is established. Then, we investigate the partition thresholds by combining the linear and nonlinear fuzzy entropy. Furthermore, the framework for processing the boundary region of threeway clustering using the kNN algorithm is introduced. In Section 5, the validity and practicability of the algorithm are evaluated by experiment. Concluding remarks are given in Section 6.
2. Preliminaries
The basic concepts on threeway decisions are briefly reviewed in this section.
An information system is defined as a 4tuple , where denotes a finite nonempty universe, is a nonempty finite of condition attributes, is a nonempty finite of decision attributes, and , where is a domain of attribute ; is an information function such that for every . If is a membership function value, then the value of object under attribute can be expressed as .
The trisectingandacting framework of threeway decisions is an extension of binary decision in order to overcome some shortcomings of binary decision. The traditional binary decision model only has acceptance and rejection options, which can easily lead to errors when the information available is insufficient to make an accurate judgment. Sometimes, the cost of wrong decisions is very high. Therefore, deferment decision is necessary, which allows decision makers to collect more information and make more accurate judgment. This is a strategy that people often adopt in the decisionmaking process, and deferment decision is consistent with human cognition. A threeway decision model based on the evaluation function and a pair of thresholds is shown as follows.
Definition 1. (see [30]). Let U be a finite nonempty universe, be an evaluation function, and a pair of thresholds, , then the positive, negative, and boundary regions of any subset are defined as follows:Evaluation function is the key of decision. The result of decisionmaking is different with different evaluation functions. There are various evaluation functions that can be adopted. If a fuzzy membership function is used as an evaluation function, then the induced three regions are defined by the following equations [31]:The threevalued approximations of a fuzzy set is described by Zadeh [32] as follows: (1) , if ; (2) does not belong to , if ; (3) and has an indeterminate status relative to , if . These three cases correspond to the threeway decisions of the above fuzzy set. When and , we obtain the qualitative threeway decisions of a fuzzy set. However, the qualitative decision model of fuzzy set is very restrictive, and we generally do not select these two thresholds.
3. Fuzzy Covering and Its Validity
The focus of this section is on the method of constructing valid fuzzy covering of raw data and discusses the properties of the fuzzy covering. Let us first recall some concepts that help us to better understand fuzzy covering.
Definition 2. (see [33, 34]). Let be a finite universe and be the fuzzy power set of . For each , we call with , a fuzzy covering of , if for each . is called a fuzzy covering approximation space. If for each , then is called a fuzzy covering of U. is called a fuzzy covering approximation space. for each , then is called a fuzzy partition of U. We call a fuzzy partition approximation space.
Definition 3. (see [35]). Let be a mapping . is called the degree of similarity between fuzzy sets and , if satisfies the following properties:(1)(2) = (3)if , then Some similarity measures are listed as follows:The fuzzy set in this paper is constructed by fuzzy similarity relation which satisfies the following properties. For any ,(1)(2)For a fuzzy similarity relation , , and , the membership of belonging to fuzzy set is denoted asObviously, if , it means that certainly belongs to . Conversely, if , it indicates that certainly does not belong to . is also called a fuzzy similarity class associated with on . Therefore, the set of fuzzy similarity classes constructed by relation is a fuzzy covering of universe .
In the following, we investigate the validity and related properties of the fuzzy covering of the raw dataset.
Definition 4. Let be a universe. is the fuzzy similarity relation on , and is the similarity relation on . is a fuzzy covering of constructed by fuzzy similarity relation R. For any , is the set of similarity objects with . is defined as a valid fuzzy covering of with respect to , if the following condition holds:where .
It is easy to know that the value of depends on and and the choice of . is generally assigned no less than 0.8. The closer the is to 1, the more relation the expresses the structure of sample space. If is less than 0.5, the fuzzy covering of the universe is invalid. The fuzzy covering satisfies that similar objects in have corresponding similar fuzzy classes, so the fuzzy covering more fully reflects the original distribution of objects in .
Proposition 1. Let , then .
Proof. It can be easily verified by the definition.
Remark 1. Let and be two valid fuzzy coverings of with respect to the same . We choose fuzzy covering with a larger validity index as research data.
4. ThreeWay Clustering
4.1. Rough Fuzzy CMeans Algorithm Based on Fuzzy Covering
In this section, we discuss the rough fuzzy Cmeans algorithm with fuzzy covering. The reason for clustering with fuzzy covering is that each fuzzy similarity class can reflect the relationship with the whole dataset, avoiding the disadvantage of excessive loss of clustering information with raw data.
The combination of fuzzy set and rough set provides an important direction for uncertain reasoning. Lingras [36] developed rough Cmeans (RCM) by combining the Cmeans clustering algorithm with rough set theory. The new clustering center is only related to the positive region and the boundary region, unlike fuzzy Cmeans (FCM) [37], which is related to all objects. Since there is no membership involved, rough Cmeans (RCM) cannot effectively deal with the uncertainty caused by overlapping boundaries. In such circumstances, Mitra et al. [25] proposed a rough fuzzy Cmeans (RFCM) algorithm in which it combines the advantages of both fuzzy set and rough set into the framework of the Cmeans clustering algorithm. When dividing objects into approximation regions, replacing the absolute distance with a fuzzy membership is the innovation of the rough fuzzy Cmeans. This adjustment enhances the robustness of the clustering to deal with overlapping situations. Maji et al. [26] modified the calculation of the new clustering center in the RFCM model by assuming that the objects in the lower approximation have definite weights and the objects in the boundary have fuzzy weights. In what follows, we discuss the rough fuzzy Cmeans of fuzzy covering (FRFCM) algorithm, which is an RFCM algorithm based on fuzzy covering of the universe.
Suppose is a valid fuzzy covering of . The cluster centers are denoted as . In the FRFCM algorithm, is divided into clusters . The membership of to the cluster iswhere is the distance between and , , and . The parameter is the fuzzifier greater than 1.
A twocategory dataset is taken to explain the influence of different parameters on classification. The membership degree of each object belonging to each cluster can be considered as a function which is related to relative distances and the fuzzifier parameter. Then, formula (6) translates to the following form:where denotes the relative distance of an object with respect to one of the clusters.
The uncertainty caused by different fuzzifier parameter can be illustrated in Figure 1.
It is easily to obtain that if the value of tends to 1, the memberships are most crisp, as well as the uncertainty of the system is reduced which is suitable for threeway clustering. In this circumstance, only objects that are approximately the same distance from each cluster center are divided into boundary regions. In addition, the parameter cannot be assigned with a very large value because as the value increases, the memberships of objects around the center of the cluster will be assigned to 1 and most objects are divided into boundary region which will increase the uncertainty of the system and the error rate of decisionmaking. Furthermore, the positive region of cluster may become empty.
The center vectors are updated as follows:where and can be considered as the contributions to the center by the fuzzy lower region and fuzzy boundary region, respectively. denotes the boundary region of cluster , where and are the lower and upper approximations of cluster with respect to relation R, respectively. The weighted values and usually satisfy and . In this paper, we take and .
The approximation regions are determined by the FRFCM algorithm with the following principles: if , where and , then , It also means . In this case, cannot be divided into the positive region of any clusters. Otherwise, and . Due to the particularity structure of the fuzzy covering of , the results of fuzzy covering clustering can well reflect the clustering results of the raw dataset through the above FRFCM algorithm.
4.2. Acquisition of Thresholds for ThreeWay Clustering
In this section, we firstly review the shadowed set model for computing thresholds. Then, a novel method of calculating thresholds is proposed by combining the linear and nonlinear fuzzy entropy.
The FRFCM algorithm is an important tool to deal with imprecise, incomplete, and inconsistent data. The thresholds in FRFCM which determines the formation of approximation regions should be carefully selected. The unreasonable thresholds may cause the partition of approximate regions to be distorted, and clustering centers may deviate from the expected locations. Therefore, we should compute the partition thresholds scientifically according to some principles.
There are many methods to obtain the thresholds, and the most popular method is the shadowed set [38]. In fact, the shadowed set adopts the method of elevating and reducing membership degree, which divides the domain of fuzzy set into three regions. The corresponding membership function is as follows:where is the membership function of fuzzy set .
In the following study, only discrete fuzzy systems are considered, and similar models and conclusions can be obtained for continuous fuzzy systems. According to shadowed sets theory, the following formula is proposed to calculate the minimum value to obtain the optimal thresholds and :
However, the semantic interpretation of obtaining threshold pairs by using the above method is not very clear. Because the shadowed set model can not reasonably explain the relationship between the obtained shadowed set and the fuzziness of the raw fuzzy set, further research is needed. Various methods for measuring uncertainty are described in the literature [39]. Fuzzy entropy is an important tool to measure the uncertainty of fuzzy set and meets the following requirements.
Definition 5. (see [40]). Let be a fuzzy set on the universe of discourse . The fuzzy entropy of fuzzy set is the mapping , which satisfies the following four conditions:(1) if (2)(3), if or , then (4)It is easy to verify that, for any , or , the value of corresponding entropy function is 0, then the fuzzy entropy of the fuzzy set equals 0; i.e., the uncertainty of the fuzzy set is the minimum. When holds for any , the value of corresponding entropy function is 1, then the fuzzy set has maximum uncertainty. The commonly used linear and nonlinear fuzzy entropy functions are listed as follows [41–43]:With the above fuzzy entropy functions of fuzzy measure, the corresponding fuzzy entropy of the fuzzy set can be easily obtained as follows:The basic idea of calculating the thresholds by fuzzy entropy is to reduce the uncertainty of the membership of the objects which are the elevating or reducing operation in the shadowed set to 0, while the membership of objects corresponding to the middle part in the shadowed set is adjusted to the maximal uncertainty; i.e., the fuzzy degree increases to 1. In what follows, we propose a flexible fuzzy entropy method which combines the linear fuzzy entropy function and nonlinear fuzzy entropy function to obtain the clustering thresholds. Then, the calculation model is as follows:where is a parameter adjusting the impacts of linear entropy and nonlinear entropy.
In equation (13), when , only linear fuzzy entropy function is used to calculate the thresholds. If , only nonlinear fuzzy entropy function is used to calculate the thresholds. The smaller the value of , the more the influence brought from the linear fuzzy entropy, and vice versa. In the subsequent experiments of this study, we assign .
Figure 2 illustrates the increase and decrease in fuzzy degree of the fuzzy entropy function by taking the linear fuzzy entropy function , the nonlinear fuzzy entropy function , and the fuzzy entropy function which is combined by and with equal weight as examples.
It can be seen from Figure 2 that the curve of flexible fuzzy entropy function lies between the curve of linear and nonlinear entropy functions. The method of using flexible fuzzy entropy to obtain the thresholds can prevent the uncertainty of fuzzy set measured by linear or nonlinear fuzzy entropy from being too small or too large, which leads to the partition thresholds unreasonable.
Thresholds used in RFCM and its related algorithms are usually userdefined. However, the threshold calculated by the above model can not only be interpreted from the change in fuzzy degree of fuzzy set but also be adjusted and optimized automatically.
According to and , the positive, boundary, and negative regions of each cluster can be expressed aswhere is the membership degree of the object belonging to the class.
4.3. Boundary Region Processing of ThreeWay Clustering Based on kNN Algorithm
Following the above discussion on automatically selecting the optimal partition thresholds based on fuzzy entropy theory, this section will present the object processing in the boundary regions of threeway clustering.
In the threeway clustering, the boundary region objects are rarely further processed. knearest neighbor (kNN) algorithm [44] is a wellknown nonparametric classifier, which is considered as one of the simplest methods in data mining and pattern recognition. The principle of the kNN algorithm is to find k nearest neighbors of a query in dataset and then predicts the query with the major class in the k nearest neighbors. In this paper, the kNN algorithm will be utilized to process the objects in the boundary regions. If the object does not find a positive region, it is still classified to the boundary region. Therefore, the uncertainty of the boundary region decreases with the decrease in the number of objects in the boundary region, and reclassifying the objects in the boundary region can improve the accuracy of the threeway clustering.
The details of updating the boundary region with the kNN algorithm are as follows.
Because the kNN algorithm mainly relies on limited adjacent objects for classification, it is more suitable than other methods for the overlap of class domain or the object set to be classified at the boundary region. Therefore, Algorithm 1 can handle the uncertain arising from the boundary region. Of course, dealing with the boundary region with the knearest neighbor algorithm will add extra computing burden and may also face the risk of misclassification of objects.

In what follows, based on valid fuzzy covering, FRFCM and kNN algorithms, we proposed a threeway clustering algorithm, which is called the kNNFRFCM algorithm, and it can be formed, as shown in Algorithm 2.

Thus, according to Algorithm 2, we obtain threeway clustering results of the original dataset by using the valid fuzzy covering.
5. Experiment Analysis
Threeway clustering method based on fuzzy covering proposed in this paper is suitable for dataset with less data and dimension or data with similar amount of data and dimension. Otherwise, clustering with the fuzzy covering constructing by the data with a large amount of data and few dimension will cause the curse of dimensionality. In this paper, six datasets include Iris, Breast Cancer Wisconsin (Original) (BCWO) which eliminates the missing data, New thyroid, Seeds, Foresttype mapping (FTM), and CT from UCI Machine Learning Repository [45] for empirical study. On these datasets and their corresponding fuzzy covering, the results of clustering methods including FCM, RCM, RFCM, kNNRCM, and kNNRFCM are compared. In order to distinguish the results of the raw dataset and the fuzzy covering with the same algorithm, the clustering algorithms of the fuzzy covering are expressed as FFCM, FRCM, FRFCM, kNNFRCM, and kNNFRFCM, respectively. Details of the six datasets are described in Table 1.
The partition threshold related to RCM and its related algorithms is set as 0.001. and involved in fuzzy covering are set as 0.8 and 0.9, respectively. The value of k in the kNN algorithm is assigned as 7, and the evaluation indexes such as the normalized mutual information (NMI) [47], ACC [48], and rand index (RI) [49] are utilized to investigate the validity of the algorithm. Furthermore, the reasonable values of fuzzifier involved in all comparison algorithms are greater than 1. and are selected, and the experimental comparison results are listed in Tables 2–7.
From Tables 2–7, it can be easily concluded that the selected fuzzy parameters have a significant impact on the performance of all comparison algorithms when dealing with the same dataset. Since the boundary region is the main cause of system uncertainty, thus, too large boundary regions are not required for threeway clustering and we need to pay attention to the uncertainty caused by the fuzzifier in the implementation of the algorithms. Moreover, the clustering results show that kNNFRFCM algorithm has better performance than the other algorithms in most of cases. This is mainly because it can reduce the uncertainty of the system by reprocessing the objects in the boundary regions. From the clustering results, we can also obtain that the results of clustering based on fuzzy covering are mostly better than the results of clustering with raw data. Therefore, the valid fuzzy covering can replace the raw dataset for clustering, and the clustering results are better than the raw dataset. The premise that fuzzy covering can replace the raw dataset for clustering is to select the appropriate fuzzy similarity relation [46].
6. Conclusions
In this paper, a valid fuzzy covering of the raw dataset is constructed by some principles. Because the similarity between fuzzy similarity classes in the valid fuzzy covering can be used to measure the similarity between objects in the raw dataset, each fuzzy similarity class reflects the connection with the whole dataset, so valid fuzzy covering instead of the raw data for clustering can improve the precision of clustering. From the perspective of semantic explanation of uncertainty change in fuzzy sets, we investigate the method of combining linear fuzzy entropy with nonlinear fuzzy entropy to obtain decision threshold pairs. The advantage of calculating thresholds method in this paper not only objectively obtains the classification thresholds based on the objects intrinsic relations but also the formula is simple and easy to understand, as well as the method of calculating the thresholds avoids the inappropriate subjective assignment. Additionally, the objects in the boundary region obtained by the FRFCM algorithm are reprocessed by the kNN algorithm to reduce the uncertainty of the system.
Furthermore, we will continue to investigate the method of thresholds acquisition and the processing method of boundary region for threeway clustering following the idea of this paper. The threeway clustering in incremental information system is one of the future research directions too.
Data Availability
The experimental data supporting the findings of this study are available on the website provided in this article.
Conflicts of Interest
The author declares that there are no conflicts of interest.
Acknowledgments
This work was supported by the Science Research Project of Inner Mongolia University for Nationalities with the title “Research on threeway clustering methods of preference linguistic data” (no. NMDYB18030) and Natural Science Foundation of Inner Mongolia Autonomous Region (nos. 2018MS01008 and 2020MS07008).