Abstract

Data clustering algorithms experience challenges in identifying data points that are either noise or outlier. Hence, this paper proposes an enhanced connectivity measure based on the outlier detection approach for multi-objective data clustering problems. The proposed algorithm aims to improve the quality of the solution by utilising the local outlier factor method (LOF) with the connectivity validity measure. This modification is applied to select the neighbour data point’s mechanism that can be modified to eliminate such outliers. The performance of the proposed approach is assessed by applying the multi-objective algorithms to eight real-life and seven synthetic two-dimensional datasets. The external validity is evaluated using the F-measure, while the performance assessment matrices are employed to assess the quality of Pareto-optimal sets like the coverage and overall non-dominant vector generation. Our experimental results proved that the proposed outlier detection method has enhanced the performance of the multi-objective data clustering algorithms.

1. Introduction

Data clustering intends to arrange collections of data points using similarity functions that can be employed next to understand the data. A diversity of applications utilised the data clustering algorithms to recognise the embedded structures within the data, and to analyse a precise collection of clusters to be additionally investigated and to recognise each cluster feature [1, 2]. Consequently, the quality of the clusters can be handled by utilising the internal validity/similarity measures, such as connectedness, compactness, and isolation. The data clustering validity measures serve as an important part in the development of the clustering algorithms, which are built based on distance measures such as the k-means partitioning algorithm. In general, the partitioning algorithms aim to identify spherically shaped clusters, but it is inefficient to recognise arbitrarily shaped clusters like non-convex or interlaced clusters that are studied in several applications. Moreover, the partitioning algorithms experience challenges in recognising data points that are either outlier or noise [3]. Unlike other validity measures, cluster connectivity works indifferently with the shape of clusters [4], which decides the degree to which neighbours of a data point have been located in the corresponding cluster. However, the robustness of the connectivity measure depends on the associated L-nearest neighbour [5, 6]. These neighbours concerned in quantifying the connectivity measure can contain outliers, which can extremely influence the accuracy of the connectedness based on non-reliable data points that can be a form of outliers [7]. Therefore, choosing a proper neighbour data point’s mechanism can be adjusted to eliminate such outliers, to enhance the performance of the connectivity measure. Data clustering and outlier detection share a corresponding relationship, in which a data point is recognised as a cluster member or an outlier. Data clustering algorithms commonly incorporate a mechanism for managing the outliers that eliminate these data points from the clusters. The applicability across the different problem fields is one significant problem for the outlier analysis [710]. Also, the effectiveness of an outlier analysis algorithm is quantified with the performance of the resolution of different thresholds for the outlier score.

The local distance methods have been applied in several outlier detection methods [7, 11]. The primary assumption of these methods is that the normal data points reside within dense neighbourhoods. In contrast to normal data, the outliers reside remotely out from the nearest neighbours. One of the most common local distance outliers detection algorithms is the local outlier factor (LOF) algorithm, which is used in several applications [7]. LOF is recognised as one of the widely applied local outliers detection algorithms and was introduced by [12], in which the local density of a point is associated with the surrounding neighbourhood points [7, 13]. Although the LOF geometric anticipation is employed in low-dimensional data, the LOF algorithm can be implemented in different dissimilarity functions [14]. The LOF algorithm has shown outperformance against different competitor algorithms in several disciplines such as fault detection [15] or network intrusion detection [16]. The LOF variants can be generalised and implemented in various applications, such as detecting outliers in big data [17], machine learning [18], and data streams [19]. Additionally, the LOF algorithm can be employed for different cluster shapes with different dissimilarity functions, while other local distance methods such as connectivity-based outlier factor (COF) deals with outliers differing from spherical density-based shapes such as lines, while the influenced outlierness (INFLO) method handles the clusters that reside near to each other, and the local outlier probability (LoOP) method utilises the measurement of data points in the corresponding dataset with other datasets. To solve the concerns explained above, this paper intended to address the multi-objective data clustering problems using an outlier detection approach. The contribution significance of the paper is twofold.(1)We introduced a modified connectivity validity measure based on the outlier detection approach (coded as Conn_LOF) for multi-objective data clustering problems.(2)We developed an algorithm that intends to enhance the quality of the solution generated by the multi-objective metaheuristic approach by utilising the LOF with the connectivity validity measure.

This paper is organised as follows: The related works of multi-objective metaheuristic clustering are briefly reviewed in Section 2. Section 3 discusses the theoretical background and concepts such as the data clustering problem, outlier detection methods, and the LOF method. In section 4, the description of the modified Conn_LOF approach is presented. Section 5 presents the experimental design of the modified Conn_LOF approach algorithm, and in Section 6 the experimental results of the introduced method are explained. Finally, Section 7 presents the paper’s conclusions and future works.

Several multi-objective metaheuristics approaches have been introduced to solve data clustering problems [2026]. The multi-objective data clustering approach was initially offered by [27], where they proposed a multi-objective data clustering algorithm that was based on one or more cluster quality measures. Their algorithm used the Pareto envelope-based selection algorithm (PESA-II), a multi-objective algorithm, to optimise the deviation and connectivity cluster quality measures. Their research was extended in [28], where they investigated the performance of four different pairs of criteria (cluster quality measures) in multi-objective clustering. Reference [29] introduced a new dynamic multi-objective evolutionary algorithm (MOEA) for data clustering, which applies a chromosome with variable length scheme to search for optimal cluster number and cluster centre. Reference [30] proposed a multi-objective optimisation algorithm for solving the categorical data clustering problem (MOGA). Reference [31] offered a multi-objective evolutionary ensemble algorithm for addressing texture image segmentation (MECEA). Reference [32] introduced an enhanced multi-objective evolutionary approach for data clustering (EMCOC), which aims to determine the overlapping complex shape dataset problem. Reference [33] offered a multi-objective genetic fuzzy clustering (MOVGA) for the segmentation of multispectral magnetic resonance imaging (MRI). Reference [34] proposed a multi-objective clustering algorithm (MOCA) for data clustering.

Recently, [35] proposed a multi-objective algorithm based on the artificial bee colony optimisation algorithm and the non-dominated sorting (NSABC) to solve the data clustering problems. Reference [21] offered a particle swarm optimisation using the multi-objective approach (MOPSO) to increase the diversity of the solutions. Later, [36] presented an improved binary gravitational search algorithm using the multi-objective approach for feature selection (IMBGSAFS). The Pareto-based approach is used in the algorithm to obtain better solutions diversity, by optimising the silhouette index and feature cardinality validity measures. Reference [37] introduced the multi-objective clustering algorithm based on a reduced-length representation. Reference [23] proposed a kernel-based, attribute-weighted multi-objective optimisation data clustering algorithm, in which they used the compactness and the separation cluster quality measures to find an optimal clustering solution.

Table 1 demonstrates that most of the offered multi-objective clustering approaches were based on the NSGA-II multi-objective algorithm, which was widely used to achieve high-quality solutions. Several multi-objective clustering algorithms employ more than one validity measure to be optimised simultaneously, which minimises two validity measures such as cluster connectivity (Conn) and overall cluster deviation (Dev).

According to the related studies of data clustering algorithms, which are based on the multi-objective metaheuristic algorithms, further enhancements are required to tackle the rapid growth of data complexity with the consideration of preserving the accuracy of the clustering algorithm [7]. Although the majority of the clustering algorithms attempt to detect outliers during the clustering analysis stage [7], few algorithms offer validity measures that can tackle the detection of these outliers [38]. The connectivity measure of the cluster, which is commonly used in most multi-objective clustering algorithms, can measure the level of the connectedness of the neighbour data objects that are located in the same cluster [6, 35] and may measure the amount of connectedness based on non-reliable data objects that can be a form of outliers [7]. Therefore, the selection of a suitable neighbour data objects mechanism can be modified to exclude such outliers, and consequently improve the performance of the connectivity measure.

3. Background

This section introduces the concepts of the data clustering problems, the outlier detection methods, and the LOF method.

3.1. Data Clustering Problems

Data clustering is an essential task of data mining that intends to group N data objects X = {x1, x2, …, xN} into a set of clusters C = {C1, C2, …, CK}, where all data objects in the same clusters are similarly based on a specified similarity measure. The clustering methods must ensure the following hard constraints [39]:(i)Each cluster should not be empty and hold at least one data object:(ii)Various clusters should not share data objects:(iii)Every data object should be included in a cluster:

The mathematical representation of a multi-objective data clustering problem with M-objectives is given in equation (4) [40]:

The f(X, C) is the objective function that measures the partitions’ quality produced by the clustering algorithm, where the objective function can be minimised or maximised depending on the similarity/dissimilarity measure employed. denotes the p inequality constraints, and hj(X, C) denotes the q equality constraints.

3.2. Connectivity of the Cluster

Connectivity of the cluster [27, 35] is an objective function used to measure the amount of neighbour data points that are placed in each cluster that should be minimised. The mathematical formulation of the cluster connectivity is shown in equations (5) and (6):where N is the number of data points, and parameter M represents the number of neighbour data points, which will be considered to measure the connectivity.

3.3. Outlier Detection Methods

The outlier detection methods are applied to overcome the influence of the outlier in creating descriptive or predictive models, and also to be adopted in the pre-processing stage in several applications of data mining. The common outlier detection techniques are classified into distance-based, density-based, distribution-based, clustering-based, and probabilistic-based methods. Besides, the outlier detection approaches are divided into local or global methods, in which global methods give each data point an anomaly score depending on the entire dataset points. On the contrary, the local distance methods assign an anomaly score to each data point depending on the surrounding neighbourhoods. Many variants of the local distance methods are introduced to produce simple anomaly score presentation and identify hidden outliers by the global methods. The variants of the local distance methods include the following methods:(1)Local Outlier Factor (LOF) [12]: It is recognised as the most broadly adopted local methods that associates the local density of data objects with the average distance of the k-nearest-neighbour objects. The anomaly score of the LOF algorithm is defined as the ratio of the data points’ local density to the neighbourhood points’ average local density.(2)Connectivity-based Outlier Factor (COF) [41]: It detects outliers of other density-based shapes like lines.(3)Influenced Outlierness (INFLO) [42]: It was introduced to produce further reliable results involving the different clusters' densities that exist near each other.(4)Local Outlier Probability (LoOP) [43]: It consists of statistical methods that define the anomaly score as a probability. These probabilities employ the analysis of data points in the dataset with other datasets.

The local distance methods have been utilised in several outlier detection methods [7, 4446]. The primary assumption of these methods is that the points of normal data exist inside dense neighbourhoods. Unlike normal data, the outliers remain remotely out from the nearest neighbours. The nearest neighbour methods need a distance metric to identify the distance separating the two data points [7]. One of the popular local distance outliers detection algorithms is the LOF algorithm, which is applied in several applications [7].

3.4. Local Outlier Factor (LOF)

LOF is one of the commonly used local outliers detection algorithms that was introduced by [12], in which the local density of a point is related to the surrounding neighbourhood points [7, 13]. The outlier factor is local which considers only each neighbourhood point. The local reachability distance of a point p is described as the inverse of the average reachability distance based on the minPts_nearest neighbours of p. Thus, minPts is a primary parameter needed by the LOF algorithm which indicates the number of nearest neighbours employed in discovering the local neighbourhood of each point. The local reachability distance (lrd) is defined by equation (7), and the reachability distance is defined by equation (8) [12]:where minPts denotes a positive integer, D denotes the dataset points, and {o, p} ∈ D. The distMinpts(p, o) is defined as the distance between p and point o. Given the minPts_distance of p, the minPts_distance neighbourhood of p contains every point whose distance from p is not greater than the minPts_distance. The outlier factor of point p represents the level of point p to be considered an outlier, which is defined in equation (9) [12]:

The utilisation of distance ratios ensures that the local distance performance is properly assessed. Therefore, the LOFminpts for the points in density regions is close to 1 (LOF ≃ 1). Otherwise, the LOFminpts of the outlier points will be much higher (LOF ≫ 1) because they are measured depending on the ratios to the average neighbour reachability distances. Essentially, the maximum value of LOFminpts over a variety of minpts amount is employed as the outlier score to identify the optimal neighbourhood size.

3.5. The Proposed Outlier Detection Approach

The proposed outlier detection approach of the connectivity measure (named Conn_LOF) is discussed in this section. The flowchart of the introduced outlier detection approach for the connectivity measure is shown in Figure 1, which includes the following stages:(i)Stage 1. The pre-processing phase includes the gathering and cleaning of the needed datasets and then converting them into related nearest neighbours matrix and other matrices, which are utilised throughout the generation of solutions.(ii)Stage 2. The computation of L-distance: includes the computation of L-distances of each data point with the L-nearest neighbourhood data points based on the Euclidean distance [7].(iii)Stage 3. The computation of reachability of L-distance neighbourhoods: consists of the computation of the local reachability distances along with the reachability of all L-distance neighbourhoods using equation (7).(iv)Stage 4. The LOF algorithm computation: consists of labelling the outlier sequence of the entire L-distance neighbourhoods based on the outlier factor based on the chosen threshold value λ of LOF.(v)Stage 5. The computation of the connectivity measure includes the computation of the connectivity validity measure using equation (5). The procedure of computing the connectivity measure excludes the outlier-labelled neighbourhoods’ points.(vi)Stage 6. The execution of the multi-objective clustering algorithm: executes the multi-objective clustering algorithm such as the non-dominated sorting genetic algorithm (NSGA-II) [47] and the strength Pareto evolutionary algorithm (SPEA-II) [48].

The algorithmic steps of the proposed method are shown in Algorithm 1, where λ denotes the threshold value used in the LOF algorithm that is set to 1, where the LOF value of each neighbourhoods point is approximated and then compared to the λ threshold value. The Clabel matrix stores the labels of the neighbourhoods’ points.

(i)//Inputs:
(ii)C//the nearest neighbours matrix that is generated from the stage (1)
(iii)L//number of nearest neighbours minPts in LOF algorithm
(iv)λ//The threshold used in the LOF algorithm
(v)Clabel//the labels matrix generated by LOF
(vi)for each Cj in C do
(1)    //stage (2)
(vii)        Compute the L-distance neighbourhood points of Cj;
(2)     //stage (3)
(viii)        Compute the reachability distance for neighbourhood
(3)        points of Cj;
(xi)      //stage (3)
(x)         Compute the LOF of neighbourhood points of Cj;
(4)      //stage (4)
(ix)for each neighbourhood point, Pi of Cjdo
(5)      If LOF of Pi ≥ λ then
(6)           Label Pi as outlier and store it Clabel;
(7)     Endif
(8)End for
(9)//stage (5)
(10)Compute connectivity of C by excluding outliers in Clabel;
(11)//stage (6)
(12)Execute the multi-objective clustering algorithm;

4. Experimental Design

The performance of the proposed Conn_LOF outlier detection method is examined using eight real-life datasets with a variety of complexity, obtained from the UCI repository of the machine learning databases [49], and seven synthetic two-dimensional datasets [5], as shown in Table 2.

Since most of the state-of-the-art multi-objective clustering algorithms are based on NSGA-II (as shown in Table 1), NSGA-II and SPEA-II algorithms are used to prove the contribution of this paper. Additionally, other multi-objective algorithms are not used since the proposed Conn_LOF method is performed before running the multi-objective clustering algorithm (as shown in Figure 1) and will not affect the algorithmic steps of any given algorithm.

To evaluate the performance and the effectiveness of the proposed Conn_LOF method, the NSGA-II algorithm [47] is modified by employing two conflicting objectives that include the intra-cluster distance [50] and the proposed Conn_LOF method (named as eNSGA-II) and compared with the NSGA-II algorithm with a pair of conflicting objectives that include the intra-cluster distance [50] and the standard connectivity of the cluster [27]. Similarly, SPEA-II [48] is modified by employing the intra-cluster distance and the Conn_LOF method (named as eSPEA-II) and then compared with the standard SPEA-II with a pair of conflicting objectives that include the intra-cluster distance [50] and the standard connectivity of the cluster [27].

The data clustering solutions are represented using a label-based representation that includes a one-dimensional array, where a solution is denoted as a set of N data objects. Figure 2 demonstrates a solution representation example of eight data objects and three clusters. The solutions are randomly generated. Each data object is randomly attached to a cluster.

The algorithm’s external validity is evaluated using the F-measure [51]. The running time of the algorithms is not investigated since the Conn_LOF method runs before the execution of the multi-objective clustering algorithm (as shown in Figure 1), which will not affect the running time of these competing algorithms. The inference time is the same for a particular dataset depending on the number of attributes and instances.

Also, performance assessment indices (PI) are utilised to assess the Pareto-optimal sets’ quality and to compare the performance between diverse multi-objective algorithms. Hence, to assess the multi-objective metaheuristic clustering algorithms, we followed the performance indices that have been used in recent data clustering researches [36, 52], including the Overall Non-dominated Vector Generation (ONVG) [53] and coverage [54]. The details of these indices are below:

1. Coverage of Two Sets (C) [54]: Coverage is employed to compare two solution sets based on domination. Assuming that S1 and S2 are two Pareto-fronts/sets, then C(S1, S2) indicates the portion of set S2 that is dominated by the solutions in set S1. The mathematical formulation of the coverage is shown in equation (10).where higher values of C denote that the dominance is better, which must be within the range [0, 1].

2. Overall Non-dominant Vector Generation (ONVG) [53] represents the number of solutions in the Pareto-front set S; the mathematical formulation of the ONVG is shown in equation (11).

To evaluate the performance of the multi-objective methods using the PI indices, a Pareto-front pool is generated utilising the whole Pareto-fronts of the competing multi-objective algorithms. The non-dominated solutions in N runs of every algorithm are joined. Some PIs require a Pareto-front pool such as the coverage measure.

The setting of the parameters for the competing algorithms was independently performed 31 times on each of the 15 datasets; then the average value and the standard deviation of the F-measure are computed. The population size is set to 20 and the maximum number of iterations is set to 1000. The nearest L data points are set to 21. Lastly, Java 1.8 is used to implement the algorithms and were run on a personal computer with a CPU of Intel Core i7 (2.6 GHz) that was equipped with 4 GB memory.

5. Experimental Results and Discussion

Table 3 shows the results of the coverage (C), where A, B, C, and D symbolise eNSGA-II, NSGA-II, eSPEA-II, and SPEA-II, respectively. The C(A, B) values compared with C(B, A) values obtained better coverage for the datasets 2d-20c-no0, CMC, Ecoli, engytime, Flame, Seeds, Sizes5, Sonar, Soybean-small, and Thyroid, which means that the entire solutions in the pool of NSGA-II at least have been dominated by a single solution of the eNSGA-II solutions pool. On the other hand, the C(A, B) values compared to C(B, A) mostly obtained better coverage for the datasets Elly-2d10c13s, Ionosphere, Iris, Spherical_5_2, and Square1. The C(C, D) values compared with C(D, C) values obtained better coverage for the datasets 2d-20c-no0, CMC, Ecoli, Elly-2d10c13s, engytime, Flame, Seeds, Sizes5, and Spherical_5_2, which means that the entire solutions in the pool of SPEA-II at least have been dominated by a single solution of the eSPEA-II solutions pool. In contrast, the C(D, C) values compared to C(C, D) mostly obtained better coverage for the datasets Ionosphere, Iris, Sonar, and Soybean-small.

Generally, this shows that the solutions in the modified algorithms with the Conn_LOF method’s pool dominated the standard algorithms’ solutions in a considerably high ratio. In conclusion, the modified algorithms with Conn_LOF method attained better performance amongst other standard algorithms based on the coverage PI.

Table 4 reveals the results of the obtained F-measure on the Pareto-fronts produced by the competing algorithms. The eNSGA-II algorithm achieves higher F-measure results than the NSGA-II algorithm for most of the datasets except 2d-20c-no0, CMC, Sizes5, and Soybean-small datasets. The eSPEA-II provides higher F-measure results than SPEA-II for most of the datasets excluding CMC, Iris, and Soybean-small datasets. The results verify that the average F-measure of the eNSGA-II and eSPEA-II is enhanced by adopting the Conn_LOF method compared to the corresponding standards NSGA-II, and SPEA-II.

Additionally, the impact of adopting the Conn_LOF is perceived in the ONVG metric, as shown in Table 5, in which the eNSGA-II algorithm achieves higher ONVG results than the NSGA-II algorithm for most of the datasets except 2d-20c-no0, Ecoli, and Seeds. The eSPEA-II provides higher ONVG results than SPEA-II for most of the datasets except 2d-20c-no0, Sizes5, and Soybean-small. The table also shows a weak performance of other competing algorithms concerning the ONVG metric. Hence, the modified eNSGA-II and eSPEA-II achieve better ONVG performance.

Results shown in Table 4 are additionally analysed using Friedman’s test ranking using the F-measure. As presented in Table 6, Friedman’s test shows that eNSGA-II achieved the best F-measure rank. The NSGA-II achieved the second rank, and the eSPEA-II algorithm achieved the third rank. Finally, SPEA-II obtained the worst rank.

In general, eNSGA-II, and eSPEA-II are proven to be a reliable choice for data clustering in the multi-objective approach by adopting the Conn_LOF outlier detection method for providing Pareto-front solutions with efficient clustering measures for datasets with varying characteristics and complexity.

6. Conclusions and Future Work

In this paper, an enhanced connectivity measure based on the LOF outlier detection method (Conn_LOF) is offered to enhance the performance of the connectivity measure by eliminating the outliers. To examine the efficiency of the proposed Conn_LOF method, it is employed within the competing algorithms and tested on eight real-life datasets with a variety of complexity obtained from the UCI repository of the machine learning database. Thus, the efficiency of the competing algorithms is tested on seven synthetic two-dimensional synthetic datasets with different cluster shapes and characteristics. The experimental results show that the performance of the modified eNSGA-II and eSPEA-II enhanced by adopting the Conn_LOF method concerning the average, and the standard deviation results of the F-measure. Thus, the multi-objective performance assessment matrices are used to evaluate the quality of the Pareto-optimal sets that include coverage and overall non-dominant vector generation. Furthermore, the Conn_LOF outlier detection method is proven to be effective when combined with the clustering algorithms to provide better Pareto-front solutions with efficient clustering measures for datasets with varying characteristics and complexity.

Data Availability

The real-life datasets used to support the findings of this study have been deposited in the UCI Data repository (URLs: https://archive.ics.uci.edu/ml/datasets/Contraceptive+Method+Choice; https://archive.ics.uci.edu/ml/datasets/ecoli; https://archive.ics.uci.edu/ml/datasets/ionosphere; https://archive.ics.uci.edu/ml/datasets/Iris; https://archive.ics.uci.edu/ml/datasets/seeds; https://archive.ics.uci.edu/ml/datasets/connectionist+bench+(sonar,+mines+vs.+rocks); https://archive.ics.uci.edu/ml/datasets/soybean+(small); https://archive.ics.uci.edu/ml/datasets/thyroid+disease). Additional synthetic datasets (such as 2d-20c-no0, Elly-2d10c13s, Engytime, Flame, Sizes5, Spherical_5_2, and Square1) were used to support this study and are available at [doi: 10.1109/TEVC.2006.877146]. These prior datasets are cited at relevant places within the text as references [5].

Conflicts of Interest

The authors declare no conflicts of interest regarding this paper.

Acknowledgments

This research was funded by a research grant from Universiti Kebangsaan Malaysia (Ref. No: DIP-2019-013).