Complexity

Volume 2018, Article ID 7698274, 16 pages

https://doi.org/10.1155/2018/7698274

## Self-Adaptive -Means Based on a Covering Algorithm

^{1}School of Computer Science and Technology, Anhui University, Hefei 230601, China^{2}School of Software and Electrical Engineering, Swinburne University of Technology, Melbourne, VIC 3122, Australia^{3}School of Information Technology, Deakin University, Melbourne, VIC 3125, Australia

Correspondence should be addressed to Xing Guo; moc.qq@uhagnixoug

Received 29 December 2017; Accepted 26 March 2018; Published 1 August 2018

Academic Editor: Xiuzhen Zhang

Copyright © 2018 Yiwen Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

The -means algorithm is one of the ten classic algorithms in the area of data mining and has been studied by researchers in numerous fields for a long time. However, the value of the clustering number in the -means algorithm is not always easy to be determined, and the selection of the initial centers is vulnerable to outliers. This paper proposes an improved -means clustering algorithm called the covering -means algorithm (C--means). The C--means algorithm can not only acquire efficient and accurate clustering results but also self-adaptively provide a reasonable numbers of clusters based on the data features. It includes two phases: the initialization of the covering algorithm (CA) and the Lloyd iteration of the -means*.* The first phase executes the CA. CA self-organizes and recognizes the number of clusters based on the similarities in the data, and it requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. Therefore, it has a “blind” feature, that is, is not preselected. The second phase performs the Lloyd iteration based on the results of the first phase. The C--means algorithm combines the advantages of CA and -means. Experiments are carried out on the Spark platform, and the results verify the good scalability of the C--means algorithm. This algorithm can effectively solve the problem of large-scale data clustering. Extensive experiments on real data sets show that the accuracy and efficiency of the C--means algorithm outperforms the existing algorithms under both sequential and parallel conditions.

#### 1. Introduction

The development of big data technologies, cloud computing, and the proliferation of data sources (social networks, Internet of Things, e-commerce, mobile apps, biological sequence databases, etc.) enables machines to handle more input data than human being could. Due to this dramatic increase in data, business organizations and researchers have become aware of the tremendous value the data contain. Researchers in the field of information technology have also recognized the enormous challenges these data bring. New technologies to handle these data, called big data, are required. Therefore, it is vital for researchers to choose suitable approaches to deal with big data and obtain valuable information from them. Recognizing valuable information in data requires the use of ideas from machine learning algorithms. Thus, big data analysis must combine the techniques of data mining with those of machine learning. Clustering is one such method that is used in both fields. Clustering is a classic data mining method, and its goal is to divide datasets into multiple classes to maximize the similarities of the data points in each class and minimize the similarities between the classes. The cluster analysis method has been widely used in many fields of science and technology, such as modern statistics, bioinformatics, and social media analytics [1–5]. For example, clustering algorithms can be applied to social events to analyze big data to determine peoples’ opinions, such as predicting the winner of an election.

Based on the characteristics of different fields, researchers have proposed a variety of clustering types, which can be divided into several general categories, including hierarchy clustering, density-based clustering, graph theory-based clustering, grid-based clustering, model-based clustering, and partitional clustering [1]. Each clustering type has its own style and optimization approaches. We focus on partitional clustering algorithms. The most popular algorithm is -means [2, 3, 6, 7], which is one of the top ten clustering algorithms in data mining. The advantages of the -means algorithm are its easy implementation and understanding, whereas its disadvantages are that the number of clusters cannot be easily determined and the selection of the initial centers is easily disturbed by outliers, which has a significant impact on the final results [6]. Due to the simple iteration of the -means algorithm, it has good scalability when dealing with big data and is easy to implement in parallel execution [8–10]. Researchers have proposed improved -means algorithms to address the drawbacks of the -means algorithm, and most of the improvements were made by optimizing the selection of the initial -means centers [11–13]. Good initial centers can significantly affect the performance of the Lloyd iterations in terms of quality and convergence and eventually help the -means algorithm to obtain the nearly optimal clustering results.

However, -means and its improved algorithms still need to ascertain the number of clusters in advance and then determine the best data partitioning based on this parameter. However, the obtained results do not always represent the best data partitioning. To address these problems, this paper proposes a -means clustering algorithm that is combined with an improved covering algorithm, which is called the C--means algorithm. Our improved covering-initialized algorithm has “blind” features. Without determining the number of clusters in advance, the algorithm can automatically identify the number of clusters based on the characteristics of the data and is independent of the initial centers. The C--means algorithm combines the advantages of the CA and -means algorithms; it has both the “blind” characteristics of the CA and the advantages of fast, efficient, and accurate clustering of high dimensional data of the -means algorithm. Moreover, CA is easy to implement in parallel and has good scalability. We implemented the parallel C--means clustering algorithm and baseline algorithms in the Spark environment. The experimental results showed that the proposed algorithm is suitable for solving the problems of large-scale and high-dimensional data clustering.

In particular, the major contributions of this paper are as follows: (1)We propose a covering-based initialization algorithm based on the quotient space theory with “blind” features. The initialization algorithm requires neither the number of clusters to be prespecified nor the initial centers to be manually selected. CA determines the appropriate number of clusters and the -specific initial centers quickly and adaptively.(2)The convergence algebra of the Lloyd iterations of the C--means clustering algorithm is much simpler than that of baseline algorithms.(3)The parallel implementation of C--means is much faster than parallel baseline algorithms.(4)Extensive experiments on real datasets show that the proposed C--means algorithm outperforms existing algorithms in both accuracy and efficiency under sequential and parallel conditions.

The remainder of this paper is organized as follows. Section 2 provides an overview of the related work. Section 3 gives an introduction to baseline algorithms and details of the C--means algorithm under both sequential and parallel conditions. Section 4 presents the experimental results and analysis, and Section 5 concludes the paper with future work identified.

#### 2. Related Work

As a classic clustering algorithm, the -means algorithm is widely used in the fields of database and data anomaly detection. Ordonez [14] implemented efficient -means clustering algorithms at the top of a relational database management system (DBMS) for efficient SQL. They also implemented an efficient disk-based -means application that takes into account the needs of the relational DBMS [15]. Efficient parallel clustering algorithms and implementation techniques are key to meet the scalability and performance requirements for scientific data analysis. Therefore, other researchers have proposed parallel implementation and applications of the -means algorithm. Dhillon and Modha [16] proposed a parallel -means clustering algorithm based on a message passing model, which utilized the inherence of the -means algorithm. Due to data parallelism, as the amount of data increases, the speedup and extendibility of the algorithm improve. Zhao et al. [8] implemented a -means clustering algorithm based on MapReduce, which significantly improved the efficiency of the -means algorithm. Jiang et al. [17] proposed a two-stage clustering algorithm to detect outliers. In the first stage, the algorithm used improves -means to cluster the data. In the second stage, while searching for outliers in the clustering results of the first stage, it identifies the final outlier. Malkomes et al. [18] used the -center clustering variant to handle noisy data, and the algorithms used are highly parallel. However, the selection of the initial center point of the -means algorithm is easily disturbed by abnormal points, which has a significant impact on the final results. However, efficient methods to solve the issue in which the -means algorithm is influenced by the initial centers have not been proposed.

Recently, scholars have focused on research into the issue that the selection of the initial centers of the -means algorithm is easily disturbed by outlier points and have proposed several improved algorithms to help the -means algorithm select the initial centers. The most classic improved algorithms are the -means++ algorithm and the -means|| algorithm. The -means++ algorithm, which was proposed by Arthur and Vassilvitskii [12], helps the -means algorithm to obtain the initial centers prior to the Lloyd iteration. It randomly selects a data point as the first cluster center, which is followed by selection based on the probability of the number of data points constituting the center point of the initial set of . The probability of selecting each successive center point is dependent on the previously selected cluster centers. However, due to the inherent sequential execution characteristics of -means++, the clustering centers must traverse the datasets times and the current clustering center calculation depends on all of the previously obtained clustering centers, which makes the -means++ initialization algorithm difficult to implement in parallel. Inspired by the -means++ algorithm, Bahmani et al. [13] proposed the -means|| algorithm to improve the performance of the parallelization and initialization phases. The -means|| initialization algorithm introduces oversampling factors, obtains initial centers that are much larger than the value of after a constant number of iterations, and assigns the weights to the center points. It then reclusters these weighted center points using the known clustering algorithm to obtain the final initial centers containing points. -means|| initialization has the advantages of the -means++ algorithm and also addresses the drawback of -means++ being difficult to extend. In follow-up research, researchers have proposed more improved algorithms of -means and most are compared to these two classic improved algorithms. Cui et al. [10] proposed a new method of optimizing -means based on MapReduce to process large-scale data, which eliminated the iterative dependence and reduced the computational complexity. Wei [19] improved the -means++ algorithm by selecting the cluster centers using the sampling method in the -means++ algorithm and then producing centers with the expectation of having an approximately constant factor for the best clustering result. Newling and Fleuret [20] used the CLARANS to help -means solve the problem of selecting initial centers.

However, the number of clusters in the -means algorithm and its variations must be known in advance, and the best data division based on this parameter is then defined. The data division defined in this way is actually based on an imaginary model; it is not necessarily suitable for the best data division. In addition, the final clustering result is based on clustering under a hypothetical parameter without considering the actual structural relationship of the data.

In response to the problems described above, this paper presents a novel clustering algorithm called C--means that has both the “blind” feature of the CA and the fast, efficient clustering advantage of the -means algorithm. It can be applied to high-dimensional data clustering with strong scalability. We implement the parallelized C--means algorithm on the Spark cloud platform. Extensive experimental results show that the C--means clustering algorithm is more accurate and efficient than the baseline algorithms.

#### 3. The Algorithms

In this section, we first introduce the -means clustering, -means++ clustering, and -means|| clustering algorithms. The motivation for using the CA as the initialization algorithm of the C--means clustering algorithm is then introduced, and the reason that the CA initialization can obtain clustering results that are approximately optimal is explained. Finally, we implement the parallel C--means algorithm. Before explaining these questions, we summarize the notions used throughout this paper in Table 1.