Abstract

Clustering is one of the most important unsupervised machine learning tasks, which is widely used in information retrieval, social network analysis, image processing, and other fields. With the explosive growth of data, the classical clustering algorithms cannot meet the requirements of clustering for big data. Spark is one of the most popular parallel processing platforms for big data, and many researchers have proposed many parallel clustering algorithms based on Spark. In this paper, the existing parallel clustering algorithms based on Spark are classified and summarized, the parallel design framework of each kind of algorithms is discussed, and after comparing different kinds of algorithms, the direction of the future research is discussed.

1. Introduction

Clustering is one of the most important unsupervised machine learning tasks. Its purpose is to divide data points into groups or clusters so that data points in the same cluster are similar to each other but are very different from data points in other clusters. Clustering is widely used in text mining, information retrieval, social network analysis, image and video analysis, and other fields. In the past decades, researchers have put forward many clustering algorithms, such as K-Means [1], K-Medoids [2], DBSCAN [3], BIRCH [4], OPTIGRID [5], FCM [6], PCM [7], CURE [8], CHAMELEON [9], DENCLUE [10], OPTICS [11], WaveCluster [12], STING [13], CLIQUE [14], FADE [15], CLARA [16], CLARANS [17], and ORCLUS [18], which have achieved good results over small-scale set of data points. Some clustering algorithms such as Possibilistic Fuzzy C-Mean realize the fuzzy segmentation of data points based on probability, which are applied to image segmentation and other fields [1922].

With the rapid development of information technology such as sensors, computers, and communication, the data generated by people and various devices is growing explosively, and we have entered the era of big data. Big data can be defined and described generally with 5V [23] which is Volume, Variety, Value, Velocity, and Veracity. Owing to the above characteristics, the traditional clustering algorithms cannot meet the needs of big data clustering, so a parallel clustering algorithm is needed to meet the challenges.

Apache Spark is a new generation of parallel process platform for big data, which has merits such as easy to use, versatile, and automatic fault-tolerant. Specifically, Spark uses Resilient Distributed Datasets (RDDs) to store data in memory, which can significantly improve the performance of machine learning tasks requiring multiple iterations. Compared with the classic big data parallel process platform named Hadoop, Spark has an order of magnitude advantage in performance [24] and is suitable for clustering that requires multiple iterations in particular. Many parallel clustering algorithms based on Spark have been proposed in recent researches, which significantly improve the efficiency and accuracy of big data clustering.

Some researchers have reviewed the clustering algorithms for big data. In [25], the characteristics of big data are discussed in detail, and the classification and clustering algorithms based on MapReduce are summarized and discussed. The article [26] discusses the characteristics of different types of clustering algorithms and the main challenges in dealing with big data and makes a comparative analysis of the major clustering algorithms. The paper [27] discusses the application, opportunities, and challenges of big data and briefly describes the latest technologies used to process big data. The paper [28] classifies the existing nonparallel clustering algorithms and compares the accuracy, scalability, and performance of different types of algorithms in dealing with big data through experiments. Currently, although Spark is the most popular big data parallel processing platform, and there are more and more parallel clustering algorithms based on Spark, there is no specific overview and discussion of parallel clustering algorithms based on Spark.

This paper provides an overview of parallel clustering algorithms based on Spark using the research methods of literature survey and classification, classifies the parallel clustering algorithms based on Spark proposed in the literature, summarizes the parallel implementation framework of each type of algorithms, and compares different types of algorithms.

The main contributions of this paper are as follows:(1)The proposed parallel clustering algorithms based on Spark are studied in taxonomy(2)The design framework and characteristics of each kind of parallel clustering algorithms based on Spark are discussed(3)Different kinds of algorithms are compared and discussed, and the prospect of future research direction is discussed

This paper is organized as follows. In Section 2, we summarize the classical clustering algorithms and introduces the preliminaries. In Section 3, the main parallel clustering algorithms based on Spark are classified, and the design framework and characteristics of each kind of algorithms are discussed. In Section 4, different kinds of algorithms are compared, and the prospect of future research is discussed.

2. Preliminaries

2.1. Problem Definition

Let be a data point depicted by attributes , and is a collection of n data points. The set of clusters is expressed as , where . A cluster can be represented by a special data point in this cluster or a statistical value of the data point in this cluster. This data point or statistical value is called the centroid of the cluster, which is expressed as . The aim of clustering is to allocate the data points in N to m clusters. In the following, is used to represent the distance (or similarity) between two data points; is used to represent the distance (or similarity) between two clusters, and is used to represent the distance (or similarity) between a data point and a cluster.

2.2. Classical Clustering Algorithm

Clustering is a traditional machine learning task. In the past decades, researchers have proposed many nonparallel clustering algorithms. Classical clustering algorithms can be divided into five categories.

2.2.1. Partitioning-Based

This is the most commonly used clustering algorithm, which divides data points into multiple mutually exclusive clusters. In the process of partition, each data point is usually divided into the nearest cluster according to the distance between the data point and the cluster. The most famous algorithm in this category is K-means [1], which randomly selects m data points as the initial centroid, divides the remaining data points into clusters closest to them, updates the centroid of the cluster with the average value of the data points in the cluster after the division, and iterates the above process until the division is stable or meets the iteration stop condition (generally the maximum number of iterations or the best approximation function value). This kind of algorithm has four key aspects: the selection of initial centroid, the measurement of the distance between data points and clusters, the update method of the centroid, and the design of approximation function. Other algorithms in this category are optimized in the above four aspects. For example, Intelligent Keams [29] uses abnormal data points as the initial centroid; K-Medoids [2] uses the center of cluster as the updated centroid; FCM [6] and PCM [7] use different approximation functions. The clustering algorithm based on a partition is easy to understand and realize, but it has several obvious disadvantages: firstly, the selection of initial centroid has a significant impact on the clustering speed and accuracy; secondly, the partition is based on the distance between the data point and centroid of the cluster, which is only suitable for finding spherical clusters; thirdly, the clustering quality is significantly affected by outliers; fourthly, it needs to specify the number of clusters in advance; sometimes, this parameter is difficult to determine priorly.

The partition-based clustering algorithms depend heavily on the selection of initial centroid. If the initialization centroid is not selected properly, it will have a very serious impact on the quality of clustering results. From another point of view, clustering can be seen as a process of continuously optimizing the selection of the centroids of clusters and searching the best centroid of the cluster by heuristic searching. Many heuristic search methods can be used for continuous optimization, including artificial bee colony [30] and particle swarm optimization (PSO) [31]. These methods simulate the intelligence of the colony or the process of natural evolution. Generally, they converge quickly and have a good effect.

2.2.2. Hierarchical-Based

This kind of algorithm organizes data points into a tree of clusters with a hierarchical structure, in which the leaf is the data point, the root node is all data points, and the other nodes in the tree represent a cluster. The parent nodes in the tree represent the cluster obtained by agglomerating clusters represented by the child node, and the child node represents the cluster obtained by the devising cluster represented by the parent node. Therefore, the hierarchical-based clustering algorithm can be carried out in the top-down direction in a division way or in a bottom-up way in an agglomeration way. In the former way, all data points are regarded as a cluster firstly, and then the cluster is divided into several subclusters recursively. In the latter way, each data point is regarded as a cluster firstly, and then two or more clusters agglomerate recursively. The above iterative process ends when the iterative stop condition is reached (generally, the number of clusters is used), using the distance or similarity between clusters as the basis for cluster division or agglomeration. BIRCH [4], CURE [8], and CHAMELEON [9] are the representative algorithms of this kind. This kind of algorithm has an obvious disadvantage, that is, the operation of division or agglomeration is irreversible, and improper operation can lead to low-quality clusters.

2.2.3. Density-Based

The clustering algorithms based on partition and hierarchy can only find spherical clusters, and it is difficult to find clusters of arbitrary shapes. The density-based clustering algorithm regards cluster as a dense area separated by sparse area in data space. If a data point belongs to a cluster, it must be in a dense area; that is to say, it has more than one predefined threshold number of neighbors in a special radius. The discovery of clusters can start from a data point and extend in any direction according to the density. Therefore, clusters of any shape can be found, and the outliers are naturally filtered. DBSCAN [3] is a typical representative algorithm of this kind. It identifies the dense area and the core point by calculating the number of neighbors in the field of this point. Multiple dense areas can be connected to form a cluster according to density reachable between core points. Data points that cannot be included in any cluster (i.e., data points that are not in any dense area) can be identified as an outlier. These kinds of well-known algorithms include DENCLUE [10] and OPTICS [11].

2.2.4. Model-Based

This kind of clustering algorithm assumes that data points are generated according to a certain probability distribution model, and the clustering process is to adapt all data points to some predefined mathematical models. Therefore, this kind of algorithm can automatically identify the number of clusters and outliers in data points according to the selected mathematical model. The commonly used probability distribution models are the Gaussian mixture model (GMM) [32], mixture model for cluster analysis [33], etc., and the typical algorithms are EM [34], etc.

2.2.5. Grid-Based

The above discussed four kinds of clustering algorithms are all data-driven, which directly partition or identify data points, while the grid-based clustering algorithm is space-driven. This kind of algorithm divides the data space of data points into a fixed number or size of grid units, and clustering is carried out on grid units instead of data points. Because the number of grid units is far less than the number of data points and only needs to scan the data points in the grid once to get the statistical information of the unit, the grid-based clustering algorithm is faster, and the performance is independent of the number of data points. WaveCluster [12] and STING [13] are representative algorithms of this kind. Although the clustering algorithm based on the grid has faster speed, it needs to predefine the number or size of grid units. If the data points’ distribution is irregular, the use of fixed number or uniform size of the grid unit to divide the data points may lead to poor quality of clusters and long clustering time and is not suitable for processing high-density data.

2.3. Spark

Apache Spark [35] is one of the most popular big data parallel processing platforms at present. Because it uses RDD based on memory to store input data and intermediate results, compared with Hadoop [36], it reduces a lot of I/O operations, especially those suitable for machine learning tasks which require multiple iterations [37].

A typical Spark cluster consists of one master node and several worker nodes. The master node is responsible for managing the resources of the worker nodes and assigning tasks to the worker nodes, while the worker nodes perform the corresponding distributed tasks. The working model of Spark is shown in Figure 1.

Spark distributes RDD to worker nodes in the cluster to realize distributed storage and provides a batch of functions for performing parallel operations on RDD which cross worker nodes. Commonly used functions are as follows:map (f): use a user-defined function f to convert each record in RDD to a new record, and return an RDD containing the new recordmapPartitions (f): use a user-defined function f to convert each record in the local RDD partition into a new record, and return an RDD containing the new recordfilter (f): use a user-defined function f to filter the records in RDD and return an RDD containing the filtering resultsreduce (f): use a user-defined function f to aggregate data in RDDreduceByKey (f): use a user-defined function f to aggregate data with the same key in pair RDDtakeSample (s): use a user-defined generator seed s to get a sample of RDD recordscollect (): return the records in RDD to the master node as an array of objects

3. Taxonomy and Framework of Parallel Clustering Algorithm Based on Spark

The algorithms that have been proposed are basically to transplant the classical clustering algorithms to the Spark platform. They use RDD to store datasets in a distributed way and use the functions provided by Spark to realize the parallel execution of key steps, so as to achieve parallel clustering. The proposed algorithms can be roughly divided into four categories.

3.1. Parallel K-Means Clustering Algorithm

K-means is the most famous and commonly used clustering algorithm. It has three key steps: the selection of initial centroid, distribution of every data point into the nearest cluster by calculating the distance between the data point and each cluster centroid, and updating of each cluster centroid. Many K-means variant algorithms take different strategies in these three key steps. The implementation framework of this kind of algorithm based on Spark is shown in Figure 2.

As can be seen from Figure 2, the three key steps of the K-means algorithm can be executed in parallel using the functions provided by Spark. After loading data points into RDD, this kind of algorithm can use takesample() function to randomly select initial centroid or use filter() function to customize filtering rules. The distance between every data point and all centroids can be calculated by map() or mapPartitions (). Each data point is divided into clusters represented by the nearest centroid. This is the result of the division store in pair RDD, where the key is the cluster ID and the value is the data point. Algorithms of this kind use reducebykey() function to update the centroid of clusters. If the iteration end condition has been met (generally the number of iterations or the change proportion of data point distribution), stop the iteration, and use collect() to collect the clustering results from each worker node. Otherwise, redistribute data points and update centroid.

The framework of parallel K-means algorithm in MLlib [38], a Spark-based machine learning algorithm library, is exactly the same as that in Figure 2. It randomly selects the initial centroids, uses Euclidean distance (1) to measure the distance between data points, and uses the average value of data points in the cluster to update the centroid of the cluster (2). This algorithm is easy to understand and use, but the Euclidean distance measurement will be invalid if the dimension of the data point is high, and the mean value of data points in the cluster is used to update the centroid, which does not support fuzzy clustering. So, this algorithm is only suitable for the deterministic clustering of low dimensional data. The parallel K-means proposed in [39, 40] is similar to the parallel K-means in MLlib, which is based on the classic K-means algorithm implemented in Spark:

Wang et al. proposed a series of optimization strategies for parallel K-means algorithm in MLlib in [41]. This algorithm can use a simple and fast random selection method to select initial centroids or use the method proposed in [42] to select high-quality initial centroids according to probability distribution or use the distributed centroid selection method [43] to speed up the selection of initial centroids from big data. If the dimension of the data point is high, this algorithm uses cosine distance [44] (3) or KL distance [45] (4) to measure the distance between data points:

The selection of initial centroids has a significant impact on algorithms of K-means kind. Improper selection of initial centroids may lead to long execution time or poor clustering quality. Therefore, many types of research have explored the optimization choice of initial centroids. In [46], a parallel intelligent K-means based on Spark is proposed. The difference between this algorithm and K-means is the way of a generation method of initial centroids. Intelligent K-means [29] looks for the center gravity or average of data points in advance and initializes the centroids which are farthest from the center gravity. This method of selecting the initial centroid takes into account the distribution of data points but is greatly affected by outliers. Similar algorithms include [47], which uses bat [48] and firefly [49] to optimize the selection of initial centroids.

Lu et al. proposed a parallel clustering algorithm, which uses a tabu search strategy [50, 51] to optimize the updating of centroids [52]. Given a cluster Ci and its centroid , a domain (5) with a radius can be created. The new centroid can only be selected from . The value of can be determined by the average (6) of the distance between data points in the cluster. If the centroid of a cluster is updated from to in an iteration, will be added to the tabu list, and will no longer be an alternative centroid in the specified t-round iteration. Although this method of centroid updating is more complex than the classical K-means, centroid updating of each cluster can be performed in parallel on Spark, so it can still generate new centroids of higher quality quickly, making each cluster closer, making the distance between all data points DistSum (7) as small as possible:

Another way to update the centroid is to use a heuristic search method, which is also called evolutionary computation. Its characteristic is to simulate the method of biological evolution or natural selection which chooses the best solution among the new solutions generated randomly. The representative algorithms of this kind of methods are artificial bee colony (ABC) [30], Particle Swarm Optimization (PSO) [31], etc. Yan et al. proposed a parallel ABC algorithm based on Spark [53]. The process of clustering is a simulation of bees’ search for high-quality food sources. ABC algorithm divides all data points into three categories: food source (centroid), employed bee (assigned data point), and the unemployed bee (unassigned data point). In each iteration, a random partition solution is generated first, and then the partition probability of each data point is calculated. By comparing with the previous partition solution, whether to update the solution is determined until the iteration stops. Similarly, the KMPSO [54] proposed by Matthew et al. implements a parallel PSO algorithm based on Spark.

Fuzzy C-means (FCM) is the most classical fuzzy clustering algorithm, which is proposed by Bezdek [6]. The clustering result of this algorithm is not to divide each data point into clusters but to generate the probability that each data point belongs to a cluster. The difference between FCM and K-means mainly lies in the method of centroid updating and the design of objective function. It uses equation (8) to update centroid and equation (9) as an objective function. The objective of iteration is to minimize the weighted average of all data points belonging to a cluster:

Neha et al. discussed the design ideas of three famous variants of the FCM algorithm named LFCM [55], resFCM [55], RSIO-FCM [56], and proposed a parallel clustering algorithm SRSIO-FCM [57] based on Spark.SRSIO-FCM divides the set of data points N into multiple subsets randomly , clustering the first subset with the classical K-means algorithm to generate the set of centroids . In the following iteration, when dealing with , the centroids of is calculated according to the membership to generate the recommended centroid set of , , which is used as the initial centroid to cluster until all subsets are clustered.

Spark-PCM [7] proposed by Zhang et al. is a parallel version of the Possibilistic C-Means (PCM) [7] based on Spark. PCM uses the possibility model instead of the probability model of FCM, which is less affected by outliers. Because the possibility model matrix needs to be updated repeatedly, Spark-PCM uses the distributed matrix operation library Marlin [58] to accelerate the matrix update operation.

In order to improve the traditional distance measurement method which only supports numerical attributes (such as equations (1), (3), and (4)), Mohamed et al. [59] implemented the Spark-based K-Prototypes [60]. The difference between K-Prototypes and K-means is the measurement method of the distance between data points and centroid. It uses equation (10) to measure the distance between data points and centroid so that it can support data points with categorical attributes:

3.2. Parallel Hierarchical Clustering Algorithm

The hierarchical-based clustering algorithm organizes all data points into a tree structure, which can agglomerate data points from the bottom-up direction or divide the set of data points from the top-down direction. Obviously, the method based on agglomeration is more suitable for parallel execution. Single linkage hierarch clustering (SHC) [61] is a representative clustering algorithm based on agglomeration. It needs multiple iterations. Each iteration will merge the nearest data points or clusters with each other until the end condition of iteration is met (generally the number of clusters or the number of iterations).

In order to realize the agglomeration of contiguous data points or clusters, the set of data points needs to be divided according to the data space in advance. The framework of the parallel hierarchical clustering algorithm based on Spark is shown in Figure 3.

Jin et al. proposed a parallel SHC algorithm based on Spark named SHAS [62]. The framework of SHAS is the same as Figure 3, which mainly includes three stages: data point division, local clustering, and cluster merging. In the local clustering stage, the method proposed in [63] is introduced to transform the clustering into a problem of finding a minimum spanning tree (MST) of a complete graph. All the vertices in the graph are data points, and the weight of the edge is the distance between data points. This method improves the efficiency of local clustering and cluster merging. Firstly, SHAS divides the set of data points into roughly equal partitions and constructs a complete graph for every partition. Then, bipartite graphs ( in total) are constructed for each pair of data partitions. After MSTs of each complete graph are constructed, each bipartite graph is combined until only one MST is left.

MLlib, a machine learning algorithm library based on Spark, also contains a parallel hierarchical clustering algorithm, which is based on the bisecting K-means [64], and the design idea comes from the paper [65]. Because bisecting K-means is a hierarchical clustering algorithm based on the division in the top-down direction, this algorithm can only start from a single node with all data points. It uses K-means to further divide clusters in each iteration. The parallelism degree of this algorithm is low, and it is only suitable for training data with a small size [66].

3.3. Parallel Density Clustering Algorithm

The clustering algorithm based on partition or hierarchy can only find spherical clusters, and the clustering quality is greatly affected by outliers, while the clustering algorithm based on density overcomes the above two shortcomings. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [3] is one of the most representative density-based clustering algorithms, which can identify clusters of any shape and outliers efficiently. DBSCAN uses the number of data points denoted by contained in which is a neighborhood with radius r of data point p (Figure 4) as the density of data point p and divides all data points into three categories according to the density of data points: core points, boundary points, and outliers. The core point is the data point whose density is greater than the specified threshold MinPts. The boundary point is the data point included in of a core point but not the core point; the outlier is the data point not in of any core point. of all the density-connected core points can form a cluster (Figure 5). Data points that do not belong to any cluster are outliers. Therefore, the density-based clustering algorithm mainly consists of two key steps: finding the core points by calculating the density of data points and merging (representing a subcluster) of the density-connected core points into a cluster:

The framework of a parallel density clustering algorithm based on Spark is shown in Figure 6.

Fang et al. proposed a parallel density clustering algorithm named parallel DBSCAN [67] based on Spark and classic DBSCAN. According to the analysis of DBSCAN, more than 90% of its execution time is used to calculate the density of data points and find the core points. Therefore, parallelizing this step can significantly improve the performance of DBSCAN. Parallel DBSCAN mainly consists of three stages: in the first stage, data points are redistributed according to the strategy of grid and secondary expansion partition [68] according to the data space. Each grid is called a local area, the adjacent parts between the local areas are called boundary areas, and the data points in the boundary area are called boundary points (Figure 7). Because the boundary points may be density-connected with data points in multiple local areas, the algorithm enlarges the grid appropriately to include the boundary points (Figure 8).

In the second stage, the classical DBSCAN is executed in each cell in parallel to generate local clusters. Finally, in the third stage, all local clusters are merged and relabeled to generate global clusters. The authors in [67] also discuss the optimization of parallel DBSCAN in terms of data transmission, serialization, parameter configuration, etc. The experimental results show that parallel DBSCAN has good acceleration under various Spark operating environments than classic DBSCAN.

Amar et al. proposed a parallel clustering algorithm SparkSNN [69] based on Spark and Shared Nearest Neighbor (SNN) [70]. Different from the parallel DBSCAN which uses Euclidean distance (equation (1)) to measure the distance between data points, SparkSNN uses the number of data points in the neighborhood intersection of data points as the method for distance measurement (equation (12)), the density of is defined as the sum of the similarities between and data points in (equation (13)), and the core point (also is the centroid of subcluster) is the point whose density is greater than the specified threshold MinPs. SparkSNN, like parallel DBSCAN, also includes three main stages: redistributing data points, local clustering, and global merging. The difference is that the local clustering stage uses SNN. Compared with the DBSCAN algorithm, the SNN algorithm not only has the advantages of discovering clusters of arbitrary shapes and recognizing outliers well but also has a better effect on the collection of data points with a high dimension or uneven density:

Liu et al. proposed a parallel clustering algorithm Parallel DP [71] based on Spark Graphx and Density Peaks [72]. The main difference between this algorithm and the parallel DBSCAN algorithm is the selection of centroid of clusters. The parallel DBSCAN selects the first found core point as the centroid of the cluster, which can only ensure that the density of centroid is not less than the predefined threshold MinPts. The Parallel DP calculates the density of each data point and the minimum distance to other data points with higher density (equation (14)) and selects the data point with the largest as the centroid of the cluster. In other words, the densest data point in the cluster is selected as the centroid of the cluster, which makes the cluster compact and more conducive to merging local clusters to generate global clusters:

The framework of the parallel clustering algorithm [73] proposed by Liang et al. is consistent with Figure 6, but special methods are adopted in three stages. In the data point redistribution stage, the locality sensitive hashing (LSH) function [74] is used to achieve better load balancing and spatial density estimation; in the local clustering stage, the classic kernel density and high-density nearest neighbor (KNN) [75] are used, and the Gauss model is used to represent the local cluster to reduce the amount of data transmitted by the network and the amount of computation in the merging stage. In the stage of merging local clusters into global clusters, the density connection based on the Gauss model is used to merge multiple models (local clusters).

3.4. Parallel Model Clustering Algorithm

The model-based clustering algorithm is based on the assumption that the data points come from the data source which contains multiple subpopulations. The data points in each subpopulation conform to a certain probability distribution, and the data point set is a mixture of multiple subpopulations. The most commonly used model is Gaussian Mixture Models (GMMs) [32], which regards the cluster as a Gaussian distribution, and the set of data points is a mixture of multiple Gaussian distributions with different parameters (equation (15)). The process of clustering is to divide the data points into a Gaussian distribution, which directly generates clusters. Therefore, the model-based clustering algorithm has more advantages in running speed than other types of algorithms:

MLlib provides a parallel clustering algorithm based on GMM. It uses the expectation maximization (EM) [34] for sampling data points to find one or more variable Gaussian distributions and training Gaussian mixture model and generates the mean value and standard deviation of each Gaussian distribution closer to the real situation through multiple iterations. After training, the Gaussian mixture model is used to classify all data points and get a predefined number of clusters. In addition to the GMM model, MLlib also provides a parallel Latent Dirichlet Allocation (LDA) [76] algorithm, which is mainly used for text clustering. LDA is a probability model of the corpus. It is supposed that a text is randomly mixed by multiple latent topics; each latent topic can be identified by the probability distribution of words.

The above two parallel model clustering algorithms provided by MLlib both include three main stages: data sampling, training probability model, and classification. These algorithms firstly get some samples from all data points, then use the samples to train the probability model, and finally use the trained probability model to classify all data points in parallel.

3.5. Parallel Clustering Algorithm for Special Dataset

A high-dimensional dataset refers to the set of data points described by a large number of attributes. Data points with high dimensionality will bring serious challenges to the clustering algorithm based on distance measurement, which not only affects the performance but also has a significant impact on the accuracy of clustering. This is called “the curse of dimensionality” [77]. There are two main methods to cluster a high-dimensional dataset: one is subspace clustering, which refers to clustering a subset of all attributes rather than all attributes; the other is dimension reduction method, which combines or converts the original attributes to form new attributes, so as to reduce the dimension.

Zhu et al. proposed a parallel subspace clustering algorithm named CLUS [78] based on Spark, which parallelized the classic subspace clustering algorithm SUBCLU [79]. If the dimension of data points is , there are subspaces, and the subspaces with the same dimension can be clustered in parallel. CLUS starts from subspaces with a single dimension and generates all clusters in these subspaces. In the k-th iteration, any two (k-1)-dimensional candidate clusters are merged, and the merged results are pruned monotonically to generate a k-dimensional subspace. The DBSCAN algorithm is used to cluster the newly generated k-dimensional subspace, which requires at most logP iterations. In the above process, each worker node needs to process all data points in the subspace. Some data points need to be copied repeatedly between multiple nodes. It is difficult to redistribute the dataset, and the amount of data transmission between nodes is also large.

Spectral clustering is a dimension reduction method. Zhu et al. put forward a parallel spectral clustering algorithm based on Spark, named SCoS [80]. This algorithm parallelized the four main steps of the spectral clustering algorithm: building similarity matrix, building Laplacian matrix and normalization, feature vector calculation, and parallel clustering of feature vector matrix. Some methods of performance optimization are used in the algorithm of SCoS: parallel computation based on multiround iteration is adopted to improve the speed of building similarity matrix; Sparse representation and storage are adopted to further reduce the data transmission between storage and nodes; ScalaPack, a numerical linear algebra computing library, is used to accelerate the parallel solution of feature vectors.

In addition to the high-dimensional data, there are also parallel clustering methods [81, 82] for the streaming data realized by Spark streaming and parallel clustering methods [8386] for the graph data realized by Spark graph x, which expands the types of the dataset that can be parallel clustered based on Spark.

4. Conclusion and Prospection

4.1. Summary of Proposed Algorithms

To sum up, many researchers have proposed many parallel clustering algorithms based on Spark, and their main common features are as follows:(1)These parallel clustering algorithms are all based on the classical clustering algorithm; there is no significant change in the algorithm design. They all improve the clustering efficiency by parallelizing the main steps of the classical algorithm.(2)These parallel clustering algorithms use RDD provided by Spark to store data points, but most of them need to distribute data points according to the characteristics of data points or algorithms.(3)These algorithms use the functions provided by Spark to realize the parallel operation of each data partition and make full use of the characteristics of Spark based on memory computing to improve the efficiency of multiple iterations.

The comparison of key features of various algorithms is shown in Table 1.

4.2. Prospection of Future Works

Big data can be defined and described generally with 5V [23] which are Volume, Variety, Value, Velocity, and Veracity. From the above discussion, we can see that the proposed parallel clustering algorithms based on Spark mainly solve the problem of the large scale of data. In addition to the huge Volume of data, big data also includes the following features.

4.2.1. Variety

The data types are very diverse. In addition to the structured two-dimensional quantitative data which is most used, it also includes categorical data, high-dimensional data, and three-dimensional data such as time-series data.

4.2.2. Value

Because of the low value density of data, a data mining algorithm is needed to find the important information contained in the data.

4.2.3. Velocity

The processing of big data requires not only fast batch processing speed but also real-time data processing.

4.2.4. Veracity

Generally, big data contains some wrong data or noise data.

According to the above characteristics of big data, the research on parallel clustering algorithm based on Spark can be carried out from five aspects:(1)Research on a parallel clustering algorithm that can deal with more types of data: categorical data is different from quantitative data, its value is discrete, there is no natural order between different categorical values, and there is no distance measurement information. Therefore, it is necessary to further study the parallel clustering algorithm which can process categorical data. Generally, big data has a large number of attributes. Because “the curse of dimension” will lead to the failure or meaningless of classical distance measurement methods, it is necessary to further study the parallel clustering algorithm based on subspace or dimension reduction technology to process high-dimensional data. Three-dimensional data are usually time- or location-related data, such as gene-sample-time series in bioinformatics or item-time-location data in market analysis. It is necessary to further study the parallel clustering algorithm that can process three-dimensional data to find clusters across time or location.(2)Research on parallel clustering algorithm which can find the significant clusters in the data: big data with a large amount of data and attributes will lead to a great increase in the number of clusters generated by clustering, and the clustering results become redundant and poorly interpretable. The concept of maximum cluster or significant cluster can be used to study the parallel clustering algorithm of important clusters in data to improve the interpretability of clustering results and reveal the important information in data.(3)Research on parallel clustering algorithm with better performance: grid-based clustering algorithm transforms the clustering of data objects into clustering of grid units, which has better performance. We can further study the grid-based parallel clustering algorithm supported by Spark. Because the performance of grid-based clustering algorithm is closely related to data space and grid size, and the number of data points has little effect on the performance of this kind of algorithm, it is more suitable for big data clustering.(4)Research on a parallel clustering algorithm which can deal with streaming data: in addition to clustering of stored data in batch, we also need to study a parallel clustering algorithm which can process streaming data in real time to further improve the real time and responsiveness of clustering.(5)Research on a parallel clustering algorithm that can deal with noise: noise data generally contains the values of error, missing or unknown, which often appears in big data. Therefore, further research on parallel clustering algorithm which can deal with noise is also an important research direction.

Conflicts of Interest

The authors declare that there are no conflicts of interest in connection with the work submitted.