Abstract

The density-based applied spatial clustering algorithm is an algorithm based on high-density interconnected regions, which discovers class clusters of arbitrary shapes in noisy data sets and is widely used. However, it suffers from slow computation speed due to large-scale disk I/O and clustering bias due to uneven density class clusters and poor parameter search ability. To address these problems, a parallel density clustering algorithm based on an improved fruit fly optimization algorithm and Spark memory iteration is proposed. The proposed algorithm first divides the data grid using an irregular dynamic density region partitioning strategy. Then, a hybrid fruit fly particle swarm algorithm based on a genetic optimization mechanism is proposed to achieve dynamic optimization seeking for parameters in local clustering to improve the clustering effect of local clustering. Finally, the local merging of samples in irregularly bounded grid cells under each partition is achieved by designing a custom clustering merging strategy. The experiments show that the improved algorithm is generally applicable to the clustering of different shape class clusters and larger scale data and has obvious improvement in accuracy and parallel efficiency.

1. Introduction

The age of big data is an era of information explosion. Data in financial transactions, electronic information, social networks, biomedicine, and other fields are generated at a high speed and presented in different structured forms. As a result, data in the age of big data are characterized by large volume, fast speed, variety, high value, and accurate information [1, 2]. Big data contains data in many fields, such as human lifestyle, behavior habits, biomedicine, and software data. The rich value information is deeply concerned by various enterprises and even national departments [3, 4]. It has become an important issue in the era of big data to obtain and mine the valuable information contained in it to provide services to individuals, enterprises, and even countries [5]. In addition, the marking process of the original data in the process of image information fusion is very time-consuming and labor-intensive. Therefore, clustering the original data first becomes a better choice, and it also lays a foundation for the performance improvement of subsequent marking, recognition, classification, and other algorithms.

Data mining is a technique that can extract potentially useful information and knowledge from large amounts of incomplete, noisy, fuzzy, and random data [6]. However, traditional data mining algorithms only apply to small-scale data and cannot be directly extracted for big data. Therefore, researchers have introduced a big data distributed computing framework to traditional data mining algorithms to process and analyze massive, high-speed, and diverse data to meet the demand for rapid mining of valuable information in the era of big data [7]. Big data mining algorithms are divided into four steps: clustering analysis, classification analysis, association rules, and regression [811]. As an unsupervised learning algorithm in data mining, the clustering algorithm can classify data sets into classes composed of similar objects based on the relevant characteristics of physical or abstract objects and thus identify potential distributions in the data. Therefore, clustering analysis becomes an important task for big data analysis and mining. It is also a fundamental tool for granularity of information in big data as well as for information compression [12]. Due to the massive scale and high-dimensional characteristics of big data, it has had a huge impact on the theory and calculation methods of cluster analysis. Whether from the model to the method of the algorithm itself or in the application of the algorithm, clustering analysis has posed a serious challenge [13]. Efficient clustering analysis is important for accurate improvement of object detection and classification algorithms [1417]. Thus, in the research of clustering analysis methods for big data, the analysis of parallel density clustering algorithms has become a current research hotspot.

The parallel density clustering algorithm has been committed to improving the optimization ability of clustering parameters. In recent years, with the proposal and development of the swarm intelligence optimization algorithm, the algorithm that seeks the optimal solution of the problem by simulating biological behavior is widely applied to the parameter optimization problem of the clustering algorithm. However, the parallel density clustering algorithm based on swarm intelligence optimization also suffers from the defects of the complex optimization seeking process, not easy to implement and easy to fall into local optimality. Considering the above issues, the following research is carried out in this study:(1)We design an irregular dynamic density region partitioning strategy for dynamically partitioning data grid cells.(2)A hybrid fruit fly particle swarm optimization algorithm based on a genetic optimization mechanism is proposed, which dynamically optimizes Eps and MinPts parameters in local clustering to improve the clustering effect of local clustering.(3)For the spark model, a user-defined clustering merging strategy is designed for the local merging of samples in irregularly bounded grid cells under each partition. This strategy effectively improves the overall parallel efficiency of the algorithm.(4)The performance of the proposed method in big data clustering is verified by using different data sets, especially data for image information fusion.

2. State of the Art

As an unsupervised learning algorithm, the traditional density clustering algorithm needs high time complexity and space complexity when processing data. The traditional density clustering models and methods (such as Density-Based Spatial Clustering of Application with Noise, DBSCAN) have been greatly challenged in processing massive multisource heterogeneous types of big data, especially in image information fusion. Earlier large data density clustering algorithms only added preprocessing steps before algorithm execution, such as sampling-based methods, partition-based methods, and compression-based processing methods [18, 19]. However, with the introduction of Google MapReduce computing framework, it is favored by people because of its high availability, high scalability, good fault tolerance, and other advantages. Many researchers began to study density clustering algorithms based on parallel computing frameworks such as MapReduce and Spark [20].

The parallel density clustering algorithm based on MapReduce, Spark, and other models improves the traditional density clustering algorithm by dividing the density clustering task into several independent and simultaneously executed subtasks and then combining the data results from these subtasks to obtain the final clustering results. The first distributed parallel density clustering algorithm was proposed by Li et al. on the MapReduce framework. The algorithm obtains local clusters by employing a parallel execution of the DBSCAN algorithm in the form of partitioned data and then gradually merges the local clusters to obtain global clustering results. However, the algorithm does not provide a feasible method for data partitioning, and the efficiency of obtaining global clusters is not high. In addition, the selection of parameters of the algorithm has a large impact on the clustering effect [21]. To solve the data partitioning problem and improve the efficiency of obtaining global clusters, Wang et al. [22] proposed an incremental parallelized fast clustering algorithm. The algorithm uses dichotomous partitioning of the spatial grid based on the number of data points and then combines the greedy algorithm with elastic reorganization partitioning to reasonably partition the data. The algorithm uses -tree structure to merge the results of local clustering and improve the efficiency of obtaining global clusters. In addition, Jing et al. [23] applied a new data partitioning and merging method to implement the DBSCAN algorithm for partitioning and clustering based on the Spark framework. The algorithm introduces KD trees in the data partitioning phase and reduces the number of visits to the dataset by using neighbor queries. In order to reduce the time overhead required for local cluster merging, the algorithm separately extracts feature points of local clusters as the basis for merging local clusters. Considering that the above algorithm still does not apply the idea of distribution to the acquisition of global clusters, the overall parallel performance of the algorithm needs to be improved. Moreover, the algorithm is still sensitive to the threshold value of dividing the grid edge length and the parameter selection of DBSCAN. To further solve the problem of sensitive clustering parameter selection, Song and Xu [24] proposed the H-DBSCAN algorithm based on k-dist graphs in the Spark framework. This algorithm calculates the Eps neighborhood values of density clustering by using k-dist graphs after partitioning the data. Xia et al. [25] proposed a parallel SP-DBSCAN algorithm in the Spark framework to optimize the traditional DBSCAN algorithm by silhouette coefficients and boarding rates to solve the parameter sensitivity problem. Experimentally, it was demonstrated that both of these algorithms improved the exact value of the clustering effect and effectively solved the parameter selection sensitivity problem. However, these algorithms perform poorly on the global parameter seeking and have some limitations.

In recent years, swarm intelligence optimization algorithms have developed rapidly. By simulating the behavior of animal groups in the nature, this kind of algorithm realizes the purpose of obtaining optimal solutions through the interaction between limited individuals based on the use of information exchange and cooperation between the groups [26]. Swarm intelligence optimization algorithms have the advantages of good global search effect and easy implementation and are widely used in the parameter search problem of clustering algorithms [27]. Therefore, the density clustering algorithm based on swarm intelligence optimization under big data has gradually become a new research hotspot. Hu et al. [28] used genetic algorithms to iteratively calculate the optimal values of Eps and MinPts in the parallel DBSCAN algorithm. Considering the slow search speed and strong dependence on initial population selection of the genetic algorithm, Sherar and Zulkernine [29] implemented a parallel clustering algorithm with particle swarm optimization in the Apache Spark framework to improve the accuracy of data partitioning. Deng [30] proposed a parallel density clustering algorithm under MapReduce, which partitions data by using the particle swarm optimization algorithm and uses the k-dist graph to find local clustering parameters. In addition, Ashish et al. [31] proposed a parallel clustering method using the bat algorithm, which achieves fast and efficient work by dividing large data sets into small blocks and then clustering these smaller blocks in parallel. Lai et al. [32] selected and improved a special variable update method with excellent optimization performance for the parameter optimization problem of DBSCAN. The algorithm can not only find the highest clustering accuracy of DBSCAN quickly but also find the Eps interval corresponding to the highest accuracy. Zhou et al. [33] and Liu et al. [34] used the fruit fly swarm optimization algorithm and the improved fruit fly swarm optimization algorithm, respectively, to optimize the optimal parameters in parallel density clustering. The dynamic optimization search was performed to improve the clustering effect of local clusters. Although these algorithms have successfully applied the swarm optimization algorithm to optimize the clustering parameters, there are also some shortcomings of the swarm intelligence algorithm, such as complex calculation, prematurity, and falling into local optimization. Therefore, improving the parameter search optimality of the swarm intelligence algorithm is an urgent problem to be solved in the current parallel density clustering algorithm.

3.1. DBSCAN Algorithm

The DBSCAN algorithm is a typical density-based clustering algorithm, which defines a cluster as the largest set of densely connected points. The algorithm is able to classify high-density data regions into different clusters and identify clusters of arbitrary shape in a “noisy” data set. In the n-dimensional space, the parameters, threshold radius Eps and threshold size MinPts, are set to discover arbitrarily shaped clusters of sample points in the space by iterative computation, filter the sample data set with noisy points, and obtain the density clustering results.

The basic idea of the DBSCAN algorithm is as follows: first, a point p is randomly selected from a given set of data objects D and the clusters are found in the region of a given radius Eps of the point. If the Eps neighborhood of point p contains at least MinPts objects, then a new cluster is created with the point p as the core object, and then the data objects with direct density reachability are repeatedly found based on these core objects, and the finding process may involve the relevant merging of density reachable clusters. The DBSCAN clustering algorithm execution flowchart is shown in Figure 1.

3.2. Spark Framework

Spark is an in-memory distributed computing engine framework, which introduces RDD (Resilient Distributed Dataset) based on the structure of MapReduce computing engine. When Spark executes, the data are mapped to the RDD structure, giving full play to the characteristics of its memory computing structure and repeatedly caching the data to memory in iterative calculations. Unlike MapReduce, which converts data in memory to hard disk I/O, Spark mode improves the computing speed by ten or even a hundred times. From RDD dependencies, operations on RDDs can be divided into logically passed conversion operations and actually running action operations, which are not executed immediately and have to wait for action operations to be fully triggered. The arithmetic used for parallel operations is important, including MapReduce, Join, and Foreach, plays a key role in algorithm optimization. The overall architecture of Spark is shown in Figure 2.

4. Methodology

For the problem of iterative computation of the DBSCAN clustering algorithm, the parallel density clustering algorithm of Spark and the improved fruit fly optimization algorithm are proposed by considering the Spark parallel mechanism comprehensively. The method is divided into 3 steps. (1) The irregular dynamic density region partitioning strategy is designed to divide the data space into grid cells. (2) In local clustering, an improved fruit fly swarm optimization algorithm is used to dynamically adjust the optimal parameters in local clustering. (3) A custom strategy is proposed for local clustering result merging to achieve global clustering by accelerating the convergence speed of merging local clusters. A parallelized merging local cluster algorithm is designed by combining with the Spark parallel mechanism. The purpose of the parallelization algorithm is to improve the accuracy of clustering computation and analysis of large-scale data sets in the Spark clustering environment, shorten the clustering running time, and improve the clustering efficiency. The overall flow of the proposed algorithm is shown in Figure 3.

4.1. Data Division

When dividing big data grids, there are two problems: (1) sensitive grid division threshold selection, i.e., it is difficult to select the threshold value for data grid division, and the selection of the threshold value will have an impact on the clustering results. (2) Uneven data grid density, i.e., the data are not partitioned according to the data distribution of big data, resulting in an uneven data grid density.

To solve the above problems, a strategy of the irregular dynamic density region division is proposed. The improved algorithm divides the grid cells according to the number of data sets and the index width difference. Then, the data sets with similar density and dispersion and within the range of consolidation threshold conditions are divided by a user-defined calculation method. The purpose of the hierarchical merging and averaging algorithm is to reduce the density fluctuation during the formation of each irregular region, which is conducive to averaging the density dispersion index of each divided region.

Let there exist a d-dimensional data set H = {h1, h2, , hd} with the data of each dimension in H on the ith dimensional data interval [fi, ), where i = 1,2,, d, and each dimensional data set is bounded, then is the d-dimensional data space. D-dimensional space is the disjoint data set, and dimensions denote data attributes, for each dimensional data space division, d-dimensional grid cells can be formed. d-dimensional space under each dimensional data grid width is calculated as shown in the following equation:

In this, c is the set to expand the linear equation, and the grid width is appropriately deflated to obtain a suitable initial cell grid. The ratio of boundary difference to the number of samples in each dimension is calculated, and the smallest of them is taken as the grid width benchmark. After calculating and obtaining the grid width of each dimension of data, the number of grids and grid indexes of each data dimension is further determined as shown in the following equation:

The grid merging scheme is designed after initializing the grid cells. In d-dimensional space in each dimension, there are eligible sample data in each grid cell, and the number of each disjoint grid data set is used as the cell density, and the density size is denoted by Den(f). The improved algorithm is designed to obtain and judge the size of the relative difference ratio of grid density of adjacent grid cells to merge the grid cells, and combined with the greedy algorithm, the problem is decomposed into several subproblems, traversing the unmarked index grid and judging that the maximum density is the core grid cell, and the relative difference between the two adjacent grid cells is calculated as shown in the following equation:

Using it as the judgment grid merging condition, the initial core and adjacent grid cells are calculated in turn according to the above equation to satisfy.

DD (f, f0) < ε. We determine the adjacent grid attribution, and the grid merging strategy is shown in Figure 4.

Satisfying the above conditions, the adjacent grid and the core grid are marked as one class, and in order to attenuate the fluctuation of density difference ratio caused by density calculation, a hierarchical subsumption mean quantity algorithm is used to unify the average density value with the layer level as the base, as shown in the following equation:

The process of dividing a search area into a grid can be achieved by employing the breadth-first search (BFS) algorithm. Subsequently, grids that satisfy the specified criteria are merged, while those that do not fulfill the discriminatory conditions remain unmerged. Additionally, adjacent cell grids that delineate the boundaries are distinctly marked as boundary grids. All the grids that have participated in the calculation are marked as searched until the end of this merging. If there are still cell grids in the unsearched state, the subproblem continues to be solved until the end of the regional division.

4.2. Local Clustering Analysis Based on Improved Fruit Fly Optimization

After region partitioning, the Spark parallelization structure is designed to carry out iterative clustering computation, and the irregular region dataset is partitioned by creating RDDs in the HDFS distributed file system and transformed into different partitions. The iterative clustering computation is performed in parallel with different partitions, and the corresponding threshold radius Eps and threshold size MinPts are determined dynamically based on the improved fruit fly optimization algorithm for different RDD partitions.

4.2.1. Fruit Fly Optimization Algorithm

Fruit fly swarm optimization algorithm (FOA) is an optimization algorithm that simulates the foraging behavior of fruit fly swarms. In FOA, a fruit fly swarm searches for food by continuously updating the swarm position. The parameters of the fruit fly optimization algorithm are simple in structure and easy to adjust. If the number of fruit flies is Nf, the positions of fruit flies are Xaxis and Yaxis. The basic fruit fly optimization algorithm update iteration formula is described in the following equation:where j is the fruit fly serial number. . tf is the fruit fly dimension. . rand is the random number. . is the search radius of the tf dimension of the fruit fly.

Since the food location is not known, it is necessary to calculate the distance Distj between the current individual position of the fruit fly with serial number j and the origin and then calculate the taste concentration judgment value Smj. Smj is the reciprocal of the distance Distj, and the specific calculation formula is described in the following equation:

The fruit fly optimization algorithm determines the merit by its flavor concentration value, which is calculated as described in the following equation:where Smellj is the taste concentration function value of the jth individual fruit fly and fs is the formula for calculating the taste concentration value.

4.2.2. Improved Fruit Fly Optimization Algorithm

The FOA algorithm generally uses randomly generated data as the initial population information in solving function optimization problems, which ensures the uniform distribution of initial positions to some extent, but it is difficult to preserve the diversity of the population, resulting in a poor optimization search effect of the algorithm. To improve the performance of the FOA algorithm, a combination of particle swarm optimization (PSO) and fruit fly optimization is proposed to maintain the diversity of the population. In addition, the hybrid approach introduces a genetic evolution mechanism to suppress the disadvantage that the hybrid optimization algorithm relies on the optimal individual search mode, which leads to the algorithm easily falling into the local optimum. The individual update iteration formula of the hybrid fruit fly particle swarm algorithm (FOAPSO-GA) based on the genetic evolution mechanism is described in the following equation:where rpso is the update weight of the particle swarm algorithm. . Then, 1- rpso is the update weight of the fruit fly hybrid algorithm. rd is the search radius of the dth dimension. is the individual update result obtained by the basic particle swarm algorithm. is the bootstrap individual given by the particle swarm fruit fly hybrid algorithm. is the final individual update result obtained by the fruit fly particle swarm hybrid algorithm.

According to equation (8), the value of individual is the weighted sum of and . Therefore, the hybrid fruit fly particle swarm algorithm considers both the updating methods of the PSO and the FOA. To some extent, it can reduce the destruction of population diversity caused by a single updating mode, in which a large number of individuals in the population tend to be the same in the late iteration. However, the hybrid algorithm is easy to fall into local minima. In order to overcome the above problems and effectively solve the shortcomings of the algorithm easily falling into local optimum, this study introduces the genetic evolution mechanism into its update process.

(1) Crossover Mechanism. To enrich the diversity of the population, the crossover mechanism in the genetic algorithm is introduced. In this study, a single-point crossover is used, where the parents cross over two by two to produce offspring, an elite strategy is used to retain the dominant individuals, and a tournament is used to outcompete among the parents and offspring. The crossover in the matrix encoding form is shown in Figure 5.

(2) Mechanism of Variation. The variation operation is to increase the search space to avoid the algorithm from falling into local optimum. The chance of variation operation is very small, and the general variation probability is taken as 0.001–0.1. In this study, the variation probability is set as 0.05. The new fitness is calculated, and the current best fitness and the best individual are recorded. The variation in the matrix encoding form is shown in Figure 6.

Figure 7 shows the algorithm flowchart of FOAPSO-GA as follows.

4.2.3. FOAPSO-GA-Based Parameter Optimization Search

According to Figure 7, the clustering property, namely, the principle of maximum similarity within clusters and minimum similarity between clusters, this study uses the cluster fitness judgment function CCF [35] as the fitness function of FOAPSO-GA:where c represents the number of clusters. Ii refers to the i-type cluster. sim (Ii, Ij) represents the similarity between classes. WIE is the weighted information entropy, which is defined as follows:where H(i) represents the information entropy of the i-th cluster.

FOAPSO-GA is used to find the optimal parameters for clustering in different data partitions separately. First, we set the range of values of parameters Eps and MinPts and initialize k individuals of the population in the solution range, and the position of each individual is the value of parameter Eps or MinPts. Second, the fitness function values of individuals are calculated and ranked, and the individual extremes and global extremes are selected. Again, the crossvariation operation is performed on the population individuals according to a certain probability according to the genetic evolution mechanism. Then, the individual values of the population are updated according to the update formula of the particle swarm fruit fly hybrid algorithm. If the new solution found is better than the old one, the position of the best individual is updated and saved. If the number of new solutions cannot be found and the maximum number of iterations is reached, the algorithm stops. The position of the best solved individuals is the optimal parameters Eps and MinPts. The relationship between the density clustering algorithm and FOAPSO-GA in the implementation process is shown in Table 1.

After determining the Eps and MinPts threshold parameters, the RDD partition in the Spark computational model unfolds parallel computation according to the adaptive parameter settings to obtain parallelized clustering results. The local clustering parallelization process is shown in Figure 8.

4.3. Analysis of Customized Clustering Merging Strategies

In this study, a custom clustering merging strategy is designed to merge local clustering results to form a global cluster. The proposed strategy uses the strategy of judging the relationship between the sample data sets in the grid cells with irregular boundaries of each partition to further expand the local merging. The specific merging rules are as follows:(1)Let the samples P1 and P2 correspond on the boundary grid of two partitions C1 and C2, respectively. If P1 and P2 are the core points of Eps under the corresponding partitions, respectively, the condition to judge that C1 and C2 merge into the same class of clusters is shown in the following equation:(2)It is assumed that there is a density region, and the noise point is in the irregular boundary grid element, which may be the boundary point of one of the clusters in another density region. Let there exist a noise point P1 in a partition C1 if there exists a data sample P2 in another partition, such as the boundary grid cell in C2 satisfying the condition as shown in the following equation:P1 in partition C1 can be divided into the cluster of P2 sample points in partition C2.(3)Noise points of irregularly bounded grid cells in the density region, where there exist a relatively small number of samples on one side of the boundary or a relatively low density. Considering the global cluster merging condition, if there exist two class clusters C1 and C2 in the density region with noise point datasets Q1 and Q2, respectively, where there exists a sample P and the following conditions are satisfied:

Pi denotes the sample of partitions that meet the density direct of this neighborhood condition. The number of these samples is conditioned as shown in the following equation:

The condition holds that the noise point P can be defined as the core point of a new class cluster, and the sample point Pi is denoted as the boundary point of the new class cluster.

5. Result Analysis and Discussion

5.1. Experimental Environment and Experimental Data

In order to verify the effectiveness of the proposed parallel density clustering algorithm, experiments are conducted in this study. The experimental environment consists of one Master node and three Slaver nodes, which are connected to each other through a 300 Mbps network with the same configuration: 1TB hard disk, 16GB memory, InterCore i7-9750H CPU, and CentOS 6.5. The programming environment is Python 3.5.2 and Spark version 2.4.3.

The experimental data used in the proposed algorithm are three popular datasets and two image datasets from the UCI public database. The dataset is selected according to the corresponding dimension and size criteria of the dataset. The experimental data of UCI public database are HTRU2, BitcoinHeist, and healthnews in Twitter. BitcoinHeist is a dataset of daily bitcoin transactions on the network, with 2916697 records, and is characterized by large and homogeneous data volumes. Healthnews in Twitter is a collection of health news from over 15 major health news organizations, collected using the Twitter API. Healthnews in Twitter is a set of health news from more than 15 major health news organizations collected using Twitter API, which contains 58,000 records and 25,000 attributes, featuring large data volume and high data dimensionality. The details of each dataset are shown in Table 2.

The two image datasets are the COIL-20 image dataset and Caltech101 image dataset. The COIL-20 data set contains 20 object objects, each of which poses 72 positions at different angles in 360 degree rotation, and the captured image is 128 × 128 grayscale image covering the object and background. The dataset is composed of 101 categories on object pictures, with 40 to 800 color pictures in different categories and 300 × 200 pixels. The test set in the experiment includes 2,386 images composed of 20 objects. The details of the graph dataset are shown in Table 3.

5.2. Evaluation Indicators

In this stuy, two metrics, F-measure and speedup, are used to evaluate the performance of the proposed algorithm. F-measure metric is to evaluate the clustering accuracy of the algorithm on the dataset. This metric is the weighted average of precision and recall which is calculated as follows:

The higher the value of F-measure, the more accurate and reasonable the clustering results are.

To verify the ability of the proposed algorithm to process in parallel with large data sets, speedup is used to measure the performance of parallel computation. The speedup ratio is the performance improvement obtained by reducing the running time under parallel computing, which is defined aswhere Ts denotes the running time of the algorithm on a single node. Tp denotes the running time of the parallel computation. A larger value of speedup indicates that the parallel computation takes less relative time and the more efficient the parallelization of the algorithm.

5.3. Performance Evaluation of FOAPSO-GA

In order to verify the effectiveness of the proposed FOAPSO-GA, this study compares the hybrid Fruit fly particle swarm algorithm based on the genetic evolution mechanism with the standard fruit fly optimization algorithm FOA [34], the improved fruit fly optimization algorithm IFOA [34], and the hybrid fruit fly particle swarm algorithm FOAPSO [36] on three different classical test functions in order to evaluate the effectiveness of the FOAPSO-GA in global and local search. GA is on the global and local optimization search. The test functions are continuous single-peaked function and nonlinear multipeaked function, which are used to evaluate the accuracy and the merit-seeking ability of the algorithm, respectively. The three test functions are shown as follows:

In the experiments, the method in this study and the comparison method are run 60 times on each of the 3 functions. Then, the algorithm performance is evaluated by calculating the mean and variance of each function. Table 4 shows the comparison results of the comparison methods on the tested functions.

As shown in the Table 4, the proposed FOAPSO-GA has the best results on three functions. It shows that FOAPSO-GA is superior to other methods in convergence accuracy and stability. The reason is that the combination of the fruit fly optimization algorithm and particle swarm algorithm improves the diversity of the population and effectively suppresses the convergence effect of individuals. In addition, by introducing the genetic evolution mechanism, the hybrid algorithm overcomes the drawback that it is easy to fall into local optimum and improves the global optimization-seeking ability.

5.4. Analysis of Clustering Results
5.4.1. Comparative Analysis of the Accuracy of the Algorithms

In order to analyze the clustering accuracy of the proposed algorithm, this study conducts experiments based on three datasets, HTRU2, BitcoinHeist, and healthnews in Twitter. Based on the F-measure of the experimental results, the accuracy is compared and analyzed with that of DBSCAN-PSM [25], SP-DBSCAN [25], S-DBSCAN [37], and DBSCAN-KD [38], respectively. The experimental results are shown in Figure 9.

As can be seen in Figure 9, the proposed algorithm has good accuracy in dealing with large datasets. In the healthnews in the Twitter dataset, the accuracy of the proposed algorithm is on average 3.1%, 6.7%, 6.5%, and 5.2% higher than DBSCAN-PSM, SP-DBSCAN, S-DBSCAN, and DBSCAN-KD, respectively. In the HTRU2 dataset, the accuracy of the proposed algorithm is on average 1.7%, 2.9%, 2.8%, and 3.2% higher than that of DBSCAN-PSM, SP-DBSCAN, S-DBSCAN, and DBSCAN-KD, respectively. In the BitcoinHeist dataset, the accuracy of the proposed algorithm is on average 2.4%, 4.0%, 3.7%, and 3.1% higher than that of DBSCAN-PSM, SP-DBSCAN, S-DBSCAN, and DBSCAN-KD, respectively. From the comparison of the results, it can be seen that the accuracy improvement of the proposed algorithm is not particularly significant when dealing with the lower dimensional dataset HTRU2 as well as BitcoinHeist. On the contrary, the improvement in accuracy is more advantageous when dealing with the high-dimensional dataset healthnews in Twitter. This is mainly due to the fact that the proposed algorithm uses an irregular dynamic density region partitioning strategy to divide the data, which effectively balances the load level of each computational node and makes the algorithm more stable when performing clustering. At the same time, the FOAPSO-GA algorithm is applied to the parameter finding problem of the clustering algorithm in the process of local clustering, thus improving the accuracy of the clustering effect. Therefore, the experimental results show that the proposed algorithm is more accurate than other comparative algorithms in most cases when dealing with large data sets, and the advantage is more obvious when dealing with data sets of higher dimensionality.

Figure 10 shows the comparison of clustering accuracy of different algorithms in image datasets. It can be seen from the figure that the proposed algorithm can effectively deal with the big data classification problem in image fusion. Compared with other comparison algorithms, the proposed algorithm has the best accuracy and can obtain the key feature information of data objects for classification.

5.4.2. Comparative Analysis of Algorithm Speedup Ratios

To evaluate the parallelization performance of the proposed algorithm, this study conducts experiments based on three datasets, HTRU2, BitcoinHeist, and healthnews in Twitter. Based on the acceleration ratios of the experimental results, the acceleration ratios are compared and analyzed with those of DBSCAN-PSM, SP-DBSCAN, S-DBSCAN, and DBSCAN-KD, respectively. The experimental results are shown in Figure 11.

It can be seen from Figure 11 that the proposed algorithm has a good parallel performance in processing large data sets. The speedup ratios of all algorithms improve with the increase of the number of nodes in the dataset HTRU2. In the BitcoinHeist dataset, which has a large data volume, the speedup ratios of the proposed algorithm to DBSCAN-PSM, SP-DBSCAN, S-DBSCAN, and DBSCAN-KD are 3.8, 3.3, 2.5, 2.2, and 2.9, respectively. The speedup ratio of the proposed algorithm increases linearly with the increase of the number of nodes, and the speedup ratio improves by 0.6, 1.8, 1.7, and 0.9 over DBSCAN-PSM, SP-DBSCAN, S-DBSCAN, and DBSCAN-KD, respectively, in the case of four nodes. As shown in Figure 11(c), the speedup ratios of the proposed algorithm under the four nodes improve by 0.7, 1.6, 1.8, and 0.8 over DBSCAN-PSM, SP-DBSCAN, S-DBSCAN, and DBSCAN-KD, respectively. According to the results of data analysis, it can be seen that the speedup ratios of SP-DBSCAN and S-DBSCAN rise slowly as the number of nodes increases and the algorithm performance bottlenecked. In contrast, the speedup ratios of the algorithms in this study and DBSCAN-PSM and DBSCAN-KD improve more significantly. This is due to the difficulty of scaling SP-DBSCAN and S-DBSCAN for large data sets, which makes it difficult to increase the speedup ratio. In contrast, the DBSCAN-PSM and DBSCAN-KD algorithms introduce KD tree partitioning to improve the speedup ratio when parallelizing the processing. The algorithm in this study uses the irregular dynamic density region partitioning strategy to efficiently process large data sets, and its ability to shorten the running time makes it possible to improve the acceleration ratio of the algorithm even when the number of nodes grows. Therefore, it has a higher speedup ratio compared to DBSCAN-PSM and DBSCAN-KD, and the experimental data can reflect that it has better parallel performance and efficiency with a larger number of nodes.

6. Conclusion

A parallel density clustering algorithm based on an improved fruit fly optimization algorithm and Spark memory iteration is proposed to address the shortcomings of the DBSCAN algorithm, such as slow serial computation, poor parameter search ability, and uneven data density reducing accuracy. The proposed algorithm improves the accuracy and parallel efficiency of the DBSCAN algorithm through irregular dynamic region partitioning of data, the FOAPSO-GA algorithm to achieve dynamic optimization of Eps and MinPts parameters, and a customized clustering merging strategy.

Although the proposed algorithm has made progress in terms of the clustering effect, there is still some space for improvement, the problem of parameter initialization of the improved fruit fly optimization algorithm has not been solved yet, and the parallel performance of the algorithm also needs to be enhanced. Subsequent work will focus on testing and algorithm optimization on additional datasets. It includes partition strategy algorithm selection optimization, Spark parallel fusion design optimization, and Spark cache strategy and memory management optimization, which are used to further improve the performance of a parallel density clustering algorithm under the Spark model.

Data Availability

The labeled data set used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work was supported by the Jiaozuo Normal College.