Computational Intelligence and Neuroscience

Volume 2017 (2017), Article ID 2519782, 12 pages

https://doi.org/10.1155/2017/2519782

## On the Accuracy and Parallelism of GPGPU-Powered Incremental Clustering Algorithms

^{1}School of Computer Engineering, Weifang University, Weifang, Shandong 261061, China^{2}School of Electromechanical Engineering, Guangdong University of Technology, Guangzhou 510006, China^{3}School of Automation, Northwestern Polytechnical University, Xi’an, China^{4}School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA

Correspondence should be addressed to Chunlei Chen

Received 29 January 2017; Revised 17 July 2017; Accepted 31 July 2017; Published 11 October 2017

Academic Editor: Amparo Alonso-Betanzos

Copyright © 2017 Chunlei Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Incremental clustering algorithms play a vital role in various applications such as massive data analysis and real-time data processing. Typical application scenarios of incremental clustering raise high demand on computing power of the hardware platform. Parallel computing is a common solution to meet this demand. Moreover, General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. Nevertheless, the incremental clustering algorithm is facing a dilemma between clustering accuracy and parallelism when they are powered by GPGPU. We formally analyzed the cause of this dilemma. First, we formalized concepts relevant to incremental clustering like evolving granularity. Second, we formally proved two theorems. The first theorem proves the relation between clustering accuracy and evolving granularity. Additionally, this theorem analyzes the upper and lower bounds of different-to-same mis-affiliation. Fewer occurrences of such mis-affiliation mean higher accuracy. The second theorem reveals the relation between parallelism and evolving granularity. Smaller work-depth means superior parallelism. Through the proofs, we conclude that accuracy of an incremental clustering algorithm is negatively related to evolving granularity while parallelism is positively related to the granularity. Thus the contradictory relations cause the dilemma. Finally, we validated the relations through a demo algorithm. Experiment results verified theoretical conclusions.

#### 1. Introduction

##### 1.1. Background

Due to the exciting advancements in digital sensors, advanced computing, communication, and massive storage, tremendous amounts of data are being produced constantly in the modern world. The continuously growing data definitely imply great business value. However, data are useless by themselves; analytical solutions are demanded to pull meaningful insight from the data, such that effective decisions can be achieved. Clustering is an indispensable and fundamental data analysis method. The traditional clustering algorithm is executed in a batch-mode; namely, all data points are necessarily loaded into memory of the host machine. In addition, every data point can be accessed unlimited times during the algorithm execution. Nevertheless, the batch-mode clustering algorithm can not adjust the clustering result in an evolving manner. For instance, it is necessary to incrementally cluster the evolving temporal data such that the underlying structure can be detected [1]. In stream data mining, a preprocessing task like data reduction needs the support of incremental clustering [2, 3]. In addition, incremental clustering can significantly contribute to massive data searching [4]. To sum up, the evolving capability of incremental clustering is indispensable under certain scenarios, such as memory-limited applications, time-limited applications, and redundancy detection. A practical application may possess arbitrary combination of the three aforementioned characteristics. The incremental clustering algorithm proceeds in an evolving manner, namely, processing the input data step by step. In each step, the algorithm receives a newly arrived subset of the input and obtains the new knowledge of this subset. Afterwards, historic clusters of the previous step are updated with the new knowledge. Subsequently, the updated clusters serve as input of the next step. With regard to the first step, there is no updating operation. The new knowledge obtained in the first step serves as input of the second step. Application scenarios of incremental clustering generally raise high requirements on computing capacity of the hardware platform. General Purpose Graphic Processing Unit (GPGPU) is a promising parallel computing device. GPGPU has vast development prospects due to following superiorities: much rapider growing computing power than CPU, high efficiency-cost ratio, and usability.

##### 1.2. Motivation and Related Works

Our previous work revealed that the existing incremental clustering algorithms are confronted with an accuracy-parallelism dilemma [5, 6]. In this predicament, the governing factor is the evolving granularity of the incremental clustering algorithm. For instance, the point-wise algorithms proceed in fine granularity. In each step, the algorithms only receive a single data point (serving as the new knowledge); this point will either be assigned to an existing historic cluster or induce an independent cluster in the historic clusters. Such algorithms generally achieve favorable clustering accuracy. However, they sacrifice parallelism due to strong data dependency. Namely, the next data point cannot be processed until the current one is completely processed. Modern GPGPUs can contain thousands of processing cores, while the number of existing historic clusters needs to increase progressively even if the number eventually reach the magnitude of thousand. GPGPU can fully leverage its computing power only if it runs abundant threads in a SIMD (Single Instruction Multiple Data) manner (commonly twice the number of processing cores or even more). In addition, the work-depth is inevitably no less than the amount of input data points under point-wise setting. Consequently, the computing power is ineluctably underutilized. Moreover, more kernel launches (GPGPU code execution) are required if work-depth is larger. Time overhead of kernel launch is generally high. Some representative point-wise incremental clustering algorithms were elaborated in [7–19]. Ganti et al. used block-evolving pattern to detect changes in stream data [20]. Song et al. adopted the block-wise pattern and proposed an incremental clustering algorithm of GMM (Gaussian Mixture Model) [21]. The algorithm of [21] proceeds in coarse granularity. Each step contains three substeps. First, obtain the new knowledge by running the standard EM (Expectation Maximum) algorithm on a newly received data block. Second, identify the statistically equivalent cluster pairs between the historic clusters and the new clusters. Finally, merge the equivalent cluster pairs separately. The standard EM algorithm of GMM is inherently GPGPU-friendly [22]. The algorithm of [21] maintains the inherent parallelism. However, its clustering accuracy exhibits degradation by order of magnitude, compared to its batch-mode counterpart (the standard EM algorithm of GMM) [5]. Moreover, we qualitatively analyzed the reason why block-wise pattern tends to induce accuracy degradation in our previous work [6]. Algorithms of [23, 24] are also block-wise. D-Stream is ostensibly point-wise [25]. Nevertheless, D-Stream is essentially block-wise due to the fact that mapping data points into grids can be parallelized in a SIMD manner. As far as we know, most existing works focus on clustering accuracy. However, existing algorithms, even the block-wise ones, do not explicitly consider algorithm parallelism on SIMD many-core processing devices like GPGPU. Some recent works formally analyzed issues of clustering or machine learning algorithms. Ackerman and Dasgupta pointed out a limitation that incremental clustering cannot detect certain types of cluster structure [26]. They formally analyzed the cause of this limitation and proposed conquering this limitation by allowing extra clusters. Our work is similar to that of [26] in the sense that we also formally analyzed the reason why incremental clustering is inefficient under certain conditions. In contrast, we elaborated our work under the context of GPGPU-acceleration. Ackerman and Moore also formally analyzed perturbation robustness of batch-mode clustering algorithm [27, 28]. Nevertheless, works of [27, 28] only concentrated on classical batch-mode clustering algorithms. Gepperth and Hammer qualitatively analyzed challenges that incremental learning is facing [29]. They pointed out a dilemma between stability and plasticity. However, we focus on the dilemma between accuracy and parallelism.

##### 1.3. Main Contribution

In this paper, we extend our previous works of [6, 30] in the following ways. First, some vital concepts (such as incremental clustering and evolving granularity) are formally defined. Second, we formally proved how evolving granularity exerts influence on accuracy and parallelism of incremental clustering algorithms. In this way, the starting point of our previous works can be formally validated. Finally, we demonstrated the theoretical conclusions through a demo algorithm. The conclusions will be the footstone of our future work.

#### 2. Formal Definition of Terminologies

##### 2.1. Terminologies on Incremental Clustering

*Definition 1 (incremental clustering). * is a series of data points (, ). The data points are partitioned into sets: . This partition satisfied the following conditions: (1).(2)If , , then .A data analysis task adopts discrete time system and time stamps are labeled as .* This task is incremental clustering if and only if the following applies:*(1)When , the task receives . In time interval , the task partitions into clusters and is the set of these clusters. The entire input to the task is .(2)When , the task receives . In time interval , the task resolves the set of clusters such that can find its affiliated cluster in . Entire inputs to the task are and .

*Definition 2 (the th step and historic clustering result of the th step). *An algorithm for incremental clustering is an incremental clustering algorithm. The time interval is the th step of an incremental clustering algorithm (or step of an incremental clustering algorithm). is the historic clustering result of the th step.

*Definition 3 (micro-cluster). *Let* batchAlgorithm* represent a batch-mode clustering algorithm. is the parameter of* batchAlgorithm*.* batchAlgorithm* can partition into subsets . is a constant vector. is a -micro-cluster produced by if and only if the following applies:(1).(2), .(3) forms a data cloud in -dimensional space. The hypervolume of this data cloud is positively related to . contains and only contains one data point if .(4) are all preset constant values.

*Definition 4 (batch-mode part and incremental part of step ). *Some incremental clustering algorithms divide step () into two parts [21, 23–25]: In the first part, is partitioned into* new clusters (or new micro-clusters)* pursuant to certain similarity metrics. is the set of these clusters (or micro-clusters); in the second part, is resolved based on and (if any). The number of clusters (or micro-clusters) in is denoted as . The first part can be accomplished by a batch-mode clustering algorithm; this part is* the batch-mode part of step *. The second part is* the incremental part of step *.

*Definition 5 (benchmark algorithm, benchmark clustering result, and benchmark cluster). *Denote (pursuant to Definition 1) as . Incremental clustering is applied to , and is the historic clustering result of the th step (); let . If could be entirely loaded into memory of the host machine and were processed by a certain batch-mode clustering algorithm, then the resulting clusters were in a set of clusters denoted as . The batch-mode algorithm is called* benchmark algorithm of the incremental clustering algorithm. * is the benchmark clustering result up to step . An arbitrary cluster of is a* benchmark cluster*.

*Definition 6 (local benchmark cluster). * is the benchmark clustering result up to the th step; represent the benchmark clusters in . is the newly received data set of step . All data points of are labeled such that points with the same label are affiliated to the same benchmark cluster. Partition into nonempty subsets, noted as . These subsets satisfy the following conditions: (1) and , .(2)If then and possess the same label.(3); if , , then and have different labels. are called the* local benchmark clusters of step * or* local benchmark clusters* for short. We abbreviate local benchmark cluster to LBC.

Definitions 1–4 actually provide terminologies to formally interpret the concept of incremental clustering as well as execution mechanism of incremental clustering. Definitions 5 and 6 furnish a benchmark to evaluate the accuracy of an incremental clustering algorithm.

##### 2.2. Terminologies on Evolving Granularity, Clustering Accuracy, and Parallelism

*Definition 7 (containing hypersurface of a data set). *Let represent a set of data points. HS is a hypersurface in the -dimensional space. HS is the* containing hypersurface* of , if and only if HS is a close hypersurface and an arbitrary point of is within the interior of HS.

*Definition 8 (envelope hypersurface, envelop body, and envelop hypervolume of a data set). *Let represent a set of data points. is the set of containing hypersurfaces of . Let represent the hypervolume encapsulated by . Let be a hypersurface. is the envelope hypersurface of , if and only if . Let represent the* envelope hypersurface* of ; the region encapsulated by is the envelope body of ; the hypervolume of this envelope body is the* envelope hypervolume* of .

*Definition 9 (core hypersphere, margin hypersphere, core hypervolume, and margin hypervolume of a data set). *Let be a data set. is the envelope hypersurface of ; represents the geometric center of the envelope body of . represents the distance between and an arbitrary point on . ; .

A hypersphere is the* core hypersphere* of if and only if this hypershpere is centered at and its radius is , noted as . The hypervolume encapsulated by is the* core hypervolume* of , noted as .

A hypersphere is the margin hypersphere of if and only if it is centered at and its radius is , noted as . The hypervolume encapsulated by is the margin hypervolume of , noted as .

*Definition 10 (core evolving granularity, margin evolving granularity, and average evolving granularity). *In the th step, the incremental clustering algorithm receives data set . is partitioned into nonempty subsets pursuant to certain metrics: such that , , and . Let be the core hypervolume of . Then* in the **th step, the core evolving granularity* of the algorithm is .*Up to the ** th step, the core evolving granularity* of the algorithm is .

Let be the margin hypervolume of . Then* in the **th step, the margin evolving granularity* of the algorithm is .*Up to the ** th step, the margin evolving granularity* of the algorithm is . Let be the envelope hypersurface of , and ; is the number of data points within ; represents the geometric center of the envelope body of . represents the distance between and . Let .

is the hypervolume of the hypersphere whose center is at and radius is .* In the **th step, the average evolving granularity* of the algorithm is .*Up to the ** th step, the average evolving granularity* of the algorithm is .

*Definition 11 (different-to-same mis-affiliation). *Different-to-same mis-affiliation is the phenomenon that, in step , data points from different benchmark clusters are affiliated to the same cluster of or .

*Definition 12 (same-to-different mis-affiliation). *Same-to-different mis-affiliation is the phenomenon that, in step , data points from the same benchmark cluster are affiliated to the different clusters of or .

We adopt Rand Index [31] to measure clustering accuracy. Larger Rand Index means higher clustering accuracy.

There are numerous criterions of cluster separation measurement in the existing literatures. We select Rand Index due to the fact that this criterion directly reflects our intent: measuring the clustering accuracy by counting occurrences of data point mis-affiliations.

*Definition 13 (serial shrink rate (SSR)). *Let incAlgorithm represent an incremental clustering algorithm. In step , the batch-mode part generates micro-clusters. Suppose incAlgorithm totally clustered data points up to step . Up to step , the serial shrink rate of incAlgorithm is Lower SSR means that less computation inevitably runs in a non-GPGPU-friendly manner. Consequently, smaller SSR means improved parallelism. Work-depth [32] of the algorithm can shrink if SSR is smaller. Hence, more computation can be parallelized.

#### 3. Theorems of Evolving Granularity

##### 3.1. Further Explanation on the Motivation

GPGPU-accelerated incremental clustering algorithms are facing a dilemma between clustering accuracy and parallelism. We endeavor to explain the cause of this dilemma through formal proofs. The purpose of our explanation is that the formal proofs reveal a possible solution to seek balance between accuracy and parallelism. This basic idea of this solution is discussed as follows.

In the batch-mode part of the th step, data points from different local benchmark clusters may be mis-affiliated to the same micro-cluster of . Theorem 14 points out that the upper and lower bounds of the mis-affiliation probability are negatively related to margin evolving granularity and core evolving granularity, respectively. The proof of this theorem demonstrates that larger evolving granularity results in more occurrences of different-to-same mis-affiliation.

The batch-mode part should evolve in fine granularity to produce as many homogeneous micro-clusters as possible. Only in this context, the operations of incremental part are sensible. Namely, the incremental part cannot eliminate different-to-same mis-affiliation induced by the batch-mode part. The incremental part absorbs advantages of point-wise algorithms by processing micro-clusters sequentially. This part should endeavor to avoid both same-to-different and different-to-same mis-affiliations on the micro-cluster-level.

Nevertheless, Theorem 15 proves that parallelism is positively related to evolving granularity. Thus, the contrary relations cause the dilemma.

However, we can adopt GPGPU-friendly batch-mode clustering algorithm in the batch-mode part. Moreover, the total number of micro-clusters is much smaller than that of data points up to a certain step. Consequently, the work-depth can be dramatically smaller than that of a point-wise incremental clustering algorithm.

##### 3.2. Theorem of Different-to-Same Mis-Affiliation

Theorem 14. *Let represent the probability of different-to-same mis-affiliations induced by the batch-mode part of the th step. and are the core evolving granularity and margin evolving granularity up to the th step, respectively. The upper bound of is negatively related to and the lower bound of is negatively related to .**Suppose contains local benchmark clusters (LBC). is a set containing these LBCs. Between any two adjacent LBCs there exists a boundary curve segment. The boundary curve segment between LBC and LBC ) is noted as . Obviously, and represent the same curve segment. We define a probability function to represent :whereand is the hypervolume enclosed by envelope surface of .** reaches the upper bound if equals the radius corresponding to the margin evolving granularity (Definition 10) in step ; reaches the lower bound if equals the radius corresponding to the core evolving granularity (Definition 10) in step .*

*Proof. **In order to more intuitively interpret this theorem, we discuss the upper and lower bounds in two-dimensional space. The following proof can be generalized to higher-dimensional space*.*In two-dimensional space, the envelop hypersurface (Definition 8) degenerates to an envelope curve. The core hypersphere and margin hypersphere (Definition 9) degenerate to a core circle and a margin circle, respectively. The envelop body degenerates to the region enclosed by the envelope curve. The envelope hypervolume degenerates to the area enclosed by the envelope curve*. Let incAlgorithm represent an incremental clustering algorithm. Each step of incAlgorithm includes batch-mode part and incremental part.*(**1) Partition Data Points Pursuant to Local Benchmark Clusters*. incAlgorithm receives in the th step. Partition into local benchmark clusters (Definition 6, LBC for short): . Let . is the envelope curve of . represents the area enclosed by . Assume that are convex sets. (We can partition into a set of convex sets if it is not a convex set.)*(**2) Partition the Boundary Curve between Two Local Benchmark Clusters into Convex Curve Segments.* Let be the set of boundary curves between any two adjacent LBCs. where is the boundary curve segment between two adjacent LBCs. Consider an arbitrary LBC . Suppose that there are totally LBCs adjacent to . The boundary curves segments are . These boundary curve segments can be consecutively connected to form a closed curve such that only data points of are within the enclosed region. Further partition into a set of curve segments such that are all convex curve segments. can be viewed as a set of data points.*(3) Construct Auxiliary Curves.* Figure 1(a) illustrates an example of adjacent LBCs and a boundary curve. The black, gray, and white small circles represent three distinct LBCs (For clarity of the figure, we use small circles to represent data points). The black bold curve is the boundary curve between the black and white LBCs. We cut out a convex curve segment from this boundary curve, noted as . Figure 1(b) magnifies . Assume that the analysis formula of is .

Let be a circle centered at point . The radius of is* threshold *. Place to the right of . Roll along such that is always tangent to . Let be a circle centered at point The radius of is also* threshold*. Place to the left of . Roll along such that all points of are always to the left of , except the points of tangency between and . The trajectories of points , form two curves, and , respectively. Adjust the starting and ending points of and such that the definition domains of both curves are .*(4) Characteristics of New Clusters.* In step , is partitioned into new clusters (or new micro-clusters) (Definition 3) pursuant to certain methods. Let GR be a set containing data points of an arbitrary new cluster. Without loss of generality, let envelope curve of GR be a circle centered at , noted as ENGR. The radius of this circle is noted as* radius*. We can view as a random variable. This random variable represents the possible coordinates of GR’s center. Let represent a set of all vectors enclosed by envelope curve of in -dimensional () space (including vectors on the envelope curve). Let . The statistical characteristic of is unknown before is processed. Consequently, it is reasonable to assume that obeys the uniform distribution on . Namely, we assume that every point within is possible to be GR’s center and the probabilities of every point are equal.*(5) Criterion of Different-to-Same Mis-Affiliation.* Assume that we can neglect distance between the boundary curve and the right side of the black LBC’s envelope curve in Figure 1. Similarly, assume that we can neglect distance between the boundary curve and the left side of the white LBC’s envelope curve. The criterion of different-to-same mis-affiliation is as follows.

If , then GR contains data points of at least two distinct local benchmark clusters.

Smaller distance between and means higher probability of different-to-same mis-affiliation induced by the batch-mode part; the larger the radius is, the higher the probability is.

In Figure 1(b), two lines and two curves form an open domain, noted as . Let Set represent the set of threshold values that make the following causal relationship hold:

and .

Part (3) of this proof explained the meaning of* threshold *. GR’s* threshold with regard to * is . . Different-to-same mis-affiliation can still occur as long as* radius* is sufficiently large even if .*(6) Three Typical Situations of Different-to-Same Mis-Affiliation Induced by Batch-Mode Part.* The value of is dramatically influenced by the following factors: first: shape of LBC’s envelop curve and second: the relative positions of and GR. Shape of LBC’s envelope curve is generally irregular. We simplify the proof without loss of generality. As illustrated by Figure 2, assume that three sectors are consecutively connected to form the envelope curve. In addition, three sectors’ centers overlapped on point . Radiuses of sectors 1, 2, and 3 are , , and , respectively. equals to the radius of the margin envelope circle. equals to the radius of the core envelope circle. is between and . Generally, distances between point and points on the envelope curve are between and . Sector 2 can represent these ordinary points.

Let GR rotate around . Figures 2(a), 2(b), and 2(c) illustrate three characteristic positions of GR during the rotation. In Figure 2(a), threshold is . covers the largest area. In Figure 2(c), threshold is . covers the smallest area.*(7) Probability of Different-to-Same Mis-Affiliation Induced by Batch-Mode Part: Lower Bound.* As aforemetioned, a boundary curve between two LBCs can be partitioned into curve segments. Data points on both sides of a curve segment are affiliated to the same cluster if different-to-same mis-affiliation occurs (this mis-affiliation occurs in the batch-mode part of a certain step). Let represent a curve segment from a boundary curve. Let be the probability that data points on both sides of are affiliated to the same cluster. Considering all boundary curves in , represents the total probability of different-to-same mis-affiliation in the batch-mode part of step .

Let be the radius of the hypersphere corresponding to core evolving granularity up to step . Assume that auxiliary curves and are constructed under . On the basis of the previous parts of proof, we can draw the following inequalities:*(8) Probability of Different-to-Same Mis-Affiliation Induced by Batch-Mode Part: Upper Bound.* Let be the radius of the hypersphere corresponding to margin evolving granularity. Assume that auxiliary curves and are constructed under . On the basis of the previous parts of proof, we can draw the following inequalities: