Abstract
Data streams are continuously generated over time from Internet of Things (IoT) devices. The faster all of this data is analyzed, its hidden trends and patterns discovered, and new strategies created, the faster action can be taken, creating greater value for organizations. Densitybased method is a prominent class in clustering data streams. It has the ability to detect arbitrary shape clusters, to handle outlier, and it does not need the number of clusters in advance. Therefore, densitybased clustering algorithm is a proper choice for clustering IoT streams. Recently, several densitybased algorithms have been proposed for clustering data streams. However, densitybased clustering in limited time is still a challenging issue. In this paper, we propose a densitybased clustering algorithm for IoT streams. The method has fast processing time to be applicable in realtime application of IoT devices. Experimental results show that the proposed approach obtains high quality results with low computation time on real and synthetic datasets.
1. Introduction
Using RFID and conventional sensors in the base of the data collection mechanisms in Internet of Things (IoT) makes the volume of the collected data intensively large. In many cases, the communications and data transfers between the objects are required to enable smart analytics. Such communications and transfers require both bandwidth and energy consumption, which are usually limited resources in real scenarios. Furthermore, the analytics required for such applications is often realtime, and therefore it requires the design of methods which can provide realtime insights [1–3]. Data mining techniques are very useful for this kind of analytics. However, since the generated data is considered as stream, we modify the multilayer data mining model for Internet of Things (IoT) from [4] to a multilayer data stream mining model for IoT. The model is illustrated in Figure 1.
Mining data stream is relatively a new area of research in the data mining community. It became more prominent in many applications such as monitoring environmental sensors, social network analysis, realtime detection of anomalies in computer network traffic, and web searches [5, 6].
Clustering is a remarkable task in mining data stream [6]. However, data stream clustering needs some important requirements due to data streams’ characteristics such as clustering in limited memory and time with single pass over the evolving data streams and also handling noisy data [7–9].
There are different methods for clustering data streams. In clustering methods, data are categorized based on the similarities among objects. The similarity is determined based on distance or density [5]. The distancebased method [10] leads to form only spherical shapes. On the other hand, densitybased method [11] has the ability to detect any shape cluster and they are useful for identifying the noise.
In the last few years, many proposals to extend densitybased clustering for data stream have been presented [12]. Densitybased data stream clusterings are mainly grouped as density gridbased method and density microclustering method.
The density gridbased clustering [13] quantizes the data space into a number of density grids that form a grid structure on which all of the operations for clustering are performed. The main advantage of the approach is its fast processing time, which is independent of the number of data points, yet dependent on only the number of cells. However, they may have lower quality and accuracy of the clusters despite the fast processing time of the technique [5]. Some of density gridbased clustering algorithms are DStream [14], MRStream [9], and ExCC [15].
On the other hand, in densitybased microclustering [16], microclusters keep summary information about data and clustering is performed on this synopsis information. Microcluster [10] is a temporal extension of cluster feature (CF), that is, a summarization triple maintained about a cluster. Densitybased microclustering methods keep summary of clusters in microclusters and form final clusters from them. They have better quality compared to gridbased ones but need more computation time. Some of the density microclustering algorithms include DenStream [14], FlockStream [17], and SOStream [18].
To mitigate the problem of density microclustering methods, we propose a hybrid densitybased method for clustering evolving data streams. Our proposed method uses the advantages of both density gridbased and microclustering methods. We refer to our algorithm as HDCStream (hybrid densitybased clustering for data stream). HDCStream has three steps: in step one, the new data point is either mapped to the gird or merged to an existing minicluster. Minicluster is a concept similar to microcluster which is formed from a grid cell. Second step prunes miniclusters and grids in each pruning time. Last step forms the final clusters from the pruned miniclusters using a modified DBSCAN algorithm.
The main contributions of HDCStream are summarized as follows.(1)In HDCStream, instead of searching list of outlier microclusters to find the suitable one, it maps the new data point into the grid cell which saves computation time. This reduces the number of comparisons from in finding outlier microclusters to which is the mapping time. is the number of miniclusters.(2)In HDCStream, instead of forming a new microcluster for a new data point, which is not placed in any existing microcluster and may be a seed of outlier, the new data point is mapped and kept in the grid until the grid density reaches a predefined threshold. In this case, it is converted to a minicluster.(3)The experimental results also show that it outperforms two of the wellknown existing density microclustering and density gridbased clustering methods in terms of quality and execution time. Furthermore, the experimental results show that HDCStream obtains clusters of high quality even when the noise is present.
The remainder of this paper is organized as follows: Section 2 surveys related work. Section 3 introduces basic definitions. In Section 4, we explain in detail the HDCStream algorithm. We analyze the HDCStream algorithm using synthetic and real datasets in Section 5. Section 6 discusses the advantages of the proposed method. We conclude the paper in Section 7.
2. Related Work
Clustering is an important task in data stream mining. Recently, a plenty of clustering algorithms have been developed for data streams. These clustering algorithms can be generally grouped into the four following main categories [5].
A partitioningbased clustering algorithm tries to find the best partitioning for data points in which intraclass similarity is maximum and interclass similarity is minimum. Two of the wellknown extensions of means [19, 20] on data streams are STREAM [7] and CluStream [10]. Hierarchical clustering algorithms work by decomposing data objects into a tree of clusters. BIRCH [10] and ClusTree [8] are examples of hierarchical clustering family. Gridbased clustering is independent of the distribution of data objects. In fact, it partitions the data space into a number of cells, which forms the grids. Gridbased clustering has fast processing time since it is not dependent on the number of data objects. DStream [14], MRStream [9], and ExCC [15] are gridbased clusterings over data stream.
Densitybased clustering algorithms have been developed to discover clusters with arbitrary shapes. They find clusters based on the dense areas in a shape. If two points are close enough and the region around them is dense, then these two data points join and contribute to construction of a cluster. DBSCAN [21], OPTICS [22], and DENCLUE [23] are examples of this approach.
Due to data streams’ characteristics, the traditional densitybased clustering is not applicable. Recently, many densitybased clustering algorithms are extended for data streams. The main idea in these algorithms is using densitybased method in the clustering process and at the same time overcoming the constraints, which are put by data stream’s nature. Densitybased clustering algorithms are categorized into two broad groups called density microclustering and density gridbased clustering algorithms. A comprehensive survey on densitybased clustering algorithm on data stream is presented in [12].
DenStream [24] is a density microclustering algorithm for evolving data stream. The algorithm extends the microcluster [10] concept and introduces the outlier and potential microclusters to distinguish between outliers and the real data. It has online and offline phases. In the online phase, the microclusters are formed and the offline phase performs macroclustering on the microclusters. FlockStream [17] is an extension of DenStream using a bioinspired model. It is based on flocking model [25] in which agents are microclusters and they work independently but form clusters together. It considers an agent for each data point which is mapped in the virtual space. Agents move in their predefined visibility range for a fixed time. If they visit another agent, they join to form a cluster in case they are similar to each other. It merges the online and offline phases since the agents form the clusters at any time. In FlockStream, searching for the similar agents is a time consuming process. SOStream (selforganizing densitybased clustering over data stream) [18] detects structures within fast evolving data streams by automatically adapting the threshold for densitybased clustering. SOStream dynamically creates, merges, and removes clusters in an online manner. It uses competitive learning as introduced for SOMs (selforganizing maps) [26] which is a time consuming method for clustering data stream. Density microclusterings are effective in terms of quality and they can capture the evolution of clusters effectively. However, they have high computation time in finding suitable microclusters.
The other important category is density gridbased method. DStream [27] is a density gridbased clustering algorithm in which the data points are mapped to the corresponding grids and the grids are clustered based on their density. It adjusts the clusters in realtime and captures the evolving behavior of data streams and has techniques for handling the outliers. MRStream [9] is another clustering algorithm which has the ability to cluster data stream at multiple resolutions. The algorithm partitions the data space into cells and a treelike data structure which keeps the space partitioning. The tree data structure keeps the data clustering in different resolutions. Each node has the summary information about its parent and children. The algorithm improves the performance of clustering by determining the right time to generate the clusters. DStream and MRStream algorithms cannot work properly for high dimensional data stream [12]. ExCC (exclusive and complete clustering) [15] is a density gridbased clustering for heterogeneous data stream. The algorithm maps the numerical attributes to the grid and the categorical attributes are assigned granularities according to distinct values in respective domain sets. ExCC introduces fast and slow stream based on the average arrival time of the data points in the data stream. The algorithm detects noise in the offline phase using wait and watch policy. For detecting real outliers, it keeps the data points in the hold queue, which is kept separately for each dimension. The hold queue strategy needs more memory and processing time since it is defined for each dimension. Density gridbased clustering has lower quality since it depends on the granularity of clustering. On the other hand, they can handle the outlier effectively. The computation time is high for high dimensional data.
3. Basic Definitions of HDCStream
Definition 1 (neighborhood of a point). The neighborhood is within a radius of . Neighborhood of point is denoted by : where is an Euclidean distance between and .
Definition 2 (). MinPts is the minimum number of data points around a data point in the neighborhood of .
Definition 3 (data point weight value). For each data point in the data stream, we consider a weight which decreases over time. The initial value of data point is 1. The weight of data point (with dimensions) in time is defined based on the weight in as follows (): where function is a fading function. The fading function [28] that we use in HDCStream is defined as , where .
Definition 4 (grid weight). For a grid at current time , the grid weight is defined based on sum of data points’ weights which are mapped to it:
According to the work presented in [27], we update the grid weight in with the last updated value as follows:
The total weight of all the grids in data space is which is less than . Moreover, we have
It means that sum of all data points’ weights has an upper bound of . The number of grids equals , which is , and every th dimension is divided into partitions. Therefore, the average density of each grid is .
Definition 5 (core point). It is defined as an object for which its overall weight of all neighborhood data points is at least a value .
Definition 6 (dense grid). At time , for a grid , we call it a dense grid if .
Definition 7 (sparse grid). At time , for a grid , we call it a sparse grid if .
Because the overall weight cannot be more than , is a controlling threshold.
Definition 8 (minicluster (MIC)). A at time is defined as for a group of very close data points with timestamps as follows: where is an Euclidean distance between the center of minicluster and the data points in that grid cell.
Definition 9 (grid synopsis). Is a tuple where is the number of data points, is the last timestamp and is the grid weight.
Definition 10 (outlier weight threshold (OWT)). This threshold is considered for the sparse grids which do not receive any data for long. In fact, these grids do not have any chance to be converted to dense grids and consequently to . If the grid weight is less than this threshold, it can safely be deleted from the grid list (in the outlier buffer) [14]. If the last updated time of grid is , then, at current time , the outlier weight threshold is defined as follows ():
Definition 11 (pruning time). We check all MICs’ weights as well as the weights of all grid cells in a time we call it . is the minimum time for a in timestamp to be converted to an outlier in () which is described as follows:
Lemma 12.
Proof.
4. HDCStream Algorithm
HDCStream is a hybrid densitybased clustering algorithm for evolving data streams. The overall architecture of HDCStream algorithm is outlined in Algorithm 1. It has an onlineoffline component. For a data stream, at each timestamp, the online component of HDCStream continuously reads a new data record and either adds it to an existing minicluster or maps it to the grid. In pruning time, HDCStream periodically removes real outliers. The offline component generates the final clusters on demand by the user. The procedure adopted in this algorithm is divided into three steps as follows. The steps are also illustrated in Figure 2.(1)Merging or papping (MMStep): the new data point is added to an existing minicluster or mapped to the grid (lines 5–18 of Algorithm 1).(2)Pruning grids and miniclusters (PGMStep): the grids cells as well as miniclusters’ weights are periodically checked in pruning time. The periods are defined based on the minimum time for a minicluster to be converted to an outlier. The grids and the miniclusters with the weights less than a threshold are discarded, and the memory space is released (lines 19–33 of Algorithm 1).(3)Forming final clusters (FFCStep): final clusters are formed based on miniclusters which are pruned. Each minicluster is clustered as a virtual point using a modified DBSCAN (lines 34–36 of Algorithm 1).

The steps are explained as follows.
4.1. MMStep of HDCStream
When a new data point arrives (Figure 3), we get the following.(i)HDCStream finds the nearest to the new data point.(ii)If the new data point’s distance to the nearest is less than , it will be added to that particular .(iii)Otherwise, the data point has to be mapped into the grid in the outlier buffer.(a)If the number of data points in grid reaches , then we check the grid weight .(1)If the grid weight is higher than the dense grid threshold, then we form a new out of the data points in this grid.(2)The related grid of the new is discarded from the grid list.
4.2. PGMStep of HDCStream
For each , if no new point is added, its weight will gradually decay. Furthermore, there are some grids which do not receive data points for a long time and become sporadic. These kinds of and grid cells should be removed from the miniclusters and the grid list, respectively. The decision for removing grids and miniclusters is made based on a comparison of their weights and a specified threshold. Therefore, PGMStep is performed in each which is defined in Definition 11.
4.3. FCCStep of HDCStream
When a clustering request arrives, a variant of DBSCAN algorithm is applied on the set of the online maintained miniclusters to get the clustering result. Each minicluster is considered as a virtual point located at the center of with the weight . We adopt the concept of density connectivity from [21], in order to determine the final clusters. All the densityconnected MICs form a cluster. The variant of DBSCAN algorithm includes two parameters: and .
Definition 13 (directly densityreachable). A is directly densityreachable from a with respect to and if . is the Euclidean distance between the centers of and .
Definition 14 (densityreachable). A is densityreachable from a with respect to and if there is a chain of miniclusters , such that and ( is directly density reachable from ).
Definition 15 (densityconnected). A is densityconnected to a with respect to and if there is a minicluster such that both and are densityreachable from with respect to and .
5. Experimental Evaluation
In this section, we present the evaluation of HDCStream with respect to two existing wellknown methods DenStream and DStream. We have implemented HDCStream as well as the comparative methods in Java. All experiments were conducted on a 2.5 GHz machine with 4 GB memory, running on Mac OS X. In this section, firstly, we describe the datasets and then evaluation measures used for the evaluation of the HDCStream algorithm. Detailed experiments on real and synthetic datasets are discussed as well.
5.1. Datasets
For evaluation purposes, the clustering quality, scalability, and sensitivity of the HDCStream algorithm on both real and synthetic datasets are used. We generated three synthetic datasets DS1, DS2, and DS3 which are depicted in Figures 4(a), 4(b), and 4(c), respectively. DS1 has 10000 data points with 5% noise. DS2 has 10000 data points with 4% noise, and DS3 has 10000 data points with 5% noise. Eventually, we generated an evolving data stream (EDS) by randomly selecting one of the datasets (DS1, DS2, and DS3) 10 times. For each iteration, the chosen dataset forms a 10000point part of the data stream, so the total length of the evolving data stream is 100000.
(a) Dataset DS1—10000 data points, 3% noise
(b) Dataset DS2—10000 data points, 4% noise
(c) Dataset DS3—10000 data points, 5% noise
The real dataset used is KDD CUP99 Network Intrusion Detection dataset (all 34 continuous attributes out of the total 42 available attributes are used) [29]. The dataset comes from the 1998 DARPA Intrusion Detection. It contains training data consisting of 7 weeks of networkbased intrusions inserted in the normal data and 2 weeks of networkbased intrusions and normal data for a total of 4,999,000 connection records described by 42 characteristics. KDD CUP99 has been used in [14, 17, 24, 27] and it is converted into data stream by taking the data input order as the order of streaming.
5.2. Evaluation Metrics
Cluster validity is an important issue in cluster analysis. Its objective is to assess clustering results of the proposed algorithm by comparing existing wellknown clustering algorithms. In the following, we adopt two popular measures, purity and normalized mutual information (NMI), in order to evaluate the quality of HDCStream.
5.2.1. Purity
The clustering quality is evaluated by the average purity of clusters which is defined as follows: where is number of clusters, is the number of points with the dominant class label in cluster , and is the number of points in cluster . The purity is calculated only for the points arriving in a predefined window (), since the weight of points diminishes continuously.
5.2.2. Normalized Mutual Information (NMI)
The normalized mutual information (NMI) is a wellknown information theoretic measure that assesses how similar two clusterings are. Given the true clustering and the grouping obtained by a clustering method, let be the confusion matrix whose element is the number of records of cluster of that are also in the cluster of . The normalized mutual information, , is defined as where is the number of groups in the partition , is the sum of the elements of in row (column ), and is the number of data points. If , , and if and are completely different, .
The parameters of HDCStream adopt the following settings: decay factor , minimum number of points , and . The parameters for DenStream and DStream are chosen to be the same as those adopted in [24] and [14], respectively.
5.3. Evaluation of HDCStream on Synthetic Datasets
Figure 5 shows the purity results of HDCStream compared to DenStream and DStream on EDS data stream. In Figure 5(a), the stream speed is set to 2000 points per time unit and horizon . HDCStream shows a good clustering quality. Its clustering purity is higher than 97%. We also set the stream speed at 2000 points per time unit and horizon for EDS. Figure 5(b) shows similar results too. We conclude that HDCStream achieves much higher clustering quality than DenStream and DStream in two different horizons. For example, in horizon , time unit 50, HDCStream has 98% while DenStream and DStream have purity values as 82% and 78%, respectively.
(a)
(b)
The same is observed from the normalized mutual information aspect. In fact, Figure 6 shows the NMI values obtained by three methods. We repeated the experiments with the same horizon and stream speed (Figures 6(a) and 6(b)). The results show a noticeable high NMI score for HDCStream. In fact, its value approaches 1 for both horizons. It also proves that DenStream has better NMI compared to DStream.
(a)
(b)
We noted very good clustering quality of HDCStream, DStream, and DenStream when no noise is present in the dataset. In fact, purity values are always higher than 98% and all methods are insensitive to the horizon length.
5.4. Evaluation of HDCStream for Real Datasets
The comparison results among HDCStream and both DenStream and DStream on the Network Intrusion dataset are shown in Figure 7. The evaluation is defined based on the selected time units when the attacks happen on horizons 2 and 5, whereas the stream speed is 1000. For instance, in horizon and stream speed 1000, there are 99 teardrop attacks, 182 ipsweep attacks, 618 neptune attacks, and 4097 normal connections. HDCStream clearly outperforms DenStream and specifically DStream. The purity of HDCStream is always above 91%. For example, at time 55, the purity of HDCStream is about 95% which is higher than both DenStream (86%) and DStream (76%).
(a)
(b)
We show the normalized mutual information results on Network Intrusion Detection dataset in Figure 8. The results have been determined by setting the horizon to 1 and 5, whereas the stream speed is 1000 (Figures 8(a) and 8(b)). The values of normalized mutual information for HDCStream approach 1 for both horizons. It reveals that HDCStream detects the true class labels of data more accurately than DenStream and DStream do.
(a)
(b)
5.5. Scalability Results
5.5.1. Execution Time
The execution time of HDCStream is influenced by the number of data points processed at each time unit, that is, the stream speed. Figure 9 shows the execution time in seconds on Network Intrusion Detection dataset for HDCStream compared to DenStream and DStream, when the stream speed augments from 1000 to 10,000 data items.
DenStream has higher processing time due to its merging task which is time consuming. HDCStream has lower execution time compared to the others. The execution time of other methods increases linearly with respect to the stream speed.
5.5.2. Memory Usage
Memory usage of HDCStream is which is the total number of miniclusters and grids.
5.6. Sensitivity Analysis
An important parameter of HDCStream is . It controls the importance of historical data. We test the quality of clustering on different values of ranging from 0.0078 to 1 (Figure 10). When is too small or too large, the clustering quality becomes poor. For example, when , the purity is about 75%, and, when , the points decay soon after their arrival, and only a small number of recent points contribute to the final results. So the result is not very good. However, the quality of HDCStream is still higher than that of DenStream and DStream. It is proved that if varies from 0.0625 to 0.25, the clustering quality is quite good, stable, and always above 96%.
6. Discussion
We proposed a hybrid method for clustering evolving data streams which has high quality and low computation time compared to existing methods. The algorithm clusters data streams in three distinctive steps. In existing methods such as DenStream, when a new data point arrives, it takes time to search in two lists of microclusters including potentials and outliers in order to find the suitable microcluster. If it is unable to find a microcluster, DenStream forms a new microcluster for that data point which may be a seed of an outlier, hence leading to a low clustering quality result. However, HDCStream only searches in potential list and if it cannot find the suitable microcluster, the data point is mapped to the grid, which keeps the outlier buffer. We reduced the time complexity of clustering algorithm using gridbased clustering. The gridbased method allows us to decrease merging time complexity from to . We implemented the grid list in a 234 tree data structure which makes search and update faster. The size of the grid list is and the time required for search and update in the grid list is . Consider
We reduced the number of comparisons; therefore, time complexity for merging to minicluster list is ; in which the number of is less than number of microclusters in DenStream, since, in that algorithm, there are two lists to keep potential and outlier microclusters. Furthermore, we increased the clustering quality by forming miniclusters from the data points that are surely not outliers. When the grid density reaches the specified threshold, the data points inside that grid form a minicluster. Therefore, we do not need to form a minicluster for a newly arrived data if it cannot be placed in any minicluster. The quality is also increased since miniclusters are never formed from an outlier.
Finally, the evaluation results prove that using a hybrid method for clustering evolving data streams improves the clustering quality results and reduces the computation time.
7. Conclusion
In this paper, we proposed a hybrid densitybased clustering algorithm for Internet of Things (IoT) streams. Our hybrid algorithm has three steps in which the new data point is either mapped to grid or merged to an existing minicluster, the outliers are removed, and finally arbitrary shape clusters are formed using miniclusters by a modified DBSCAN. Our method is a hybrid one, which uses density gridbased clustering and density microclustering to improve the computation time and quality. The evaluation results on synthetic and real datasets show that it has high quality with low computation time for merging. However, HDCStream is not suitable to be used in distributed environments.
Our future work will focus on the improvement of HDCStream as a distributed densitybased data stream clustering algorithm.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgment
This research is supported by High Impact Research (HIR) Grant, University of Malaya, no. UM.C/625/HIR/MOHE/SC/13/2 from Ministry of Higher Education.