Abstract

Cyber physical systems have grown exponentially and have been attracting a lot of attention over the last few years. To retrieve and mine the useful information from massive amounts of sensor data streams with spatial, temporal, and other multidimensional information has become an active research area. Moreover, recent research has shown that clusters of streams change with a comprehensive spatial-temporal viewpoint in real applications. In this paper, we propose a spatial-temporal clustering algorithm (STClu) based on nonnegative matrix trifactorization by utilizing time-series observational data streams and geospatial relationship for clustering multiple sensor data streams. Instead of directly clustering multiple data streams periodically, STClu incorporates the spatial relationship between two sensors in proximity and integrates the historical information into consideration. Furthermore, we develop an iterative updating optimization algorithm STClu. The effectiveness and efficiency of the algorithm STClu are both demonstrated in experiments on real and synthetic data sets. The results show that the proposed STClu algorithm outperforms existing methods for clustering sensor data streams.

1. Introduction

Cyber physical systems (CPS) have grown exponentially and have been attracting a lot of attention over the last few years [1]. Emerging applications include transportation, energy, manufacturing, environment monitoring, and many parts of our social infrastructure. It consists of a large number of sensors to monitor physical or environmental conditions [2, 3]. The incoming information in CPS needs to be considered as a stream and not so much as an ever-growing data set [4, 5].

To obtain interesting relationships and useful information from multiple data streams, a variety of methods have developed over the past decades. Beringer and Hüllermeier [6] used a discrete Fourier transformation (DFT) to summarize the data streams and presented an online version of the classic -means clustering algorithm to cluster parallel data streams. To avoid recalculating the DFT coefficients and minimize processing time, an incremental update mechanism for clustering large number of data streams was proposed in [7]. To discover cross relationships among multiple data streams, a clustering-on-demand framework was presented in [8]. Yeh et al. [9] proposed a framework for clustering over multiple evolving streams by correlations and events, which monitored the distribution of clusters over multiple data streams based on their correlation. To find communities in a large set of interacting entities, Aggarwal and Yu [10] proposed an online analytical processing framework for community detection of data streams. Evolutionary clustering was a relatively new topic and was formulated by Chakrabarti et al. [11] in 2006. To discover communities and capture their evolution with temporal smoothness, incremental and evolutionary clustering technologies were designed to handle dynamic data streams [10, 1214]. Wang et al. [15] proposed an ECKF framework for evolutionary clustering large-scale data based on low-rank kernel matrix factorization. Chi et al. [16] proposed two algorithms to incorporate temporal smoothness in evolutionary spectral clustering. Ning et al. [14] presented an incremental approach for spectral clustering to handle not only insertion/deletion of data points but also similarity changes between existing points.

The values of a sensor in the temporal domain usually have spatial correlation features, meaning that the sensors’ readings are influenced by near sensors. Such as in transportation systems, the congestion swiftly expands along the street and influences nearby sensors. A serious congestion usually lasts for a few hours and covers hundreds of sensors when reaching the full size. As time passes by, it shrinks slowly, eventually reduces the coverage, and finally disappears. As shown in Figure 1, we find that those records in an atypical event are spatially close and timely relevant. To retrieve the useful information from massive sensor data streams, Tang et al. [5] proposed a model of atypical cluster to describe the atypical events and retrieve them with spatial and temporal attribute. To explore the spatial correlation between data sampled by different sensors for sensor networks, -local spatial clustering algorithm was proposed [17]. To explore the temporal locality features of sensor data, Wei and Peng [18] proposed an incremental clustering algorithm that clusters objects by inducing clustering results from previous time windows.

In this paper, we propose a spatial-temporal clustering algorithm () based on nonnegative matrix trifactorization by utilizing time-series observational data streams and geospatial relationship for clustering multiple sensor data streams. Our proposal was motivated by recent progress in nonnegative matrix factorization and its applications [1925], with the aim being to discover clusters of objects with similar behavior and, consequently, discover potential anomalies and atypical events which may be revealed by the relationship evolving over time. We assume that the spatial feature is the summary of the atypical event in temporal dimension, and the temporal feature is the summary of the event in spatial dimension. simultaneously incorporates the spatial correlation between two sensors in proximity and integrates the historical information into consideration. To maintain a consistent clustering result, there is a trade-off between the spatially close and timely relevant cost embedded in and the benefit from current observations and the spatially close historical results. Thus, the clustering results at a given time step are determined by current snapshot, prior knowledge, and spatial correlation between two sensors in proximity.

The rest of this paper is organized as follows. Related research work is introduced in Section 2. Then, the preliminaries of are given in Section 3. Section 4 presents the proposed algorithm in detail, while convergence and correctness proofs of the are provided in the Appendix. Experimental results using synthetic and real world data sets are presented in Section 5. Finally, Section 6 gives the conclusion.

2. A Brief Review of Matrix Factorization and Its Applications

Graphs are used in a wide range of applications, such as transportation networks, sensor networks, and social networks. So, various problems are studied under graph mining. To find clusters in graphs is a new challenge if the graph is evolving over time. Matrix is a usual representation of a graph; the relationships between matrix factorization and -means clustering have been explored in [26].

Singular value decomposition (SVD) has served as a building block for many important applications, such as PCA and LSI [27]. SVD factorizes a matrix with the general form of , where is a unitary basis consisting of left-singular vectors of , is a unitary basis consisting of right-singular vectors of , and is a diagonal matrix with singular values on the diagonal. In SVD, matrices and are allowed to have negative entries; the projected data might have negative values in spite of the original data being positive. That can prevent from the clustering results to be intuitive for applications such as documents or images that have a positive data input.

Nonnegative matrix factorization (NMF) [28, 29] is a linear, nonnegative approximate data representation technique. NMF focuses on the analysis of data matrices whose elements are nonnegative, a common occurrence in data sets derived from text and images. The nonnegative data matrix is factorized into matrixes and as , with the constraints that and are nonnegative.

In particularly, the result of a -means clustering run can be written as a matrix factorization , where is a data matrix, contains the cluster centroids, and contains the cluster membership indicators. Nonnegative matrix factorization focuses on the analysis of data matrices whose elements are nonnegative [28].

To extend the applicable range of NMF methods, when the data matrix is unconstrained, Semi-NMF is motivated from the perspective of clustering [29], in which it restricts to be nonnegative while placing no restriction on the signs of . For reasons of interpretability, Convex-NMF constrains the basis vectors . The vectors defining lie within the column space of : , or . It can be applied to both nonnegative and mixed-sign data matrices. This constraint could interpret the columns as weighted sums of certain data points and these columns would capture a notion of centroids.

To expand the range of application of NMF, Semi-NMF and Convex-NMF algorithm have been proposed in [29]. Semi-NMF offers a low-dimensional representation of data points which lends itself to a convenient clustering interpretation. Convex-NMF restricts the columns of to be convex combinations of data points in . The advantage of this constraint is that the columns as weighted sums of certain data points could interpret. Based on NMF, a clustering algorithm has been proposed in [29].

However, the traditional NMF, Semi-NMF, and Convex-NMF are linear model and they may fail to discover the nonlinearities of data streams. In real world, the data streams have potential nonlinear structure. Kernel method is a powerful technique for extracting useful information form nonlinear correlations. To solve the problem of nonlinearities, the kernel method is to map the data nonlinearly into a kernel feature space. Then, the Convex-NMF method can be accomplished in the kernel feature space to process the nonlinear data.

Indeed, several important applications can be modeled as large sparse graphs, such as transportation network analysis and social network analysis. Low-rank approximation for the matrix of a graph is essential in finding patterns and detecting anomalies. And it can extract correlations and remove noise from matrix structured data. This has led to proposal of methods such as CUR [30], CMD [31], and the family of Colibri [32].

3. Preliminaries

To monitor the traffic status along a freeway, sensors are deployed and utilized to collect readings, such as vehicle speed and volume of traffic. It is assumed that data records arrive synchronously, which means that all data streams will be updated simultaneously. The sensors are the points in the graph and an edge is formed between each pair of sensors; these data records can be used to construct graphs of transportation networks periodically.

Given two sets of spatial objects and , the spatial relationships are depicted as Figure 2. Let be a graph associated with the sensor network of transportation system at timestamp , where represents the set of sensors associated with graph and represents the set of edges between and . And similar definitions are depicted for associated with the sensor network of transportation system at timestamp .

Without loss of generality, we use the adjacency matrices   and to represent two graphs and , respectively. is the element at the th row and th column of the matrix ; is the th column of . Every row or column in corresponds to a node in . If there is an edge from node to node with similarity , we set the value to . Otherwise, we set it to zero. The similarity matrix is computed as follows: where is a scaling parameter controlling how rapidly similarity reduces with the distance between and .

The spatial relationships between two graphs are denoted as a set of edges connecting the nodes between and . We use the adjacency matrix to represent the spatial relationships. Given two objects and , the locations of and are denoted as and , respectively. The spatial relationship between and is defined as follows:

At timestamp , the objective of is to create sets of partitions . , should meet the following conditions: and . The community is a set of similar data streams such that , where is the total number of streams in .

4. Our Approach

4.1. Objective Function of STClu

By utilizing time-series observational data streams and geospatial relationship for clustering multiple sensor data streams, the proposed method takes the spatial correlation between two sensors in proximity and the historical information into consideration. The cost of spatially close and timely relevant is embedded in the objective function. The cluster partitions are obtained by decomposing the adjacency matrix representation of the graph into a product of its latent factors. The square of the Euclidean distance and the generalized Kullback-Leibler divergence are common metrics for the approximation error [24]. In , we seek to minimize the distance function between two nonnegative matrices and , defined as where is the Frobenius norm. This is lower bounded by zero and clearly vanishes if, and only if, .

The objective function can be written as where and are the representative column matrices of and , respectively, and are the weight matrices, respectively, and are the corresponding cluster membership matrices for graphs and , respectively, and reflects the correspondence between the subgroups derived from two time steps.

Specifically, assume cluster numbers and at time steps and , respectively. If , then the extra clusters at time must be computed using only the current data and we let . Otherwise, and we simply remove the deleted cluster weight vectors in .

4.2. Optimization of STClu

As can be seen, the objective function in (4) is minimized with respect to , , , and , and it is unrealistic to expect an algorithm to find the global minimum. In the following, we introduce an alternating schema to optimize the objective function, which can achieve a local minimum.

We optimize the objective function with respect to one variable while fixing the other variables. This iterative procedure repeats until convergence. Using the matrix properties and , the objective function in (4) can be rewritten as

To alternately update the entries of , , , and , we resort to a Lagrangian function. Let , , , and be the Lagrange multiplier for constraints , , , and , respectively. Then the Lagrange function is expressed as

Since the derivatives of with respect to , , , and must be zero, we have

Using the Karush-Kuhn-Tucker conditions [29, 33] and letting , , , and , we obtain the following equations for , , , and :

According to (8), we present the following updating rules:

Proofs of the convergence of update rules (9), (10), (11), and (12) are given in the Appendix.

4.3. Algorithm Description

To eliminate linearly dependent columns and construct the subspace used for low-rank approximation from data matrix , a unique and independent subspace is constructed using Colibri- and Colibri- [32]. Algorithm 1 summarizes the details of the iterative updating algorithm for (4). It is assumed that the adjacency matrices  , , , , and are given. To improve the clustering efficiency, the variable factors are initialized by the previous clustering results instead of random values [29]. Applying the update rules in (9), (10), (11), and (12) until convergence, we obtain a new , , , and .

Input: Matrices , , , , and , cluster number , and maximum iteration
Output: Matrices , , , and
(1)   use the family of Colibri methods to get
(2)  determine cluster number
(3)  initialize , , and using , , and , respectively
(4)  if we need to form a new partition
(5)    go to (8)
(6)  else
(7)    let and , and return
(8)  while not converging and do
(9)    update by (9)
(10)    update by (10)
(11)   update by (11)
(12)    update by (12)
(13) end while
(14) return , , , and

The computational cost of the proposed algorithm is discussed as follows. To compute the complexity of the algorithm, we need to inspect its main operations, namely, the decomposition matrix and multiplicative updates. In the decomposition step, the computation is actually done using the three small matrices, , , and . Each of  , , and is multiplied by an update factor, which involves several matrix multiplications. We assume that is the cluster number at time step , and are the sizes of the data subspaces at time steps and , respectively, and and are the sizes of the input matrices at time steps and , respectively; the multiplicative update stops after iterations, and the cost for is , where and .

5. Experiments and Results

In this section, we use several synthetic and real world data sets to evaluate the effectiveness and efficiency of the algorithm.

5.1. Baseline Algorithms and Evaluation Methods

To demonstrate the efficiency of , we compare it with the following three popular clustering algorithms. The first is -means clustering in the principle component analysis (PCA) subspace. The other two algorithms are the typical spectral clustering algorithm normalized cut (Ncut) [34] and NMF-based clustering [19].

To evaluate clustering quality, all our comparisons are based on clustering accuracy (ACC) and normalized mutual information (NMI) measurements.

Clustering accuracy (ACC) discovers the one-to-one relationships between clusters and classes and measures the extent to which each cluster contains data points from the corresponding class. It is defined as [18, 19] where denotes the cluster label, denotes the true class label, is the total number of documents, is the delta function that is equal to one if and zero otherwise, and is the permutation mapping function that maps each cluster label to the equivalent label from the data set.

Between two random variables (category label) and (cluster label), NMI is defined as [35] where is the number of documents, and denote the number of documents in category and cluster , respectively, denotes the number of documents in category as well as in cluster, and and are the cluster numbers in the true and predicted clusters. The NMI score is 1 if the clustering results perfectly match the category labels whereas the score is 0 if the data are randomly partitioned. The higher the NMI score is, the better the clustering quality is.

Since there are no predefined categories in our data, we have to design an alternative way to carry out the evaluation. In this paper, we use the -means clustering result of the combined feature data as the target labels for feature clustering. Similarly, the -means clustering result of the combined user data is used as the target labels for user clustering. To randomize the experiments, we conduct the evaluation using different cluster numbers. Note that all the experiments ran on the same machine. At each time step, for each given cluster number, 30 test runs are carried out on different randomly chosen clusters. For each experiment, we run it 10 times and report the average.

5.2. Synthetic Data Sets

In this section, we use synthetic datasets to demonstrate that can significantly cluster data for multiple evolving data streams over time. The number of nodes and number of clusters in each time step are given. The synthetic datasets are generated by two steps [18]. The first step generates the objects’ geographic information, and the second step generates the values of the objects’ nongeographic attributes. The objects’ nongeographic attributes are generated by applying prototype systems used in [6]. It can be defined as follows: , , in which is a stochastic process, , and are independent random variables, uniformly distributed in an interval . So, the data stream can be defined by , where and are stochastic processes that are generated in the same way as the prototype . The constant determines the smoothness of a data stream, which can be different for , , and , such as 0.05, 0.1, and 0.2, respectively. To examine the quality of for clustering multiple data streams, we generate random datasets with the number of streams ranging from 100 to 1000. Each stream has a length of 2,000 points.

We show the performance of four methods while the time window evolving over time and the number of clusters in the data varies from 2 to 20. The stream number is fixed at 400, and each of them contains 1,000 points. As can be seen in Table 1, the average ACC and NMI among four methods with the different numbers on synthetic dataset are shown. It can be clearly found that the outperforms all other approaches and having highest NMIs at evolving time steps. The reason is that it incorporates the previous clustering results of the upstream sensors and the spatial relationship between the upstream and downstream sensors.

The average ACC and NMI value for the proposed algorithm varies with the varied number of clusters or data streams shown in Figures 3(a) and 3(b), respectively. As shown in Figure 3, we can observe that the performance of the proposed algorithm is very robust to the number of clusters and data streams.

Finally, we evaluate the average processing time for a round of clustering multiple data streams. There are different factors affecting the execution time. In this experiment, we evaluate the effect of the window size and the number of data streams on the response time for the compared algorithms.

The following set of experiments evaluate the effect of window size on the execution time of these algorithms. Figure 4(a) shows that as the window size increases, the processing time of these algorithms increases. The -axis shows the execution time, while the -axis represents the window size from 60 to 600. In Figure 4(b), the -axis shows the execution time, while the -axis represents the number of data streams being varied from 100 to 1000. The other parameters are fixed as follows:  s, . Note that as the number of streams increases, the execution time of these methods increases. However, the time of the and NMF increases much more slowly than others. This is because of the execution efficiency of and NMF for clustering multiple data streams in the approximate subspace. Although requires more time than NMF, as can be seen from the experiments, it is able to process extremely large sensor data streams with spatial, temporal, and other multidimensional information.

5.3. Real World Data Sets

The real world data sets were obtained from the PeMS (http://pems.dot.ca.gov/) database. With sensor devices installed on road networks, a system monitors traffic flows on major US highways 24 hours a day, 7 days a week, thus acquiring huge volumes of data. We collected data from 963 sensors positioned along the freeway. We selected 15 months of daily data records giving the occupancy rates (between 0 and 1) of different car lanes on freeways in the San Francisco bay area over time. The data set covered records sampled every 10 min from 1 January 2008 to 30 March 2009. Data for public holidays were removed from the data set, as well as data collected on two days with anomalies (8 and 9 March 2008), where all sensors were muted between 2:00 and 3:00 am.

First, we discuss the experiments on the PeMS data set with the number of clusters varying from 2 to 20. For each given cluster number , the clustering results of all these algorithms on the PeMS data set are shown in Table 2; Ncut, NMF, and the proposed algorithm achieve better performance than the -means algorithm, since these three algorithms consider the geometrical structure information contained in the data.

Next, we discuss the experiments on the PeMS data set with different time steps. For each time step, the clustering results of all these algorithms on the PeMS data set are shown in Table 3. We observe that the algorithm performs better than the other three algorithms for most time steps, except . The reason for this is that integrates more historical knowledge and the spatial relationship between two sensors in proximity.

Finally, we report the experimental results on PeMS data set over a different number of time steps with a varying number of clusters. The clustering results of all these algorithms on the PeMS data set are shown in Table 4. outperforms the other three algorithms in terms of both ACC and NMI, because it simultaneously considers the geometric structure and incorporates the prior knowledge and the spatial correlation between sensors in proximity.

6. Conclusions

Clustering multiple sensor data streams in CPS has been extensively studied in various applications, including transportation systems, sensor networks, and social networks. To extract and retain meaningful information from multiple sensor data streams, we assume that the spatial feature is the summary of the atypical event in temporal dimension, and the temporal feature is the summary of the event in spatial dimension. In this work, we have proposed a spatial-temporal clustering algorithm, called . It utilizes time-series observational data streams and geospatial relationship for clustering multiple sensor data streams. It aims to discover clusters of objects with similar behavior and discover potential anomalies which may be revealed by the relationship evolving over time. simultaneously incorporates the spatial relationship between two sensors in proximity and integrates the historical information into consideration. Experimental results on synthetic and real world data sets show that achieves a better performance for clustering multiple sensor data streams evolving over time.

Appendix

In this section, we investigate the convergence and correctness of the objective function in (4) under update rules (9), (10), (11), and (12). Regarding these four update rules, we have the following theorem.

Theorem A.1. For , , , , and , the update formulas for , , , and given in (9), (10), (11), and (12), respectively, monotonically decrease the objective function in (4).

The proofs are provided with the aid of auxiliary functions. Since the first term of the objective function in (4) is only related to , the first term of the objective function in (4) is only related to , and the third term of in (4) is only related to , we have exactly the same updating rule for , , and as in [19], respectively. Thus, we can use the convergence proof in [19] to show that is monotonically decreasing under the updating rules in (9), (10), and (12), respectively. Please see [19, 36] for details. Next, we prove convergence under the update rule for in (11).

Definition A.2. is an auxiliary function for if the following two conditions are satisfied:

Lemma A.3. For any and , if is an auxiliary function, is monotonically decreasing under the formula

Proof. Consider
Thus, and , which completes the proof.

Next, we show that the update rule for in (12) is exactly the same as the update in (A.2) with a proper auxiliary function. Considering any element in , we let Then, we get

Since our update rule operates elementwise, it is sufficient to show that each is nonincreasing under the update formula in (12).

Lemma A.4. The function is an auxiliary function for .

Proof. We first get the Taylor series expansion of : Using (A.6) to find that is equivalent to we get
Thus, (A.8) holds . Furthermore, is obvious.

Proof of Theorem A.1. Replacing in (A.2) by (A.6), we get
Since (A.6) is an auxiliary function for , is nonincreasing under the update rule in (11).

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

The authors gratefully acknowledge the support provided for this research by the Research Fund for the Doctoral Program of Higher Education of China (Grant no. 20120191110047), Natural Science Foundation Project of CQ CSTC of China (Grant no. CSTC2012JJB40002), Engineering Center Research Program of Chongqing of China (Grant no. 2011pt-gc30005), Fundamental Research Funds for the Central Universities of China (Grant no. 106112014CDJZR178801), and Fundamental Research Funds for the Central Universities of China (Grant no. CDJXS10170004).