Abstract

The rapid spreading of positioning devices leads to the generation of massive spatiotemporal trajectories data. In some scenarios, spatiotemporal data are received in stream manner. Clustering of stream data is beneficial for different applications such as traffic management and weather forecasting. In this article, an algorithm for Continuous Clustering of Trajectory Stream Data Based on Micro Cluster Life is proposed. The algorithm consists of two phases. There is the online phase where temporal micro clusters are used to store summarized spatiotemporal information for each group of similar segments. The clustering task in online phase is based on temporal micro cluster lifetime instead of time window technique which divides stream data into time bins and clusters each bin separately. For offline phase, a density based clustering approach is used to generate macro clusters depending on temporal micro clusters. The evaluation of the proposed algorithm on real data sets shows the efficiency and the effectiveness of the proposed algorithm and proved it is efficient alternative to time window technique.

1. Introduction

Recently, moving objects such as vehicles and animals are equipped with GPS devices; these devices leave digital traces (latitude, longitude) position at each moment. The cheap price of GPS devices leads to an exponential growth of trajectories data. Analysis of trajectory data leads to extraction curial information which helps the researchers to find solutions for many challenges such as traffic congestion [1]. One of the most important analysis tools is clustering; clustering aims to aggregate data in clusters such that the similarity among cluster members is high and the similarity of members belonging to different clusters is very low [2, 3]. Clustering of stream data is more complex than classical data, since clustering stream data faces a set of challenges: (i) single pass processing due to continuous arriving of data, (ii) unbounded size of data stream and limited memory space and time, and (iii) evolving data where the model underlying the data stream may change over time. Thus the clustering algorithm should be able to detect such changes [3, 4]. Many algorithms of data stream clustering depend on object based paradigm which consists of two phases: online phase and offline phase. The online phase stores summarized information of data stream in specific micro clusters which act as representative for raw data stream. When the size of micro clusters exceeds memory limitation, similar micro cluster will merge to reduce memory size. The offline phase which is evoked on user demand and density base clustering approach is used to cluster representative line of micro clusters to demonstrate the current results of stream data.

Problem Statement. Many existing algorithms such as TCMM and ConTraClu exploit time window technique to incrementally cluster trajectory data stream. Time window technique partitions trajectory stream data into equal temporal periods (time bins or time stamp) and clusters each period separately as illustrated in Figure 1. Starting clustering from scratch in each time bin leads to the following. (i) Disturbance occurs in clustering quality which centralizes in the border area between two adjacent time bins specially if it is very dense of trajectory segments since clustering process in time window technique creates new micro clusters () for some segments at the start of each even though these segments are very close (within distance threshold) to micro clusters () at the end of previous . (ii) It is true that TCMM algorithm combines some of and during merging stage to reduce memory space but that will be time consuming. In addition to these problems, TCMM framework merges similar micro cluster when its size exceeds a given memory space, and the merging process does not maintain temporal information and that is inconsistent with the complexity of free moving object since it visits the same spatial area many times in different periods of time as illustrated in Figure 2.

In this article, Continuous Clustering of Trajectory Stream Data Based on Micro Cluster Life (CC_TRS) algorithm is proposed; the algorithm assigns a life time for each newly created micro cluster, and the new upcoming segments will affect only the nonexpired clusters; for example, any segments in current time (dash line in Figure 3) will affect temporal micro cluster (B and C) and ignore A since it is expired. A Micro Cluster Life is a continuous clustering technique; therefore there is no need to divide stream data to time bins and start clustering from scratch as in time window technique. CC_TRS algorithm consists of two-phase micro clustering (online phase) and macro clustering (offline phase). In online phase, temporal micro cluster (TMC) data structure is defined to store summarized information for each group of similar segments; TMC is similar to micro cluster data structure (MC) in TCMM [5] framework except that it has additional temporal features which describe TMC temporal existence. When the size of TMCs exceeds a given memory space, similar TMCs have to merge (spatiotemporally) to reduce memory space. In offline phase, any density based approach can be used to cluster temporal micro clusters to generate macro clusters, a response to user request. Note that CC_TRS is used to cluster the trajectories of free moving object.

The main contributions of this paper are as follows:(i)Propose the concept of temporal micro cluster (TMC) which means TMC exists for a period of time starting from creation time. Clustering task will take into account TMC as long as it exists.(ii)Define new data structure for TMC.(iii)Evaluation of proposed algorithm on real data sets shows ability to maximize the cluster quality and minimize the execution time compared with existing algorithms.

The rest of this article is organized as follows. Section 2 presents the data stream clustering algorithms related works. Section 3 presents problem definitions. Section 4 proposed algorithm CC_TRS. Section 5 presents performance evaluation. Finally, Section 6 concludes the article.

Trajectory stream clustering aims to find out representative paths and common tendencies that are shared by group of moving objects. Numerous clustering methods have been presented for static data sets of trajectory; these methods can be classified into five main categories [6, 7]: spatial based clustering [8], time dependent clustering [9], partition and group based clustering [10], uncertain trajectory clustering [11], and semantic trajectory clustering [12].

Many researches have been conducted for stream clustering of data. Aggarwal et al. [13] proposed CluStream framework for clustering evolving data stream; CluStream uses the concept of pyramidal time frame in conjunction with micro clustering approach. However, the CluStream framework does not handle trajectory stream data. Elnekave et al. [14] presented an incremental clustering algorithm for finding evolving groups of similar mobile objects in spatiotemporal data. In this algorithm, each trajectory is represented by set of minimal bounding box (MBB), the entire overlapping between two trajectories. MMBs represent the similarity between them. The algorithm uses a developed version of incremental -mean algorithm to cluster moving object trajectories. Jensen et al. [15] presented a disk-based algorithm for continuous clustering of moving object. The algorithm employs clustering features structure that can be updated incrementally. Moving object may be deleted from or inserted into a moving cluster during a period of time. Next, the approach merges and splits the clusters through monitoring their compactness. Li et al. [5] suggested the TCMM framework which consists of two parts: online micro clustering and offline macro clustering for incremental trajectories clustering. Online micro clustering stores statistical information of similar trajectory segments in cluster features (CF) data structure and updates CFs when new batch of segments is added. Similar CFs are merged to solve memory limitation issue. Offline clustering is implemented on the set of micro clusters based on density based clustering when user sends request to see the clustering results. Some studies use optimization strategies such as indexes or pruning to minimize search and enhance the efficiency of clustering. Yu et al. [16, 17] proposed ConTraClu algorithm to cluster continuous high speed trajectories data stream and discover moving pattern such as flock. The algorithm consists of online clustering of trajectory segments depending on density based approach and updating process of closed clusters depends on bi-Tree index. Da Silva et al. [18, 19] proposed an incremental algorithm for trajectory data stream. The algorithm uses a micro group structure to truck moving object and its evolution at consecutive time windows. Micro group describes the relationship among moving objects and evolves (merge or split) in the next time period. Mao et al. [2] produce two-stage framework TSCluWin over sliding window model. During the first stage, sufficient summarized information of the micro clusters is stored and maintained continuously in EF data structure. During the second stage, a small number of micro clusters are produced depending on micro clusters. There also exist some different approaches but they deserve to be mentioned. Costa et al. [20] interpret trajectory as a discrete time signal and use Fourier transform to measure the similarity between two trajectories.

3. Problem Definition

In this section, we will define some notations.

Definition 1 (temporal micro cluster (TMC)). A vector summarized the spatiotemporal features for a set of similar directed segments. TMC is of the following form (, , , , , , , , and ).: the linear sums of the line segments center points.: the square sums of the line segments center points.: the linear sums of the line segments angles.: the square sums of the line segments angles.: the linear sums of the line segments lengths.: the square sums of the line segments lengths.: TMC creation time.: True if TMC is expired, otherwise false.: number of line segments in temporal micro cluster (TMC).Note that the lifespan of TMC starts at TMC. and ends at TMC.+ , where is a predefined time threshold.

Definition 2 (representative line of TMC). It represents the spatial average of TMC members. The start and end points of any representative line of TMC can be calculated from its features:where , , , and .

Definition 3. The distance between representative line of TMC () and trajectory segment () is equal to the sum of three components: center distance (), angle distance (), and parallel distance () as illustrated in Figure 4:where represents the Euclidian distance between the centers of two segments .Note that and denote the length of and , respectively. represents the smallest angle between and , and the range of is within .where and are the Euclidean distance between the points and , respectively. and are the end points of line segments .

Definition 4 (temporal micro cluster extent (δ)). TMC extent is a pointer of its spatial tightness. The extent of TMC comprises three components , , and , since the representation of TMC contains these three parts as illustrated in Figure 5. The extents are the standard deviation which can be computed from its corresponding LS and SS as defined inwhere is the number of line segments in the temporal micro cluster and β represents center, length, and angle.

4. CC_TRS Algorithm

The core idea of CC_TRS algorithm is to specify a lifetime (predefined threshold) for each newly created TMC; the TMCs can only interact with clustering task during their lives. When a new trajectory segment arrives, the CC_TRS algorithm finds the closest nonexpired for each segment , since expired TMCs become already far spatially or temporally or both form new coming segments. If the distance between and is less than a user distance threshold (), will be inserted into and update information. Otherwise, a new temporal micro cluster will be created for .

Basically, the algorithm maintains two arrays called TMC and E_TMC to control execution time; the data structure of each array entry is illustrated in Definition 1. CC_TRS algorithm continuously inserts the newly created into TMC array with its status (existence) and creation time. After a while, the size of TMC array increases and some of TMCj become expired which affect the efficiency of the algorithm since the most time-consuming part is finding the closest in TMC array. Therefore, all expired are transferred periodically to E_TMC array to minimize searching time in TMC array. Eventually, if the size of TMC and E_TMC exceeds memory constraint, some TMC will be merged based on their spatiotemporal information. Algorithm 1 illustrates CC_TRS algorithm steps.

Input: Trajectories
: Temporal Micro clusters
Parameters: , , Φ
for every do
for every do
Min_dist = inf;
for every in TMC
if = Expired then next
if .time- threshold % check the validity of %
dist = distance;
If dist < Min_dist then Min_dist = distance; index =
else
.Expire = true;
end;
if then
Add into and update information
(16) else
(17) Create a new micro-cluster for
(18) = false; ;
(19) end;
(20) end;
(21) transfer expired TMC to list E_TMC every Φ second
(22) if size of (TMC, E_TMC) is larger than memory limit then
(23) Merge similar spatio-temporal micro cluster in E_TMC
(24) end;

Algorithm 1 performs the creation and updating of temporal micro cluster. In lines (6–13) after the arrival of new segment , the algorithm finds the nearest distance between and all nonexpired TMC. is nonexpired when the difference between time and creation time is less than threshold. In lines (14–19), if the distance (, ) is less than a threshold distance (), will be added to and update information; otherwise a new will be created for and set . Line (21) transfers all expired TMCs to E_TMC array every (Φ) second. In lines (22-23), if the size of both TMC and E_TMC exceeds memory constraint, some TMC will be merged based on spatiotemporal information.

4.1. TMC Merging Algorithm

Spatially, CC_TRS adopts the TCMM [5] technique to merge two TMCs, TCMM suggests taking into consideration the tightness of micro clusters when merging them. Furthermore, TCMM framework gives the priority to merge two loose micro clusters rather than merging tight micro clusters if the distance between their representative lines is equal, since merging very tight micro clusters will break their tightness as in Figure 6(a) while merging two loose micro clusters will not hurt their loss tightness as illustrated in Figure 6(b).

The spatial distance between and is equivalent to the distance between their representative lines with extent and with extent . Note that extent δ is used to strengthen the similarity of two loose micro clusters in order to give them the priority to merge as illustrated in Figure 7. The distance between and contains three parts: center distance, angle distance, and parallel distance: The center distance with extent isThe angle distance with extent isThe parallel distance with extent is

When the size of TMCs exceeds the memory limits, the TMCs must be merged to satisfy memory constraint. The merging algorithm of given TMCs is illustrated in Algorithm 2. Note that CC_TRS maintains temporal and spatial information during merging process while TCMM maintains only spatial features.

Input: set of temporal micro clusters temporally arranged
calculate the spatial similarity between every two consecutive
Arrange the similarity from the most similar to least similar TMCs
Merge most similar TMCs until the size of TMCs satisfy the memory limits.
4.2. Trajectory Macro Clustering

Macro clustering is evoked when the user requests to see the overall results. Any density based clustering algorithm can be used to achieve macro clustering by replacing the distance between spatial points with the distance between temporal micro clusters as depicted in Figure 8. The distance between temporal micro clusters is defined in (6).

5. Performance Evaluation

In this section, the performance of CC_TRS algorithm is tested and compared with TCMM framework and ConTraClu. Two real data sets called Elk1993 and Deer1995 are used; the Elk 93 has 33 trajectories and 47204 points, while Deer 95 has 32 trajectories and 20065 points. Any trajectory segmentation algorithm such as in [10, 21] can be used to divide each trajectory to set of segments. Matlab R2012b and excel 2013 were used to implement the algorithm and plot the charts.

5.1. Clustering Quality Evaluation

The sum of square distance SSQ was used to compare the clustering quality results of CC_TRS with TCMM. The SSQ of trajectory segments is equal to the sum of square distances between segment and its closest TMC representative line as illustrated in (10) and (11); the value of distance threshold () is set to 600:Figure 9 shows an improvement in clustering quality results of CC_TRS compared with TCMM using data sets Deer 95 and ELK 93. For Deer 95 data set, the maximum improvement is 4.7% when number of time bins is equal to 100, while minimum improvement is 3.75% when number of time bins is equal to 200 as illustrated in Figure 9(a). For ELK 93 data set, the maximum improvement is 3.8% when number of time bins is equal to 200, while minimum enhancement is 3.5% when number of time bins is equal to 100 as depicted in Figure 9(b). Therefore, the improvements of clustering quality range within 3.5–5% (the smaller SSQ, the better clustering quality).

5.2. Efficiency Evaluation (Time and Memory Space)

The CC_TRS was compared with TCMM and ConTraClu algorithms to evaluate its efficiency in terms of execution time and space requirements. TCMM and ConTraClu divide stream data to set of time bins and cluster each time bin separately, while CC_TRS allocates lifespan for each new created cluster. To make the comparison fair, we set cluster life equal to time bin as in Figures 1 and 3. Equation (12) can be used to calculate cluster life (time bin):where and are the stamp time of the first and last segments in data stream and NOB is the number of time bins which are specified by user threshold. The stamp times of first and last segments for ELK 93 data set are (4.6896 × 104) and (4.9320 × 104) minutes and for Deer 95 data set are (6.3553 × 104) and (6.6840 × 104) minutes. The algorithms are run several times for different values of NOB (100, 200, 300, and 400); Figures 10(a) and 10(b) illustrate that the execution time of the CC_TRS is less than TCMM and ConTraClu.

On the other hand, the tests show that CC_TRS needs 10–15% higher of memory space than TCMM as illustrated in Figures 11(a) and 11(b). The size of micro cluster in TCMM is 28 bytes, since the micro cluster has 7 fields and each field declared using uint32 Matlab variable (4 bytes). While TMP size is 32 bytes and one bit since it has two extra fields, the first field is used to save creation time of TMC and a logical variable is used to save the status of TMC (expired, nonexpired). It is obvious that TMC size is equal to 1.15 of micro cluster size; therefore, most of the extra memory required by CC_TRS comes from the additional temporal field in TMC data structure. Note that we compare the memory requirements of CC_TRS with only TCMM since both algorithms have similar data structure.

5.3. Parameter Φ Effect

To explain the effect of Φ parameter on execution time of CC_TRS, the algorithm is run several times with different values of Φ range within 1000–3500 seconds. We set the value of NOB (400) and (600) since these values give minimum execution time for Deer 95 data set as shown in Figure 10(a). Figure 12 shows that minimum execution time is achieved when Φ is equal to 2000 seconds.

5.4. Parameter Effect

In this section, we describe the effects of on clustering quality and running time depending on Deer 95 data set. The smaller the , the better the clustering quality but it requires longer execution time. On the other hand, the larger , the faster running time but more information will be lost in micro clustering. For example, the clustering quality when (red line) is better than its value when (blue line) as illustrated in Figure 13(a), while the running time is higher for as illustrated in Figure 13(b). As a consequence, a trade-off between clustering quality and running time is required to get the best results.

6. Conclusion

In this article, CC_TRS algorithm is proposed for the clustering of trajectories stream data for free moving object. The algorithm consists of two phases: online phase and offline phase. In online phase, CC_TRS algorithm suggests a lifespan technique which assigns a lifetime for each newly created temporal micro cluster instead of time window techniques which divide stream data to time bin and cluster each time window separately. In offline phase, density base clustering approach is used to demonstrate clustering results on user demand. The tests of CC_TRS on two data sets Deer 95 and ELK 93 minimize running time (50% and 20%) compared with TCMM and ConTraClu, respectively. Besides that, the clustering quality is improved by 3.5–5% compared with TCMM based on the sum of square distance SSQ. On the other hand, CC_TRS algorithm needs higher memory space by 10–15% compared to TCMM framework since the data structure of temporal micro cluster has extra temporal fields.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.