An Asynchronous Periodic Sequential Pattern Mining Algorithm with Multiple Minimum Item Supports for Ad Hoc Networking

Yu, Xiangzhan; Zhang, Zhaoxin; Yu, Haining; Jiang, Feng; Ji, Wen

doi:https://doi.org/10.1155/2015/461659

Journal of Sensors

On this page

Abstract Introduction Related Work Conclusions Acknowledgments References Copyright Related Articles

Special Issue

Bioinspired Mechanisms in Wireless Ad Hoc and Sensor Networks

View this Special Issue

Research Article | Open Access

Volume 2015 | Article ID 461659 | https://doi.org/10.1155/2015/461659

An Asynchronous Periodic Sequential Pattern Mining Algorithm with Multiple Minimum Item Supports for Ad Hoc Networking

Xiangzhan Yu,¹Zhaoxin Zhang,¹Haining Yu,¹Feng Jiang,¹and Wen Ji²

Academic Editor: Daniel Bo-Wei Chen

Received09 Jan 2015

Accepted28 Feb 2015

Published04 Aug 2015

Abstract

The original sequential pattern mining model only considers occurrence frequencies of sequential patterns, disregarding their occurrence periodicity. We propose an asynchronous periodic sequential pattern mining model to discover the sequential patterns that not only occur frequently but also appear periodically. For this mining model, we propose a pattern-growth mining algorithm to mine asynchronous periodic sequential patterns with multiple minimum item supports. This algorithm employs a divide-and-conquer strategy to mine asynchronous periodic sequential patterns in a depth-first manner recursively. We describe the process of algorithm realization and demonstrate the efficiency and stability of the algorithm through experimental results.

1. Introduction

Ad hoc networking has become a hot issue recently as large-scale nodes exhibit in the network. Heterogeneous devices and heterogeneous environments deepen the difficulty of ad hoc networking. The sequential pattern mining model is used to discover all sequential patterns that occur frequently in a sequence database; the target database is shown in Table 1. The spatiotemporal sequence mining model is used to discover all frequent spatiotemporal sequential patterns in a spatiotemporal sequence database; the target database is depicted in Table 2. The periodic pattern mining model is used to discover all periodic patterns in a temporal sequence database; the target database is shown in Table 3.

According to the three mining models, the time property of a sequence can be divided into two levels. Fine-grained time property is as follows: the fine-grained time property is used to annotate each item set of each sequence, such as the subscript of every item set in Table 4. Coarse-grained time property is as follows: the coarse-grained time property is used to annotate each sequence, such as tid in Table 4. The fine-grained time property is a further division of the coarse-grained time property. The sequential pattern mining model only uses the fine-grained time property to sort items and ignores the coarse-grained time property of a sequence. The fine-grained time property is utilized in the spatiotemporal sequence mining model. This model uses the temporal annotation as a criterion to judge whether two sequences are equal, and deviations of the annotation can be tolerated. Nevertheless, both models only concentrate on the frequency of sequences but ignore the characteristics of the distribution of sequences in a database. The periodic pattern mining model takes the fine-grained time property as a criterion to discover periodic patterns, and it uses periodic patterns within the fine-grained time property to calculate the patterns within the coarse-grained time property. However, the restrictions of the temporal annotations of item sets are too strict (the fine-grained time property should be divided into unit time intervals); besides, frequent deviations of annotations of item sets would not be permitted while mining. In addition, a myriad of sparse periodic patterns with little meaning will be generated by the periodic pattern mining model when the time series is sparse.

The original sequential pattern mining model only considers the occurrence frequencies of sequential patterns and disregards their occurrence periodicity. Therefore, we propose an asynchronous periodic sequential pattern mining model, and, for this model, we propose a pattern-growth mining algorithm to mine asynchronous periodic sequential patterns with multiple minimum supports. The algorithm is a recursive algorithm that uses a divide-and-conquer strategy to mine the patterns, and the search is depth-first.

The remainder of this paper is organized as follows. Section 2 gives a brief overview of recent related research on sequential pattern mining and periodic sequential pattern mining. In Section 3, definitions of related concepts are introduced. In Section 4, the asynchronous and synchronous periodic sequential pattern mining models are presented. Section 5 proposes a pattern-growth mining algorithm to mine asynchronous periodic sequential patterns with multiple minimum item supports. In Section 6, the experimental results show the efficiency and stability of the algorithm. We present the conclusions of our study in Section 7.

Traditional sequential pattern mining algorithms, such as FreeSpan, PrefixSpan [1], and SPADE [2], discover frequent subsequences as patterns in a sequence database. These traditional algorithms are described in detail in [3]. A spatiotemporal sequence mining algorithm is a special type of sequence mining algorithm. Deviations of temporal annotation and spatial location of a sequence should be considered in this method. In [4], a spatiotemporal sequence mining algorithm based on PrefixSpan is proposed. In this algorithm, temporal annotations not only are used to sort the location or status but also are involved in mining the spatiotemporal sequence directly. However, the complexity of the algorithm would increase dramatically as the spatial dimension aggrandizes.

Traditional periodic pattern mining algorithms discover periodic patterns in time-related databases, and these algorithms can be divided into several categories as follows by different characteristics:(1)the full periodic pattern mining method and partial periodic pattern mining method: full periodic patterns [5] are those in which all items of the time series take part in the periodic behaviour patterns, whereas there are almost no such patterns due to fairly strict constraints; in partial periodic patterns [6], only a portion of the items of the time series reflects the periodicity; compared with full periodic patterns, partial periodic patterns are more loose and realistic;(2)the synchronous periodic pattern mining method and asynchronous periodic pattern mining method: synchronous periodic patterns [6] are those in which if one pattern appears at time , it would definitely appear at (), where period represents the length of the period, and the patterns, which do not happen at such fixed times, would be taken as irrelevant patterns; asynchronous periodic patterns [7] are those in which if one pattern appears at time , instead of just taking place at (), the pattern would appear at any time of the time series; synchronous periodic patterns are a special case of asynchronous periodic patterns, and the latter is more realistic.

The full periodic pattern mining method has been widely studied in the field of signal analysis. Fast Fourier transforms and Wavelet analysis are often used to find the full periodic patterns in time series data. In [8], a partial periodic pattern mining algorithm based on the downward closure property is proposed. To improve efficiency, this algorithm builds a max-subpattern tree to separate partial periodic patterns. The authors in [9] proposed a convolution-based algorithm employing the improved fast Fourier transform to mine the partial periodic patterns and discover all possible synchronous periodic patterns. Time series are divided into intensive intervals in [10], and then the synchronous periodic patterns are mined. [7, 11] propose an asynchronous periodic sequential pattern mining algorithm called LSI to find the longest periodic subsequence, in which some sequences whose lengths are less than the threshold value can exist. However, the algorithm is not suitable for the condition of multievents, and only the longest subsequence of the asynchronous periodic patterns can be found, with other subsequences being ignored. Aiming to address these shortcomings, an improved algorithm, SMCA, based on a hash table and enumeration, is proposed by [12]. This algorithm not only implements all functions of LSI but also corrects the defects and improves the efficiency. Reference [13] proposes an algorithm for mining periodic patterns that utilizes the method of [3] to preprocess trajectory sequences and uses the max-subpattern tree proposed by [8] to discover periodic patterns. Reference [14] proposes a method to mine periodic-frequent item sets with approximate periodicity using an interval transaction-ids list tree, and it is extended by [15] to fulfill the requirement of mining periodic-frequent patterns in transactional databases. Reference [16] proposes a HACE theorem that characterizes the features of the Big Data revolution and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. Reference [17] proposes a trajectory prediction approach for mobile objects by combining semantic features through pattern mining in the geographic trajectory data of user. Reference [18] focuses on dynamic networks to study the related pattern mining method and its applications in biological and social networks. Reference [19] proposes a pattern mining algorithm which is called Multiconstraint Closure Conditional Tree. The algorithm sets different limiting conditions to solve the combinatorial explosion problem and the rare item problem. And, to prevent noise and other uncertainties, this paper introduces similarity-based pattern matching method, making the pattern mining method more robust.

3. Basic Concept and Definition

Before talking about the asynchronous periodic sequential pattern mining algorithm, some basic concepts will be defined.

At first, we assume that is the set of all items, and each item is the nonempty subset of . A sequence is a sequential list of item sets, and it can be expressed as , where is an item set and also en element of , which can be expressed as , where is an item. To simplify the expression, the parentheses can be omitted if an item only consists of one element; for example, “()” can be expressed as “.” One item can appear merely once or less in one element of a sequence, but more than one item can exist in different elements of a sequence. In general, the elements are sorted lexically, and the number of elements is the length of the sequence. A sequence of length is called an -sequence.

The target database of asynchronous periodic sequential pattern mining is defined as follows.

Definition 1 (temporal symbolic sequence database). The temporal symbolic sequence database is a set of tuples; that is, TDB = , where is the number of tuples and represents the tuple , where tid is the time when symbolic sequence takes place. In general, tuples of TDB are sorted in ascending order, and each tid is evenly spaced. Figure 1 shows a temporal symbolic sequence that consists of 23 tuples.

Definition 2 (subsequence and supersequence). Sequence is a subsequence of sequence , and is the supersequence of , which can be expressed as if there exists a set of integers , where and .

Definition 3 (containment and appearance). is one of the tuples of TDB, and is a sequence. We can say that contains or appears at only if is a subsequence of .

For instance, as is shown in Figure 1, appears at the tuple (0, a(abc)(ac)d(cf)).

Definition 4 (prefix). Sequence is the prefix of sequence () only if these three requirements are all met: (i) , where (). (ii) . (iii) All items of are after by lexicographic order.

Definition 5 (postfix). Sequence is the prefix of sequence (), and sequence is called the postfix of with regard to prefix , denoted as or , where .

Definition 6 (periodic projected database). Construct a temporal symbolic sequence database TDB, and specify a preset parameter of maximum period max_per, max_per (), where tid_f is the first time stamp of TDB and tid_l is the last time stamp. The periodic projected data of TDB that starts at time with a period is composed by the tuple (, ) (), where , 1 ≤ p ≤ max_per, and , and these data can be expressed as .

For instance, as is shown in Figure 1, the tuples of prj_3,1(TDB) are as follows: 1: , 4: , 7: , 10: , 13: , 16: , and 19: .

Definition 7 (prefix projected databases). If sequence is an asynchronous periodic sequential pattern in TDB, then the prefix projected database is the subdatabase that is composed of all of the prefixes of with regard to the postfix of , denoted as .

Definition 8 (contained segments). Sequence with a period of in TBD contains a segment L, which is a time list including consecutive tuples of in , where the timestamp is the remainder of the division of the starting position by . The contained segment consists of a quaternion (, , , ), where is the target sequence, p is the period, rep is the number of times in which appears, and begin is the starting position where the repeated appearance of starts. L can be called the maximum contained segment if is not contained in the two directions of L.

As is shown in Figure 1, the maximum contained segment is expressed as (, 3, 5, 7), which means that starts at tid 7 and appears repeatedly a total of 5 times with a period of 3 and the tids at which appears are 7, 10, 13, 16, and 19. Similarly, is expressed as (, 3, 3, 0), is expressed as (, 3, 3, 14), and both of them are the maximum contained segments with a period of 3.

Definition 9 (valid contain segments). A maximum contained segment is valid only if the repeat count of is not less than min_rep; that is, L.rep ≥ min_rep.

As is shown in Figure 1, we assume that min_rep = 3; then, the maximum contained segments , , and are the valid contained segments of with a period of 3 because the repeat counts of , , and are 3, 5, and 3, respectively, which are not less than 3.

Definition 10 (distance between contained segments). Given two contained segments, and (L.begin < .begin), the distance, dis, between them is the difference between the starting time of and the ending time of ; that is, dis = .begin − []. If dis < 0, we can say that and intersect; otherwise, they do not.

As is shown in Figure 1, the distance between and is 1 (), the distance between and is 8 (), and the distance between and is −5 (). Therefore, we can safely draw the conclusions that (, ) and (, ) are not intersecting, while (, ) is intersecting.

Definition 11 (sequences of contained segments). The sequence of contained segments is the sequence that meets the following requirements: (i) contained segments come from the same database and are of the same sequence with the same period; (ii) contained segments must be valid; (iii) contained segments are sorted by increasing starting time; (iv) any two contained segments are not intersecting. The sequence of contained segments of sequence with a period of can be expressed as follows:where is a valid segment, () and the distance dis_i between two consecutive segments can be expressed by the following equation: .

Definition 12 (valid sequences of contained segments). The sequence of contained segments is valid when dis_i ≤ max_dis p and , where max_dis indicates the preset maximum distance coefficients and min_sup is the preset minimum support of the asynchronous period.

Figure 1 shows that, in the symbolic database, if min_rep = 3, max_dis = 3, and min_sup = 6, then we can say that () and () are valid sequences of the contained segments of with a period of 3.

4. Periodic Sequential Pattern Mining Model

4.1. Synchronous Periodic Sequential Pattern Mining Model

In a temporal symbolic sequence database TDB, the synchronous periodic support of sequence with a period starting at tid is the number of tuples that contains. This is shown as follows:

Given a minimum support min_sup (0 < min_sup ≤ 1) and a maximum period max_per (0 < max_per ≪ ), sequence in TBD, which starts at time , is frequent with the period if and 0 < ≤ max_per. A synchronous periodic-frequent sequence is called a synchronous periodic sequential pattern. The goal of a synchronous periodic sequential pattern mining algorithm is to discover all synchronous periodic sequential patterns whose periods are from 1 to max_per.

4.2. Asynchronous Periodic Sequential Pattern Mining Model

In many applications, the periodicity of a sequence is usually not perfect and precise, and there may exist noise between segments of a sequence. However, only the noise of a “sequential deficiency,” not the noise of a “sequential offset,” can be recognized by this model; as a result, many sequential patterns with high value cannot be discovered. To some degree, the interruption of noise can be tolerated by the periodic sequential mining algorithm. In addition, “system behaviour” occurs repeatedly, and, then, it disappears or changes. The sequential patterns with such uncertainty only appear periodically in a portion of TDB.

Based on the analysis above, this paper proposes an asynchronous periodic sequential pattern mining model that can discover the patterns that appear periodically in various time periods of the TDB; also, certain noise can be tolerated in this model. The main idea of the model is as follows. At first, to judge if a sequence is a potential asynchronous periodic sequential pattern, the sequence must appear repeatedly, which means that this sequence has significant periodic trends. Then, the time interval within which the sequence appears periodically would be examined to determine whether the time interval is “random noise” or “the change of system behaviour.” Finally, on the premise that the noise is tolerated, the time periods at which the sequence appears periodically would be linked to obtain the maximum periodic time range.

Concretely, an asynchronous periodic sequential pattern and its mining model are defined as follows.

Definition 13 (asynchronous periodic sequential pattern). Sequence is one asynchronous periodic sequential pattern in the temporal symbolic database TDB. If there exists more than one valid sequence of contained segments relating to , then the period of such sequences is that of . It is not difficult to find that one asynchronous periodic sequential pattern may have several periods, and each period may involve several valid sequences of contained segments.

Definition 14 (asynchronous periodic sequential pattern mining model). Given the minimum support, the maximum distance coefficients, the minimum repeat count, and the maximum period, the purpose of asynchronous periodic sequence mining is to discover all asynchronous periodic sequential patterns and their valid sequences of contained segments in TDB.

5. AP-PrefixspanM Mining Algorithm

To use the one and only minimum support, we must assume that all the items in the database have the same properties and a similar frequency of occurrence. However, this assumption conflicts with the actual application situation, which leads to the result that the sequential patterns with few but important items are omitted. A perfect asynchronous periodic sequential pattern mining algorithm should support multiple minimum item supports, which means users can identify each item with minimum support, and different requirements of minimum supports can be met by different sequences with different items. With multiple minimum item supports, not only can we prevent the generation of a myriad of meaningless asynchronous periodic sequential patterns, but we can also discover the sequential patterns with few items.

In this paper, we propose a pattern-growth mining algorithm to mine asynchronous periodic sequential patterns with multiple minimum item supports; this algorithm is called AP-PrefixspanM (asynchronous periodic prefix projected sequential patterns with multiple minimum item supports).

5.1. Relationship between Algorithm Parameters

Let represent the minimum support of item . The minimum support of the sequential pattern is the lowest minimum support of all items of ; for example, if the item set of is , then the equation of minimum support of is

Let be the minimum repeat count of item ; the minimum repeat count of is the lowest repeat count of all items; that is

Additionally, let be the maximum interference distance of item , which represents a reasonable bound of disturbance between two valid sequences of contained segments; the maximum interference distance of is the largest maximum interference distance of all items; that is

To reduce the workload of user settings, we assume that there is a function relationship between the maximum interference distance, the minimum repeat count, and the minimum support. Specifically, the relationship between the minimum repeat count of item , , and the minimum support of item , , is linear increasing, and it can be expressed as follows:

The relationship between the maximum interference distance of item , , and the minimum support of item , , is exponential decreasing, and it can be expressed as follows:where is the preset constant of interference distance and max (MinIS) is the largest minimum support of MinIS.

If MinIS() ≤ MinIS() ≤ ≤ MinIS(), then MinREP() ≤ MinREP() ≤ ≤ MinREP(), MaxDis() MaxDis() ≥ ≥ MaxDis().

5.2. The Main Idea and Steps of the Algorithm

For the purpose of reducing the search space of asynchronous periodic symbolic sequential pattern mining, the downward closure property is used in the asynchronous periodic sequential pattern mining model with single minimum support.

Property (Downward Closure Property). If the sequence is an asynchronous periodic sequential pattern and its set of valid sequences of contained segments is QSet, then we can say that all the nonempty subsequences of are also asynchronous periodic sequential patterns, and the set of valid sequences of contained segments of these subsequences is the superset of QSet.

However, because the minimum item support of an asynchronous periodic sequential pattern may be less than its subsequence’s with multiple minimum item supports, the downward closure property cannot be used directly to prune the search space.

The main idea of the PrefixspanM algorithm is based on the divide-and-conquer strategy. The problem of asynchronous periodic sequential pattern mining is divided progressively into a series of subproblems that are not intersecting. The framework of database division in the PrefixspanM algorithm is as shown in Figure 2. Two dividing methods can be adopted: prune the search space with the downward closure property: at level 1, the database is divided into a series of subdatabases by the minimum support vectors and generate asynchronous periodic sequential patterns with pattern-growth: below the second level, the prefix projected database is generated recursively by the prefix.

Firstly, the AP-PrefixspanM algorithm divides the database based on minimum support vectors. Then, a series of subdatabases are generated and mined by the single minimum support. Each division is based on a frequent item called a key item. The single minimum support utilized is called the minimum support of the key item. The downward closure property will not be destroyed while mining, and the specific splitting steps of data are described as follows.

Step 1. Scan the database TDB and obtain item , whose real support is at least MinIS(). Such items are called frequent items, and they are placed in ascending order according to their MinIS to obtain , ().

Step 2. The complete set of sequential patterns of asynchronous period in TDB can be divided into the following mutually disjoint subsets, among which some subsets may be empty: sequential patterns that contain sequence ,sequential patterns that contain sequence but do not contain , ⋮sequential patterns that contain sequence but do not contain , ⋮sequential patterns that only contain .
Divide TDB by these subsets, and then mine the subdatabases to obtain the subsets of these asynchronous periodic sequential patterns: TDB₁ (key item ): delete the infrequent items of tuples and the tuples that do not contain ; TDB₂ (key item ): delete the infrequent items of tuples, , and the tuples that do not contain ; ⋮ TDB_i (key item ): delete the infrequent items of tuples, , and the tuples that do not contain ; ⋮ TDB_n (key item ): delete all the infrequent items of tuples except for and the tuples that do not contain .

Mine the asynchronous periodic sequential patterns in the TDB_i with the minimum support of MinIS(). This problem can be divided into a series of mutually disjoint subproblems.(1)Let be the complete set of length- asynchronous periodic sequential patterns in TDB_i, and this complete set can be divided into mutually disjoint subsets. The subset () is the set of asynchronous periodic sequential patterns whose prefix is .(2)Let be the asynchronous periodic sequential length- pattern. is the set of asynchronous periodic sequential patterns whose prefix is and length is . Apart from , the complete set of asynchronous periodic sequential patterns whose prefix is can be divided into mutually disjoint subsets. The subset k () is the subset of asynchronous periodic sequential patterns whose prefix is , as is shown in Figure 3.

Based on the descriptions above, the problem of asynchronous periodic sequential pattern mining can be divided recursively, which suggests that each subset can be further divided. Thus, a divide-and-conquer framework is composed. To mine the subsets of asynchronous periodic sequential patterns, we can construct corresponding prefix projected databases and mine each database recursively. The main step is shown as follows:

Step 1 (generate parameters). Regarding MinIs as the minimum support vector, generate the minimum repeat count vector, MinREP, and the maximum interface distance vector, MaxDis.

Step 2 (find the length-1 sequential patterns). Scan the database once and generate all length-1 asynchronous periodic sequential patterns. The length of all patterns will be increased by 1 when the prefix is extended by those length-1 sequential patterns.

Step 3 (divide the search space). Divide the complete set of asynchronous periodic sequential patterns into the subsets whose prefixes are the patterns whose lengths have been increased by 1.

Step 4. Construct the prefix projected databases, and, then, discover the subsets of each asynchronous periodic sequential pattern recursively. Specifically, repeat the steps above until no length-1 asynchronous periodic sequential patterns can be generated.

In the mining process described above, only the asynchronous periodic sequential patterns will be added to the mining result. Other patterns will be deleted after mining.

5.3. Implementation of the Algorithm

The pseudocode of AP-PrefixspanM is described in Algorithm 1. Firstly, regarding MinIS as the minimum support vector, generate the minimum repeat count MinREP and the maximum interference distance MaxDIS (as is described in steps and of Algorithm 1).

Input: target database TDB, minimum support vector MinIS and the maximum period max_per.
Output: all the asynchronous periodic sequential patterns and valid contained segments queue.
() MinREP ← generateREP(MinIS);
() MaxDis ← generateDIS(MinIS);
() ← init_pass(M, TDB);
() ← ;
() ← sort (, ASCEND);
() FOR each in DO
() // is the 1-pattern before of .
() ← delete ;
() MAPPrefixSpan(null, , , MinIS(), MaxREP(), MaxDis());
() ENDFOR

Then, scan the TDB once and find all the frequent items for which the real support is more than MinIS(). Such items are the potential asynchronous periodic sequential patterns whose lengths are 1 (as is described in step of Algorithm 1). Obtain the list SPF₁ by sorting the frequent items in ascending order based on the minimum support vector MinIS (as is described in step (5) of Algorithm 1). Finally, obtain the SPF₁ divided subdatabases, and, for each sub database, call the function MAPPrefixSpan to mine the asynchronous periodic sequential patterns (as is described in steps (6)–(10) of Algorithm 1).

The function MAPPrefixSpan discovers all synchronous periodic sequential patterns by the method of depth-first search (as is described in Algorithm 2). First, we define a hash map hash_item item_id, (count, tids) to record each frequent item’s frequency and its time slot (as is described in step (1) of Algorithm 2), where item_id is the key of the hash map, two-tuples (count, tids) is the value, count is the frequency of item_id, and tids is the time slot queue of item_id. Each item has two possible ways to be extended to the prefix and obtain a new sequential pattern: join to the last item set of the prefix and item_id of is expressed as “_x,” and join independently to the prefix, and item_id of is expressed as “x.” Scan the projected database TDB, and record the frequency and the time slot queues of the two extended ways by the hash map (as is described in steps (2)–(8) of Algorithm 2). Delete the items for which the frequency is less than min_sup in the hash map (as is described in steps (9)–(11) of Algorithm 2). At this time, the frequent items of hash_item may be extended to the current prefix, and a new asynchronous periodic sequential pattern would be generated. For each item, call the function ASPDetector to calculate the possibility of frequent items being extended to the prefix to generate new patterns. If there would be new patterns, then all the valid contained segment queue set of the new pattern would be generated (as is described in step (13) of Algorithm 2). If frequent items can be extended to the prefix, then generate the new prefix after extension (as is described in step (15) of Algorithm 2); meanwhile, call the function Projectdatabase to generate the prefix projected database of the current item. If c_item is not included in the prefix, delete the tuple in which the key item c_item is not included in (as is described in steps (16)–(18) of Algorithm 2). If the number of tuples in the filtered is not less than min_sup, then call the function MAPPrefixSpan recursively to discover the items that can be extended to the prefix in the smaller prefix projected database, and discover the asynchronous periodic sequential patterns that are growing progressively (as is described in steps (19)-(20) of Algorithm 2). In the process of the recursion above, all the asynchronous periodic sequential patterns and their valid contained segment queues can be discovered.

Input: prefix prefix, temporal symbolic sequence databases TDB, key item c_item, minimum support min_sup, minimum
repeat count min_rep, maximum period max_per, maximum interference distance max_dis.
() hash_item〈item_id, (count, tids)〉;
() FOR each tuple in DO
() IF , THEN
() hash_item[].count++; hash_item[].tid ← ;
() IF , THEN
() //prefix.litemset is the last item of prefix
() hash_item[].count++; hash_item[].tid ← ;
() ENDFOR
() FOR each key in hash_item DO
() IF THEN delete hash_item[];
() ENDFOR
() FOR each key c in hash_item DO
() ASPDetector (, hash_item[], _item, prefix, min_sup, min_rep, max_per, max_dis);
() IF can be extended to prefix THEN
() newprefix ← extend(, prefix);
() ← Projectdatabase (, );
() IF prefix notcontain _item THEN
() delete tuples in which not contain _item;
() IF min_sup ≤ THEN
() AP-PrefixspanM(newprefix, , _item, min_sup, min_rep, max_per, max_dis);
() ENDFOR

The function ASPDetector is used to judge whether the current item can be extended to the prefix, obtain a new growing asynchronous periodic sequential pattern (a new pattern for short), and discover the new pattern’s valid contained segment queue. The judgment of pattern growth consists of three stages: potential period detection (PPD), valid segment detection (VSD), and valid segment mergence (VSM).

The PPD stage (as is described in steps (3)–(9) of Algorithm 3) is responsible for detecting all the possible periods of a new pattern. The hash map hash_period p, count (initialize count’s value to 1) is used to record the frequency of a period. Scan the time slot queue of the parameter occur. Start with the first time slot and establish a sliding window of which the length is max_per, and calculate the time intervals between the first time slot in the sliding window and other time slots behind it. At an arbitrary time slot tid_i, calculate the interval between tid_i and tid_j, where ≤ j ≤ min(max_per-i, ). If ≤ max_per, then increase the count of the period of hash_period by one (as is described in step (4) and step (5) of Algorithm 3). Otherwise, stop calculating and slide the window to the time slot tid_i+1 before tid_i, and repeat the calculation process. The requirement for generating a new pattern, for which the period is (), is that the frequency be not less than min_rep; that is, hash_period[] min_rep.

Input: frequent items fitem, time slot table occur, key item _item, prefix prefix, minimum support min_sup, minimum
repeat count min_rep, maximum period max_per, maximum interference distance max_dis
() hash_period〈 , count ← 1〉;
() FOR each in occur.tids DO
() FOR each in occur.tids, DO
() IF ≤ max_per THEN
() hash_period [].count++;
() ELSE break;
() ENDFOR
() ENDFOR
() FOR each key that hash_period [] ≥ min_rep DO
() hash_segment〈pos, (rep, last)〉;
() vs_set (patten, period, rep, start) ← ⌀;
() FOR each in occur.tids DO
() ← mod ;
() IF () == THEN
() hash_segment [].rep++;
() hash_segment [].last ← ;
() continue;
() IF hash_segment [].rep ≥ min_rep THEN
() pattern ← extend(prefix, fitem);
() vs_set ← vs_set (pattern, , hash_segment[].rep, hash_segment[]hash_segment[]));
() hash_list[].rep ← 1;
() hash_list[].last ← ;
() ENDFOR
() FOR each key that hash_list[] ≥ min_rep DO
() pattern ← extend(prefix, fitem);
() vs_set ← vs_set (pattern, , hash_list[].rep, hash_list[posi].hash_list[);
() ENDFOR
() svs_set ← sort(vs_set);
() FOR each in svs_set DO
() IF (.rep ≥ min_sup) && (_item∈pattern) THEN
() Output();
() MergeSeg(, svs_set, .rep, min_sup, max_dis);
() ENDFOR
() ENDFOR

The VSD stage is responsible for discovering all the valid contained segments of the new pattern. Firstly, for each potential period , define a hash map hash_segment pos, (rep, last) to record the contained segments (as is described in step (10) of Algorithm 3), where the key pos is the pattern’s offset, and the value (rep, last) is used to record the repeat count and the new time slot of the contained segments. At an arbitrary time slot tid_i, tid_j-hash_segment[pos_i].last (pos_i = tid_i%) is the time interval of two simultaneous occurrences. If tid_i, tid_j-hash_segment[pos_i].last is equal to , which means that those two occurrences are consecutive in the same contained segment. hash_segment[pos_i].rep records the repeat count of a contained segment. The current contained segment, for which the offset is pos_i, is valid when hash_segment[pos_i].rep ≥ min_rep. Specifically, for each period , the time slot queue should be scanned once, and, for each time slot tid_i, the offset pos_i = tid_j% needs to be calculated (as described in step (13) of Algorithm 3). If tid_j-hash_segment[pos_i].last is equal to , the current contained segment is growing; increment hash_segment[pos_i].rep by one, update hash_segment[pos_i].last to tid_i (as described in steps (15) and (16) of Algorithm 3), and then process the next time slot. Otherwise, the current contained segment is interrupted; at that time, if hash_segment[pos_i].rep ≥ min_rep, the current contained segment is valid. It would be recorded as a tetrad in vs_set (as described in steps (18)–(20) of Algorithm 3). After the interruption, hash_segment[pos_i].rep would be reset to 1, and hash_segment[pos_i].last would be updated to tid_i (as described in steps (21) and (22) of Algorithm 3). After the time slot queue is detected, hash_segment would be detected again to judge whether a valid contained segment exists (as described in steps (24)–(27) of Algorithm 3). Sort all valid contained segments for which the period is in ascending order by their start time slot (as described in step (28) of Algorithm 3). For each contained segment, call the function MergeSeg to merge all contained segments, and generate a valid contained segment queue, for which the period is ; start with the current contained segment (as described in steps (28)–(31) of Algorithm 3), then print the valid contained segment queue for which the key items are included.

The VSM stage is responsible for generating the valid contained segment queue. Function MergeSeg adopts the depth-first enumeration method to merge all possible contained segments into the queue. Generate a different contained segment queue with different initial contained segments, and, for each contained segment, use the divide-and-conquer strategy to find the segment that can be merged with the initial segments. Then, regard these segments as the initial segments to find other segments that can be merged recursively. Repeat this process until no segments can be merged. Specifically, if the repeat count of an initial segment is not less than min_sup, print this segment (as described in steps (29) and (30) of Algorithm 3). For any initial segment, scan svs_set once and find the segment set into which the segment can be merged directly (as described in steps (2)–(7) of Algorithm 4). For the segments of such set, call the function MergeSeg recursively (as described in step (14) of Algorithm 4) to generate a smaller segment set (as described in step (11) of Algorithm 4), and various longer segment queues can be generated (as described in step (9) of Algorithm 4). Once the sum of the segments’ repeat counts is not less than min_sup and key items are included in the pattern while merging, print that segment queue (as described in steps (12) and (13) of Algorithm 4).

Input: prefix segments prefix_segs, contain segments set svs_set, repeat count of prefix segments prefix_sum, minimum support
min_sup, maximum interference distance max_dis
() VSQ ← ⌀;
() FOR each in svs_set DO
() tail ← .start + ..);
() IF ((.) > ) THEN break;
() ELSE IF (tail > .start) THEN continue;
() ELSE VSQ ← VSQ ;
() ENDFOR
() FOR each in VSQ DO
() newprefix_segs ← prefix_segs ;
() newprefix_sum ← prefix_sum + .rep;
() newsvs_set ← svs_set delete segments before ;
() IF (newprefix_sum ≥ min_sup) && (_item ∈ pattern) THEN
() Output(newprefix_segs);
() MergeSeg (newprefix_segs, newsvs_set, prefix_sum, min_sup, max_dis);
() ENDFOR

6. Experimental Analysis

This paper first proposes an asynchronous periodic sequential pattern mining model, and, to confirm the validity and stability of the algorithm, we also design a synthetic dataset generation algorithm, which is shown in Algorithm 5.

Input: parameters and their meaning are shown in Table 5
Output: simulation data set
() FOR to do
() p.period = GaussianDist(, MAXPER);
() p.pattern = GenPattern(p.period, , );
() position = UniformDist();
() WHILE () DO
() p.local = GeometricDist(, MAX = );
() IF (-position < p.local * p.period) THEN break;
() FOR to p.local DO
() position+=p.period;
() insert p.pattern into TIDS[position];
() ENDFOR
() p.disturbance = p.period * GeometricDist(, MAXDIS);
() postion+=p.disturbance;
() ENDWHILE
() ENDFOR
() FOR to DO
() = PoissonDist(, MAX = );
() = ;
() random generate symbols with ;
() ENDFOR

At first, C possible asynchronous periodic sequential patterns would be generated by the algorithm, and their periodic lengths are generated by a P-expectation normal distribution. The average number of items is generated by a T-expectation normal distribution. The appearance probability consists of three levels, High, Medium, and Low, and the starting position of patterns is determined by an (N/5)-expectation normal distribution. Then, for each asynchronous periodic sequential pattern, a set of contained segments is generated until the end of the time slot is reached. Also, according to the contained segments, the pattern would be inserted into the corresponding time slot. The repeat counts of the contained segments are generated by an R-expectation normal distribution. Finally, scan this time slot; some items will be compensated for in the sequences, and the total number of items is generated by a ()-expectation Poisson distribution.

A synthetic dataset is generated by the dataset generation algorithm, and this algorithm is as shown in Table 5. The AP-PrefixspanM algorithm is used to mine asynchronous periodic sequential patterns of the synthetic dataset. The maximum period is 20, is 0.2, and is 10.

The mining results of the AP-PrefixspanM algorithm are shown in Table 6. When the minimum support is between 0.2 and 0.4, the mining results contain 5 asynchronous periodic sequential patterns that are preset in the dataset generation algorithm. The insert mode of the largest valid contained segment queue corresponds to these 5 sequential patterns, which means that the AP-PrefixspanM algorithm can precisely discover asynchronous periodic sequential patterns and their valid contained segment queue.

Figure 4 shows the distribution of lengths of asynchronous periodic sequential patterns with different minimum supports. The results show that the scale and number of sequential patterns will increase when the minimum support increases. There will only be one length-1 asynchronous periodic sequential pattern when the minimum support reaches 5%. When the minimum supports of high frequency items and medium frequency items are set very high, such as 5%, but the minimum supports of low frequency items are set very low, such as 1%, a myriad of length-1 or length-2 asynchronous periodic sequential patterns will appear, which suggests that even low frequency items are focused on. Many meaningless sequential patterns would still appear if the minimum support was set very low, which means that the AP-PrefixspanM algorithm can effectively mine the asynchronous periodic sequential patterns with sparse items and also can avoid the generation of many meaningless patterns.

Figure 5 shows the distribution of periodic lengths of asynchronous periodic sequential patterns with different minimum supports. The results show that, except for when the minimum supports of high, medium, and low items are set as 5%, 5%, and 1%, the periodic lengths of asynchronous periodic sequential patterns are all nearly 5, which corresponds to the preset parameter . However, under the condition that the minimum supports of high, medium, and low items are set as 5%, 5%, and 1%, periodic lengths are random, which means that there are a great number of meaningless sequential patterns.

Figure 6 shows the distribution of lengths of valid contained segment queues of asynchronous periodic sequential patterns with different minimum supports. The results show that, except for when the minimum supports of high, medium, and low items are set as 5%, 5%, and 1%, the lengths of most valid contained segment queues are 3 or 4. Although the preset parameter of the dataset generation algorithm is 25, because of the independent insertion and the random supplement of sequential patterns, the repeat counts of many short sequential patterns are larger than 25; therefore, the lengths of most valid contained segment queues are 3 or 4.

These experiments show that the AP-PrefixspanM algorithm is stable and efficient for mining asynchronous periodic sequential patterns.

7. Conclusions

In this paper, an asynchronous periodic sequential pattern mining model was proposed to discover sequential patterns that not only occur frequently but also appear periodically and to recognize the time range of their occurrences. Based on this mining model, we further propose a pattern-growth mining algorithm named the AP-PrefixspanM algorithm to mine asynchronous periodic sequential patterns with multiple minimum item supports. This algorithm applies a divide-and-conquer strategy to divide the problem of mining asynchronous periodic sequential patterns into a series of mutually disjoint subproblems progressively and then to mine the patterns in such subdatabases. During the process of dividing the database, growing asynchronous periodic sequential patterns and their valid contained segment queues are generated. This is exactly what the algorithm targets. Experimental results show the efficiency and stability of the algorithm. The data which can be applied in this algorithm are those regular and frequent happening data, such as entity movement trajectory data. The algorithm can mine the regular pattern of entity movement trajectory data and predict the future movement.

The next work is to extend the AP-PrefixspanM algorithm and make it possible to mine asynchronous periodic spatiotemporal sequential patterns in spatiotemporal sequential databases with multiple minimum item supports.

Conflict of Interests

The authors declare that no conflict of interests exists.

Acknowledgments

This work is supported by and National Natural Science Foundation of China under Grants 61173144, 61370215, and 61370211 and National Key Technology R&D Program under Grants 2012BAH37B01 and 2012BAH45B01.

References

J. Pei, J. Han, B. Mortazavi-Asl et al., “Mining sequential patterns by pattern-growth: the prefixspan approach,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 11, pp. 1424–1440, 2004.
View at: Publisher Site | Google Scholar
M. J. Zaki, “SPADE: an efficient algorithm for mining frequent sequences,” Machine Learning, vol. 42, no. 1-2, pp. 31–60, 2001.
View at: Publisher Site | Google Scholar
J. Han, H. Cheng, D. Xin, and X. Yan, “Frequent pattern mining: current status and future directions,” Data Mining and Knowledge Discovery, vol. 15, no. 1, pp. 55–86, 2007.
View at: Publisher Site | Google Scholar | MathSciNet
F. Giannotti, M. Nanni, and D. Pedreschi, “Efficient mining of temporally annotated sequences,” in Proceedings of the 6th SIAM International Conference on Data Mining, pp. 348–359, Bethesda, Md, USA, April 2006.
View at: Google Scholar | MathSciNet
B. Ozden, S. Ramaswamy, and A. Silberschatz, “Cyclic association rules,” in Proceedings of the 14th International Conference on Data Engineering (ICDE '98), pp. 412–421, February 1998.
View at: Google Scholar
J. Han, W. Gong, and Y. Yin, “Mining segement-wise periodic pattern in time related databases,” in Proceedings of the International Conference on Knowledge Discovery and Data Mining, pp. 214–218, 1998.
View at: Google Scholar
J. Yang, W. Wang, and P. S. Yu, “Mining asynchronous periodic patterns in time series data,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 613–628, 2003.
View at: Publisher Site | Google Scholar
J. Han, G. Dong, and Y. Yin, “Efficient mining of partial periodic patterns in time series database,” in Proceedings of the 15th International Conference on Data Engineering (ICDE '99), pp. 106–115, March 1999.
View at: Google Scholar
M. G. Elfeky, W. G. Aref, and A. K. Elmagarmid, “Periodicity detection in time series databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 7, pp. 875–887, 2005.
View at: Publisher Site | Google Scholar
C. Sheng, W. H. Mong, and L. Lee, “Mining dense periodic patterns in time series data,” in Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), pp. 1–3, April 2006.
View at: Publisher Site | Google Scholar
W. Wang, J. Yang, and P. S. Yu, “Mining patterns in long sequential data with noise,” ACM SIGKDD Explorations Newsletter, vol. 2, no. 2, pp. 28–33, 2000.
View at: Publisher Site | Google Scholar
K.-Y. Huang and C.-H. Chang, “SMCA: a general model for mining asynchronous periodic patterns in temporal databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 774–785, 2005.
View at: Publisher Site | Google Scholar
H. Cao, N. Mamoulis, and D. W. Cheung, “Discovery of periodic patterns in spatiotemporal sequences,” IEEE Transactions on Knowledge and Data Engineering, vol. 19, no. 4, pp. 453–467, 2007.
View at: Publisher Site | Google Scholar
K. Amphawan, A. Surarerks, and P. Lenca, “Mining periodic-frequent itemsets with approximate periodicity using interval transaction-ids list tree,” in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining (WKDD '10), pp. 245–248, January 2010.
View at: Publisher Site | Google Scholar
R. U. Kiran and P. K. Reddy, “Towards efficient mining of periodic-frequent patterns in transactional databases,” in Proceedings of the 21st International Conference on Database and Expert Systems Applications: Part II, pp. 194–208, 2010.
View at: Google Scholar
X. Wu, X. Zhu, G.-Q. Wu, and W. Ding, “Data mining with big data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, 2014.
View at: Publisher Site | Google Scholar
J. Huang, P. Zhang, X. Huangfu, and H. Sun, “A trajectory prediction approach for mobile objects by combining semantic features,” Journal of Computer Research and Development, vol. 51, no. 1, pp. 76–87, 2014.
View at: Google Scholar
L. Gao, J.-Y. Yang, and G.-M. Qin, “Methods for pattern mining in dynamic networks and applications,” Journal of Software, vol. 24, no. 9, pp. 2042–2061, 2013.
View at: Publisher Site | Google Scholar
S. Xu and D. Pi, “Mining periodic-frequent patterns of moving objects,” Journal of Chinese Computer Systems, vol. 35, no. 8, pp. 1705–1710, 2014.
View at: Google Scholar

Copyright

Copyright © 2015 Xiangzhan Yu et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

1719

Downloads

1151

Citations

Journal of Sensors

Bioinspired Mechanisms in Wireless Ad Hoc and Sensor Networks

An Asynchronous Periodic Sequential Pattern Mining Algorithm with Multiple Minimum Item Supports for Ad Hoc Networking

Abstract

1. Introduction

2. Related Work

3. Basic Concept and Definition

4. Periodic Sequential Pattern Mining Model

4.1. Synchronous Periodic Sequential Pattern Mining Model

4.2. Asynchronous Periodic Sequential Pattern Mining Model

5. AP-PrefixspanM Mining Algorithm

5.1. Relationship between Algorithm Parameters

5.2. The Main Idea and Steps of the Algorithm

5.3. Implementation of the Algorithm

6. Experimental Analysis

7. Conclusions

Conflict of Interests

Acknowledgments

References

Copyright