Abstract
Highutility sequential pattern mining (HUSPM) is an emerging topic in data mining, where utility is used to measure the importance or weight of a sequence. However, the underlying informative knowledge of hierarchical relation between different items is ignored in HUSPM, which makes HUSPM unable to extract more interesting patterns. In this paper, we incorporate the hierarchical relation of items into HUSPM and propose a twophase algorithm MHUH, the first algorithm for highutility hierarchical sequential pattern mining (HUHSPM). In the first phase named Extension, we use the existing algorithm FHUSpan which we proposed earlier to efficiently mine the general highutility sequences (sequences); in the second phase named Replacement, we mine the special highutility sequences with the hierarchical relation (sequences) as highutility hierarchical sequential patterns from sequences. For further improvements of efficiency, MHUH takes several strategies such as Reduction, FGS, and PBS and a novel upper bounder TSWU, which will be able to greatly reduce the search space. Substantial experiments were conducted on both real and synthetic datasets to assess the performance of the twophase algorithm MHUH in terms of runtime, number of patterns, and scalability. Conclusion can be drawn from the experiment that MHUH extracts more interesting patterns with underlying informative knowledge efficiently in HUHSPM.
1. Introduction
Sequential pattern mining (SPM) [1–3] is an interesting and critical research area in data mining. According to the problem definition [4], a large database of customer transactions has three fields, i.e., customerid, transactiontime, and the items bought. Each transaction corresponds to an itemset, and all the transactions from a customer are ordered by increasing transactiontime to form a sequence called customer sequence. The support of a sequence is the number of customer sequences that contains it. If the support of a sequence is larger than a userspecified minimum support, we call it a frequent sequence. The sequential pattern mining algorithm will discover the frequent sequences called sequential patterns among all sequences. In a word, the purpose of sequential pattern mining is to discover all frequent sequences as sequential patterns, which reflect the potential connections within items, from a sequence database under the given minimum support. An example of such a sequential pattern is that customers typically buy a phone, then a phone shell, and then a phone charger. Customers who buy some other commodities in between also support this sequential pattern. In the past decades, many algorithms [1, 5] have been proposed for sequential pattern mining, which makes it be widely applied in many realistic scenarios (e.g., consumer behavior analysis [6] and web usage mining [7]). However, sequential pattern mining has two apparent limitations.
Firstly, frequency does not fully reveal the importance (i.e., interest) in many situations [8–12]. In fact, many rare but important patterns may be missed under the frequencybased framework. For example, in retail selling, a phone usually brings more profit than selling a bottle of milk, while the quantity of phones sold is much lower than that of milk [9], and the decisionmaker tends to emphasize the sequences consisting of highprofit commodities, instead of those frequent commodity sequences. This issue leads to the emergence of highutility sequential pattern mining (HUSPM) [8, 12–15]. To represent the relative importance of patterns, each item in the database is associated with a value called external utility (e.g., indicating the unit profit of the item purchased by a customer). In addition, each occurrence of the item is associated with a quantity called internal utility (e.g., indicating the number of units of the item purchased by a customer in a transaction). The utility of a sequence is computed by applying a utility function on all sequences in the database where it appears. The task of highutility sequential pattern mining is to discover all highutility sequential patterns (HUSPs, the sequences with high utility) from the quantitative sequence database with a predefined minimum utility threshold. Many highutility sequential pattern mining algorithms have been proposed in the past decades [13, 16–20], and highutility sequential patterns can be extracted more efficiently with a series of novel data structures and pruning strategies proposed. In addition, highutility sequential pattern mining has many practical applications including web log data [21], mobile commerce environments [22], and gene regulation data.
Secondly, in sequential pattern mining, the hierarchical relation (e.g., product relation and semantic relation) between different items is ignored, so some underlying knowledge may be missed. In general, the individual items of the input sequences are naturally arranged in a hierarchy [23]. For example, suppose both the sequence and the sequence are infrequent, then it seems that there is no association between the three commodities. However, we may find that the sequence is frequent from the perspective of product hierarchy, indicating that customers usually buy a phone first, then buy a phone accessory (including “mobile power pack” and “bluetooth headset”). That is to say, products in sequences of customer transactions can be arranged in a product hierarchy, where mobile power pack and bluetooth headset can generalize phone accessory. Another example is that the individual word in a text can form a semantic hierarchy. The words drives and driven can generalize to their common lemma drive, which in turn generalize to their respective partofspeech tag verb. The concept of hierarchy (is a taxonomy) provides the deciders with a different perspective to analyze sequential patterns. More informative patterns can be extracted through the hierarchybased methodology. Besides, although the information reveled from the hierarchical perspective may be relatively fuzzy, it reduces the loss of underlying knowledge to a certain extent. Particularly, the hierarchical relation between different items is sometimes inherent to the application (e.g., hierarchies of directories or web pages) or they are constructed in a manual or automatic way (e.g., product relation) [23]. Figure 1 shows a simple example of a taxonomy of biology in the real application. Sequential pattern mining with hierarchical relation can be traced back to the article [6] where the hierarchy management was incorporated into sequential pattern mining, and the GSP algorithm was proposed to extract sequential patterns according to different levels of hierarchy. Later, sequential pattern mining with hierarchical relation has been studied extensively in the literature [24, 25]. Efficient algorithms were proposed in a wide range of realworld applications, such as customer behavior analysis [6, 26] and information extraction [27].
However, to the best of our knowledge, there is no related work taking consideration of both two limitations. In this paper, given a quantitative sequence database (with an external utility table), a userdefined minimum utility threshold, and a series of taxonomies denoting the hierarchical relation, we are committed to finding all highutility sequences consisting of items with the hierarchical relation (i.e., highutility hierarchical sequential patterns). In fact, mining such patterns is more complicated than highutility sequential pattern mining and sequential pattern mining with hierarchical relation. Firstly, compared with highutility sequential pattern mining, the introduction of hierarchical relation leads to consumption of a large amount of memory and has long execution times due to the combinatorial explosion of the search space. Secondly, the methods of mining sequential pattern mining with hierarchical relation cannot be directly applied, for the download closure property (also known as the Apriori property) [28] is not held under the utilitybased framework.
To address the above issues, we propose a new algorithm called MHUH (mining highutility hierarchical sequential patterns) to mine highutility hierarchical sequential patterns (to be defined later) by taking several strategies. The major contributions of this paper are as follows.
Firstly, we introduce the concepts of hierarchical relation into highutility sequential pattern mining and formulate the problem of highutility hierarchical sequential pattern mining (HUHSPM). Especially, important concepts and components of HUHSPM are defined.
Secondly, we propose a twophase algorithm named MHUH (mining highutility hierarchical sequential patterns), the first algorithm for highutility hierarchical sequential pattern mining. So that the underlying informative knowledge of hierarchical relation between different items will not be missed and to improve efficiency of extracting HUHPs, several strategies (i.e., FGS, PBS, and Reduction) and a novel upper bounder TSWU are proposed.
Thirdly, substantial experiments were conducted on both real and synthetic datasets to assess the performance of the twophase algorithm MHUH in terms of runtime, number of patterns, and scalability. In particular, the experimental results demonstrate that MHUH can extract more interesting patterns with underlying informative knowledge efficiently in HUHSPM.
The rest of this paper is organized as follows. Related work is briefly reviewed in Section 2. We describe the related definitions and problem statement of HUHSPM in Section 3. The proposed algorithm is presented in Section 4, and an experimental evaluation assessing the performance of the proposed algorithm is shown in Section 5. Finally, a conclusion is drawn and future work is discussed in Section 6.
2. Related Work
In this section, related work is discussed. The section briefly reviews (1) the main approaches for sequential pattern mining, (2) the previous work of highutility sequential pattern mining, and (3) stateoftheart algorithms for sequential pattern mining with hierarchical relation.
2.1. Sequential Pattern Mining
Agrawal et al. [28] first presented a novel algorithm Apriori holding the download closure property for association rule mining. The proposed Apriori algorithm is based on a candidate generation approach that repeatedly scans the database to generate and count candidate sequential patterns and prunes those infrequent. They then defined the problem of sequential pattern mining over a large database of customer transactions and proposed the efficient algorithms AprioriSome and AprioriAll [4]. Srikant and Agrawal then proposed GSP, which is similar to AprioriAll in the execution process but greatly improves performance over AprioriAll. As Apriori’s successor, adopting several technologies including time constraints, sliding time windows, and taxonomies, GSP uses a multiplepass, candidate generationandtest method to find sequential patterns [6]. Zaki [1] proposed the efficient SPADE which only needs three database scans. SPADE utilizes combinatorial properties to decompose the original problem into smaller subproblems, which can be independently solved in main memory using efficient lattice search techniques and using simple join operations. Later, SPAM was proposed by Ayres et al. [29], which applies to the situation that sequential patterns in the database are very long. In order to deal with the problems of large search spaces and the ineffectiveness in handling dense datasets, Yang et al. [30] proposed a novel algorithm LAPIN with a simple idea that the last position is the key to judging whether to extend the candidate sequential patterns or not. Then, they developed the LAPINSPAM algorithm by combining SPAM, which outperforms SPAM up to three times on all kinds of dataset in experiments. Notably, the property, used in SPAM, LAPIN, LAPINSPAM, and SPADE, that the support of superpatterns is always less than or equal to the support of its support patterns is different from the Apriori property used in GSP. Summarizing all algorithms mentioned above, they all belong to Aprioribased algorithms [2, 3].
It is known that database scans will be timeconsuming when discovering sequential patterns. For this reason, a set of pattern growth sequential pattern mining algorithms that are able to avoid recursively scanning the input data were proposed. For example, Han et al. [31] proposed a novel, efficient algorithm FreeSpan that uses projected sequential databases to confine the search and growth of subsequences. The projected database can greatly reduce the size of a database. Pei et al. [7] designed a novel data structure called web access pattern tree, or WAPtree in short, in their algorithm. WAPtree stores highly compressed and critical information, and it makes mine access patterns from web logs efficiently. Then, Han et al. [32] proposed PrefixSpan with two kinds of database projections levelbylevel projection and bilevel projection. PrefixSpan projects only their corresponding postfix subsequences into the projected database, so it runs considerably faster than both GSP and FreeSpan. Using a preorder linked, positioncoded version of WAPtree and providing a position code mechanism, the PLWAP algorithm was proposed by Ezeife et al. based on WAPtree [33]. Recently, SequenceGrowth, the parallelized version of the PrefixSpan algorithm, was proposed by Liang and Wu [34], which adopts a lexicographical order to generate candidate sequences that avoid exhaustive search over the transaction databases.
There are some drawbacks to pattern growth sequential pattern mining algorithms. Obviously, it is timeconsuming to build projected databases. Consequently, some algorithms with early pruning strategies were developed to improve efficiency. Chiu et al. [35] designed an efficient algorithm called DISCall. Different with previous algorithms, DISCall adopts the DISC strategy to prune the nonfrequent sequences according to the other sequences with the same length instead of the frequent sequences with shorter lengths. Recently, a more fast algorithm called CloFAST was proposed for mining closed sequential patterns using sparse and vertical idlists. CloFAST combines a new data representation of the dataset, whose theoretical properties are studied in order to fast count the support of sequential patterns, with a novel onestep technique both to check sequence closure and to prune the search space [36]. It is more efficient than previous approaches. More details about the background of sequential pattern mining can be found in [2, 3].
2.2. HighUtility Sequential Pattern Mining
To address the problem that frequency does not fully reveal the importance in many situations, utilityoriented pattern mining frameworks, for example, highutility itemset mining (HUIM), have been proposed and extensively studied [12, 37]. Although HUIM algorithms can extract interesting patterns in many reallife applications, they are not able to handle the sequence database where the timestamp is embedded in each item. Many highutility sequential pattern mining algorithms have been proposed in the past decades [9, 13, 16, 18, 38], and highutility sequential patterns can be extracted more efficiently with a series of novel data structures and pruning strategies proposed. Ahmed et al. [13] first defined the problem of mining highutility sequential patterns and proposed a novel framework for mining highutility sequential patterns. They presented two new algorithms UL and US to find all highutility sequential patterns. The UL algorithm, which is simpler and more straightforward, follows the candidate generation approach (based on breadthfirst search), while the US algorithm follows the pattern growth approach (based on depthfirst search). They can both be regarded as twophase algorithms. In the first phase, they find a set of highSWU sequences. In the second phase, they calculate the utility of sequences by scanning the sequence database to output highSWU sequences, only those whose utility is no less than the threshold minutil.
The twophase algorithms mentioned above have two important limitations, especially for low minutil values [8]. One limitation is that the set of highSWU sequences discovered in the first phase needs a considerable amount of memory. The other one is that computing the utility of candidate sequences can be very timeconsuming when scanning the sequence database. Instead of dividing the algorithm into two phases, Shie et al. [22] proposed a onephase algorithm named UMSpan for highutility sequential pattern mining. It improves efficiency by using a projected databasebased approach to avoid additional scans of databases to check actual utilities of patterns. Similarly, a onephase algorithm named PHUS was proposed by Lan and Hong [39], which adopted an effective upper bound model and an effective projectionbased pruning strategy. Furthermore, the indexing strategy is also developed to quickly find the relevant sequences for prefixes in mining, and thus, unnecessary search time can be reduced.
Yin et al. then enriched the related definitions and concepts of highutility sequential pattern mining. Two algorithms, USpan [9] and TUS [17], were proposed by Yin et al. for mining highutility sequential patterns and top highutility sequential patterns, respectively. In USpan, they introduced the lexicographic quantitative sequence tree to represent the search space and designed concatenation mechanisms for calculating the utility of a node and its children with two effective pruning strategies. The width pruning strategy avoids constructing unpromising patterns into the LPTree, while the depth pruning strategy stops USpan from going deeper by identifying the leaf nodes in the tree. Based on USpan, Alkan and Karagoz and Wang et al., respectively, proposed HuspExt [16] and HUSSpan [38] to increase efficiency of the mining process. Zhang et al. [18] proposed an efficient algorithm named FHUSpan (named HUSUT in the paper), which adopts a novel data structure named UtilityTable to store the sequence database in the memory and the TRSU strategy to reduce search space. Recently, Gan et al. proposed two efficient algorithms named ProUM [40] and HUSPULL [41], respectively, to improve mining efficiency. The former utilizes the projection technique in generating utility array, while the latter adopts a lexicographic sequence (LQS) tree and a utilitylinkedlist structure to quickly discover HUSPs. More current development of HUSPM can be referred to in literature reviews [8, 14].
2.3. Sequential Pattern Mining with Hierarchical Relation
Sequential pattern mining with hierarchical relation can be traced back to article [6] where the hierarchies were incorporated into the mining process, and the GSP algorithm was proposed to extract sequential patterns according to different levels of hierarchy. There are two key strategies to improve efficiency in GSP. The first one is precomputing the ancestors of each item and dropping ancestors which are not in any of the candidates before making a pass over the data. The second strategy is to not count sequential patterns with an element containing both item and its ancestor. However, the depth of the hierarchy limits the efficiency of the algorithm because it increases the size of the sequence database. To represent the relationships among items in a more complete and natural manner, Chen and Huang [25] sketched the idea of fuzzy multilevel sequential patterns and presented the FMSM algorithm and the CROSSFMSM algorithm based on GSP. Each item in hierarchies can have more than one parent with different degrees of confidence in their paper.
Plantevit et al. [24] incorporated the concept of hierarchy into a multidimensional database and proposed the twophase algorithm HYPE extending their preceding approach to extract multidimensional generalized sequential patterns. Firstly, the maximally specific items are extracted. Secondly, the multidimensional generalized sequences are mined in a further step. As the accessor of HYPE, they then proposed [42] to extract multidimensional and multilevel sequential patterns based on . The approaches are not incomplete; in other words, they do not mine all frequent sequences. Similarly applying fuzzy concepts to the hierarchy, Huang [43] later presented a divideandconquer strategy based on the pattern growth approaches to mine such fuzzy multilevel patterns. Recently, Egho then presented MMISP to extract heterogeneous multidimensional sequential patterns with hierarchical relation and applied it to analyze the trajectories of care for colorectal cancer. Beedkar et al. [23], who were inspired by MGFSM, designed the first parallel algorithm named LASH for efficiently mining sequential patterns with hierarchical relation. MGFSM first partitions the data and subsequently mines each partition independently and in parallel. Drawing lessons from the basic strategy of MGFSM, Lash adopts a novel, hierarchyaware variant of itembased partitioning, optimized partition construction techniques and an efficient specialpurpose algorithm called pivot sequence miner (PSM) for mining each partition. As we know, the sequence database contains not only rich features (e.g., occurrence quantity, risk, and profit) but also multidimensional auxiliary information, which is partly associated with the concept of hierarchy. Recently, Gan et al. [44] proposed a novel framework named MDUS to extract multidimensional utilityoriented sequential useful patterns.
There are also several hierarchical frequent itemset mining algorithms, which are more or less similar to sequential pattern mining with hierarchical relations. For example, Kiran et al. [45] proposed a hierarchical clustering algorithm using closed frequent itemsets that use Wikipedia as an external knowledge to enhance the document representation. In Prajapati and Garg’s research [46], the transactional dataset is generated from a big sales dataset; then, the distributed multilevel frequent pattern mining algorithm (DMFPM) is implemented to generate levelcrossing frequent itemset using the Hadoop Mapreduce framework. And then, the multilevel association rules are generated from frequent itemset.
3. Preliminaries and Problem Formulation
3.1. Definitions
Let be a set of items. A nonempty subset is called an itemset, and the symbol denotes the size of . A sequence is an ordered list of itemsets, where (). The length of is , and the size of is . A sequence with the length of is called an sequence. is the subsequence of , if there exists integers: so that . For example, is the subsequence of .
A item (quantitativeitem) is an ordered tuple , where and is a positive real number representing the quantity of . A itemset with items is denoted as . A sequence, denoted as , is an ordered list of itemsets. A sequence database (e.g., Figure 2(a)) consists of a collection of tuple , where ID is the identifier and is a sequence.
(a) Quantitativesequence database DE
(b) Taxonomies
(c) External utility table
The hierarchical relation of different items is represented in the form of taxonomy which is a tree consisting of items in different abstraction levels. We assume that each item is only associated with one taxonomy. Figure 2(b) shows a simple example of taxonomies. In a taxonomy, if an item is an ancestor of item , we say that is more general than is more specific than , denoted as . We distinguish three different types of items: leaf items (most specific, no descendants), root items (most general, no ancestors), and intermediate items. The complete set consisting of descendants of item is denoted as . For example, in Figure 2(b), is a root item, is an intermediate item, is a leaf item, and . In this paper, we assume that different items belonging to the same itemset/itemset belong to different taxonomies.
Given two itemsets and , we say that is more specific than or equal to / is more general than or equal to (denoted as ), and so that or . For example, in Figure 2(b), ; . Similarly, given two sequences with the size of , , and , we say that is more specific than or equal to / is more general than or equal to (denoted as ); if , we have , where is the th itemset of and is the th itemset of . In particular, if and , we say that is more specific than / is more general than , denoted as . For example, in Figure 2(b), .
3.2. Utility Calculation
Each item is associated with an external utility (represented as ) which is a positive real number representing the weight of . For a nonleaf item , it should meet the condition that . The external utility of each item is recorded in an external utility table (e.g., Figure 2(c)).
The utility of a item is defined as . The utility of a itemset/sequence/sequence database is the sum of the utility of the items/itemset/sequence it contains. For example, in Figure 2, the utility of in the 1st itemset of is 6 (); the utility of the 1st itemset of is 16 (); the utility of , represented as , is 44 (); and the utility of , represented as , is 228 ().
Given an itemset and a itemset , we say that occurs in , denoted as , iff there exist distinct integers: so that or . The utility of in is defined as if ; otherwise, . For example, in Figure 2, let , ; ; ; ; ; and .
Given a sequence and a sequence where , we make the following definitions. We say that occurs in (denoted as ) at position : iff there exist integers: so that . The utility of in at , denoted as , is defined as . For example, in Figure 2, occurs in at position ; .
Obviously, may occur in many times. The utility of in , denoted as , is defined as , where the symbol denotes the complete set containing all positions of in . The utility of in a sequence database , denoted as , is defined as . For example, in Figure 2, occurs in three times; ; ; . More details about the methods of utility calculation can be found in [8].
Given a minimum utility , we say that sequence is highutility if . In particular, is the most specific pattern, denoted as sequence, iff and , . Similarly, sequence is the most general pattern, denoted as sequences, iff and . The sequences contain the underlying informative knowledge of hierarchical relations between different items, which cover less meaningless information compared with those sequences highly generalized. Therefore, we define these sequences as highutility hierarchical sequential patterns (HUHSPs) to be extracted.
3.2.1. Problem Statement
Given a minimum utility , a utility hierarchical sequence database including a quantitative sequential database , a set of taxonomies, and an external utility table, the utilitydriven mining problem of highutility hierarchical sequential pattern mining (HUHSPM) consists of enumerating all HUHSPs whose overall utility values in this database are no less than the prespecified minimum utility account .
4. Proposed HUHSPM Algorithm: MHUH
In this section, we present the proposed algorithm MHUH for HUHSPM. We incorporate the hierarchical relation of items into highutility sequential pattern mining, which makes MHUH able to find the underlying informative knowledge of hierarchical relation between different items ignored in highutility sequential pattern mining. In other words, MHUH can extract more interesting patterns. The mining process of MHUH mainly includes two phases named Extension and Replacement. MHUH finds highutility sequences by the existing algorithm FHUSpan (also named HUSUT) which we proposed earlier based on the prefixextension approach in the first phase. For a sequence , we then generate all sequences that are more specific than by progressive replacement and store sequences with a collection in the second phase. The work we need to do in the two phases can be observed visually from the two names. The mining process with two phases ensures that the underlying informative knowledge of hierarchical relation between different items will not be missed. At the same time, it can increase efficiency when discovering HUHSPs.
Without the loss of generality, in this section, we formalize the theorems under the context of a minimum utility and a utility hierarchical sequence database (includes a sequence database , taxonomies, and external utility table).
4.1. Reduction: Remove Useless Items
Before mining the sequential patterns, MHUH adopts the Reduction strategy in the data preprocessing procedure, which removes useless items to reduce search space in advance. It mainly consists of two points, removing the unpromising items from the sequence database and removing the redundant items from the taxonomies.
An item is unpromising if any sequence containing this item is not highutility. Here, we propose a novel upper bound TSWU (Taxonomy SequenceWeighted Utility) based on SWU [13] to filter out the unpromising items.
Definition 1. Given an item , we define as , where is the root item of the taxonomy containing .
For example, in Figure 2, ;; .
Theorem 2. Given a sequence , two sequences and , where , .
Proof. Let . We have . For a sequence , , , so . Further, .
Theorem 3. For any sequence that contains item , if .
Proof. From Theorem 3, we know that , so . If , .
For a given , we can remove items satisfying safely according to the above theorem. For example, in Figure 2, when , and can be safely removed from Figure 2(a).
We say that an item is redundant if it (1) appears in taxonomy but does not appear in sequence database and (2) has at most one child in taxonomy. For example, in Figure 2, and are redundant items. In terms of utility, removing these items has no effect on correctness, which will be proved in Subsection 4.3. Thus, we can safely remove these items.
4.2. Extension (Phase I): Find Sequences
In the first phase named Extension, we use the existing algorithm FHUSpan [18] which we proposed earlier to efficiently mine the general highutility sequences (sequences). The main tasks of this phase are improving efficiency greatly of MUHU and extract sequences preparing for the next phase.
In fact, no sequences will be missed based on the FGS (From General to Special) strategy. To prove the correctness of this conclusion, we need to prove two points: (1) there does not exist sequence that cannot be discovered by the FGS strategy and (2) the correctness of the algorithm that finds sequence is based on a given sequence. Here, we prove the correctness of (1), and the proof about (2) is illustrated in the next subsection.
Theorem 4. Given two sequences and where , .
Proof. We first prove that . From Section 4.2 () and Theorem 3, . Then, we prove . We have if , where . .
Corollary 5. Given a sequence , .
Theorem 4 and Corollary 5 reveal the correctness of (1). We assume that is a sequence that cannot be discovered by the FGS strategy. In fact, we can always find (replace item with the item’s ancestor) the sequence where and . Because is not sequence, or . So, . We then draw a contraction that . Therefore, the assumption does not hold, which ensures the correctness of (1).
Theorem 6. All items contained in gsequence are root items.
Proof. Given a sequence , we assume that the th item of is not the root item. Then, we can find a sequence where . We then draw a contraction that is not sequence. Therefore, the theorem holds.
We then introduce how to find sequences. Theorem 6 shows that we merely need to consider the root items in the process of finding sequences. Thus, we can transform the sequences into another form so that we can ignore the hierarchical relation in this phase. We illustrate this transformation through an example. Consider in Figure 2, we transform it into , where the value in the bracket is utility (). Obviously, with this transformation, mining sequence is equivalent to mining highutility sequences. So, we use the existing highutility sequential pattern mining algorithm FHUSpan [18], which we proposed earlier to find sequences.
Here, we briefly introduce the mining process of FHUSpan, which finds highutility sequences based on the prefixextension approach. It first finds all appropriate items (only the sequence starting with these items may be highutility). Then, for each appropriate item, it constructs a sequence containing only this item and extends the sequence recursively until all sequences starting with the item are checked. In particular, two extension approaches are used, Extension (appending an itemset containing only one item to the end of the current sequence) and Extension (appending an item to the last itemset of the current sequence). It is based on the algorithm HUSSpan which uses two pruning strategies, PEU (Prefix Extension Utility) strategy and RSU (Reduced Sequence Utility) strategy to reduce the search space. The novel data structure named UtilityTable and the pruning are used to terminate extension in FHUSpan so that it can efficiently discover highutility sequences.
4.3. Replacement (Phase II): Find Sequence
In the second phase named Replacement, we mine the special highutility sequences with the hierarchical relation (sequences) from sequences by the PBS strategy. The main task of this phase is to extract sequences efficiently.
For a sequence , we then generate all sequences that are more specific than by progressive replacement and store sequences with a collection . In particular, for each replacement, we replace the th item of with a child item. For example, in Figure 2, we replace the first item of with the child items of , and one specific sequence is .
Algorithm 1 shows the progressive replacement starting from the th item of a sequence, which is based on DFS. Firstly, it checks if the current sequence has been visited to avoid repeated utility calculation (line 1). If , we have according to Theorem 4, so we terminate search (lines 23). Otherwise, it adds into and removes the sequences that are more general than from (line 5). Then, it generates the more specific sequences based on , which follows the order from top to bottom (line 9), left to right (lines 1012). In detail, it first finds which is the set containing all child items for replacement. For each , it replaces the th item of with to generate . It then checks the sequences that are more specific than (line 9, from top to bottom). After that, it checks the sequences that are more specific than from left to right (lines 1012), where is the length of .

We also use a strategy, PBS (Pruning Before Search), to reduce search space before Algorithm 1. The main idea behind this strategy is considering only the items in the current index. In other words, we generate and check the more specific sequences in one direction (from top to bottom) to reduce the size of taxonomies.
We illustrate this strategy through an example under the context of Figure 2. Let , the sequence is a sequence (). We construct copies of taxonomy, denoted as , and , for the th () item of . Then, we reduce the size of the three taxonomies. For , we have ( is a redundant item and was removed). Then, we generate by replacing the first with , and . So, we retain . Because , we then consider and generate . We also retain , for . Then, we continue to check the child items of . Such a procedure will continue until all items belonging to have been checked. Finally, we remove from , for . We then continue the above procedure for and , and the processed taxonomies are shown in the right of Figure 3. In addition, note that in Algorithm 1, is obtained from the processed taxonomies instead of the original taxonomies.
In the above example, the max count of sequences that are more specific than reduces from 107 () to 11 (). In fact, for a sequence , this count reduces from to , where and are the sizes of in the original and processed taxonomies, respectively, and is the th item of .
In the rest of this subsection, we prove the conclusion left before. We first prove that removing redundant items has no effect on correctness.
Proof. For a sequence , we assume that the th item of , , is a redundant item. Firstly, if is a leaf item, we can safely remove it, because for each sequence that contains , we have . Secondly, if has one child, we generate sequence by replacing with its child. Then, we have according to the related utility definition (the utility of in ). Therefore, removing redundant items does not change the utility of related sequences, which means that it has no effect on the correctness.
Then, we prove the conclusion the correctness of the algorithm which finds sequence based on a given sequence.
Proof. Firstly, the PBS strategy does not ignore the underlying sequences. Suppose we cannot find a sequence from the taxonomies processed by PBS strategy, then we have , which violates Theorem 4. So, the assumption does not hold. Secondly, Algorithm 1 does not ignore any sequences. Algorithm 1 is based on the DFS framework, which ensures the completeness of the algorithm. Besides, Algorithm 1 terminates search in advance based on Theorem 4, so it does not ignore any sequences. In summary, the conclusion holds.
5. Experiments
We performed experiments to evaluate the proposed MHUH algorithm which was implemented in Java. All experiments were carried out on a computer with Intel Core i7 CPU of 3.2 GHz, 8 GB memory, and Windows 10.
5.1. Datasets
Five datasets, including three real datasets and two synthetic datasets, were used in the experiments. DS1 is the conversion of Bible where each word is an item. DS2 is the conversion of the classic novel called Leviathan. DS3 is a clickstream dataset called BMSWebView2. The three datasets can be obtained from the SPMF website [47]. DS4 and DS5 are two synthetic datasets. The characteristic of them is summarized in Table 1. The values of parameters in Table 1 are as follows: is the number of sequences, is the number of distinct items, is the max length of sequences, and is the average length of sequences.
Note that these datasets do not contain taxonomies. So, for each dataset, we generated taxonomies based on the items it contains. The max depth and degree of these taxonomies are 3, which indicates that the max number of leaf items contained in taxonomy is 27. The datasets and source code will be released at the author’s Github after the acceptance for publication.
5.2. Performance Evaluation
We evaluated the performance of the proposed algorithm on different datasets when varying . For the sake of simplicity, here, we calculate as , where δ is a decimal between 0 and 1, and is the utility of the sequence database (see the concepts in Subsection 3.2). In addition, we also tested the effect of the PBS strategy, and the modified MHUH algorithm which does not take the PBS strategy is denoted as MHUH_base.
The execution times of MHUH and MHUH_base on DS1 to DS3 are shown in Figure 4. When increases, both of the two algorithms take less execution time since the search space reduces. The results prove that the PBS strategy effectively decreases the execution time, for it greatly reduces the search space on these datasets. Besides, the results also show that the MHUH algorithms can efficiently extract sequences under a low .
Figure 5 shows the distribution of discovered patterns by MHUH on DS1 to DS3. It shows that the number of patterns per length increases with the decrease of . In particular, it is interesting that some longer patterns may disappear as increases, which indicates that the shorter patterns may have higher utility.
5.3. Utility Comparison with HighUtility Sequential Pattern Mining
We conducted this experiment to evaluate the utility difference between the patterns discovered by MHUH and that discovered by the existing algorithm FHUSpan [18] which we proposed earlier.
Figure 6 shows the sum utility of top # (depends on utility) patterns discovered by FHUSpan and MHUH from three datasets. The axis refers to the value of #, and the axis represents the sum utility of top # patterns. For example, on DS1, the sum utility of top 1000 patterns extracted by MHUH is higher than the sum utility of that discovered by FHUSpan. Figure 7 shows that the average utility per length of top # patterns on DS1 to DS3 (# is set to 1000, 700, and 600, respectively). The axis refers to the length of patterns, and the axis denotes the average utility of patterns with the same length. For example, on DS1, in terms of the top 1000 patterns, the average utility of patterns with length of 8 discovered by MHUH is higher than the average utility of that discovered by FHUSpan. From these two figures, we know that MHUH can discover higher utility patterns compared with FHUSpan, indicating that more informative knowledge can be found by MHUH.
5.4. Scalability
We conducted experiments to evaluate MHUH’s performance on largescale datasets. For each dataset, we increased its data size through duplication and performed the MHUH algorithm with different . Figure 8 shows the experimental results. We know from the figure that the MHUH algorithm has well scalability on the two datasets, for the execution time is almost linear with the data size. For example, the execution time of MHUH () on DS4 almost linearly increases when the data size (the number of sequences it contains) changes from 10K to 50K. It also shows that MHUH can efficiently identify the desired patterns from the largescale dataset with a low . For example, in terms of DS5, MHUH costs 300 s when the data size is 300K and .
6. Conclusion and Future Work
In this paper, we incorporate the hierarchical relation of items into highutility sequential pattern mining and propose a twophase algorithm MHUH, the first algorithm for highutility hierarchical sequential pattern mining (HUHSPM). In the first phase named Extension, we use the existing algorithm FHUSpan which we proposed earlier to efficiently mine the general highutility sequences (sequences); in the second phase named Replacement, we mine the special highutility sequences with the hierarchical relation (sequences) from sequences. The proposed MHUH algorithm takes several novel strategies (e.g., Reduction, FGS, and PBS) and a new upper bound TSWU, so it will be able to greatly reduce the search space and discover the desired pattern HUHSPs efficiently. A conclusion can be drawn from the experiment that MHUH extracts more interesting patterns with underlying informative knowledge efficiently in HUHSPM.
In the future, we will generalize the proposed algorithm based on the more complete concepts. Besides, several extensions of the proposed MHUH algorithm can be considered such as improving the efficiency of the MHUH algorithm based on better pruning strategies, efficient data structures [40, 41], and the multithreading technology [2].
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors have declared that no conflict of interest exists.
Acknowledgments
This work was supported by the Natural Science Foundation of Guangdong Province, China (Grant No. 2020A1515010970) and Shenzhen Research Council (Grant No. GJHZ20180928155209705).