Table of Contents Author Guidelines Submit a Manuscript
Mathematical Problems in Engineering
Volume 2017 (2017), Article ID 8576829, 13 pages
https://doi.org/10.1155/2017/8576829
Research Article

PHUIMUS: A Potential High Utility Itemsets Mining Algorithm Based on Stream Data with Uncertainty

Air Force Engineering University, Xi’an, China

Correspondence should be addressed to Ju Wang; moc.kooltuo@ekujgnilgnoy

Received 8 October 2016; Revised 16 February 2017; Accepted 23 February 2017; Published 16 March 2017

Academic Editor: Haipeng Peng

Copyright © 2017 Ju Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

High utility itemsets (HUIs) mining has been a hot topic recently, which can be used to mine the profitable itemsets by considering both the quantity and profit factors. Up to now, researches on HUIs mining over uncertain datasets and data stream had been studied respectively. However, to the best of our knowledge, the issue of HUIs mining over uncertain data stream is seldom studied. In this paper, PHUIMUS (potential high utility itemsets mining over uncertain data stream) algorithm is proposed to mine potential high utility itemsets (PHUIs) that represent the itemsets with high utilities and high existential probabilities over uncertain data stream based on sliding windows. To realize the algorithm, potential utility list over uncertain data stream (PUS-list) is designed to mine PHUIs without rescanning the analyzed uncertain data stream. And transaction weighted probability and utility tree (TWPUS-tree) over uncertain data stream is also designed to decrease the number of candidate itemsets generated by the PHUIMUS algorithm. Substantial experiments are conducted in terms of run-time, number of discovered PHUIs, memory consumption, and scalability on real-life and synthetic databases. The results show that our proposed algorithm is reasonable and acceptable for mining meaningful PHUIs from uncertain data streams.

1. Introduction

Knowledge discovery in databases (KDD) is an emerging issue since the important, implicit, unknown, and potential useful information can be found from huge databases [1, 2]. And frequent itemsets mining (FIM), which is used to mine the frequent itemsets that their occurrence frequencies are no less than minimum support threshold, is one of the most important and common tasks of data mining [3]. Apriori [4] based on bread first search and FP-growth [2] based on depth first search are well-known fundamental FIM algorithms. However, these traditional FIM algorithms assume that the profit of every item is the same and the frequency value of every item in transactions is 0 or 1. In real-life applications, the itemsets that bring high profit to retailers and managers are useful [5], not the most frequent itemsets. Thus, factors like quantity, price, and profit are needed to be included in the FIM.

To deal with the limitations of FIM, Chan et al. [6] first proposed high utility itemsets mining algorithm over the nonbinary databases with different profit values of items. The goal of HUIs mining is to discover itemsets that bring considerable profit to users, although they are not frequent itemsets. Aiming at the issue of HUIs mining, level-wise approaches [6, 7], pattern growth approaches [810], and list based approaches [11, 12] are three main frameworks to deal with the problem of undownward closure property and combinational explosion about it.

These traditional HUIs mining algorithms are proposed to deal with static databases, which ignore itemsets’ timeliness. Therefore, Tseng et al. first proposed THUI-Mine [13] to mine HUIs from data stream according to the two-phase model based on sliding windows. Afterward, lots of improved algorithms [1418] are proposed to handle this problem more efficiently. However, the above algorithms can only deal with the precise data streams, and they could not deal with uncertainty.

In real-life applications, while the data is collected from noisy data sources, uncertainty may be introduced. But most HUIs mining algorithms are developed to handle precise databases, which ignore itemsets’ existential probability. In fact, for the uncertain databases, itemsets with high utility and high existential probability are useful to users, not itemsets with only one of them. To the best of our knowledge, Lin et al. [19] proposed PHUI-UP based on two-phase model and PHUI-List based on list structure, Lan et al. [20] proposed UHUI-apriori based on Apriori, and these are only algorithms that used to solve HUIs mining problem over uncertain databases. However, the above algorithms can only handle static data with uncertainty, and they could not deal with uncertain data stream.

Uncertain data streams, where the transactions data are added constantly, having the feature of continuous, unlimited, and uncertainty, play an important role in the real-life applications as they exist everywhere, such as wireless sensor, GPS, WIFI system, and RFID. But the issue of HUIs mining over uncertain data stream is seldom studied. Note that HUIs mining over uncertain data stream has to satisfy the following requirements. () The analyzed uncertain data stream can be scanned only once. () Memory usage for the mining process should be limited in the acceptable range. () All the data must be processed as fast as possible. () Itemsets with high utility and high existential probability can be output whenever users want the results.

In this paper, to deal with the new issue of HUIs mining over uncertain data stream, PHUIMUS algorithm is proposed to mine PHUIs over uncertain data stream based on sliding windows. For the realization of PHUIMUS, PUS-list is designed to keep exact potential utility of items and transactions, and TWPUS-tree is developed to maintain batch-by-batch information inside the nodes. Major contributions of this paper are summarized as follows:(1)Previous works about HUIs mining mainly focus on the issue of mining HUIs efficiently in the static and precise databases. To the best of my knowledge, seldom researches are conducted to deal with the issue of mining HUIs over uncertain data stream that takes both uncertainty and timeliness into account.(2)As HUIs mining over uncertain data stream brings existential probability and sliding windows into consideration, the calculation of items utility, itemsets utility, transaction utility, and transaction weighted utility is changed. In this paper, new definitions about them are given, and a novel type of itemsets named PHUIs is designed.(3)PHUIMUS algorithm is proposed to mine PHUIs over uncertain data stream based on the developed PUS-list and TWPUS-tree in the current window, which can efficiently prune the unpromising itemsets and get PHUIs without rescanning the analyzed uncertain data stream.(4)Substantial experiments have been conducted on real-life and synthetic databases. Results show that the designed algorithm can effectively discover PHUIs over uncertain data stream and has a good performance on run-time, number of discovered PHUIs, memory consumption, and scalability.

The remainder of this paper is organized as follows. In Section 2, we describe the related work. In Section 3, we present our new definitions and the problem of HUIs mining over uncertain data stream. In Section 4, we develop our proposed PUS-list and TWPUS-tree and design PHUIMUS algorithm to the stated problem. In Section 5, our experimental results are presented and analyzed. Finally, in Section 6, conclusions are drawn.

2. Related Work

In this section, related work about HUIs mining over data stream and uncertain database are briefly reviewed, respectively.

2.1. HUIs Mining over Data Stream

As an expansion of FIM, HUIs mining focuses on finding itemsets whose utilities are not lower than a minimum utility threshold which has been widely studied recently, which can be used in various areas, such as web click analysis, biological gene analysis, and retail marketing [18]. Its goal is to discover items or itemsets in transactions that are valuable to users, not the most frequent ones. Aiming at the issue of HUIs mining, several typical algorithms had been proposed to deal with the problem of undownward closure property and combinational explosion about it, such as two-phase model [7], HUP-growth [9], HUI-Miner [11], HUPEumu-GRAM [21], and HUIs mining-BPSO [22].

In contrast to discovering HUIs from static database, THUI-Mine [13] is the first algorithm for mining HUIs from data stream according to the two-phase model based on sliding windows [7] and thus suffers from the problem of level-wise candidate generation. Afterward, to reduce the number of THUI-Mine’s candidate itemsets, Li et al. [14, 15] proposed MHUI-BT and MHUI-TID by using bit vectors and TID-lists for each distinct item. The experiments show that MHUI-TID is an outstanding algorithm for mining HUIs from data stream since TIDlist is an efficient data structure that can reduce the number of candidate itemsets sharply.

Moreover, Shie et al. [16] proposed efficient algorithms for mining maximal HUIs from data streams with different models. Ahmed et al. [17] designed an interactive mining algorithm of high utility patterns over data streams. Zihayat and An [18] suggested an algorithm in mining top-k high utility patterns over data streams. However, the above algorithms can only deal with the precise data streams, and they could not deal with uncertainty.

2.2. HUIs Mining over Uncertain Database

It is assumed by most HUIs mining algorithms that the information stored in the databases is precise, which ignore itemsets’ existential probability. Thus, traditional HUIs mining algorithms are insufficient to process transactions with uncertainty in real-life applications. In fact, for the uncertain database, itemsets with high utility and high existential probability are useful to users, not itemsets with only one of them. To the best of our knowledge, Lin et al. [19] proposed PHUI-UP based on two-phase model and PHUI-List based on list structure, Lan et al. [20] proposed UHUI-apriori based on Apriori, and these are only algorithms that used to solve HUIs mining problem over uncertain databases. However, the above algorithms can only handle static databases with uncertainty, and they could not deal with uncertain data stream.

In this paper, the concepts of HUIs mining over data stream and uncertain database to discover HUIs from uncertain data stream are combined. To the best of our knowledge, the proposed PHUIMUS algorithm is the first work that discovers HUIs over uncertain data stream.

3. New Definitions and Problem Statement

In this section, some definitions of HUIs mining are extended from the precise and static databases to the uncertain data streams. New definitions and problem statement related to HUIs mining over uncertain data stream are given.

3.1. New Definitions

Let be a finite set of items; an uncertain data stream is a continuous sequence of transactions . A transaction in contains a number of items, and each item in is associated with internal utility and existential probability , which indicates quantity value of in and the likelihood of being present in , respectively [23]. In addition, external utility of item in is represented by .

In an uncertain data stream, data cannot be completely stored as infinite volume and storage structure are required in the dynamic adjustment for the purpose of reflecting the evolution of itemsets utility. So a sliding window is needed, which consists of most recent batches, represented by . And a batch consists of a certain number of continuously transactions in a time period; that is, . HUIs mining over uncertain data stream based on sliding windows is to mine PHUIs from every new window, which is formed once the oldest batch is removed from the window and the newest batch is inserted into the window.

For example, database in Figure 1 is partial of an uncertain data stream and its profit table, respectively. Assume that each batch includes two transactions and each sliding window includes three batches; there are four batches in the stream: , and , and the first three batches form the first sliding window: . When the fourth batch is filled up with transactions, the second sliding window is formed: .

Figure 1: Partial of an uncertain data stream with its profit table.

What is different in PHUIs mining compared to traditional HUIs mining over data stream is that the items in the transactions have existential probability, which brings a change in the calculation of itemsets utility, transaction utility, and transaction weighted utilization. New definitions about them are presented below.

Definition 1 (potential utility of items). In a transaction , potential utility of an item , , is the product of its internal utility, existential probability, and external utility, denoted asFor example, in Figure 1, the potential utility of item in is .

Definition 2 (potential utility of itemsets in a batch). The potential utility of an itemset in the batch is defined as the sum of potential utilities of in all transactions of including , denoted asFor example, in Figure 1, .

Definition 3 (potential utility of itemsets in a window). The potential utility of an itemset in the window is defined as the sum of potential utilities of in all batches of including , denoted asFor example, in Figure 1, .

Definition 4 (potential utility of transactions). Potential utility of a transaction (PTU) is defined as the sum of utilities of items in , represented asFor example, in Figure 1, .

Definition 5 (see [20] (minimum potential utility value in a window)). Given the minimum utility threshold , minimum potential utility value in the window , , is defined as For example, in Figure 1, when set , .

Definition 6 (see [20] (PHUIs in a window)). An itemset is a PHUI in the window , if . Finding PHUIs in window means finding out all the itemsets having criteria .
For example, in Figure 1, when set , , as , is a PHUI.

It can be seen that potential utility of itemsets has no down closure property [4], showing that the potential utility constraint is not monotone and antimonotone. Hence, unlike FIM, the potential utility of an itemset cannot be used to prune the search space. To deal with this problem, itemset’s overestimate utility, transaction weighted probability and utility (TWPU), is used in the PHUIs mining process to prune the search space.

Definition 7 (TWPU in a batch). The TWPU of an itemset in the batch , , is defined byFor example, in Figure 1, .

Definition 8 (TWPU in a window). The TWPU of an itemset in the window , , is defined byFor example, in Figure 1, .

Definition 9 (high transaction weighted probabilistic and utilization itemsets in a window). is a high transaction weighted probabilistic and utilization itemset (HTWPUI) in the window , if .
For example, in Figure 1, when set , , as , is a HTWPUI.

Lemma 10. The TWPU value of an itemset in the window maintains the downward closure property.

Proof. Let be an itemset that is contained in and be a superset of itemset . Then, if is absent, cannot be presented in any transaction. So according to Definition 8, the TWPU value of is no larger than , denoted as , and if is less than , cannot be a HTWPUI.

Lemma 11. For a window and a minimum utility threshold , the set of PHUIs () is a subset of the set of HTWPUIs ().

Proof. Let be a PHUI in . According to Definitions 3 and 8, must be less than or equal to . So, if is a PHUI, it must be a HTWPUI in , Furthermore, it can obtain that is a member of the set , and .
Since is an overestimate of , any PHUI will not be missed. But the true potential utility of the generated HTWPUIs may be lower than the minimum utility threshold. So in our algorithm, while finding HTWPUIs in by TWPUS-tree, we calculate PHUIs from them by PUS-list.

3.2. Problem Statement

Given a continuous uncertain data stream, a predefined profit table and a user specified minimum utility threshold , the problem of HUIs mining over uncertain data stream is to find the PHUIs whose potential utilities are no lower than .

4. The Proposed Algorithm for Mining HUIs over Uncertain Data Stream

In this section, we first develop our proposed PUS-list and TWPUS-tree, respectively. Then, algorithm for HUIs mining over uncertain data stream based on sliding windows, PHUIs mining, is designed. Lastly, the proposed algorithm is thoroughly described and analyzed.

4.1. Construction of PUS-List and TWPUS-Tree over Uncertain Data Stream

The construction procedure of our proposed PUS-list and TWPUS-tree that are used to deal with the problem of HUIs mining over uncertain data stream is described. PUS-list is employed to calculate potential utility of candidates that are generated by mining TWPUS-tree without rescanning the analyzed uncertain data stream, which consists of potential utility of the items and transactions in the current window, respectively. Its number of rows is equal to the number of transactions in the current window, and its number of columns is equal to the number of items in utility table. The items in the TWPUS-tree with its header table are arranged in the lexicographic order. Item-id and TWPU value of the items are maintained in the header table to get HTWPUIs. Item-id and batch-by-batch TWPU information are maintained by each node in the TWPUS-tree to keep the window sliding environment. And adjacent links are also maintained in the tree structure to facilitate the tree traversals.

For the example of uncertain data stream in Figure 1, when arrives, sorting items in the lexicographic order, current window is formed, and SW1_ represents the th transaction in . Calculate potential utility of each item and transaction in , respectively, when scanning the uncertain data stream, and potential PUS-list in Figure 2(a) can be obtained. Subsequently, as the first transaction has a PTU value of 21.26 and includes three items “,” “,” and “,” “” is inserted into TWPUS-tree by creating a node with a TWPU value of 21.26, and items “” and “” are inserted with TWPU values of 21.26 later. At the same time, their TWPU values are also inserted into the header table in the lexicographic order. After inserting transactions and into TWPUS-tree, Figure 2(b) shows the TWPUS-tree constructed for .

Figure 2: Construction of PUS-list and TWPUS-tree for .

Similarly, and are inserted in PUS-list and TWPUS-tree, since contains the first three batches; Figures 3(a) and 3(b) are the final PUS-list and TWPUS-tree for it, respectively.

Figure 3: Construction of PUS-list and TWPUS-tree for .

When arrives, the information of needs to be inserted into PUS-list and TWPUS-tree, and the information of also needs to be deleted from PUS-list and TWPUS-tree. In detail, PUS-list needs to remove the transactions of from top of the list, insert transactions of from bottom of the list, and change SW1_ to SW2_ correspondingly. The TWPU counters of the nodes in TWPUS-tree are shifted one position left to remove the TWPU information of , and the TWPU information of is inserted subsequently. Figures 4 and 5 indicate the deleting and inserting process, respectively. In Figure 4(b), as “” contains information for , , and , its new information is now . On the other hand, since its child “” does not include any information of and , it becomes after shift operation and is deleted from the tree. Perform the same operations with the nodes in Figure 4(b). Subsequently, is inserted into the tree, and the result is shown in Figure 5.

Figure 4: Construction of PUS-list and TWPUS-tree for deleting .
Figure 5: Construction of PUS-list and TWPUS-tree for inserting .
4.2. Mining Process of PHUIMS

This section deals with the mining procedure of our proposed PHUIMUS algorithm, which combines pattern growth approach with list based approach. In the proposed algorithm, a prefix tree is created from the bottommost item, where all the branches prefixing that item are taken with their TWPU values. For facilitation, all the TWPU values of each node in the prefix tree are added to one value. Subsequently, conditional tree is established based on the prefix tree, by removing those nodes with low TWPU value for that particular item. Lastly, potential utility of candidates that are generated by mining TWPUS-tree are calculated from current PUS-list, which can avoid rescanning the uncertain data stream.

For the example in Figure 1, mining the recent PHUIs means all the PHUIs in must be found. Let and , the prefix tree of item “,” which is the bottom item, is shown in Figure 6(a). It demonstrates that items “” and “” cannot form any candidate itemsets with item “” as their TWPU values are lower than . Hence, by deleting all the nodes that contain items “” and “” from the prefix tree of “,” the conditional tree of item “” is constructed and shown in Figure 6(b). Candidate itemsets , , , , , , , and are generated here.

Figure 6: Mining process.

Judge whether TWPU value of the other items is less than . If yes, any super itemsets of it/them cannot be candidate itemsets as well as PHUIs according to the downward closure property, so prefix/conditional tree for it/them need not be created. If no, generate candidate itemsets in the same way as item “” for the items. All the candidate itemsets are added to global candidate list, and it is reset to NULL when the sliding window changed. Exact potential utility in for candidate itemsets is calculated from PUS-list in Figure 5(b) directly. For example, for the candidate itemset , it exists in SW2_ and SW2_, so we can calculate that + from the PUS-list in Figure 5(b) without rescanning the analyzed uncertain data stream. And as , is a PHUI. Perform the same calculation process for the other candidate itemsets in the global candidate list; then PHUIs for the current window can be obtained.

4.3. Algorithm Description and Analysis

In this section, PHUIMUS algorithm is described and analyzed. At first, the description of PHUIMUS algorithm is shown in Algorithm 1.

Algorithm 1: Description of PHUIMUS algorithm.

In Algorithm 1, Steps to are used to initialize a global header table H that is used to keep all the items in the lexicographic order, TWPUS-tree that is initialized as NULL, and a global PUS-list that can keep potential utility of items and transactions are created, respectively. Step to Step are TWPUS-tree’s construction process. When a new batch arrives, Step sorts items of transaction in batch and Step calculates potential utility of items and transactions and inserts them into PUS-list; Step updates header table H. If current batch number is no more than , must be included in the first sliding window , Step only changes the th position of TWPU counter arrays, whose size is , no matter whether the items exist in the TWPUS-tree before. On the contrary, when current batch number is larger than , Step first performs one time left shift operation for all the TWPU counter arrays to remove the oldest batch. Then, remove the transactions of and insert transactions of in the PUS-list. Subsequently, it updates header table H and deletes the nodes that all the values in their corresponding TWPU counter arrays are zero. Lastly, Step changes the rightest position of TWPU counter arrays and keeps the size of the arrays as , no matter whether the items exist in the TWPUS-tree before. Step to Step are the mining process of PHUIs from the current window . From each bottom item of H, prefix tree with its header table is created. According to user specified , the items that TWPU values are less than are deleted from and , and conditional tree and its header table are created. Subsequently, mine all the candidate itemsets from and add in the global candidate list. Furthermore, calculate PHUIs from the global candidate list by PUS-list. Finally, delete current bottom item of H, and when it becomes NULL, jump out current loop.

What is more, it is not difficult to find that following properties are satisfied by our proposed algorithm.

Lemma 12. The number of candidate itemsets generated by the PHUIMUS algorithm () is no larger than that of the level-wise based algorithms (), denoted as .

Proof. When all the subsets of an itemset are candidate itemsets (HTWPUIs), it becomes a candidate itemset in the existing level-wise based algorithms. Therefore, may have low TWPU value that cannot be a candidate or does not appear in the current window. In PHUIMUS algorithm, if does not appear in the current window, it cannot appear in any branch of the tree and becomes a candidate. What is more, after determining is not a HTWPUI, it is pruned. So the candidate set of PHUIMUS contains only the true HTWPUIs, and cannot be larger than .

Lemma 13. The PHUIMUS algorithm can find out PHUIs without rescanning the analyzed uncertain data stream.

Proof. As the proposed PHUIMUS algorithm combines pattern growth approach with list based approach, exact potential utility can be calculated by PUS-list from the global candidate itemsets that are generated by TWPUS-tree directly, without rescanning the analyzed uncertain data stream.

5. Experimental Results

In this section, four experiments are used to evaluate the performance of our proposed algorithm over uncertain data stream in terms of run-time, memory consumption, number of discovered PHUIs, and scalability. Because it is considered to be the first work HUIs mining over uncertain data stream and MHUI-TID is an outstanding algorithm for mining HUIs from data streams, the performance of the designed PHUIMUS algorithm is only compared with MHUI-TID. And the comparison between PHUIs and HUIs is made to evaluate whether the proposed algorithm is acceptable.

The overall algorithms are carried out in Matlab, with experiments in a PC with Intel Core’s i5-4590 dual core processor, 4 GB RAM, and 32-bit windows operation system of the Microsoft company. Experimental results and discussions are followed.

5.1. Datasets

Experiments were performed on synthetic database T10I4D100K and real-life databases mushroom, connect, and accidents, which are widely used in the issue of HUIs mining. Parameters and characteristics of these databases are, respectively, shown in Tables 1 and 2.

Table 1: Parameters of used databases.
Table 2: Characteristics of the databases.

As these databases do not provide external utility, internal utility, and existential probability of each item, a simulation model [7] is employed. The model generates random numbers that obey log-normal distribution in the interval and interval, which correspond to internal and external utility, respectively. In addition, due to uncertainty property of items in each transaction, their existential probability obeys uniform distribution in the interval. What is more, as the study object of our proposed algorithm is uncertain data stream, we divide these databases into some windows containing a fixed number of batches, and the Batchsize () and Winsize () of each database are also shown in Table 2.

5.2. Run-Time

Set is equal in all the windows, abbreviated as , run-time of the proposed algorithm is compared with that of MHUI-TID on the above four databases for various , and the result is shown in Figure 7. Notice that the databases processed by MHUI-TID do not contain probability values, so only precise versions of the four databases are used by MHUI-TID.

Figure 7: Run-time of the compared algorithms for various .

It is indicated in Figure 7 that the algorithm is superior to MHUI-TID. This result is reasonable since the proposed algorithm discovers PHUIs from the TWPUS-tree and PUS-list directly, which can effectively avoid consuming time on database scans. What is more, this also indicates that the combination of pattern growth approach and list based approach has a good performance on dealing with the problem of HUIs mining over uncertain data stream.

5.3. Number of Discovered PHUIs

In an uncertain data stream, when the existential probability of items is set to 1, it degraded for an accurate data stream. For this case, PHUIs mining algorithms also get the whole set of HUIs, which is the same as the result of traditional HUIs mining algorithms. As no algorithm had been developed for discovering HUIs over uncertain data stream previously, the result of the designed algorithm is compared to that of MHUI-TID algorithm by ignoring probability values of uncertain data stream. This comparison is made between PHUIs and HUIs, which is employed to evaluate whether the proposed algorithm can be accepted. The number of discovered HUIs and PHUIs under various is shown in Figure 8.

Figure 8: Number of HUIs and PHUIs under various .

From Figure 8, for various on four databases, it is presented that the number of PHUIs is usually no larger compared with that of HUIs. Besides, both the numbers of HUIs and PHUIs are inversely proportional to . This is because that the proposed algorithm considers both the probability and the utility, and MHUI-TID only considers the utility. This result also shows that few PHUIs are produced from numerous discovered HUIs when considering the probability constraint. So, in real-life applications, lots of HUIs may not be the itemsets needed by users for making efficient decisions, especially when the is set high. Therefore, PHUIs are more valuable and fewer compared to HUIs as PHUIs have distinct probability values.

5.4. Memory Consumption

The peak memory consumption of the proposed algorithm and MHUI-TID algorithm is compared, and the results under various are shown in Figure 9.

Figure 9: Memory consumption under various .

From Figure 9, in various for the four databases, the proposed algorithm has a slightly good performance on memory consumption compared with MHUI-TID algorithm. This result is reasonable since the proposed PHUIMUS algorithm discovers PHUIs by taking both the probability and utility constraints into consideration through the designed TWPUS-tree and PUS-list, so more efficient pruning strategies can be applied in them to improve its performance. As a result, the memory consumption of the PHUIMUS algorithm is somehow a little better than MHUI-TID algorithm.

5.5. Efficiency of PHUIMUS with Window Size Variation

In terms of run-time and memory consuming, stream data mining algorithms based on sliding windows are greatly influenced by window sizes. As usual, window size depends on the number of transactions and batches in a window. Therefore, given a certain , by varying both of these two parameters, it compares the run-time of the algorithm and that of the existing MHUI-TID algorithm, as shown in Figure 10.

Figure 10: Effect of window size variation on run-time.

When the window size changes, it is presented in Figure 10 that our algorithm is better than the existing one. Particularly when the window size is bigger or the number of distinct items increases, the efficiency of the proposed algorithm is more prominent.

What is more, Figure 11 shows memory consumption of the proposed algorithm and MHUI-TID algorithm under various sizes of windows.

Figure 11: Effect of window size variation on memory consumption.

From Figure 11, it is presented that our proposed algorithm exceeds the existing MHUI-TID algorithm in terms of memory consuming under different window sizes. The main reason for this result is that our proposed TWPUS-tree can represent all the useful information in a compressed form. More importantly, our algorithm can be effective without rescanning the analyzed uncertain data stream with PUS-list to discover PHUIs.

6. Conclusions

In the precise data stream, several algorithms have been proposed to mine HUIs, such as THUI-Mine, MHUI-BT, and MHUI-TID. Moreover, some extended areas of HUIs mining, such as maximal high utility itemsets mining, interactive mining, and top-k high utility itemsets mining, have been studied recently.

In the uncertain databases, itemsets with high utility and high existential probability are useful to users, not itemsets with only one of them. To the best of our knowledge, Lin et al. proposed PHUI-UP based on two-phase model and PHUI-List based on list structure, Lan et al. proposed UHUI-apriori based on Apriori, and these are only algorithms that used to solve HUIs mining problem over uncertain databases.

So according to the above researches, this paper provides an efficient method for HUIs mining over uncertain data stream. New definitions of items utility, itemsets utility, transaction utility, and transaction weighted utility are given. A novel tree structure, TWPUS-tree, list structure, PUS-list, and a new algorithm PHUIMUS are proposed. In detail, TWPUS-tree can maintain a fixed sort order and batch-by-batch information, which is easy to construct and maintain with a sliding window. PUS-list can get exact potential utility of candidate itemsets generated by TWPUS-tree without rescanning the analyzed uncertain data stream. By using TWPUS-tree and PUS-list, PHUIMUS algorithm can capture the recent change of information in an uncertain data stream adaptively. Experiments results show that our algorithm outperforms the existing algorithm in run-time, number of discovered PHUIs, memory usage, and scalability.

To the best of my knowledge, this is the first algorithm about finding HUIs over uncertain data stream. By combining a pattern growth approach with a list based approach, the proposed algorithm can significantly reduce the number of candidate itemsets as well as the overall run-time. What is more, by keeping the recent information very efficiently in the TWPUS-tree and PUS-list, the algorithm also saves a lot of memory space. More works can be done in improving efficiency of discovering HUIs over uncertain data stream in the near future.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

References

  1. R. Agrawal, T. Imielinski, and A. Swami, “Database mining: a performance perspective,” IEEE Transactions on Knowledge and Data Engineering, vol. 5, no. 6, pp. 914–925, 1993. View at Publisher · View at Google Scholar · View at Scopus
  2. J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: a frequent-pattern tree approach,” Data Mining and Knowledge Discovery, vol. 8, no. 1, pp. 53–87, 2004. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  3. R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large database,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216, Washington, DC, USA, 1993.
  4. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487–499, Santiago, Chile, 1994.
  5. H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach to mining itemset utilities from databases,” in Proceedings of the SIAM International Conference on Data Mining, pp. 211–225, 2004. View at MathSciNet
  6. R. Chan, Q. Yang, and Y.-D. Shen, “Mining high utility itemsets,” in Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM '03), pp. 19–26, Melbourne, Fla, USA, November 2003. View at Scopus
  7. Y. Liu, W. K. Liao, and A. Choudhary, “A two-phase algorithm for fast discovery of high utility itemsets,” in Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 689–695, Hanoi, Vietnam, May 2005.
  8. C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and Y.-K. Lee, “Efficient tree structures for high utility pattern mining in incremental databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 12, pp. 1708–1721, 2009. View at Publisher · View at Google Scholar · View at Scopus
  9. C.-W. Lin, T.-P. Hong, and W.-H. Lu, “An effective tree structure for mining high utility itemsets,” Expert Systems with Applications, vol. 38, no. 6, pp. 7419–7424, 2011. View at Publisher · View at Google Scholar · View at Scopus
  10. V. S. Tseng, B.-E. Shie, C.-W. Wu, and P. S. Yu, “Efficient algorithms for mining high utility itemsets from transactional databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp. 1772–1786, 2013. View at Publisher · View at Google Scholar · View at Scopus
  11. M. Liu and J. Qu, “Mining high utility itemsets without candidate generation,” in Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 12), pp. 55–64, November 2012. View at Publisher · View at Google Scholar · View at Scopus
  12. J. C.-W. Lin, W. Gan, T.-P. Hong, and V. S. Tseng, “Efficient algorithms for mining up-to-date high-utility patterns,” Advanced Engineering Informatics, vol. 29, no. 3, pp. 648–661, 2015. View at Publisher · View at Google Scholar · View at Scopus
  13. V. S. Tseng, C. J. Chu, and T. Liang, “Efficient mining of temporal high utility itemsets from data streams,” in Proceedings of the ACM KDD Workshop on Utility-Based Data Mining Workshop, pp. 18–27, Chicago, Ill, USA, 2006.
  14. H.-F. Li, H.-Y. Huang, Y.-C. Chen, Y.-J. Liu, and S.-Y. Lee, “Fast and memory efficient mining of high utility itemsets in data streams,” in Proceedings of the 8th IEEE International Conference on Data Mining (ICDM '08), pp. 881–886, IEEE, Pisa, Italy, December 2008. View at Publisher · View at Google Scholar · View at Scopus
  15. H.-F. Li, H.-Y. Huang, and S.-Y. Lee, “Fast and memory efficient mining of high-utility itemsets from data streams: with and without negative item profits,” Knowledge and Information Systems, vol. 28, no. 3, pp. 495–522, 2011. View at Publisher · View at Google Scholar · View at Scopus
  16. B.-E. Shie, P. S. Yu, and V. S. Tseng, “Efficient algorithms for mining maximal high utility itemsets from data streams with different models,” Expert Systems with Applications, vol. 39, no. 17, pp. 12947–12960, 2012. View at Publisher · View at Google Scholar · View at Scopus
  17. C. F. Ahmed, S. K. Tanbeer, B.-S. Jeong, and H.-J. Choi, “Interactive mining of high utility patterns over data streams,” Expert Systems with Applications, vol. 39, no. 15, pp. 11979–11991, 2012. View at Publisher · View at Google Scholar · View at Scopus
  18. M. Zihayat and A. An, “Mining top-k high utility patterns over data streams,” Information Sciences, vol. 285, pp. 138–161, 2014. View at Publisher · View at Google Scholar · View at MathSciNet · View at Scopus
  19. J. C.-W. Lin, W. Gan, P. Fournier-Viger, T.-P. Hong, and V. S. Tseng, “Efficient algorithms for mining high-utility itemsets in uncertain databases,” Knowledge-Based Systems, vol. 96, pp. 171–187, 2016. View at Publisher · View at Google Scholar · View at Scopus
  20. Y. Q. Lan, Y. Wang, Y. Wang, S. W. Yi, and D. Yu, “Mining high utility itemsets over uncertain databases,” in Proceedings of the 7th International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC '15), pp. 235–238, IEEE, Xi'an, China, September 2015. View at Publisher · View at Google Scholar · View at Scopus
  21. S. Kannimuthu and K. Premalatha, “Discovery of high utility itemsets using genetic algorithm with ranked mutation,” Applied Artificial Intelligence, vol. 28, no. 4, pp. 337–359, 2014. View at Publisher · View at Google Scholar · View at Scopus
  22. J. C.-W. Lin, L. Yang, P. Fournier-Viger, T.-P. Hong, and M. Voznak, “A binary PSO approach to mine high-utility itemsets,” Soft Computing, vol. 3, pp. 1107–1135, 2016. View at Publisher · View at Google Scholar · View at Scopus
  23. C. K. Chui, B. Kao, and E. Hung, “Mining frequent itemsets from uncertain data,” in Proceedings of the Pacific-Asia Conference Advances in Knowledge Discovery and Data Mining, pp. 47–58, Nanjing, China, 2007.