Discovering Approximate and Significant High-Utility Patterns from Transactional Datasets

Tang, Huijun; Wang, Le; Liu, Yangguang; Qian, Jiangbo

doi:https://doi.org/10.1155/2022/6975130

Journal of Mathematics

On this page

Abstract Introduction Related Works Preliminaries Conclusion Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Research Article | Open Access

Volume 2022 | Article ID 6975130 | https://doi.org/10.1155/2022/6975130

Discovering Approximate and Significant High-Utility Patterns from Transactional Datasets

Huijun Tang,^1,2Le Wang,²Yangguang Liu,²and Jiangbo Qian¹

Academic Editor: Feng Feng

Received04 Aug 2022

Revised07 Oct 2022

Accepted05 Nov 2022

Published16 Nov 2022

Abstract

Mining high-utility pattern (HUP) on transactional datasets has been widely discussed, and various algorithms have been introduced to settle this problem. However, the time-space efficiency of the algorithms is still limited, and the mining system cannot provide timely feedback on relevant information. In addition, when mining HUP from taxonomy transactional datasets, a large portion of the quantitative results are just accidental responses to the user-defined utility constraints, and they may have no statistical significance. To address these two problems, we propose two corresponding approaches named Sampling HUP-Miner and Significant HUP-Miner. Sampling HUP-Miner pursues a sample size of a transitional dataset based on a theoretical guarantee; the mining results based on such a sample size can be an effective approximation to the results on the whole datasets. Significant HUP-Miner proposes the concept of testable support, and significant HUPs could be drawn timely based on the constraint of testable support. Experiments show that the designed two algorithms can discover approximate and significant HUPs smoothly and perform well according to the runtime, pattern numbers, memory usage, and average utility.

1. Introduction

HUP mining is quite different from traditional frequent pattern mining as it focuses on the quantitative calculation of itemsets rather than the number of patterns. By considering the profit of the itemsets, HUP mining has been applied to many fields [1–4], including shopping basket analysis and financial stock extraction.

The task of HUP mining is more complex. The effective downward closure property [5, 6] cannot play its effective function in HUP mining. Thus, it means that it will take more search space to get the results since there is no effective pruning strategy. In order to achieve the goal of mining HUPs more efficiently, many HUPs mining approaches [7, 8] have been introduced, but it is important to note that all these algorithms output the same results when given a certain utility threshold; the difference is how they produce the HUPs with the goal of reducing the computations.

The recent approach PTM [6] can output HUPs smoothly by its effective list data structure, but it will encounter efficiency difficulties to mine HUPs when the dataset becomes larger than before. Pattern sampling can be a candidate technology to settle such problems because it can return approximate results with low memory-time consumption. Sampling strategies have been used in drawing graph patterns [9] and frequent patterns [10]. Algorithm CSSAMPLING [11] is a typical method to sample a sequential pattern based on the frequency proportion, and Diop [12] proposed HAI Sampler to sample one HUP according to the average utility proportion of the transactions. The sampling patterns returned by such approaches are according to the probabilities of frequency or utility proportion in the dataset, and the methods often rely on a two-phase procedure [10]. First, according to the proportion of frequency [11] or utility [12], draw one transaction randomly and then return a pattern randomly that is proportional to its frequency or utility in the transaction that was extracted before. Such sampling approaches cannot be applied to obtain approximate HUPs as they only output one pattern for each interaction. As the pattern with a large proportion has a greater chance of being extracted, the same pattern may be extracted each time, and thus, the sampling system may lack the diversity of extraction patterns. We are not drawing patterns from one random transaction like CSSAMPLING in [11] and HAI Sampler in [12]. Since we are going to return approximate HUPs in a sample of the whole dataset, the accuracy of mining results on the sample can be controlled under rigorous probability assurance. Recently, Valiullin et al. [13] first used such a sampling strategy to efficiently obtain approximate frequent itemsets with probability guarantee, but to our best knowledge, there are few methods for efficient approximate extraction of HUPs.

In addition, when the transaction has a label feature, it will return HUP, which may not have statistical significance by traditional HUP mining methods, that is to say, if we consider the mining data as a sample of the population, some HUPs may have equal probability to the labels; such HUPs are insignificant and should not be concerned, and we should pay more attention to HUPs, which are probability biased to a certain label. Statistically significant pattern mining [14, 15] (SSPM) can solve such a problem based on hypothesis testing; it can find HUPs that are more significant on label feature statistics. Nevertheless, such an approach has not been applied in HUP drawing, and thus, proposing an efficient algorithm is very important and challenging.

To address these two problems, this paper serves to propose two corresponding algorithms: Sampling HUP-Miner and Significant HUP-Miner.

Sampling HUP-Miner proposes a sampling strategy under a strict theoretical guarantee. It looks for a sample size that is less than the size of the original dataset and then outputs the approximate results, which are obtained with better time and memory efficiency. It is important to note that Sampling HUP-Miner controls such sampling strategy with a strict probability guarantee.

Significant HUP-Miner mines significant high-utility pattern (SHUP) from taxonomy transactional datasets. Traditional SSPM algorithms are to solve such problems via hypothesis tests under the error control of FWER [14] or FDR [15], and they are often used as a two-phase method, separately producing significant pattern candidates first and then testing their significance. But in fact, by drawing a testable support value, the second testing process could partake in the candidates’ mining process in order to remove the insignificant pattern timely so as to reduce the search space.

The main contributions of our paper can be listed as follows:(1)We propose Sampling HUP-Miner, which is to get a sample size and an approximate set of HUPs under a probability guarantee that can be obtained with better efficiency performances(2)We introduce the concept of testable support value for frequently used Fisher’s tests; insignificant candidates can be removed timely by using the testable support requirement so as to reduce search space, increase the test level, and find more significant HUPs.(3)We propose Significant HUP-Miner, an efficient algorithm to mine significant HUPs with testable support requirements by a pattern-growth method; the results can be efficiently obtained under the control of FDR.(4)We run several experiments to prove the effectiveness of Sampling HUP-Miner and Significant HUP-Miner. Our methods achieve a discovery result with higher efficiency compared with their counterparts.

The rest of the sections are organized as follows: Section 2 reviews the related works. Section 3 defines relevant terms. Sections 4 and 5 give our method of Sampling HUP-Miner and Significant HUP-Miner, respectively. Section 6 shows the effective results of our algorithm, and Section 7 concludes this work.

HUPs mining has become a hot research area, and there are many algorithms to solve this problem, including UP-Growth and UP-Growth+ [17], MU-Growth [18], IHUP [1], and two-phase [16]. The UP-Growth algorithm creates the HUP-Tree by mapping the transactional set to the nodes in the tree. According to the existing path, generate all the possible item combinations and calculate the utility value. The drawback of the algorithm is that it will produce a large number of item combinations, and HUP-Tree needs more space to store the itemsets and utility values. Different from HUP-growth, the IHUP algorithm [1] uses the total utility of the current processing item on the IHUP-Tree as an overestimated utility to determine whether the candidate is set or not. The MU-growth [18] algorithm makes improvements on IHUP by rebuilding the tree structure with important leaf nodes, and through experiments based on the datasets, the number of candidates could be reduced and the runtime efficiency was greatly improved. The abovementioned algorithms may produce a huge number of candidates when the dataset is large, and some algorithms that do not produce candidates are proposed, such as EFIM [19], FHM [20], and D2HUP [21]. But it may take more memory to achieve this goal, although the running time efficiency has been greatly improved.

Pattern mining from a small sample size can be used to improve mining efficiency in terms of running time, memory, and so on. Mannila et al. [22] first used a sampling strategy to efficiently obtain association rules. The approximate frequency of itemset mining can also be found in [23]. Recently, Bashir and Lai [24] proposed a sampling strategy for mining frequent itemsets with inequality control. Chen and An [25] proposed a distributed parallel and sampling strategy for mining HUPs; it is built based on Hoeffding’s inequality and Chebyshev’s inequality and can be known as an effective method for sampling HUPs.

Hämäläinen and Webb [14] proposed the SSPM model first and regard significant pattern mining as a multiple-hypothesis testing problem. Webb [26] controls the error rate by introducing FWER and FDR in significant pattern discovery. Bonferroni’s control [27] is used as a correction method for test level for multiple testing under the control of FWER. The test statistic is an important part of SSPM. Fisher’s conditional test statistic is a popular approach to measuring the significance; the LAMP [28] strategy is used to reduce the calculation time as it puts forward testable support requirements for the testing patterns based on Fisher’s test statistic. Barnard’s test statistic is known to be another effective method for calculating the p value. Pellegrina et al. [29] proposed a novel, unconditional statistical test method for evaluating the significance, and it gives the testable support requirements based on Barnard’s test statistic. Based on FWER control, Riondato and Upfal [30] proposed a significant pattern mining algorithm based on progressive sampling pattern testing. Tran et al. [31] applied the mining results of the significant pattern test to utility dataset analysis. Cheng et al. [32] proposed algorithm LTC to look for significant patterns in the data stream that are not only frequent but also persistent.

FWER may make the significance threshold too strict, and few patterns will have the opportunity to be rejected [33]. Compared with FWER for testing the m hypothesis, FDR may be more tolerant. BH [34] is a typical method to control , where is the test level, but it is conservative; in fact, some research works proved that the test level can be expanded to mα/m₀ for more rejections [35] based on the number of true hypotheses m₀. Recently, literature [36] has proposed a dynamic adaptive procedure RB20 that controls FDR effectively and with better accuracy; it is more powerful than the traditional estimation method.

3. Preliminaries

A transactional database TD can be known as {T₁, T₂, …, T_n}, T_d UD, which has a unique identifier d. There are m distinct literals, which could be acted as I = {x₁, x₂, …, x_m}. q(x_j, T_d) is donated as the internal utility of x_j in T_d; e.g., in Table 1, q(B, T₂) = 3, q(D, T₂) = 3, and q(E, T₂) = 4. p(x_j) is donated as the profit of x_j, e.g., p(C) = 9 in Table 2.

Definition 1. U(x_j, T_d) is donated as the utility of x_j in T_d, andFor the data in Tables 1 and 2, U(A, T₁) = 4 × 4 = 16, U(C, T₁) = 9 × 3 = 27, and U(F, T₁) = 2 × 1 = 2.

Definition 2. U(X, T_d) is donated as the utility of itemset X in T_d, andFor the data in Tables 1 and 2, U({BD}, T₂) = 15, and U({BE}, T₂) = 10.

Definition 3. TU(T_d) is donated as the utility of transaction T_d, andFor the data in Tables 1 and 2, TU(T₅) = U(B, T₅) + U(C, T₅) + U(D, T₅) = 6 + 18 + 6 = 30.

Definition 4. MTU_TD is donated as the maximum transaction utility value in TD, andFor the data in Tables 1 and 2, MTU_TD = max ({45, 35, 27, 26, 30, 32}) = 45.

Definition 5. U _TD(X) is donated as the utility value of itemset X in D, andFor the data in Tables 1 and 2, U_TD({CD}) = U({CD}, T₅) + U({CD}, T₆) = 24 + 21 = 45.

Definition 6. U _TD is donated as the utility value of D, andFor the data in Tables 1 and 2, U_TD = 45 + 35 + 27 + 26 + 30 + 32 = 195. The average utility of the TD could be donated as AvgTD = U_TD/|TD|, and in this example, AvgTD = 195/6 = 32.5.

Definition 7. TWU(X) is donated as the transaction-weighted utility value of itemset X, andFor the data in Tables 1 and 2, TWU({AC}) = 45 + 32 = 77.

Definition 8. For a threshold , Min U is donated as a minimum utility value, and

Definition 9. An itemset is considered as a HUP if its utility value is more than Min U.
In this study, algorithms for approximate and significant patterns are proposed. The notations of this study are given in Table 3.

4. Sampling HUP-Miner: Approximate HUP Mining

In this section, we propose Sampling HUP-Miner, an approximate HUP Mining method that returns a sample size based on a sampling strategy under a strict theoretical guarantee similar to [25], but different from [25], we are using McDiarmid’s inequality two times to establish the sampling theorem.

Lemma 1 ((McDiarmid’s inequality) [14]). Let x₁, …, x_m x, and f: x^m R be a function that satisfies the bounded differences condition that {1,….,m} and x₁, …, x_i, …, x_m, :

Then, for ,

Lemma 2 (See [25]). Let X₁, …, X_m are the m independent random variables, and each X_i has the same probability distribution with mean E(X). Then, the average of the m variables has a distribution with a mean E() = E(X).

Sampling HUP-Miner is going to output a sample size that is less than the size of TD according to Theorem 1, and approximate HUPs can be obtained with low memory-time consumption.

Theorem 1. Given a transactional dataset TD with n transactions, three pregiven parameters > 0, > 0, and probability parameter , let TS be a random sample of TD, m is the size of TS, when m ≥ and the value of satisfies , then for any pattern X in the dataset TS, the utility proportion , with probability at least .

Proof. Let TS be a random sample of TD with m transactions T_s1, T_s2, …, T_sm, which can be considered an independent random variable. For any pattern X in TS,Based on Lemma 1 and similar to [25], formulas (12)–(17) could be easily known, first d_i ≤ MTU_TD, thus,Based on Lemma 2,Formula (12) can be changed asHence,Thus,So,By using Lemma 2 again, it could knowSo,Hence,Thus,Since,We could knowLet = , which is a monotonic decreasing function, and when a pregiven parameter is given, when and m ≥ , with probability at least .
In Definition 7, for simplicity, we denote . Similar to [25], we can obtain the accuracy of the sampling method by Theorem 2.

Theorem 2. If pattern X is an HUP in TD with a minimum utility threshold , pattern X could have utility at least U_TS in TS and at least U_TD in TD with probability at least .

Proof. Since X is an HUP in TD, soBased on Theorem 1,with probability at least .
Hence,Also, based on Theorem 1,So,Hence,

Theorem 3. Reduces the size of the dataset, and an approximate HUP set could be mined by various existing HUP mining methods; the running time and memory consumption could be greatly reduced as it reduces the transaction number. Theorem 2 proves that the accuracy of the results can be guaranteed.

5. Significant HUP-Miner: Significant HUP Mining

We are going to mine the significant high-utility pattern (SHUP) based on the frequently used Fisher’s test in SSPM. Transactions may be associated with label features G⁰ or G¹, as shown in Table 4 based on Table 1.

Definition 10. For an itemset X in transaction T_d, defineThe support [10] can be known asFor the data in Table 4, pattern {AC} is in T₁ and T₆, so S({AC}) = 1 + 1 = 2.

Definition 11. A HUP X is flagged as a SHUP if the p value of the testing pattern is less than a test level threshold.
is the probability that X has the label Gⁱ. H₀: , for a given test level threshold . If the p value of a HUP is less than , it will be flagged as a SHUP. There may exist a large number of patterns waiting for testing, and we are going to mine significant HUP efficiently in a novel way by introducing the testable support requirement in the candidate mining process, which is different from the traditional two-phase procedure.

5.1. Testable Support Value

Fisher’s test is an evaluation value for calculating the p value of HUP. Its calculation process is based on the 2 × 2 contingency table, which is known in Table 5.

For a HUP pattern, X, S₁(X), and S₀(X) are the supports of X with G¹ or G⁰. For x in (min(S(X), n₁), max(0, (n₁ − (n − S(X)))). Take each integer value in the interval as a variable x, and P_F(x) is calculated as follows:

The final test statistic value for X was known asthe minimum value of P_F(x) is achieved at x_min = min(S(X), n₁) and described as

In most cases, S(X) ≤ n₁, and the following Proposition 1 confirms the monotonicity of P_F(x), which is required to calculate our testable support value. On the other hand, when S(X) > n₁, the property is not satisfied. Therefore, we define a function that draws the testable support value.

For a given pattern X, y = S(X), we define the monotonically decreasing function as

Proposition 1. If y ≤ n₁, P_F(y) ≤ P_F(y − 1) holds.

Proof. Similarly,Thus,n₁<n; hence, P_F(y) ≤ P_F(y − 1), and Proposition 1 holds.
From Proposition 1, S(X) ≤ n₁, which can be satisfied commonly, we can know that the minimal p value is monotonically decreasing; if the test level is , y_s is acted as a maximum integer satisfying the following inequality:HUPs, whose support is less than y_s, will be removed, and the number of test patterns will be greatly reduced.

5.2. Mining SHUP Candidates with Testable Support Requirement

In the process of SHUP candidate mining, according to the excellent mining performance provided by tree structure [1, 7, 8, 17, 18], we establish a utility tree and mine all SHUP candidates from the tree under testable support control; the mining process could be known in Algorithm 1.

	Input: D: Dataset TS or TD
	: a minimum utility threshold
	y_s: testable minimal support
	Output: SHUP candidates
	//create a Tree T and a header Table H
(1)	Create Tree (D, , y_s)
	//find SHUP candidates
(2)	SHUP_Candidates (y_s, T, H, base-itemset, )

Propositions 2 and 3 are used to improve the mining efficiency, and they are implemented in the processes of creating tree and SHUP_Candidates in algorithm 1, respectively.

Proposition 2. For a pattern X, if TWU(X) < Min U, X is not a candidate.

Proof. TWU(X) ≤ Min U. That is to say, if the sum of the transaction utility of X is less than Min U, U_TD(X) ≤ TWU(X), X will not be a candidate.

Proposition 3. For pattern X, if support S(X) < y_s, its extensions are not regarded as candidates.

Proof. EX is a superset of X; thus, it could know S(EX) ≤ S(X). When S(X) < y_s, X cannot be assumed as a candidate, nor can Xe.
Based on the efficient two pruning strategies, for the data in Tables 1 and 2, the utility threshold is set at 0.5. First, the transactional dataset is mapped to an efficient tree, which can be seen in from Figure 1.(1)Calculate the testable support value y_s = 3 based on formula (42). Remove the unpromising items whose support is less than 3. Therefore, delete the item G. Min U = 1950.5 = 97.5.(2)There are three parts in the header table of Figure 1(a), which are TWU, support, and the link pointer. By one scan, create the header table H as the TWU of remaining items is more than 97.5, add T₁ to the tree in Figure 1(b), and save the utility of path in a leaf node such as “F, [2, 16, 27];” it means F is the leaf node and list shows the utility of items in this path.(3)Add each transaction; Figure 1(c) shows the result after the second transaction T₂ is added. Figure 1(d) shows the tree after adding the third transaction. In the process of adding, the transaction path is already present in the tree; you only need to add the corresponding utility. The result after adding all transactions can be seen in Figure 1(e).The process of tree creation could be known in Algorithm 2. We first calculate the testable support value, remove these items whose support is not satisfied, and also remove the items whose TWU is less than Min U (lines 1–8); insert transactions after removing the insignificant items, and add the utility to the leaf node (lines 8–14).
The abovementioned process effectively constructs the global tree and maintains the data. The algorithm SHUP_Candidates uses the pattern-growth method to mine all SHUP candidates. The process node is from the bottom to the top. Here, we demonstrate the example by mining SHUP candidates with items F and B. The process can be seen in Figure 2.

	Input: D: Dataset DS or DT
	: a minimum utility threshold
	y_s: testable minimal support
	Output: a tree T and a Header Table H
(1)	For each transaction T_d of D do
(2)	For each item X in T_d do
(3)	Calculate H.X. support
(4)	Calculate H.X. TWU
(5)	End For
(6)	End For
(7)	Delete unpromising items whose support is less than y_s
(8)	Delete unpromising items whose TWU is less than Min U
(9)	Initialize a Tree T with an empty root node, initialize a Table H with TWU descending order
(10)	For each transaction T_d of D do
(11)	Insert the T_d to T with the same order in H
(12)	Add utility to the leaf node
(13)	Add the links
(14)	End For

According to the node pointer, analyze the paths with F. The path <A, C> is obtained from <A, C, F> with leaf node F, as are <D, A, E, C> and <D, E, B>, which are shown in Figure 2(a). Calculate the support and TWU with item F; the subheader table is shown in Figure 2(b). However, since the TWU of the remaining items is less than 97.5, the program is interrupted. Continue to search the candidates of items in H by the same operation steps; the tree after processing item B is shown in Figure 2(c), and the tree with base {B} is shown in Figure 2(d). It can be seen that patterns whose last item is B are also removed as the utility is not satisfied and continues to look for candidates.
The specific process of SHUP_Candidates is shown in algorithm 3. For each item in H, if TWU and support value satisfy the requirement, determine the SHUP candidates (lines 1–5). A subheader table and a subtree are established, and we look for significant HUP candidates recursively (lines 6–9). Remove the item and utility from the leaf node and transfer the utility to its parents (line 10–11) and then go to look for candidates for the next item in H.

	Input: T: a tree
	H: a header table
	: a minimum utility threshold
	y_s: testable minimal support
	Base-item
	Output: SHUP Candidates
(1)	For each item Q in H (with a bottom-up sequence) do:
	/ /Generate subtree and subtable
(2)	If H.Q. support > y_s and H.Q. TWU > Min U, then
(3)	Base-item = Qbase-item
(4)	If Q.utility ≥ MinS then
(5)	Copy base-item to Candidates
(6)	Create a subtree subT and a head table subH for base-item
(7)	SHUP_Candidates (subT, subH, base-item, y_s)
(8)	End If
(9)	End If
(10)	Remove Q from base-item
(11)	Remove and transfer utility value of Q.
(12)	End For

(a)

(b)

(c)

(d)

(e)

(a)

(b)

(c)

(d)

5.3. Testing SHUP Candidates

FWER controls the type I error rate, and compared with FWER, FDR is more tolerant. FDR is defined as the mathematical expectation of the proportion of misjudgment in the positive test results, which has relaxed the quantitative restrictions on the existence of false positive patterns. Here, the recent effective control method RB20 [36] previously mentioned is used to test the significance of SHUP candidates, and it can be followed as follows:(1)If there are m SHUP candidates, which is generated from the algorithm Significant HUP-Miner above, calculate the p value by formula (32), i.e., P₁ ≤ P₂ ≤ … ≤ P_m, and the corresponding hypotheses are H₁, H₂, …, H_m, the test level .(2)Set ,(3)If the minimum set is empty, choose j = k, .(4)Extract the largest parameter h as(5)Reject H₁, H₂, …, H_h

It is important to note that Significant HUP-Miner mines the significant high-utility patterns from taxonomy transactional datasets. In fact, if we do not consider the test process, it can be used as a HUP mining method which can be implemented on the sampling dataset returned by Sampling HUP-Miner in Section 4, but we will carry out the two algorithms together in the experiment section.

6. Experiments

We evaluate the performance of the two methods, Sampling HUP-Miner and Significant HUP-Miner, with three algorithms. The first is the recent advanced HUP mining method PTM in [6], which is set up to mining all HUPs from transactions without label constraint. The second, which we call PTMFisher, is the one that focuses on SHUP mining based on Fisher’s test but without the testable support constraint. The last one is the HAI Sampler in [12], which returns HUP proportionally. Sampling HUP-Miner implements a sample strategy first and then mines HUP by PTM. Significant HUP-Miner and PTMFisher are implemented under FDR control in Section 5.3.

The code used for the evaluation was developed in Python, and the platform is configured as follows: Windows 10 system, 2 G Memory, Intel (R) Core I i3-2310 CPU @2.10 GHz.

6.1. Dataset and Parameters

We tested the algorithms on six datasets, mushrooms, and proteins from libSVM [37]. Chess, connect, pumsb, and accidents are obtained from SPMF [38], and they are labeled by [27]. They all have two classes. Table 6 shows the detail information of the datasets, where |IT| represents the size of the alphabet, avgLength is the average length of the transactions, and |D| represents the number of transactions contained. There are three parameters in Theorem 1, the parameter is initialized as = 0.1 for a 90% probability guarantee, and = 0.1, = 0.2 for relatively high accuracy. Due to randomness, HAI Sampler must be running with less time and memory, so we are not to compare time-memory efficiency with HAI Sampler.

6.2. Efficiency Evaluation

First, we compare the running times of the methods. Figure 3 shows the comparison of running times on six datasets with different utility thresholds. Sampling HUP-Miner spent less time than PTM by using our effective sampling strategy with a smaller sample size. On accidents, when the threshold was 0.005, Sampling HUP-Miner spent 987.4 seconds, while PTM took 1984.3 seconds. Significant HUP-Miner takes less time than PTMFisher by a testable support constraint, and benefiting from reduced search space for mining SHUP candidates, Significant HUP-Miner can save a lot of time. For example, on connecting with a support threshold of 0.3, Significant HUP-Miner spent 162.3 seconds while PTMFisher took 324.5 seconds. The running time of Significant HUP-Miner is significantly longer than Sampling HUP-Miner, since the testing process takes a lot of time. From Figure 3, the running times of Sampling HUP-Miner and Significant HUP-Miner always have computational advantages over existing algorithms.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 4 shows the memory consumption of the four algorithms. The memory result can be distinguished significantly by the proposed methods on the six datasets. With a small transaction size, Significant HUP-Miner has the best memory consumption according to the usage of testable support requirements, and lots of insignificant patterns are removed timely in the candidate mining phase. From Figure 4, Significant HUP-Miner and Sampling HUP-Miner are relatively stable compared with other algorithms, and our algorithms have a certain advantage in memory utilization.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 5 shows recall, precision, and f-measure with a Sampling HUP-Miner. We can see that the recall is always close to 1, which shows that HUP by PTM or other HUP algorithms could be found by the sampling strategy, although the probability of finding HUP is identified as at least . The precision is always above 0.9 and approaches 1 as decreasing. The f-measure also grows as decreasing, but it is always more than precision. According to our experiments, we can know that the HUP mining method based on sampling strategy can mine most patterns with a high probability, but their running time and space consumption have been greatly improved. Our sampling strategy can be effectively applied to HUP mining.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 6 shows the utility distribution returned by Sampling HUP-Miner and HAI sampler on three typical datasets, chess, connect, and accidents, under different values of the probability parameter . For each probability parameter value, the sample size m that meets the accuracy requirement can be obtained based on Theorem 1. According to the obtained sample size m, the maximum length is set as 10 for the two algorithms. The HAI Sampler is applied in a loop to randomly extract m transactions, and then, the pattern is extracted proportionally. We calculate the average utility of the returned patterns using the two algorithms. For any probability parameter value, the average utility returned by sampling HUP-Miner is larger than that of the HAI Sampler on most datasets.

(a)

(b)

(c)

(d)

(e)

(f)

In Theorem 1, it is worth noting that the transactions are extracted with an equal probability according to the required sample size, but from the idea of the HAI Sampler, the transactions should be taken with a probability, so that the HUPs could be found quickly. Based on Definitions 3 and 6, the probability of a transaction T_d can be defined as follows:

The transactions are extracted based on the probability of the transactions; Figure 7 shows the required size by Sampling HUP-Miner with equal and different probabilities, and we could see that when the transactions had a different probability, the required size returned by Theorem 1 was smaller than that with an equal probability. By such a strategy, our sampling strategy can achieve the goal of mining approximate HUPs under high space-time efficiency.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 8 shows the comparison of the number of SHUPs. Significant HUP-Miner can always find more patterns than PTMFisher; PTMFisher returns patterns with a low number, and on several datasets, the number is always 0, with its huge candidates number resulting in a very small p value. For example, on the dataset pumsb, Significant HUP-Miner can always produce patterns of more than 30, while PTMFisher returns 0. The number of SHUP obtained by all algorithms shows a downward trend when the threshold increases. Significant HUP-Miner can always mine more patterns than PTMFisher by benefiting from the advantage of the testable support requirement in candidate mining.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 9 shows the p values of the tested patterns by Significant HUP-Miner. By removing insignificant patterns earlier, Figure 9(a) shows the p value distribution on chess, Figure 9(b) is the tested number on connecting by two methods, and Figure 9(c) is the p value variance on proteins, confirming that Significant HUP-Miner can smoothly return SHUP from taxonomy transactional datasets.

(a)

(b)

(c)

7. Conclusion

We introduce two efficient algorithms to mine approximate and significant high-utility patterns from two types of transactional datasets. The two proposed methods could be used either separately or simultaneously. For mining approximate patterns, our sampling strategy pursues a sample size of a quantitative transactional dataset, which has not been considered in existing sampling strategies for high-utility pattern mining. For mining statistically significant patterns from transactional datasets with labels, we introduce the concept of testable support value and integrate the test requirement into candidate mining, which is significantly different from traditional two-phase methods. The insignificant candidates can be removed timely, and the significant HUP can be obtained based on advanced tree structure under the framework of FDR. The experimental results verify the feasibility and correctness of our algorithm.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Social Science Foundation of China (21BGL088), the Project of Zhejiang Provincial Public Welfare Technology Application and Research (LGF21H180010), and the Natural Science Foundation of Zhejiang Province of China (LZ20F020001).

References

C. F. Ahmed, S. Tanbeer, K. Syed, and Y. K. Lee, “Efficient tree structures for high utility pattern mining in incremental databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 12, pp. 1708–1721, 2009.
View at: Publisher Site | Google Scholar
B. E. Shie, H. F. Hsiao, and V. S. Tseng, “Efficient algorithms for discovering high utility user behavior patterns in mobile commerce environments,” Knowledge and Information Systems, vol. 37, no. 2, pp. 363–387, 2013.
View at: Publisher Site | Google Scholar
Y. C. Li, J. S. Yeh, and C. C. Chang, “Isolated items discarding strategy for discovering high utility itemsets,” Data & Knowledge Engineering, vol. 64, no. 1, pp. 198–217, 2008.
View at: Publisher Site | Google Scholar
M. Zihayat, H. Davoudi, and A. An, “Mining significant high utility gene regulation sequential patterns,” BMC Systems Biology, vol. 11, no. S6, pp. 109–116, 2017.
View at: Publisher Site | Google Scholar
H. Yao, H. J. Hamilton, and G. J. Butz, “A foundational approach to mining itemset utilities from databases,” in Proceedings of the 4th SIAM International Conference on Data Mining, pp. 482–486, Lake Buena Vista, FL, USA, 2004.
View at: Google Scholar
X. Han, X. Liu, J. Li, and H. Gao, “Efficient top-k high utility itemset mining on massive data,” Information Sciences, vol. 557, pp. 382–406, 2021.
View at: Publisher Site | Google Scholar
W. Gan, J. C.-W. Lin, J. Zhang, P. Fournier-Viger, H. C. Chao, and P. S. Yu, “Fast utility mining on sequence data,” IEEE Transactions on Cybernetics, vol. 51, no. 2, pp. 487–500, 2021.
View at: Publisher Site | Google Scholar
J. F. Qu, P. Fournier-Viger, M. Liu, B. Hang, and F. Wang, “Mining high utility itemsets using extended chain structure and utility machine,” Knowledge-Based Systems, vol. 208, Article ID 106457, 2020.
View at: Publisher Site | Google Scholar
M. Al Hasan and M. J. Zaki, “Output space sampling for graph patterns,” in Proceedings of the VLDB Endowment, pp. 730–741, August 2009.
View at: Google Scholar
M. Boley, C. Lucchese, D. Paurat, and T. Gärtner, “Direct local pattern sampling by efficient two-step random procedures,” in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’11), pp. 582–590, Association for Computing Machinery, New York, NY, USA, 2011.
View at: Google Scholar
L. Diop, A. Giacometti, D. Li, and A. Soulet, “Sequential pattern sampling with norm constraints,” in Proceedings of the IEEE International Conference on Data Mining (ICDM), pp. 89–98, Singapore, November 2018.
View at: Google Scholar
L. Diop, “High average-utility itemset sampling under length constraints,” in Proceedings of the 26th Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 134–148, Berlin, Heidelberg, 2022.
View at: Google Scholar
T. Valiullin, Z. X. Huang, C. H. Wei, J. Yin, D. Wu, and I. Egorova, “A new approximate method for mining frequent itemsets from big data,” Computer Science and Information Systems, vol. 18, no. 3, pp. 641–656, 2021.
View at: Publisher Site | Google Scholar
W. Hämäläinen and G. I. Webb, “A tutorial on statistically sound pattern discovery,” Data Mining and Knowledge Discovery, vol. 33, no. 2, pp. 325–377, 2019.
View at: Publisher Site | Google Scholar
G. I. Webb, “Discovering significant patterns,” Machine Learning, vol. 71, pp. 131–133, 2008.
View at: Publisher Site | Google Scholar
Y. Liu, W. K. Liao, and A. Choudhary, “A two-phase algorithm for fast discovery of high utility itemsets,” in Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, pp. 689–695, Hanoi, Vietnam, 2005.
View at: Google Scholar
V. S. Tseng, C.-W. Wu, B.-E. Shie, and P. S. Yu, “Up-growth: an efficient algorithm for high utility itemset mining,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’10, pp. 253–262, ACM, New York, NY, USA, 2010.
View at: Google Scholar
U. Yun, H. Ryang, and K. H. Ryu, “High utility itemset mining with techniques for reducing overestimated utilities and pruning candidates,” Expert Systems with Applications, vol. 41, no. 8, pp. 3861–3878, 2014.
View at: Publisher Site | Google Scholar
S. Zida, P. Fournier-Viger, J. C. W. Lin, C. W. Wu, and V. S. Tseng, “EFIM: a fast and memory efficient algorithm for high-utility itemset mining,” Knowledge and Information Systems, vol. 51, no. 2, pp. 595–625, 2017.
View at: Publisher Site | Google Scholar
P. Fournier-Viger, “FHM: faster high-utility itemset mining using estimated utility co-occurrence pruning,” Foundations of Intelligent Systems, Springer, Berlin, Germany, pp. 83–92, 2014.
View at: Google Scholar
S. Krishnamoorthy, “Pruning strategies for mining high utility itemsets,” Expert Systems with Applications, vol. 42, no. 5, pp. 2371–2381, 2015.
View at: Publisher Site | Google Scholar
H. Mannila, H. Toivonen, and A. I. Verkamo, “Efficient algorithms for discovering association rules,” in Proceedings of the AAAI Workshop on Knowledge Discovery in Databases, KDD-94, pp. 181–192, Seattle, Washington, USA, July 1994.
View at: Google Scholar
C. Zhang, S. Zhang, and G. I. Webb, “Identifying approximate itemsets of interest in large databases,” Applied Intelligence, vol. 18, no. 1, pp. 91–104, 2003.
View at: Publisher Site | Google Scholar
S. Bashir and D. T. C. Lai, “Mining approximate frequent itemsets using pattern growth approach,” Information Technology and Control, vol. 50, no. 4, pp. 627–644, 2021.
View at: Publisher Site | Google Scholar
Y. Chen and A. An, “Approximate parallel high utility itemset mining,” Big Data Research, vol. 6, pp. 26–42, 2016.
View at: Publisher Site | Google Scholar
G. I. Webb, “Layered critical values: a powerful direct-adjustment approach to discovering significant patterns,” Machine Learning, vol. 71, no. 2-3, pp. 307–323, 2008.
View at: Publisher Site | Google Scholar
C. Low-Kam, C. Raïssi, M. Kaytoue, and J. Pei, “Mining statistically significant sequential patterns,” in Proceedings of the ICDM, pp. 488–497, TX, USA, 2013.
View at: Google Scholar
A. Terada, M. Okada-Hatakeyama, K. Tsuda, and J. Sese, “Statistical significance of combinatorial regulations,” Proceedings of the National Academy of Sciences, vol. 110, no. 32, pp. 12996–13001, 2013.
View at: Publisher Site | Google Scholar
L. Pellegrina, M. Riondato, and F. Vandin, “Spumante: significant pattern mining with unconditional testing,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1528–1538, AK, USA, 2019.
View at: Google Scholar
M. Riondato and E. Upfal, “Mining frequent itemsets through progressive sampling with rademacher averages,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1005–1014, Sydney, NSW, Australia, 2015.
View at: Google Scholar
T. Q. Tran, K. Fukuchi, and Y. Akimoto, “Statistically significant pattern mining with ordinal utility,” in Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 1645–1655, CA, USA, 2020.
View at: Google Scholar
S. Cheng, D. Yang, Y. Tong, H. Zhang, and B. Cui, “LTC: a fast algorithm to accurately find significant items in data streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 99, pp. 1–15, 2020.
View at: Google Scholar
Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society: Series B, vol. 57, no. 1, pp. 289–300, 1995.
View at: Publisher Site | Google Scholar
J. D. A. Storey, “A direct approach to false discovery rates,” Journal of the Royal Statistical Society: Series B, vol. 64, no. 3, pp. 479–498, 2002.
View at: Publisher Site | Google Scholar
K. Liang and D. Nettleton, “Adaptive and dynamic adaptive procedures for false discovery rate control and estimation,” Journal of the Royal Statistical Society: Series B, vol. 74, no. 1, pp. 163–182, 2012.
View at: Publisher Site | Google Scholar
P. W. MacDonald, K. Liang, and A. Janssen, “Dynamic adaptive procedures that control the false discovery rate,” Electronic Journal of Statistics, vol. 13, no. 2, pp. 3009–3024, 2019.
View at: Publisher Site | Google Scholar
C.-C. Chang and C.-J. Lin, “LIBSVM: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1–27, 2011.
View at: Publisher Site | Google Scholar
P. Fournier-Viger, A. Gomariz, and T. Gueniche, “SPMF: a java open source pattern mining library,” Journal of Machine Learning Research, vol. 15, pp. 3389–3393, 2014.
View at: Google Scholar

Copyright

Copyright © 2022 Huijun Tang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

139

Downloads

293

Citations

Journal of Mathematics

Discovering Approximate and Significant High-Utility Patterns from Transactional Datasets

Abstract

1. Introduction

2. Related Works

3. Preliminaries

4. Sampling HUP-Miner: Approximate HUP Mining

5. Significant HUP-Miner: Significant HUP Mining

5.1. Testable Support Value

5.2. Mining SHUP Candidates with Testable Support Requirement

5.3. Testing SHUP Candidates

6. Experiments

6.1. Dataset and Parameters

6.2. Efficiency Evaluation

7. Conclusion

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright