Abstract
Mining highutility pattern (HUP) on transactional datasets has been widely discussed, and various algorithms have been introduced to settle this problem. However, the timespace efficiency of the algorithms is still limited, and the mining system cannot provide timely feedback on relevant information. In addition, when mining HUP from taxonomy transactional datasets, a large portion of the quantitative results are just accidental responses to the userdefined utility constraints, and they may have no statistical significance. To address these two problems, we propose two corresponding approaches named Sampling HUPMiner and Significant HUPMiner. Sampling HUPMiner pursues a sample size of a transitional dataset based on a theoretical guarantee; the mining results based on such a sample size can be an effective approximation to the results on the whole datasets. Significant HUPMiner proposes the concept of testable support, and significant HUPs could be drawn timely based on the constraint of testable support. Experiments show that the designed two algorithms can discover approximate and significant HUPs smoothly and perform well according to the runtime, pattern numbers, memory usage, and average utility.
1. Introduction
HUP mining is quite different from traditional frequent pattern mining as it focuses on the quantitative calculation of itemsets rather than the number of patterns. By considering the profit of the itemsets, HUP mining has been applied to many fields [1–4], including shopping basket analysis and financial stock extraction.
The task of HUP mining is more complex. The effective downward closure property [5, 6] cannot play its effective function in HUP mining. Thus, it means that it will take more search space to get the results since there is no effective pruning strategy. In order to achieve the goal of mining HUPs more efficiently, many HUPs mining approaches [7, 8] have been introduced, but it is important to note that all these algorithms output the same results when given a certain utility threshold; the difference is how they produce the HUPs with the goal of reducing the computations.
The recent approach PTM [6] can output HUPs smoothly by its effective list data structure, but it will encounter efficiency difficulties to mine HUPs when the dataset becomes larger than before. Pattern sampling can be a candidate technology to settle such problems because it can return approximate results with low memorytime consumption. Sampling strategies have been used in drawing graph patterns [9] and frequent patterns [10]. Algorithm CSSAMPLING [11] is a typical method to sample a sequential pattern based on the frequency proportion, and Diop [12] proposed HAI Sampler to sample one HUP according to the average utility proportion of the transactions. The sampling patterns returned by such approaches are according to the probabilities of frequency or utility proportion in the dataset, and the methods often rely on a twophase procedure [10]. First, according to the proportion of frequency [11] or utility [12], draw one transaction randomly and then return a pattern randomly that is proportional to its frequency or utility in the transaction that was extracted before. Such sampling approaches cannot be applied to obtain approximate HUPs as they only output one pattern for each interaction. As the pattern with a large proportion has a greater chance of being extracted, the same pattern may be extracted each time, and thus, the sampling system may lack the diversity of extraction patterns. We are not drawing patterns from one random transaction like CSSAMPLING in [11] and HAI Sampler in [12]. Since we are going to return approximate HUPs in a sample of the whole dataset, the accuracy of mining results on the sample can be controlled under rigorous probability assurance. Recently, Valiullin et al. [13] first used such a sampling strategy to efficiently obtain approximate frequent itemsets with probability guarantee, but to our best knowledge, there are few methods for efficient approximate extraction of HUPs.
In addition, when the transaction has a label feature, it will return HUP, which may not have statistical significance by traditional HUP mining methods, that is to say, if we consider the mining data as a sample of the population, some HUPs may have equal probability to the labels; such HUPs are insignificant and should not be concerned, and we should pay more attention to HUPs, which are probability biased to a certain label. Statistically significant pattern mining [14, 15] (SSPM) can solve such a problem based on hypothesis testing; it can find HUPs that are more significant on label feature statistics. Nevertheless, such an approach has not been applied in HUP drawing, and thus, proposing an efficient algorithm is very important and challenging.
To address these two problems, this paper serves to propose two corresponding algorithms: Sampling HUPMiner and Significant HUPMiner.
Sampling HUPMiner proposes a sampling strategy under a strict theoretical guarantee. It looks for a sample size that is less than the size of the original dataset and then outputs the approximate results, which are obtained with better time and memory efficiency. It is important to note that Sampling HUPMiner controls such sampling strategy with a strict probability guarantee.
Significant HUPMiner mines significant highutility pattern (SHUP) from taxonomy transactional datasets. Traditional SSPM algorithms are to solve such problems via hypothesis tests under the error control of FWER [14] or FDR [15], and they are often used as a twophase method, separately producing significant pattern candidates first and then testing their significance. But in fact, by drawing a testable support value, the second testing process could partake in the candidates’ mining process in order to remove the insignificant pattern timely so as to reduce the search space.
The main contributions of our paper can be listed as follows:(1)We propose Sampling HUPMiner, which is to get a sample size and an approximate set of HUPs under a probability guarantee that can be obtained with better efficiency performances(2)We introduce the concept of testable support value for frequently used Fisher’s tests; insignificant candidates can be removed timely by using the testable support requirement so as to reduce search space, increase the test level, and find more significant HUPs.(3)We propose Significant HUPMiner, an efficient algorithm to mine significant HUPs with testable support requirements by a patterngrowth method; the results can be efficiently obtained under the control of FDR.(4)We run several experiments to prove the effectiveness of Sampling HUPMiner and Significant HUPMiner. Our methods achieve a discovery result with higher efficiency compared with their counterparts.
The rest of the sections are organized as follows: Section 2 reviews the related works. Section 3 defines relevant terms. Sections 4 and 5 give our method of Sampling HUPMiner and Significant HUPMiner, respectively. Section 6 shows the effective results of our algorithm, and Section 7 concludes this work.
2. Related Works
HUPs mining has become a hot research area, and there are many algorithms to solve this problem, including UPGrowth and UPGrowth+ [17], MUGrowth [18], IHUP [1], and twophase [16]. The UPGrowth algorithm creates the HUPTree by mapping the transactional set to the nodes in the tree. According to the existing path, generate all the possible item combinations and calculate the utility value. The drawback of the algorithm is that it will produce a large number of item combinations, and HUPTree needs more space to store the itemsets and utility values. Different from HUPgrowth, the IHUP algorithm [1] uses the total utility of the current processing item on the IHUPTree as an overestimated utility to determine whether the candidate is set or not. The MUgrowth [18] algorithm makes improvements on IHUP by rebuilding the tree structure with important leaf nodes, and through experiments based on the datasets, the number of candidates could be reduced and the runtime efficiency was greatly improved. The abovementioned algorithms may produce a huge number of candidates when the dataset is large, and some algorithms that do not produce candidates are proposed, such as EFIM [19], FHM [20], and D2HUP [21]. But it may take more memory to achieve this goal, although the running time efficiency has been greatly improved.
Pattern mining from a small sample size can be used to improve mining efficiency in terms of running time, memory, and so on. Mannila et al. [22] first used a sampling strategy to efficiently obtain association rules. The approximate frequency of itemset mining can also be found in [23]. Recently, Bashir and Lai [24] proposed a sampling strategy for mining frequent itemsets with inequality control. Chen and An [25] proposed a distributed parallel and sampling strategy for mining HUPs; it is built based on Hoeffding’s inequality and Chebyshev’s inequality and can be known as an effective method for sampling HUPs.
Hämäläinen and Webb [14] proposed the SSPM model first and regard significant pattern mining as a multiplehypothesis testing problem. Webb [26] controls the error rate by introducing FWER and FDR in significant pattern discovery. Bonferroni’s control [27] is used as a correction method for test level for multiple testing under the control of FWER. The test statistic is an important part of SSPM. Fisher’s conditional test statistic is a popular approach to measuring the significance; the LAMP [28] strategy is used to reduce the calculation time as it puts forward testable support requirements for the testing patterns based on Fisher’s test statistic. Barnard’s test statistic is known to be another effective method for calculating the p value. Pellegrina et al. [29] proposed a novel, unconditional statistical test method for evaluating the significance, and it gives the testable support requirements based on Barnard’s test statistic. Based on FWER control, Riondato and Upfal [30] proposed a significant pattern mining algorithm based on progressive sampling pattern testing. Tran et al. [31] applied the mining results of the significant pattern test to utility dataset analysis. Cheng et al. [32] proposed algorithm LTC to look for significant patterns in the data stream that are not only frequent but also persistent.
FWER may make the significance threshold too strict, and few patterns will have the opportunity to be rejected [33]. Compared with FWER for testing the m hypothesis, FDR may be more tolerant. BH [34] is a typical method to control , where is the test level, but it is conservative; in fact, some research works proved that the test level can be expanded to mα/m_{0} for more rejections [35] based on the number of true hypotheses m_{0}. Recently, literature [36] has proposed a dynamic adaptive procedure RB20 that controls FDR effectively and with better accuracy; it is more powerful than the traditional estimation method.
3. Preliminaries
A transactional database TD can be known as {T_{1}, T_{2}, …, T_{n}}, T_{d} UD, which has a unique identifier d. There are m distinct literals, which could be acted as I = {x_{1}, x_{2}, …, x_{m}}. q(x_{j}, T_{d}) is donated as the internal utility of x_{j} in T_{d}; e.g., in Table 1, q(B, T_{2}) = 3, q(D, T_{2}) = 3, and q(E, T_{2}) = 4. p(x_{j}) is donated as the profit of x_{j}, e.g., p(C) = 9 in Table 2.
Definition 1. U(x_{j}, T_{d}) is donated as the utility of x_{j} in T_{d}, andFor the data in Tables 1 and 2, U(A, T_{1}) = 4 × 4 = 16, U(C, T_{1}) = 9 × 3 = 27, and U(F, T_{1}) = 2 × 1 = 2.
Definition 2. U(X, T_{d}) is donated as the utility of itemset X in T_{d}, andFor the data in Tables 1 and 2, U({BD}, T_{2}) = 15, and U({BE}, T_{2}) = 10.
Definition 3. TU(T_{d}) is donated as the utility of transaction T_{d}, andFor the data in Tables 1 and 2, TU(T_{5}) = U(B, T_{5}) + U(C, T_{5}) + U(D, T_{5}) = 6 + 18 + 6 = 30.
Definition 4. MTU_{TD} is donated as the maximum transaction utility value in TD, andFor the data in Tables 1 and 2, MTU_{TD} = max ({45, 35, 27, 26, 30, 32}) = 45.
Definition 5. U _{TD}(X) is donated as the utility value of itemset X in D, andFor the data in Tables 1 and 2, U_{TD}({CD}) = U({CD}, T_{5}) + U({CD}, T_{6}) = 24 + 21 = 45.
Definition 6. U _{TD} is donated as the utility value of D, andFor the data in Tables 1 and 2, U_{TD} = 45 + 35 + 27 + 26 + 30 + 32 = 195. The average utility of the TD could be donated as AvgTD = U_{TD}/TD, and in this example, AvgTD = 195/6 = 32.5.
Definition 7. TWU(X) is donated as the transactionweighted utility value of itemset X, andFor the data in Tables 1 and 2, TWU({AC}) = 45 + 32 = 77.
Definition 8. For a threshold , Min U is donated as a minimum utility value, and
Definition 9. An itemset is considered as a HUP if its utility value is more than Min U.
In this study, algorithms for approximate and significant patterns are proposed. The notations of this study are given in Table 3.
4. Sampling HUPMiner: Approximate HUP Mining
In this section, we propose Sampling HUPMiner, an approximate HUP Mining method that returns a sample size based on a sampling strategy under a strict theoretical guarantee similar to [25], but different from [25], we are using McDiarmid’s inequality two times to establish the sampling theorem.
Lemma 1 ((McDiarmid’s inequality) [14]). Let x_{1}, …, x_{m} x, and f: x^{m} R be a function that satisfies the bounded differences condition that {1,….,m} and x_{1}, …, x_{i}, …, x_{m}, :
Then, for ,
Lemma 2 (See [25]). Let X_{1}, …, X_{m} are the m independent random variables, and each X_{i} has the same probability distribution with mean E(X). Then, the average of the m variables has a distribution with a mean E() = E(X).
Sampling HUPMiner is going to output a sample size that is less than the size of TD according to Theorem 1, and approximate HUPs can be obtained with low memorytime consumption.
Theorem 1. Given a transactional dataset TD with n transactions, three pregiven parameters > 0, > 0, and probability parameter , let TS be a random sample of TD, m is the size of TS, when m ≥ and the value of satisfies , then for any pattern X in the dataset TS, the utility proportion , with probability at least .
Proof. Let TS be a random sample of TD with m transactions T_{s1}, T_{s2}, …, T_{sm}, which can be considered an independent random variable. For any pattern X in TS,Based on Lemma 1 and similar to [25], formulas (12)–(17) could be easily known, first d_{i} ≤ MTU_{TD}, thus,Based on Lemma 2,Formula (12) can be changed asHence,Thus,So,By using Lemma 2 again, it could knowSo,Hence,Thus,Since,We could knowLet = , which is a monotonic decreasing function, and when a pregiven parameter is given, when and m ≥ , with probability at least .
In Definition 7, for simplicity, we denote . Similar to [25], we can obtain the accuracy of the sampling method by Theorem 2.
Theorem 2. If pattern X is an HUP in TD with a minimum utility threshold , pattern X could have utility at least U_{TS} in TS and at least U_{TD} in TD with probability at least .
Proof. Since X is an HUP in TD, soBased on Theorem 1,with probability at least .
Hence,Also, based on Theorem 1,So,Hence,
Theorem 3. Reduces the size of the dataset, and an approximate HUP set could be mined by various existing HUP mining methods; the running time and memory consumption could be greatly reduced as it reduces the transaction number. Theorem 2 proves that the accuracy of the results can be guaranteed.
5. Significant HUPMiner: Significant HUP Mining
We are going to mine the significant highutility pattern (SHUP) based on the frequently used Fisher’s test in SSPM. Transactions may be associated with label features G^{0} or G^{1}, as shown in Table 4 based on Table 1.
Definition 10. For an itemset X in transaction T_{d}, defineThe support [10] can be known asFor the data in Table 4, pattern {AC} is in T_{1} and T_{6}, so S({AC}) = 1 + 1 = 2.
Definition 11. A HUP X is flagged as a SHUP if the p value of the testing pattern is less than a test level threshold.
is the probability that X has the label G^{i}. H_{0}: , for a given test level threshold . If the p value of a HUP is less than , it will be flagged as a SHUP. There may exist a large number of patterns waiting for testing, and we are going to mine significant HUP efficiently in a novel way by introducing the testable support requirement in the candidate mining process, which is different from the traditional twophase procedure.
5.1. Testable Support Value
Fisher’s test is an evaluation value for calculating the p value of HUP. Its calculation process is based on the 2 × 2 contingency table, which is known in Table 5.
For a HUP pattern, X, S_{1}(X), and S_{0}(X) are the supports of X with G^{1} or G^{0}. For x in (min(S(X), n_{1}), max(0, (n_{1} − (n − S(X)))). Take each integer value in the interval as a variable x, and P_{F}(x) is calculated as follows:
The final test statistic value for X was known asthe minimum value of P_{F}(x) is achieved at x_{min} = min(S(X), n_{1}) and described as
In most cases, S(X) ≤ n_{1}, and the following Proposition 1 confirms the monotonicity of P_{F}(x), which is required to calculate our testable support value. On the other hand, when S(X) > n_{1}, the property is not satisfied. Therefore, we define a function that draws the testable support value.
For a given pattern X, y = S(X), we define the monotonically decreasing function as
Proposition 1. If y ≤ n_{1}, P_{F}(y) ≤ P_{F}(y − 1) holds.
Proof. Similarly,Thus,n_{1}<n; hence, P_{F}(y) ≤ P_{F}(y − 1), and Proposition 1 holds.
From Proposition 1, S(X) ≤ n_{1}, which can be satisfied commonly, we can know that the minimal p value is monotonically decreasing; if the test level is , y_{s} is acted as a maximum integer satisfying the following inequality:HUPs, whose support is less than y_{s}, will be removed, and the number of test patterns will be greatly reduced.
5.2. Mining SHUP Candidates with Testable Support Requirement
In the process of SHUP candidate mining, according to the excellent mining performance provided by tree structure [1, 7, 8, 17, 18], we establish a utility tree and mine all SHUP candidates from the tree under testable support control; the mining process could be known in Algorithm 1.

Propositions 2 and 3 are used to improve the mining efficiency, and they are implemented in the processes of creating tree and SHUP_Candidates in algorithm 1, respectively.
Proposition 2. For a pattern X, if TWU(X) < Min U, X is not a candidate.
Proof. TWU(X) ≤ Min U. That is to say, if the sum of the transaction utility of X is less than Min U, U_{TD}(X) ≤ TWU(X), X will not be a candidate.
Proposition 3. For pattern X, if support S(X) < y_{s}, its extensions are not regarded as candidates.
Proof. EX is a superset of X; thus, it could know S(EX) ≤ S(X). When S(X) < y_{s}, X cannot be assumed as a candidate, nor can Xe.
Based on the efficient two pruning strategies, for the data in Tables 1 and 2, the utility threshold is set at 0.5. First, the transactional dataset is mapped to an efficient tree, which can be seen in from Figure 1.(1)Calculate the testable support value y_{s} = 3 based on formula (42). Remove the unpromising items whose support is less than 3. Therefore, delete the item G. Min U = 1950.5 = 97.5.(2)There are three parts in the header table of Figure 1(a), which are TWU, support, and the link pointer. By one scan, create the header table H as the TWU of remaining items is more than 97.5, add T_{1} to the tree in Figure 1(b), and save the utility of path in a leaf node such as “F, [2, 16, 27];” it means F is the leaf node and list shows the utility of items in this path.(3)Add each transaction; Figure 1(c) shows the result after the second transaction T_{2} is added. Figure 1(d) shows the tree after adding the third transaction. In the process of adding, the transaction path is already present in the tree; you only need to add the corresponding utility. The result after adding all transactions can be seen in Figure 1(e).The process of tree creation could be known in Algorithm 2. We first calculate the testable support value, remove these items whose support is not satisfied, and also remove the items whose TWU is less than Min U (lines 1–8); insert transactions after removing the insignificant items, and add the utility to the leaf node (lines 8–14).
The abovementioned process effectively constructs the global tree and maintains the data. The algorithm SHUP_Candidates uses the patterngrowth method to mine all SHUP candidates. The process node is from the bottom to the top. Here, we demonstrate the example by mining SHUP candidates with items F and B. The process can be seen in Figure 2.

The specific process of SHUP_Candidates is shown in algorithm 3. For each item in H, if TWU and support value satisfy the requirement, determine the SHUP candidates (lines 1–5). A subheader table and a subtree are established, and we look for significant HUP candidates recursively (lines 6–9). Remove the item and utility from the leaf node and transfer the utility to its parents (line 10–11) and then go to look for candidates for the next item in H.

(a)
(b)
(c)
(d)
(e)
(a)
(b)
(c)
(d)
5.3. Testing SHUP Candidates
FWER controls the type I error rate, and compared with FWER, FDR is more tolerant. FDR is defined as the mathematical expectation of the proportion of misjudgment in the positive test results, which has relaxed the quantitative restrictions on the existence of false positive patterns. Here, the recent effective control method RB20 [36] previously mentioned is used to test the significance of SHUP candidates, and it can be followed as follows:(1)If there are m SHUP candidates, which is generated from the algorithm Significant HUPMiner above, calculate the p value by formula (32), i.e., P_{1} ≤ P_{2} ≤ … ≤ P_{m}, and the corresponding hypotheses are H_{1}, H_{2}, …, H_{m}, the test level .(2)Set ,(3)If the minimum set is empty, choose j = k, .(4)Extract the largest parameter h as(5)Reject H_{1}, H_{2}, …, H_{h}
It is important to note that Significant HUPMiner mines the significant highutility patterns from taxonomy transactional datasets. In fact, if we do not consider the test process, it can be used as a HUP mining method which can be implemented on the sampling dataset returned by Sampling HUPMiner in Section 4, but we will carry out the two algorithms together in the experiment section.
6. Experiments
We evaluate the performance of the two methods, Sampling HUPMiner and Significant HUPMiner, with three algorithms. The first is the recent advanced HUP mining method PTM in [6], which is set up to mining all HUPs from transactions without label constraint. The second, which we call PTMFisher, is the one that focuses on SHUP mining based on Fisher’s test but without the testable support constraint. The last one is the HAI Sampler in [12], which returns HUP proportionally. Sampling HUPMiner implements a sample strategy first and then mines HUP by PTM. Significant HUPMiner and PTMFisher are implemented under FDR control in Section 5.3.
The code used for the evaluation was developed in Python, and the platform is configured as follows: Windows 10 system, 2 G Memory, Intel (R) Core I i32310 CPU @2.10 GHz.
6.1. Dataset and Parameters
We tested the algorithms on six datasets, mushrooms, and proteins from libSVM [37]. Chess, connect, pumsb, and accidents are obtained from SPMF [38], and they are labeled by [27]. They all have two classes. Table 6 shows the detail information of the datasets, where IT represents the size of the alphabet, avgLength is the average length of the transactions, and D represents the number of transactions contained. There are three parameters in Theorem 1, the parameter is initialized as = 0.1 for a 90% probability guarantee, and = 0.1, = 0.2 for relatively high accuracy. Due to randomness, HAI Sampler must be running with less time and memory, so we are not to compare timememory efficiency with HAI Sampler.
6.2. Efficiency Evaluation
First, we compare the running times of the methods. Figure 3 shows the comparison of running times on six datasets with different utility thresholds. Sampling HUPMiner spent less time than PTM by using our effective sampling strategy with a smaller sample size. On accidents, when the threshold was 0.005, Sampling HUPMiner spent 987.4 seconds, while PTM took 1984.3 seconds. Significant HUPMiner takes less time than PTMFisher by a testable support constraint, and benefiting from reduced search space for mining SHUP candidates, Significant HUPMiner can save a lot of time. For example, on connecting with a support threshold of 0.3, Significant HUPMiner spent 162.3 seconds while PTMFisher took 324.5 seconds. The running time of Significant HUPMiner is significantly longer than Sampling HUPMiner, since the testing process takes a lot of time. From Figure 3, the running times of Sampling HUPMiner and Significant HUPMiner always have computational advantages over existing algorithms.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4 shows the memory consumption of the four algorithms. The memory result can be distinguished significantly by the proposed methods on the six datasets. With a small transaction size, Significant HUPMiner has the best memory consumption according to the usage of testable support requirements, and lots of insignificant patterns are removed timely in the candidate mining phase. From Figure 4, Significant HUPMiner and Sampling HUPMiner are relatively stable compared with other algorithms, and our algorithms have a certain advantage in memory utilization.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 5 shows recall, precision, and fmeasure with a Sampling HUPMiner. We can see that the recall is always close to 1, which shows that HUP by PTM or other HUP algorithms could be found by the sampling strategy, although the probability of finding HUP is identified as at least . The precision is always above 0.9 and approaches 1 as decreasing. The fmeasure also grows as decreasing, but it is always more than precision. According to our experiments, we can know that the HUP mining method based on sampling strategy can mine most patterns with a high probability, but their running time and space consumption have been greatly improved. Our sampling strategy can be effectively applied to HUP mining.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 6 shows the utility distribution returned by Sampling HUPMiner and HAI sampler on three typical datasets, chess, connect, and accidents, under different values of the probability parameter . For each probability parameter value, the sample size m that meets the accuracy requirement can be obtained based on Theorem 1. According to the obtained sample size m, the maximum length is set as 10 for the two algorithms. The HAI Sampler is applied in a loop to randomly extract m transactions, and then, the pattern is extracted proportionally. We calculate the average utility of the returned patterns using the two algorithms. For any probability parameter value, the average utility returned by sampling HUPMiner is larger than that of the HAI Sampler on most datasets.
(a)
(b)
(c)
(d)
(e)
(f)
In Theorem 1, it is worth noting that the transactions are extracted with an equal probability according to the required sample size, but from the idea of the HAI Sampler, the transactions should be taken with a probability, so that the HUPs could be found quickly. Based on Definitions 3 and 6, the probability of a transaction T_{d} can be defined as follows:
The transactions are extracted based on the probability of the transactions; Figure 7 shows the required size by Sampling HUPMiner with equal and different probabilities, and we could see that when the transactions had a different probability, the required size returned by Theorem 1 was smaller than that with an equal probability. By such a strategy, our sampling strategy can achieve the goal of mining approximate HUPs under high spacetime efficiency.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 8 shows the comparison of the number of SHUPs. Significant HUPMiner can always find more patterns than PTMFisher; PTMFisher returns patterns with a low number, and on several datasets, the number is always 0, with its huge candidates number resulting in a very small p value. For example, on the dataset pumsb, Significant HUPMiner can always produce patterns of more than 30, while PTMFisher returns 0. The number of SHUP obtained by all algorithms shows a downward trend when the threshold increases. Significant HUPMiner can always mine more patterns than PTMFisher by benefiting from the advantage of the testable support requirement in candidate mining.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 9 shows the p values of the tested patterns by Significant HUPMiner. By removing insignificant patterns earlier, Figure 9(a) shows the p value distribution on chess, Figure 9(b) is the tested number on connecting by two methods, and Figure 9(c) is the p value variance on proteins, confirming that Significant HUPMiner can smoothly return SHUP from taxonomy transactional datasets.
(a)
(b)
(c)
7. Conclusion
We introduce two efficient algorithms to mine approximate and significant highutility patterns from two types of transactional datasets. The two proposed methods could be used either separately or simultaneously. For mining approximate patterns, our sampling strategy pursues a sample size of a quantitative transactional dataset, which has not been considered in existing sampling strategies for highutility pattern mining. For mining statistically significant patterns from transactional datasets with labels, we introduce the concept of testable support value and integrate the test requirement into candidate mining, which is significantly different from traditional twophase methods. The insignificant candidates can be removed timely, and the significant HUP can be obtained based on advanced tree structure under the framework of FDR. The experimental results verify the feasibility and correctness of our algorithm.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Social Science Foundation of China (21BGL088), the Project of Zhejiang Provincial Public Welfare Technology Application and Research (LGF21H180010), and the Natural Science Foundation of Zhejiang Province of China (LZ20F020001).