Abstract

High Utility Itemset Mining (HUIM) is one of the most investigated tasks of data mining. It has broad applications in domains such as product recommendation, market basket analysis, e-learning, text mining, bioinformatics, and web click stream analysis. Insights from such pattern analysis provide numerous benefits, including cost cutting, improved competitive advantage, and increased revenue. However, HUIM methods may discover misleading patterns as they do not evaluate the correlation of extracted patterns. As a consequence, a number of algorithms have been proposed to mine correlated HUIs. These algorithms still suffer from the issue of the computational cost in terms of both time and memory consumption. This paper presents an algorithm, named Efficient Correlated High Utility Pattern Mining (ECoHUPM), to efficiently mine the high utility patterns having strong correlation items. A new data structure based on utility tree (UTtree) named CoUTlist is proposed to store sufficient information for mining the desired patterns. Three pruning properties are introduced to reduce the search space and improve the mining performance. Experiments on sparse, very sparse, dense, and very dense datasets indicate that the proposed ECoHUPM algorithm is efficient as compared to the state-of-the-art CoHUIM and CoHUI-Miner algorithms in terms of both time and memory consumption.

1. Introduction

We live in a data age where a huge amount of data is generated from different devices every day. It is expected that 463 exabytes of data will be generated on a daily basis by 2025 [1]. Data mining has received a great deal of attention in order to transform data into useful information, due to the exponentially explosive growth of data [2]. Pattern mining is a type of unsupervised data mining approach, which aims to find useful, interesting, and meaningful patterns that can be used to support decision-making [3, 4]. Different pattern mining techniques are used to mine different types of patterns, including frequent patterns [5], high utility patterns [6], sequential patterns, trends, outliers, and graph structures [2, 6].

Frequent itemset mining (FIM) aims to extract patterns containing items that frequently appear in transactional database. [7]. This task has been tremendously studied and remains to this day a very active research area as it has several applications in domains such as market basket analysis, product recommendation, text mining, e-learning, bioinformatics, and web click stream analysis [3, 8, 9]. Even though the mining of frequent pattern is useful, it depends on the assumption that all items in the dataset are equally important (e.g., weight or profit). Nevertheless, this assumption is not true for several real-life applications [6, 10]. For instance, the pattern in a transaction database may be extremely frequent but it may not be interesting as it may produce a low profit. In different circumstances, numerous patterns like may yield a higher profit even if they are not frequent [11]. To overcome this limitation of FIM, an emerging research area is High Utility Itemset Mining (HUIM) which aims to find high utility or important patterns [2, 6].

HUIM takes into account the weight of items in the database and their quantities in each transaction. The goal of HUIM is to find all patterns having utility not less than minimum utility threshold. Recently, HUIM has become a very active research area as it generalizes the problem of FIM and has the same wide applications [1215].

The algorithms of HUIM are divided into two main categories. The first category is called Two-Phase algorithms [11, 16]. These types of algorithms generate candidates in the first phase, and then, in the second phase, they calculate the utility of each candidate in order to derive HUIs. However, due to the huge number of candidates generated in the first phase, these algorithms may suffer from the problem of time and memory consumption. The second category is One-Phase algorithms [6, 10, 17]. The algorithms of this category try to overcome the above issue by utilizing different data structures to store sufficient information for mining the desired patterns without candidate’s generation and utilize various pruning properties to reduce the search space.

One critical downside of High Utility Itemset Mining methods is that they generally extract patterns with a high utility, but the items that make up these patterns are weakly correlated. For marketing decisions, such patterns are either useless or misleading [1823]. For instance, with market basket analysis application, the current algorithm of High Utility Pattern Mining may find that buying a pen and a 60-inch plasma TV is a high utility itemset, since these items generally create a high profit when purchased together. However, these items are weakly correlated and rarely sold together. Hence, it would be a mistake to use this pattern to promote TV to customers who buy pen [11, 21].

To address the above-stated issue, few numbers of algorithms have been developed to mine Correlated High Utility Itemsets, such as HUIPM [19], FDHUP [21], FCHMbond [22], FCHMall-confidence [22], CoHUIM [20], and CoHUI-Miner [24]. These algorithms differ from each other in the measures used to evaluate the interestingness of the extracted patterns, the data structures, and pruning properties that they used to reduce the search space and improve the mining performance. In [20, 24], a projected database has been utilized to reduce the database and improve the efficiency of correlated HUIs mining. The projected database is effective, but it suffers from the computational cost in terms of running time and memory consumption.

In order to address such issue in mining Correlated High Utility Itemsets, this study proposes a new algorithm named Efficient Correlated High Utility Pattern Mining (ECoHUPM). In the proposed algorithm, new efficient data structures and pruning properties are introduced to mine the desired patterns in efficient manner. The main contributions of this paper are summarized as follows:(i)It proposes a novel algorithm, ECoHUPM, which adopts the divide-and-conquer approach and employs UTtree structure which is an extended form of FP-tree [25].(ii)New data structure based on UTtree named CoUTlist is proposed to store sufficient information for mining the desired patterns in one phase without candidate’s generation.(iii)The proposed algorithm introduces several pruning properties to reduce the search space and improve the mining performance.(iv)An experimental performance evaluation of the proposed algorithm is conducted on sparse, very sparse, dense, and very dense datasets. The performance of the proposed ECoHUPM algorithm is compared with CoHUIM and CoHUI-Miner algorithms for Correlated High Utility Itemset Mining. Experimental results show that the proposed ECoHUPM algorithm is better than the state-of-the-art CoHUIM and CoHUI-Miner algorithms in terms of both time and memory consumption.

The rest of this paper is organized as follows: In Section 2, we review the literature associated with HUIM and CoHUIM. Next, we introduce the mathematical preliminaries and state the problem in Section 3. In Section 4, we explain the proposed algorithm in detail. Section 5 gives details of the experimental setup and analyzes the results. Section 6 concludes the work of this paper.

This section reviews the literature on HUIM and the CoHUIM.

2.1. High Utility Itemset Mining (HUIM)

Yao and Hamilton defined the problem of HUIs mining in 2004 [26]. They developed UMining algorithm for mining the itemset having high utility. UMining is an approximate algorithm and may fail to extract all HUIs. Hence, in order to extract the complete set of HUIs, Liu et al. [16] developed a Two-Phase algorithm. In the Two-Phase algorithm, a novel upper bound pruning property named TWU (Transaction Weighted Utilization) has been proposed to reduce the search space. The Two-Phase algorithm mines the HUIs in two phases. In the first phase, it generates the candidate HUIs with their TWU not less than the minimum utility threshold. Then, in the second phase, it calculates the utility of each candidate by scanning the database again to drive the HUIs. However, the Two-Phase algorithm suffers from the issue of time and memory efficiency. The main reason is that a huge number of candidates may be generated in the first phase.

In [27], a new method based on tree structure called HUP-tree is proposed to mine HUIs. It integrates the Two-Phase procedure and FP-tree concept to construct a compressed tree structure for utilizing the TWU property. This approach mines HUIs in three steps: (1) constructs the tree, (2) generates the candidate’s patterns, and then (3) identifies the HUIs from the list of candidates. The mining performance of this algorithm is affected by the number of conditional trees constructed during the whole mining process and the traversal cost of each conditional tree. Hence, this algorithm suffers from the time and memory consumption due to the generation of a huge number of conditional trees and candidate patterns as well [28].

In order to improve the efficiency of HUIs mining, several algorithms have been developed. To extract HUIs without candidates generation, Liu and Qu proposed HUI-Miner algorithm [29]. HUI-Miner utilizes utility-list structure to store sufficient information for mining the HUIs in one phase. Then Fournier-Viger et al. developed an algorithm named FHM [30], which introduced EUCS (Estimated Utility Cooccurrence Structure) and EUCP (Estimated Utility Cooccurrence Pruning) to improve the HUIs mining performance. HUP-Miner [31] extended the HUI-Miner to speed up utility list by utilizing a look-ahead strategy and pruning the search space by database partitioning. Chen and An [32] proposed PHU-Miner which is a parallel version of HUI-Miner. A novel algorithm named ULB-Miner was developed [33], in which improved utility list has been proposed, called utility-list buffer, for speeding up the utility-list join operation and reducing the memory consumption. A new projection-based algorithm, named MAHI [34], has been proposed to speed up the discovery of HUIs by utilizing a MAprun (Matrix-based pruning strategy).

For mining HUIs without the need to set the minimum utility threshold, Tseng et al. [35] developed two types of efficient algorithms named TKU (mining Top-K Utility itemsets) and TKO (mining Top-K utility itemsets in One phase) to extract top-K high utility itemsets. However, they remain expensive in terms of both runtime and memory usage. Hence, Duong et al. [12] designed a novel algorithm named kHMC to extract the top-K HUIs more effectively. The kHMC utilizes three strategies called COV, RIU, and CUD to reduce the search space and thus improves the mining performance. Recently, Gunawan et al. [36] developed an algorithm based on binary particle swarm optimization for optimizing the search for HUIs without setting the minimum utility threshold beforehand. Instead, the minimum utility threshold is determined as a postprocessing step.

Although High Utility Pattern Mining has several applications, it has some limitations. As a consequence, many extensions of High Utility Pattern Mining appeared in the literature such as Incremental Utility Mining [37, 38] which aims to extract HUPs from dynamic databases, On-Shelf High Utility Pattern Mining [3941] in which the shelf time of items is considered, and Concise Representations of High Utility Patterns (e.g., Maximal Itemsets [42, 43] and Closed High Utility Itemsets [4447]) that aim to extract a small list of meaningful HUPs.

2.2. Correlated High Utility Itemset Mining (CoHUIM)

A number of correlation measures have been suggested in the data mining literature which are used for association analysis, such as bond, all-confidence, any-confidence [48, 49], coherence [50] and Kulczynsky [51]. As the traditional algorithms of High Utility Itemset Mining do not consider the correlation of the extracted patterns, they may lead to noninteresting or misleading patterns. In such a case, they usually discover itemsets having high utility, but these itemsets may contain weakly correlated items.

In order to extract more interesting patterns and to avoid misleading patterns resulting from the traditional methods of HUIs mining, a number of algorithms have been proposed to mine Correlated High Utility Itemset by utilizing both utility and correlation measures. Ahmed et al. [19] first proposed an algorithm named High Utility Interesting Pattern Mining (HUIPM) with strong frequency affinity for mining interesting patterns in high utility itemset, in which the relation among items is meaningful. The HUIPM algorithm introduced a new data structure named Utility Tree based on Frequency Affinity (UTFA) as an efficient data structure to store sufficient information required for mining the desired patterns. While a new pruning property named Knowledge Weighted Utilization (KWU) has been proposed in this algorithm to reduce the search space, the HUIPM algorithm recursively creates a number of conditional trees to generate candidates and then derive interesting patterns. This procedure is time-consuming. Thus, Lin et al. [21] developed a new algorithm named fast algorithm for mining discriminative high utility patterns (FDHUP) to improve HUIPM. In the FDHUP algorithm, two data structures called Element Information table (EI table) and Frequency Utility table (FU table) have been proposed to store required information for mining the DHUP efficiently. New pruning property is based on summation of affinitive utility and the remaining affinitive utility has been introduced to reduce the search space.

Fournier-Viger et al. [22] developed Fast Correlated High Utility Itemset Miner (FCHM) algorithm for integrating the concept of correlation in High Utility Itemset Mining in order to extract profitable patterns that are highly correlated. Two versions of the algorithm have been proposed, FCHMbond and FCHMall-confidence, which are based on bond and all-confidence measures that are already used for measuring frequent correlated patterns [48, 50, 52]. The FCHM algorithm is based on HUI-Miner [29], in which the utility-list structure has been utilized, while TWU and strategy based on summation of initial and remaining utility have been used as pruning properties to reduce the search space. Moreover, FCHMbond and FCHMall-confidence utilize the antimonotonicity property of the bond and all-confidence measures, respectively, for further improving the mining performance.

Gan et al. [18, 20] proposed two algorithms to extract correlated purchase behaviors by considering the correlation and utility measures. The first algorithm [20] is named Correlated High Utility Itemset Mining (CoHUIM), while the second one [18] is Correlated high Utility Pattern Miner (CoUPM). Both algorithms use the Kulczynsky (abbreviated as Kulc) measure [51] in conjunction with utility measure to evaluate the interestingness of the desired patterns. The CoUPM utilizes the utility-list structure which is introduced in [29] as a data structure to store information required to mine the desired patterns. Meanwhile, an efficient projection mechanism and a sorted downward closure property are developed in CoHUIM to reduce the database size.

Vo et al. [24] suggested an algorithm, called CoHUI-Miner, to efficiently extract Correlated High Utility Itemset. The CoHUI-Miner applies the database projection mechanism to reduce the database size. Furthermore, it proposes a new concept called the prefix utility of projected transactions to directly calculate the utility of itemset.

Table 1 shows a summary of the Correlated High Utility Itemset Mining algorithms and their features.

3. Fundamental Concepts

This section presents preliminary concepts related to the problem of Correlated High Utility Itemset Mining. We adopted the definitions presented in previous work [53].

Definition 1. (quantitative database). Let I =  be a set of items and for each item profit unit (External Utility) denoted as in each transaction each item is associated with internal utility (Quantity) denoted as . A quantitative database contains a set of transactions.
Table 2 shows the transactional database, while Table 3 shows external utilities for the items in Table 2.

Definition 2. Utility of an item in each transaction Td is denoted by and is defined as , where is the external utility of an item and is the quantity of an item in transaction .
For example, .

Definition 3. Utility of an itemset X in the transaction is denoted by and is defined as , that is, the sum of the utilities of all items inside pattern X in transaction .
For example, .

Definition 4. Utility of an itemset in the quantitative database is denoted by and is defined as , that is, the sum of the utilities of itemset in all transactions containing it.
For example,  =   = 32 + 32+ 24 + 20 = 108.

Definition 5. An itemset X is called high utility itemset if , where minUtil is the minimum utility threshold.
For example, for the data presented in Table 2 with minUtil = 90, {bc} is high utility itemset.

Definition 6. Utility of a transaction is denoted by and is defined as the sum of the utilities of all items inside transaction. .
For example, the utility of transaction is calculated as .

Definition 7. The Transaction Weighted Utilization (TWU) of an itemset in database is defined as .
For example, .

Definition 8. An itemset is called High Transaction-Weighted Utilization Itemset (HTWUI) if , where minUtil is the minimum utility threshold.
For example, with minUtil = 90, an itemset (bc) is HTWUI.
Different measures have been used to evaluate the interestingness of the HUIs, such as frequency affinity, bond, all-confidence, and Kulczynsky. Kulczynsky measure was recommended in [2] and has been used in [18, 20, 24]. Kulczynsky (abbreviated as Kulc) is a null-invariant measure; it is not influenced by the null transactions and is used to evaluate the inherent correlation of patterns [48, 51].

Definition 9 (support). The support of an itemset in the transactional database D is denoted by and is defined as the proportion of transactions in the database which are matched by . , where n is the total number of transactions in the database.
For example, for the data in Table 2, .

Definition 10. The correlation between items inside an itemset X based on the Kulc measure is defined as the mean of the conditional probabilities of items:where k is the number of items inside X.
For example, for the data in Table 2, .

Definition 11. Correlated High Utility Itemset). For a given quantitative database D with minimum utility threshold (minUtil) and minimum correlation threshold (minCor), the Correlated High Utility Itemset is an itemset X such that .

4. Proposed Algorithm

In order to address the need for more efficient algorithm for mining Correlated High Utility Itemsets, we propose a new algorithm named Efficient Correlated High Utility Pattern Mining (ECoHUPM). This section presents the proposed ECoHUPM algorithm in detail, the data structures that it utilizes to store sufficient information for mining the desired patterns, and the pruning properties that are used to reduce the search space and improve the mining performance.

4.1. Database Revising

The proposed ECoHUPM algorithm revises the input database in its first step. First, Property 1 [16] is used to remove all 1-itemsets with their TWU less than minimum utility threshold. For instance, for the data presented in Table 2, with minUtil = 90, “” item is removed as its . Second, in each transaction, the utility of each item is computed through the formula quantity   profit as is stated in Definition 2. Third, items in each transaction are sorted in the descending order of their support and the total utility is assigned to each transaction. Table 4 shows the items in the descending order of their support, while Table 5 shows the revised database.

4.2. Search Space

The proposed ECoHUPM algorithm utilizes a set-enumeration tree as a search space, whose efficiency has been verified in pattern mining [29]. Reversed depth-first search traversal is adopted as shown in Figure 1 to facilitate the search tree. Note that the ECoHUPM uses the support descending order to revise database and then to construct the UTtree. Hence, with reversed depth-first search, the mining order for the running example is .

Definition 12. Given a set-enumeration tree and itemset represented by a node , a set of nodes with their ancestors are called the extensions (supersets) of .
For the -itemset (itemset containing items), we denote its extensions containing items as -extension of the itemset. By adopting reverse depth-first traversal, any extension of itemset is a combination of with the item(s) before X.
For instance, in the set-enumeration tree represented in Figure 1, itemset is 2-extension of , while itemset is 3-extension of .

4.3. Utility Tree and Correlation Utility-List Structures

Once the database is revised, the proposed ECoHUPM algorithm constructs the utility tree (UTtree). A UTtree is a concise structure that stores sufficient information for facilitating the mining of Correlated High Utility Itemsets in a single phase. It is an extended form of FP-tree [25], where each node consists of four fields: , , , and . The refers to the item’s label, points to the next node of the same item, points to the parent node, and is a dictionary that stores both a transaction’s ID as keys and item’s utility in each transaction as values.

A UTtree is constructed with only one scan of the revised database as is shown in Algorithm 1. First, the tree is initialized by creating the node. Then the transactions are processed one by one, as shown in lines 1 and 2. The information of each transaction is inserted into the tree by calling function as shown in line 3. The function is executed as follows: First, the items of the current transaction are sorted in , and the first item is stored in . Then, if the current node has child such that , update dictionary by adding key with value; else create node such that , points to , points to the previous node of the same label, and . Thereafter, while the list of remaining items is not empty, call the function recursively. Finally, after inserting the information of the last transaction, the final UTtree is returned (line 5).

For the revised database presented in Table 4, Figure 2 shows the UTtree after inserting the first transaction. First, the tree is initialized by the node, and then node is created with , , and . Then, node is created with , , and . Node is created with , , and . Node is created with , , and . Node is created with , , and . As the current transaction is the first transaction, of all nodes points to the items holding the same label in the header table. Similarly, the second and third transactions are inserted into the UTtree as shown in Figure 3. The final UTtree after inserting the last transaction is shown in Figure 4.

Besides adopting UTtree, new condensed data structure named CoUTlist is proposed to store sufficient information for mining the superset patterns without needing to scan the UTtree multiple times.

Definition 13. The Correlation Utility list (CoUTlist) of an itemset contains a set of elements, where each element represents node called a suffix where itemset appears. In the CoUTlist, each element has four fields:(i) is a unique identifier number for each node, which is used as a sequence number, for example, (1, 2, …, n).(ii) stores the total number of transactions, where itemset occurred in the current node . .(iii) stores the total utility of itemset in the current node . .(iv) is a dictionary, where the keys are the labels of the parent nodes of the current node, and the values are the summation of the utilities of each parent node in the current path.  = , where is the parent node in the current path and for the keys in .

4.3.1. The CoUTlist of 1-Itemset

Given a UTtree and item , we denote the set of nodes labeled as and the set of the parent nodes in the current path as . The CoUTlist of 1-itemset is denoted as CoUTlist and is constructed as follows: , if , are the parents of ; then construct a quadruple (, , , ), and append it to CoUTlist.

Figure 5 shows how the CoUTlist of an item denoted as is constructed. Based on the UTtree shown in Figure 3, there are two nodes: for those whose , we denote the first node as suffix1 and the second node as suffix2. According to Definition 13, the first element of the CoUTlist(a) represents suffix1, with  = 1, , , and  =  = . The second element of the CoUTlist(a) represents suffix2, with  = 2, , , and  =  = . Subsequently, the elements of suffix1 and suffix2 formalized CoUTlist(a) as shown in Figure 6.

In the same manner, the CoUTlist of the remaining 1-itemsets are constructed as shown in Figure 7.

4.3.2. The CoUTlist of -Itemset

Let itemset  =  be an extension of itemset  = . We denote the element in CoUTlist() as and the element in CoUTlist as .

Definition 14. Item is after itemset if is after all items in . Here the keys {represent items} in of Ynode are sorted according to the mining order which is the ascending order on their supports and are denoted as .
For example, we have the first node of the CoUTlist of f itemset, .

Definition 15. Given an itemset  = , the CoUTlist of its superset  =  is constructed as follows: Ynode CoUTlist, if the of Ynode, add Ynode to the CoUTlist such that(i) =  of Ynode(ii) =  of Ynode(iii) =  +  of Ynode(iv) = {: for , in the of Ynode, if }Figure 8 shows how the CoUTlist of itemset and CoUTlist of itemset are constructed from the CoUTlist of itemset. It further shows how the CoUTlist of itemset is constructed from the CoUTlist of itemset. The CoUTlist(a) is [[1, 6, 51, ‘d’: 15, ‘c’: 72], [2, 2, 24, ‘c’: 28]]. As item has only appeared in the first element of CoUTlist(a), only one element is added to the CoUTlist(da), whose  = 1,  = 6,  = 51 + 15 = 66, and  = c: 72. On the other hand, item has appeared in the two elements of CoUTlist(a); hence, two elements are added to the CoUTlist(ca). The first element is with  = 1,  = 6,  = 51 + 72 = 123, and  = { }, while the second element is with  = 2,  = 2,  = 24 + 28 = 52, and  = { }. Similarly, the CoUTlist(cda) is constructed from the CoUTlist(da). As item has appeared in the element of CoUTlist(da), this element is added to the CoUTlist(cda) with  = 1,  = 6,  = 66 + 72 = 138, and  = { }.

4.4. Pruning Properties

Alongside adopting Property 1 (TWU) [16], three new pruning properties are introduced in the proposed ECoHUPM algorithm and applied to reduce the search space.

Property 1. Transaction-Weighted Upper Bound Property (TWU) is already introduced in [16]. The proposed algorithm utilizes the TWU property [16] to remove all 1-itemsets having TWU less than minUtil threshold.
Let X be a k-itemset, and let Y be a (k − 1)-itemset such that . If X is HTWUI, Y is HTWUI as well. This means that if an itemset is Low Transaction-Weighted Utilization Itemset (LTWUI), all its supersets will be LTWUIs as well. Hence, this property can be used to reduce the search space by removing LTWUIs with their supersets from the search space.

Property 2. The first proposed pruning property is Upper Bound property based on summation of Utility and the Path Utilities (UBUPU).
Given an itemset , if plus is less than the minimum utility threshold , and any superset of are not CoHUIs.

Proof. Let be a superset of ; we know the following.
The utility of is calculated as the sum of the nodes’ utilities in the .for example,  = 51 + 24 = 75.
The path utility of itemset is calculated as the sum of the prefixes’ utilities stored in the in the .for example,  = [(15 + 72) + (28)) = 115.
Since ,for example,  = 66,  = 175, and  = 138. All values  +   = 75 + 115 = 190.

Property 3. The second proposed pruning property is Lower Bound property based on the Node Utility (LBNU). As the CoUTlist of each itemset is a list of elements (nodes), where each element represents a set of transactions containing , on the contrary to the utility list [18, 22] or projected database [20, 24], where each element represents a single transaction, with CoUTlist, there is a possibility that the utility of some itemsets exceeds the in some elements of their CoUTlists, and thus the following lower bound property based on the is formed.
Consider an itemset : Xnode CoUTlist, if , then all possible combinations of itemsets in the current path are considered high utility itemsets (Figure 9).

Proof. Let CoUTlist be the correlation utility list of itemset , and Xnode CoUTlist, we know the following:The set of parent nodes in the current path is denoted as . =   + .   Hence, , where ).For example, for the running example, with m = 90, as shown in Figure 9, in the CoUTlist(ae), the of the second element = 96 . Hence, all possible combinations of itemsets in the current path , , and are considered high utility itemsets.

Property 4. The third proposed pruning property is Sorted-Reversing Downward Closure (SRDC) property based on Kulc measure, which is used as a correlation measure in the proposed ECoHUPM algorithm. By adopting reverse depth-first traversal, each -itemset is in this form , and because these items are sorted based on their support descending order, the sorted-reversing property based on Kulc measure is formed as

Proof. Let  =  be a superset of  = ; we know thatHence,Note that the proposed SRDC property is similar to the sorted downward closure (SDC) property which was employed in [20, 24]. However, SDC cannot be applied directly in the proposed ECoHUPM algorithm, because, in [20, 24], the items are sorted in the ascending order of their supports. Meanwhile, in ECoHUPM, the items are sorted in the descending order of their supports.
The proposed UBUPU and SRDC properties are employed to reduce the search space by removing all supersets of each itemset if  <  or  < . On the other hand, the proposed LBNU property is employed to improve the searching efficiency as follows: in each element in the CoUTlist of itemset, if is equal to or greater than , all possible supersets of in the current path are considered as HUIs. Hence, ECoHUPM needs only to calculate the correlation of each superset in the current path without needing to make sure that or exceeds .
Figure 10 shows how these three properties help significantly in reducing the search space and thus improve the mining performance. For example, with  = 0.4 and  = 90, since  = 0.29 < , and all its supersets are removed from the search space based on the proposed SRDC property. Similarly, the itemsets , , and are removed along with their supersets from the search space according to the proposed UBUPU and SRDC properties. As of the second element of is greater than , all its supersets in the current path , , and are considered as stated in the proposed LBNU property, and thus ECoHUPM only needs to calculate their values to examine whether they are Correlated or not (Algorithm 1).

Input: The revised Database RD
Output: The UTtree
Initialization: Create Root node to initialize the tree
(1)for each transaction in do
(2) Items of
(3) Call insertTree
(4)end for
(5)return
Function: insertTree
(1)
(2)if has child such that then
(3) Update the of
(4)else
(5) Create new node for the
(6)
(7) points to the nodes whose label is
(8)
(9)if is not null then
(10)  Call insertTree
(11)end if
(12)end if
4.5. ECoHUPM Algorithm

Using the UTtree and the CoUTlist, we developed an efficient algorithm named ECoHUPM for mining Correlated High Utility Itemsets. ECoHUPM adopts reverse depth-first traversal for set-enumeration tree searching and employs the pruning properties to reduce the search space. The pseudocode of ECoHUPM is shown in Algorithm 2.

Input: Database , , and .
Output: All Correlated High Utility Itemsets.
(1)Scan to obtain the revised database ;
(2) the set of unique items in sorted on their support descending order;
(3)Execute the Algorithm 1 to construct for the ;
(4)while not NULL do
(5) the last item in ;
(6) from the header table of the follow the of the nodes labeled to construct the ;
(7)ifthen
(8)  
(9)end if
(10)ifthen
(11)  Call search (, , , )
(12)end if
(13)-
(14)end while
(15)return

The input for ECoHUPM is database including transactional database with external utility along with , as a given minimum utility threshold, and , as a given minimum correlation threshold. In line 1, the ECoHUPM preprocess database to obtain the revised database , and then it stores the set of unique items sorted in the descending order of their support in (line 2). Then it runs Algorithm 1 to construct the by performing one scan of the (line 3).

Lines 4 to 15 state the procedure of extracting the Correlated High Utility Itemsets. For each loop started by line 4, ECoHUPM finds all Correlated HUIs that are supersets of item . Lines 5 and 6 construct the of 1-itemset with the help of and of nodes whose label is in the as is illustrated in Section 4.3.1. As the correlation value of each 1-itemset is 1, lines 7 to 9 add itemset to the list if its utility is equal or greater than . Line 10 employs the proposed UBUPU property by examining the summation of utility and path utilities of an itemset . If the is less than , all its supersets will be pruned using the proposed UBUPU property. Otherwise, the function is called to search its supersets (line 11). This procedure is recursively performed for all 1-itemsets to discover Correlated High Utility Itemsets (Algorithm 3).

Function: search (, , , )
(1)
(2)
(3)for each do
(4)for each do
(5)  ifthen
(6)   
(7)  end if
(8)  ifthen
(9)   
(10)  end if
(11)end for
(12)end for
(13)for each do
(14) + ;
(15) Scan the to construct the
(16)ifthen
(17)  ifthen
(18)   
(19)   Call search(, , , )
(20)  else
(21)   ifthen
(22)    
(23)   end if
(24)   ifthen
(25)    Call search (, , , )
(26)   end if
(27)  end if
(28)end if
(29)end for

The function is used to search the whole list of extensions of itemset in order to discover all correlated high utility supersets. It scans node by node to find all possible (lines 1–12). If the utility of the current node is equal to or greater than , all in the of the current node are added to the list of high utility prefixes (lines 5 to 7). All unique prefixes are added to the (lines 8 to10). Then the procedure in lines 13 to 29 is performed for each prefix in . First, 1-extension itemset of is formed such that  =   +  (line 14) and its is constructed (line 15). Line 16 implements the proposed SRDC property to remove the itemset with all its supersets from the search space if its Kulc value is less than . Lines 17 to 19 employ the proposed LBNU property to add an itemset to the list of if current is in the and function is called to search all its supersets. Otherwise, lines 21 to 23 add an itemset to the list of if its utility is equal to or greater than , while lines 24–26 employ the proposed UBUPU property to remove the itemset with all its supersets if its utility plus path utilities is less than ; otherwise, function is called to search its supersets.

5. Experiment Design

In this section, we present the design of the experiments for performance evaluation. Experiments were performed on a computer with an Intel® CoreTM i7-6600U CPU @ 2.60 GHz (4 CPUs), 2.8 GHz, and 8 GB of memory, running 64-bit Windows 10 Pro. The performance of the proposed ECoHUPM algorithm was compared to that of the CoHUIM and CoHUI-Miner algorithms in terms of both runtime and memory consumption. All algorithms were implemented using Python 3 and Jupyter Notebook.

5.1. Datasets Used

We used five standard datasets downloaded from SPMF library [54], two real-life datasets with real utility values (Foodmart and Ecommerce), and three real-life datasets with synthetic utility values (BMS, Chess, and Mushroom). Characteristics of the considered datasets are shown in Table 6.

is adapted with three times on each sparse dataset and four times on each dense dataset to evaluate the efficiency of the proposed ECoHUPM algorithm and they are denoted, respectively, as ECoHUPM-minCorr1, ECoHUPM-minCorr2, ECoHUPM-minCorr3, and ECoHUPM-minCorr4. The different three thresholds are, respectively, set as follows: (1) in Foodmart dataset, 0.1, 0.2, 0.3; (2) in Ecommerce dataset, 0.1, 0.2, 0.3; (3) in BMS dataset, 0.5, 0.6, 0.7; (4) in Mushroom dataset, 0.4, 0.45, 0.5, 0.55; and (5) in Chess dataset, 0.7, 0.75, 0.8, 0.85. Because all algorithms use the same measures to evaluate the interestingness of the extracted patterns, all algorithms resulted in the same number of patterns in all experiments. Tables 711 show the total number of Correlated HUIs when and are varied in each dataset.

5.2. Runtime

The runtime of the proposed ECoHUPM algorithm was compared with those of two state-of-the-art Correlated HUIs mining algorithms: CoHUIM and CoHUI-Miner. For each dataset, the threshold was adjusted, and the runtime execution of each algorithm was calculated. Figures 11 and 12 show the results for the sparse and dense datasets, respectively. It can be observed that ECoHUPM is faster than CoHUIM and CoHUI-Miner. Further, it is found that ECoHUPM is significantly faster than CoHUIM on sparse datasets such as Foodmart (up to 12.8 times) and Ecommerce (up to 15.6 times) and on very sparse datasets such as BMS (up to 8.7 times). For the dense datasets, it is observed that, in most cases, the CoHUIM fails to discover the Correlated HUIs when the and thresholds are set to low values as is shown on the Mushroom dataset with  = [0.35–0.45] and  = [0.005–0.009]. Meanwhile, when the thresholds are set to high values such as  = 0.5 and  = [0.05–0.09], the ECoHUPM is faster than CoHUIM (up to four times). Similarly, on very dense datasets such as Chess, CoHUIM fails to discover the Correlated HUIs with  = [0.7–0.8] and  = [0.005–0.009]. Meanwhile, when the thresholds are set to high values such as  = 85 and  = [0.08–0.28], the ECoHUPM is faster than CoHUIM (up to 4.6 times).

On the other hand, the ECoHUPM is faster than the CoHUI-Miner on sparse datasets such as Foodmart and Ecommerce (up to 5 and 2.2 times, respectively) and it is slightly faster on very sparse dataset such as BMS (up to 1.3 times). For dense datasets with low threshold values, the ECoHUPM is significantly faster than the CoHUI-Miner on the Mushroom dataset (up to 3.2 times) and on very dense datasets such as Chess (up to 4.3 times faster). Meanwhile, with high threshold values, the ECoHUPM is slightly faster than the CoHUI-Miner on Mushroom dataset (up to 2.1 times faster) and on Chess dataset (up to 1.4 times faster).

The main reason why the ECoHUPM algorithm is always faster than CoHUIM and CoHUI-Miner algorithms is that the novel CoUTlist structure is highly effective in reducing the database size as compared to the projection mechanism used on CoHUIM and CoHUI-Miner algorithms. That is, in the CoUTlist, each element represents a set of transactions where the itemset occurs in the same path. Meanwhile, in the projected database, each element represents a single transaction where the itemset occurs. Moreover, the proposed pruning properties help in reducing the search space.

The CoHUIM algorithm performs two phases. It first generates the candidate itemsets whose correlation is equal to or greater than the threshold, and then it calculates the utility of each candidate. Hence, in all datasets, CoHUIM is much slower than the CoHUI-Miner and the proposed ECoHUPM.

The size of the projected database of an itemset increases as the density of the datasets is increased and thus the cost of building the projected database of the supersets is also increased. Thus, the CoHUIM could not find the Correlated HUIs when it was run on Mushroom and Chess datasets with low and thresholds. This is because it suffers from excessive dataset scanning in the second phase.

The CoHUI-Miner is a One-Phase algorithm. However, due to the big size of the projected database of each itemset as compared to the CoUTlist especially in dense and very dense datasets, the proposed ECoHUPM is significantly faster than the CoHUI-Miner on Mushroom and Chess datasets.

In very sparse datasets, the size of the CoUTlist of each itemset is slightly smaller than the size of the projected database. Hence, the proposed ECoHUPM is slightly faster than the CoHUI-Miner in BMS dataset.

5.3. Memory Usage

The comparison of the memory usage of the proposed ECoHUPM against CoHUIM and CoHUI-Miner is shown in Figure 13. In this figure, the Y-axis represents the memory usage which is measured by the module in Python.

It is observed that the proposed ECoHUPM algorithm consumes less memory as compared to the CoHUIM and CoHUI-Miner in all datasets. More specifically, on the sparse datasets such as Foodmart and Ecommerce, the memory usage of the CoHUIM occupies 1.5 and 3 times the memory of the proposed ECoHUPM, respectively. Meanwhile, on very sparse datasets such as BMS, the memory usage of the CoHUIM occupies 2.2 times the memory of the proposed ECoHUPM.

Likewise, on sparse and very sparse datasets such as Foodmart, Ecommerce, and BMS, the CoHUI-Miner occupies 1.07, 1.8, and 1.3 times the memory of the proposed ECoHUPM, respectively. Meanwhile, on dense and very dense datasets such as Mushroom and Chess, the CoHUI-Miner occupies 1.7 and 2.2 times the memory of the proposed ECoHUPM, respectively.

6. Conclusion

This paper proposed an efficient algorithm named ECoHUPM for mining Correlated HUIs. The ECoHUPM algorithm adopts divide-and-conquer approach and employs UTtree structure which is an extended form of FP-tree. A novel data structure based on the UTtree named CoUTlist is proposed in the ECoHUPM to store sufficient information for mining the desired patterns in an efficient manner. Three new pruning properties have been introduced and applied to reduce the search space and improve the mining performance. The first proposed pruning property is Upper Bound property based on summation of Utility and the Path Utilities (UBUPU), the second one is Lower Bound property based on the Node Utility (LBNU), and the third one is Sorted-Reversing Downward Closure (SRDC) property based on Kulc measure.

An extensive experimental evaluation was conducted on five datasets including sparse, very sparse, dense, and very dense datasets. The experimental results show that the proposed ECoHUPM algorithm is efficient as compared to the state-of-the-art CoHUIM and CoHUI-Miner algorithms in terms of both time and memory consumption.

Data Availability

The data used in the experiments of this paper are available at SPMF library [54]: https://www.philippe-fournier-viger.com/spmf/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.