Abstract

Mining erasable itemset (EI) is an attracting field in frequent pattern mining, a wide tool used in decision support systems, which was proposed to analyze and resolve economic problem. Many approaches have been proposed recently, but the complexity of the problem is high which leads to time-consuming and requires large system resources. Therefore, this study proposes an effective method for mining EIs based on multicore processors (pMEI) to improve the performance of system in aspect of execution time to achieve the better user experiences. This method also solves some limitations of parallel computing approaches in communication, data transfers, and synchronization. A dynamic mechanism is also used to resolve the load balancing issue among processors. We compared the execution time and memory usage of pMEI to other methods for mining EIs to prove the effectiveness of the proposed algorithm. The experiments show that pMEI is better than MEI in the execution time while the memory usage of both methods is the same.

1. Introduction

Data mining is an interesting field that has attracted many experts because of the huge amounts of data that were collected every day and the need to transfer such data into useful information to use in intelligence systems such as recommendation systems, decision making, and expert systems. Data mining has been widely used in market basket analysis, manufacturing engineering, financial banking, bioinformatics and future healthcare, and so on. The mining frequent pattern (FP) has a vital position in many data mining fields including association rule mining [1], clustering [2], and text mining [3]. Mining FP is to find all patterns that have the frequency satisfying the user-given threshold. There are many methods [4, 5] for mining FPs in recent years. In addition, some issues related to FP mining has been proposed such as maximal frequent patterns [6], top- cooccurrence items with sequential pattern [7], weighted-based patterns [8], periodic-frequent patterns [9], and their applications [10, 11].

In 2009, the erasable itemset mining was first introduced [12], which comes from predicting merchandises of the production scheming as an exciting alteration of pattern mining. For an example, a factory needs to produce several new products, each of which requires some amount of raw materials to produce. However, the factory does not have enough budget to purchase all materials. Therefore, the factory managers need to determine what essential materials are needed to produce while the profit is not affected. The main problem is how to efficiently find these materials, without which the loss of the profit is less than the given threshold. These elements are also called as erasable itemsets. Based on these erasable itemsets, the consulting team can give the managers several suggestions about production plans in the near future. It has attracted a lot of research and become an ideal topic in recent years. There are many approaches, which were summarized in [13], to mine such patterns including META [12], MERIT [14], MEI [15], and EIFDD [16]. Several related problems of mining erasable closed itemsets [17], top-rank- erasable itemsets [18], erasable itemsets with constraints [19], and weighted erasable itemsets [20, 21] have also been developed. Erasable closed itemset [17], a condensed representation of EIs without information loss, was proposed to reduce the computational cost. Top-rank- erasable itemset [18] is to merge the mining and ranking phases into one phase that only returns a small number of EIs to use in intelligent systems. Erasable itemset with constraints [19] is another approach only producing a small number of EIs, which delight a special requirement. In addition, mining weighted erasable itemset [20, 21] is a framework for mining erasable itemsets with the weight conditions for each item.

The existing algorithms for mining EIs have high computational complexity. It leads to very long execution time, especially on huge datasets and inefficiently intelligent systems. Therefore, this study proposes a parallel approach named parallel mining erasable itemsets (pMEI) using a multicore processor platform to improve the execution time for mining EIs. The major contributions of this paper are as follows: (i)A parallel method, namely, pMEI for mining EIs, splits the jobs into several duties to lessen the operating cost.(ii)Applying the difference pidset (dPidset) structure for quickly determining EIs information.(iii)A dynamic mechanism is used for load balancing the workload among cores when some processes are free.

The experiment results show that pMEI algorithm is better than MEI in execution time for the most of experimental datasets.

The remaining parts of the paper are arranged as follows: the preliminaries and related works, comprising the erasable itemset mining problem, several methods for mining EIs, as well as multicore processor architecture are presented in Section 2. Section 3 presents the pMEI algorithm proposed. The experiments on runtime and memory usage of pMEI and MEI methods for mining EIs are shown in Section 4. Finally, Section 5 reviews the outcomes and proposes some potential study topics.

2.1. Preliminaries

Let be a set of distinct items, which are the conceptual descriptions of elements of products created in a manufactory. For example, assuming that , then the included items are sugar, milk, cheese, cafe, snack, and wine. A product dataset consists of products, , where is a product shown in the pair of (items, value). In this pair, items are the items that compose product , and value is the profit value of the product . Table 1 shows the example datasets that have five items including and the set of products .

Definition 1 (itemset). An itemset , such that , is called a -itemset.

Definition 2 (profit of itemset). Given an itemset , the profit of itemset (denoted ) is calculated by

For example, let the set of products that contains , , or from Table 1 which are . Therefore,  USD.

An itemset in product dataset is said to be erasable iff , where is the minimum threshold defined by user; is the total value of determined by this formula .

Definition 3 (EI mining problem). The EI mining problem is to discover all itemsets () that have not greater than .

For the example dataset in Table 1 and  = 20%, we have by summing the value of all products. The itemset is an erasable itemset due to .

2.2. Erasable Itemset Mining

The erasable itemset (EI) mining problem was first introduced in 2009 [12], which comes from predicting merchandises of the production industry. It supports managers to decide their manufacturing strategy to guarantee the development of the company. The managers can decide which new merchandises are suitable for the factory without affecting the company’s profit. For example, a company that makes many types of goods, each good that is created will have a profit value. To create all products, the factory has to buy all essential materials. Currently, the company has no enough budget to purchase all materials. Hence, the managers of this company should deliberate their manufacturing strategies to make sure the steadiness of the company. The challenge is to obtain the itemsets that can be excluded but do not significantly change the company’s profit.

2.3. Methods for Erasable Itemset Mining

Many methods have been suggested to mine EIs such as META [12], MERIT [14], MEI [15], and EIFDD [16]. Firstly, META, an Apriori-based algorithm, generates candidate itemsets using level-wise approach. Let be the set of erasable −1-itemsets. An itemset is verified with for coherency to produce applicant erasable -itemsets. However, only a small number of that have the same prefix as are joined. Later, MERIT based on the NC_sets is structured to decrease memory manipulation, which is its main improvement. The performance of MERIT is better than that of META significantly. However, storing NC_sets structure leads to high computational cost including memory usage and runtime. Therefore, MEI uses a divide-and-conquer approach associated with the difference of pidsets (dPidset) for mining EIs to improve the memory usage and runtime. It only scans the dataset one time to determine the total profit (), the index of gain (), and the erasable 1-itemsets with their pidsets. Although the runtime and memory consumption are enhancing than those of META and MERIT, however, the MEI’s performance from several dense datasets is quite weak. To resolve this drawback of MEI for dense datasets, EIFDD is proposed by using the subsume theory. This concept helps to quickly determine the information of EIs, without the generation cost. In brief, EIFDD is regularly applied to mine EIs for dense datasets, while MEI is applied to mine EIs for the other kinds of datasets.

Although existing methods improved the computation of mining EIs, however, these methods still consume more time in large datasets or large thresholds. Hence, in this paper, we develop a new technique to improve the computational cost for mining EIs.

2.4. dPidset Structure

The dPidset structure was suggested by Le and Vo [15] to lessen memory consumption by using an index of profit for effectively mining EIs. This structure is summarized as follows.

Definition 4. Given an itemset and an item , the pidset of itemset is denoted as where is the pidset of item , that is, the set of product identifiers (IDs) which contain item .

Definition 5. Let be the profit of itemset that is computed as follows:

Theorem 1 [15]. Two given itemsets and have the same prefix which is . and are pidsets of and , correspondingly. is determined as follows:

Example 1. Consider the illustration datasets in Table 1, and . We have .

Definition 6. The dPidset of itemset signified by is defined as follows: where is the list of itemset IDs that only exist in .

Example 2. We have and , so .

Theorem 2 [15]. Two given itemsets and have the dPidsets which are and . The is determined as follows:

Example 3. We have , , and . As in Definition 6, and . As in Theorem 2, .

Theorem 3 [15]. The profit of , denoted by (), is calculated by as follows: where is the profit of and is the profit of .

Example 4. We have and , and thus, , , and . According to Example 3, and , and thus, and . , so .

2.5. Multicore Processor Platform

A multicore processor (MCP) is a physical chip including many separate cores in the same circuit [22]. MCPs enable executed multiple missions concurrently to increase the performance of applications. In a multicore processor, each core has a distinct L1 cache and execution module and uses a public L2 cache for the whole processor. This makes the greatest use of the resources and optimizes the communication between intercores. If several tasks carry on separate cores of the same circuit and if they portion data that match in the cache, then the public last-level cache between cores will reduce the data replication. Hence, it is further effective in interaction. The major improvement of multicore processors is decreasing the temperature occurring off CPU and to extensively growing the speedup of processors while it is low cost than distributed systems, so it broadly applied in many fields such as computer vision, social network, image processing, and embedded system.

In data mining, there are many studies using this architecture to enhance the performance. Nguyen et al. [23] implement a technique for parallel mining class association rules on a computer that has the multicore processor platform which uses SCR-tree (single constraint rule-tree) structure and task parallel mechanism on .NET framework. Huynh et al. in [24] utilize multicore processors for mining frequent sequential patterns and frequent closed sequential patterns which use DBV-tree (dynamic bit vector-tree) structure and data parallel strategy based on TPL (task parallel library). Laurent et al. [25] proposed PGP-mc, which uses parallel gradual pattern mining used by the POSIX thread library. Flouri et al. [26] implement GapMis-OMP, a tool for pairwise short read alignment used by the OpenMP application programming interface, and Sánchez et al. [27] propose SW (Smith-Waterman), a method comparing sequence lengths based on the CellBE hardware. Recently, Kan et al. [28] proposed the parallelization and acceleration of the shuffled complex evolution utilizing the multicore CPU and many-core GPU. In 2018, Le et al. [29] suggested a parallel approach for mining intersequence patterns with constraints, which used DBV-PatternList structure and task parallel approach on multicore architecture.

3. A Parallel Method for Erasable Itemset Mining

Our proposed approach (Algorithm 1) in this study first determines the total profit of dataset, the erasable 1-itemsets () with their pidsets (line 2). Then, pMEI will sort by the size of their pidsets in not ascending order (line 3) and add them to child node of the root node (line 4). Finally, for each node in root, pMEI will start a new task (line 6) and call pMEI_Ext procedure to process this node with the created task in parallel that is a lightweight object in .NET framework for handling a parallel element of work. It can be used when we would like to perform something in parallel. The works (or jobs) are stretched across multiple processors to maximize performance of computer. Tasks are adjusted for leveraging multicore processors to improve performance.

Input: product database and a minimum given threshold
Output: , the list of EIs
1  root=NULL
2  Scan to calculate the whole profit of , the index of profit (), and
     erasable 1-itemsets with their pidsets ()
3  Sort by the size of their pidsets in descending order
4  , where is a child node
5  For each ( in ) do
6   Start new task
7   pMEI_Ext (, )
8  End for

For procedure pMEI_Ext, the algorithm will combine each node in root together and create the next level of candidates. This strategy will be used until there is no new EI to create. The entire procedure pMEI_Ext task will be performed in parallel to achieve the good performance.

Input: node , the task
1 for to
2  
3  for to do
4   E.Items=Ev [i].Items ∪ Ev [j].Items
5   (E.pidset, pro) ← Ev [i].pidset\Ev [j].pidset
6   .pro=Ev [i].pro+pro
7   if E.pro<T×δ then
8    
9    
10  if then
11   pMEI_Ext()

An outstanding point of pMEI algorithm is that it uses a dynamic load balancing mechanism. pMEI uses a queue to store jobs (a list of work to perform), and if the queue is not empty and there exists a task that is idle, then task is assigned a job to execute. In contrast, if the queue is full, the task will perform a job until completion. After a task completes a job or is idle, then it will be immediately assigned a new job and this job will be removed from the queue, and the process continues until the queue is empty. This mechanism is effective and can help avoid both task idling and achieve balanced workloads.

In addition, one of the differences between the pMEI and parallel methods in [24, 29] is in the sorting strategy to balance the search space. For mining frequent patterns, we can sort patterns according to their supports. In mining EIs, we do not need to compute the support of itemsets, so pMEI sorts itemsets according to the size of their pidsets to balance the search space.

The illustration of the mining process of MEI and pMEI is shown in Figure 1.

pMEI executed in depth-first search; the result (EIs) is writing to global memory. Thus, it does not need to merge or synthesize process. In addition, pMEI uses a global queue for parent task and local queue for child task; both are accessed in LIFO order. The synchronization between tasks is not necessary because of the task using local queue does not involve any shared data.

4. Experimental Results

This section presents the results of the experiments that were executed on a PC with Intel Core I5-6200U (2.30 GHz, 4 threads) with 4 GB RAM and implemented in C# in Visual Studio 2015.

The experiments have been performed on Accidents, Chess, Connect, Mushroom, Pumsb, and T10I4D100K datasets which were downloaded from UCI datasets (http://fimi.ua.ac.be/data/). We have added a new column to hold the value associated to each product because the value of products has not existed in these datasets. The value was calculated based on the normal distribution . The characteristics of these datasets are exhibited in Table 2 and are accessible at http://sdrv.ms/14eshVm.

In this part, we will evaluate the memory manipulation and running time of the suggested algorithm with MEI algorithm [15] to show the effectiveness of pMEI algorithm.

4.1. Running Time

We evaluate the execution times between MEI and pMEI algorithms on six experimental datasets (Figures 27). Note that the running times are averaged across five runs.

For dense datasets, such as Chess, Connect, and Mushroom, the execution time of pMEI is much better than MEI (Figures 35). In detail, for Chess dataset at , pMEI only requires 0.203 s while MEI requires 0.31 s; and at , pMEI only requires 0.389 s while MEI requires 0.654 s, respectively. Specifically, this gap will be increased as the threshold increases. Like Connect dataset, at , the time gap between the execution time of pMEI and that of MEI is 3.49 s while at , the time gap is 7.07 s.

For sparse datasets, such as Accidents, Pumsb, and T10I4D100K, the time gaps between pMEI and MEI are small (Figures 2, 6, and 7). Therefore, pMEI outperforms MEI in terms of the execution times for all experimental datasets especially for dense datasets.

4.2. Memory Usage

For all experimental datasets, pMEI and MEI have the same memory usage (see Figures 813). In summary, pMEI improves the execution times for mining EIs on all experimental datasets while keeping the memory usage compared with MEI algorithm.

To evaluate the effectiveness of multicore systems, we executed the proposed method with the various numbers of cores (Figure 14). The speedup increases nearly two times on 2 cores and increases nearly four times on 4 cores. The average speedup rate is 1.94, 1.95, 1.95, 1.98, 1.99, and 2.06 on 2 cores and 3.82, 3.87, 3.82, 3.80, 3.82, and 3.93 on 4 cores with Accidents, Chess, Connect, Mushroom, Pumsb, and T10I4D100K datasets, respectively. Generally, the speedup is always proportional to the number of cores of the computer.

5. Conclusions

This study proposed a proficient technique for mining EIs, namely, pMEI based on multicore computers to enhance the performance. This method overcomes these drawbacks of parallel computing approaches including the interactive expense cost, synchronization, and data duplication. A dynamic mechanism for load balancing the processor workloads was also used. Experiments show that pMEI is better than MERIT for mining EIs in execution time.

In the future, we will study mining the top-rank- erasable closed itemsets and maximal erasable itemsets. In addition, we will expand the study of EI mining associated with some kinds of item constraints. Besides, approaches for mining such patterns on distributed computing systems will be developed.

Data Availability

The product data used to support the findings of this study have been deposited in the Frequent Itemset Mining Dataset Repository (http://fimi.ua.ac.be/data/).

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors wish to thank Tuong Le for his valuable comments and suggestions.