Abstract

Multilevel association rules mining is an important domain to discover interesting relations between data elements with multiple levels abstractions. Most of the existing algorithms toward this issue are based on exhausting search methods such as Apriori, and FP-growth. However, when they are applied in the big data applications, those methods will suffer for extreme computational cost in searching association rules. To expedite multilevel association rules searching and avoid the excessive computation, in this paper, we proposed a novel genetic-based method with three key innovations. First, we use the category tree to describe the multilevel application data sets as the domain knowledge. Then, we put forward a special tree encoding schema based on the category tree to build the heuristic multilevel association mining algorithm. As the last part of our design, we proposed the genetic algorithm based on the tree encoding schema that will greatly reduce the association rule search space. The method is especially useful in mining multilevel association rules in big data related applications. We test the proposed method with some big datasets, and the experimental results demonstrate the effectiveness and efficiency of the proposed method in processing big data. Moreover, our results also manifest that the algorithm is fast convergent with a limited termination threshold.

1. Introduction

Exploring knowledge in Big data is appealing in the state of the art of data mining research [1]. Due to its high volume and complexity, resourceful domain knowledge or hidden patterns are potentially useful for human decision support [2]. It is especially for the case of multilevel association rule mining approaches to discover interesting relations among data elements with multiple levels of abstractions. Successful applications include spatial data analysis [3], emergency event analysis [4], sensor network data mining [5], and gene ontology mining. However, most existing multilevel association rules mining algorithms rely on exhaustive scans of the database to find frequent patterns across different abstraction levels, such as the most renowned Apriori algorithm [6] and Frequent Pattern tree algorithm (FP-tree) [7]. When the dataset scales up, those algorithms will suffer for the excessive computation cost and the system will retard due to its heavy scan of the large database. When the algorithms are used in big data applications, the bottleneck becomes more prominent. For example, in gene ontology domain, the annotations have rapidly grown to more than 80 million by 2012. If considering the complicated relationships between gene items in various hierarchical levels, the complexity of mining association rules in multilevels has been classified as NEXP-COMPLETE [8]. Therefore a fast multilevel association rules mining algorithm for big datasets that is scalable and can be performed in parallel computation environment becomes imperative.

In this paper, we made our initial effort toward this issue by building a genetic algorithm (GA) based heuristic method for effective multilevel association rules mining in big datasets. By taking advantage of the genetic algorithm, which can efficiently find multiple solutions concurrently in a large multidimensional problem without performing exhaustive searches, our proposed method can improve the mining performance while keeping a desired accuracy but avoiding the exhausting enumeration on association rule candidates. In summary, there are three major contributions.

First, to make our GA-based approach possible, we design a new tree like encoding schema to model the genetic candidates of multilevel association rules so that a feasible implementation of genetic operators can be built. This representation model is based on the application’s domain knowledge, where each attributes (items) to be mined can be briefly illustrated as a catalog tree. Then, each valid multilevel association rule can be modeled as a subtree of the catalog tree, but each leaf node should be assigned with a binary number to indicate whether it is an antecedent or a consequent in the rule.

Next, based on the encoding schema, we build its unique GA-based operators to make the multilevel association rules mining possible: individual initiation function to build the subtree with valid multilevel association rule representations; crossover function that allows subtree to cross over at one of their common nodes to produce new generations; selection function which is based on our designed fitness function to select stronger association rules.

As the third contribution, based on our analysis of our fitness function design, we have found that our GA-based method is adaptive and robust as its termination threshold can be reached fast with a fixed time table. We have built our experiments with different big datasets and the results manifest the performance of our design as well as the fast convergence with a limited termination threshold.

The remainder of the paper is organized as follows: Section 2 is the related work; the problem of multilevel association rules mining is formally described in Section 3; in Section 4, the genetic algorithm based multilevel association rules mining is presented; in Section 5, the performance of proposed method is evaluated on several big datasets; the conclusions are drawn in Section 6.

2. State of the Art

Many researchers have focused on multilevel association rules mining. The first branch are Apriori [6] based methods. To mine multilevel association rules, these methods are either adding all the ancestors of frequent items in the corresponding transaction database, for example, cumulate [9], or exhaustively finding all frequent items in every concept level, for example, ML-T2L1 algorithm [10] and the Level-Crossing algorithm [11]. Another branch is FP-growth [7] based methods, such as methods proposed in [12, 13]. Additionally, Cao et al. [12] and Tang et al. [13] expanded the FP-tree with ancestors of items. Wan built an approach through grouping and merging the single level association rules generated by FP-growth. Comparing with Apriori based methods, the FP-tree based methods inherit the merit of FP-tree algorithm which takes less times to scan the dataset and finds the multilevel frequent items. However, when they are utilized to analyze big data, the computational and memory cost will increase exponentially which leads to a prominent bottleneck in big data analysis.

In addition, there are some other approaches to improve the efficiency of multilevel association rules mining. Vejdani et al. proposed a method that extracted multilevel membership functions by Ant Colony Systems algorithm without specifying the actual minimum support [14]. To enhance the efficiency of computing, Mahmoudi et al. optimized Vejdani’s method by fixing the functions for each item followed by computing minimum supports [15]. Wang et al. took advantages of the OLAP and data mining technology in multilevel association rules mining which brought efficiency and flexibility [16]. Besides, mining association rules with genetic algorithm (GA) [17] based methods have also been explored. The GA-based methods are able to quickly scan association rule candidate set with large amount of candidates. According to the previous research work [18], the GA-based methods can discover high-level prediction rules. This is because the GA-based methods perform a global search on association rules and can handle the data with attribute interactions better, comparing with the greedy rule induction algorithms. Previous researches have thoroughly explored single-level association rules mining with GA, such as mining single-objective rules [19] and mining multiobjective rules [20]. However, in the big data analysis context, strong association rules are always in multilevel forms, and mining multilevel association rules in big data needs more efficient methods. The GA-based multilevel association rules mining method proposed in this paper is one attempt to efficiently find multilevel association rules in big data.

3. Problem Description

The multilevel association rules mining problem can be described as follows: there are a set of items and is a catalog tree that briefly defines the multilevel categorizing relationships between items as the domain knowledge. is a parent of and is a child of if there is an edge in from to . We denote î as an ancestor of and as a descendant of î if there is a path from î to in . Only leaf nodes are presented in the database. An illustration of a catalog tree in a supermarket domain is shown as in Figure 1.

is a database of transactions where each transaction in is a set of items such that . Each transaction is associated with an identifier . Items in are expected to be leaves in . Note that a transaction supports an item if is in or is an ancestor of some items in . In addition, a transaction supports if supports every item in .

A multilevel association rule is an implication of the form , where , , and . No item in is an ancestor of any item in ; that is, . This is because a rule of the form “” is trivially true with 100% confidence, which is redundant. Both and can contain items from any level of .

The rule holds in transaction set with support , where is the percentage of transactions in that support . indicates the probability . The rule has confidence in transaction set , where is the percentage of the transactions which support in that meanwhile support . This can be represented as the conditional probability . Then,

Example. Let be the catalog tree shown in Figure 1. The Minsupp and Minconf are shown in Table 1, and two of the rules on item sets are shown in Table 2. Note that the rule “Computer Office” satisfies the minimum support (5%) and the minimum confidence (50%), but the rule “HP printer Canon Camera” does not satisfy the minimum support (1%). Therefore, rule “Computer Office” is considered as a valid multilevel association rule.

To evaluate the rules discovered from the multilevel abstraction, we prefer the following.

(i) Support Confidence. The rules with larger support and higher confidence are preferred, where larger support reflects that the rules are more general, and higher confidence reflects the certainty of discovered rules in the domain statistics.

(ii) Interest. The rule in a proper level of catalog tree is preferred. Mining association rules at low levels may lead to uninteresting rules that are too trivial; that is, “IBM_ThinkPaD_R40/P4M Symantec_Norton_Anti-virus_2003.” However, mining association rules at high levels usually leads to common sense, for example, “Computer Software.”

Mining of multilevel associations involves items at different levels of abstractions and its exhaustive computation complexity has been classified as NEXP-COMPLETE. A dataset that contains items in the primitive level can potentially generate up to primitive and nonempty frequent item sets. Particularly, in the Bigdata context, with the number of items in the catalogue and transactions increasing rapidly, the computational and memory consumptions of the traditional methods will be expanded exponentially. It is worthy of noting that the FP-tree algorithm enhances the efficiency of mining association rules, but it can hardly mine the multilevel association rules, especially the cross-level association rules. Thus, in big data analysis context, a novel heuristic method is imperative to mine multilevel association rules.

4. GA-Based Approach

Genetic algorithm is a heuristic search approach that mimics the process of natural evolution and generates solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover. Its essence is an efficient, parallel, and global search method, which can automatically obtain and accumulate knowledge about the search space, and control the search space in order to achieve the optimal solution adaptively in the search process. In the traditional multilevel association rules mining algorithms, we have to generate almost all candidate items and test them against the entire database. However most of the mining process is in vain and leads to heavy computational cost. The genetic algorithm offers a novel way to solve these problems. By efficiently testing the most likely candidate items preferentially, GA-based method can control the search space and achieve the optimal solution adaptively during the association rules searching. Therefore, by taking this advantage, the association rules search space will be greatly reduced and the performance of mining method can be dramatically improved.

4.1. Encoding Scheme

When GA is applied to mine multilevel association rules, a key is to encode and automatically generate candidates of the association rules in a GA-based form. Because the classic GA-based encoding schema is not feasible to mine multilevel association rules, we propose a new category tree based encoding scheme to represent the association rule candidates. Each valid multilevel association rule can be modeled as a subtree of the catalog tree, but each leaf node should be assigned with a binary number to indicate whether it is an antecedent or a consequent in the rule. The goal of the algorithm is to find valid candidates by the evolution of subtrees. The structure of the GA-based encoding subtree can defined as follows:

In this representation, every leaf represents a commodity and is assigned to a value of 0, 1, or −1 as shown in Figure 2. The antecedents of the association rule can be expressed by the commodities assigned to 0 and the consequents can be expressed by the ones assigned to 1. The commodities assigned to −1 do not join the association rules. With randomly pruning the catalogue tree and assigning the values of the leaves, we can initiate the subtrees with the first generation of the association rules. For example, the items in red rectangle of Figure 2(a) are regarded as the children of the association rules tree. The multilevel association rules can be generated and polished by the processes of roll-up, mutation, crossover, and selection operator.

Example. As shown in Figure 2(b), the leaves assigned to 0 represent the antecedents of the association rule and the leaves assigned to 1 represent the consequents, so the association rule can be represented as (Sony MS os Logitech) (MS office). The rule implies that people who buy Sony laptops and Logitech mouses, MS os, will be likely to buy MS office software.

4.2. Genetic Operators

Initially, according to the given catalog tree, we will randomly prune the catalog tree to get the subtrees as the children of the association rules trees. Then we randomly assign the leaves of each association rule tree to −1, 0, or 1 and make sure that the association rule tree maintains 0 and 1 at the same time. In the same way, we can get the appropriate number of the initial population.

Selection operator defines how to choose the individuals that will create the offsprings for the next generation. The selection operator is based on the fitness function that the offsprings with high fitness will have higher probabilities to be selected. In this paper, we use “roulette wheel” [21] selection, and the higher the fitness of an individual is, the more likely it is to be selected to reproduce.

After the high fitness individuals are selected, the crossover operator can be applied. This function allows a pair of selected subtrees to cross over at one of their randomly chosen common nodes to produce new generations so as to avoid generating invalid rules. In particular, only attribute values will be exchanged if only leaf nodes are crossed over. The crossover of the root node is prohibited because no new rules will be produced. An example of crossover process in a real domain is illustrated as in Figure 3.

Mutation operator plays an important role in maintaining the diversity of the population during the mutations. In our schema, we define three types of mutation operators:(1)randomly choose a leaf node and assign with an alternative attribute value;(2)randomly choose a nonroot node and prune its subtree;(3)randomly choose a nonroot node and add a subtree to the node.

An illustration of the mutation operators is shown as in Figure 4, where the node with thick border indicates that it is the one to be chosen and mutated.

4.3. Fitness Function

The fitness function plays an important role in our GA schema. It is used either to evaluate the offsprings that will be selected into the next generation or to act as the terminate condition, when there have been enough association rules with higher fitness values that are more than the predefined threshold. To build the fitness function, we have to combine the support and confidence attributes, which are necessary to described an association rule in domain, in our fitness definition. Therefore, the fitness of an association rule is defined as

Parameters and are the important factors to balance the weight of the support and confidence in the fitness function, and . To mine valid association rules from the big data base with our GA approach, the threshold of the fitness function has to be predefined. As the threshold is relevant to the support and confidence attributes, we should set the thresholds of minimum support min_sup and the minimum confidence min_conf for the algorithm. In our approach, other than uniformly using the same thresholds for all levels, we use different min_sup for different levels of association rules. The deeper the level is, the smaller the corresponding thresholds will be. Furthermore, the more leaf nodes the ancestor has, the higher min_sup of the ancestor will be [13].

5. Experimental Results

In this section, we build various experiments to analyse the performances of our design. We briefly use two different transaction databases to mine the multilevel association rules: “Dataset 1” from University of Regina (http://www2.cs.uregina.ca/~dbd/cs831/notes/itemsets/datasets.php/) and “Dataset 2” from California State University Los Angeles (http://www.calstatela.edu/centers/hipic/contents/researchy/cloudComputing/files/market/).

We also use the classic Apriori algorithm as the benchmark to compare with our GA-based algorithm. All the experiments run on a PC with Core i5 CPU and 4 GB RAM. For the default settings, we set the size of the population in our GA-based algorithm for dataset 1 as 200 while for dataset 2 as 200. The initial min_sup is 0.01 and min_conf is 0.5.

In the first experiment, we test whether valid association rules can be mined with a fixed number of initial generations and in a limited of period. In dataset 1, the initial generation size varies from 40 to 1000, and, in dataset 2, it varies from 50 to 160. The results are shown in Figures 5 and 6. We can conclude that if the population is too small, the performance of the GA-based algorithm will be similar to the random algorithm. But if the population is too large, although we can obtain enough association rules quickly, the computational complexity grows fast. However, as we can see, in both datasets, there is a good balance that, with a limited population and a limited time period, most valid association rules have been mined. Therefore we select 200 and 120 as the default population for dataset 1 and dataset 2, respectively, which performs well in our algorithm.

In the rest of this section, we compare the efficiency of our GA-based algorithm with the Apriori algorithm in the two different datasets. In Figures 7 and 8, we fixed 4000 transactions in each dataset. As it has shown, with the progress of the time step, GA-based approach can find valid association rules much quicker in both datasets. Although as an exhaustive approach, Apriori algorithm may be able to find a few more valid rules than ours if there is no time limit. However, in most big data applications, the system response time is a critical criterion of its performance and our approach is more valuable to obtain most association rules in a short period of time. In Table 3, we presented more detailed results when we picked 1000 and 2000 transaction records from each dataset. As the time goes by, we reach a consistent conclusion that our GA-based approach is more capable of finding valid association rules than the exhaustive approach.

When we changed the thresholds of the minimal support and confidence for the valid association rules, the output for both GA-based algorithm and the Apriori algorithm is changed. As shown in Figure 9, in dataset 1, we reduced the min_sup to 0.005 and the min_conf to 0.25. In Figure 10, we increased the min_sup to 0.02 and the min_conf to 0.75 for dataset 2. In both experiments, the GA-based algorithm finds more association rules in a short of period than the Apriori algorithm, no matter what the multilevel association thresholds are.

In the next group of experiments, we tested how valuable the mined association rules are from either the GA-based algorithm or the Apriori. To measure its value, we use the same formula of fitness function and set , . The experiment results in different datasets with the 4000 transactions are shown in Figures 11 and 12, and the results in datasets with different number of transactions are shown in Table 4.

Consistent with the conclusions above, with the progress of the time horizon, GA-based approach can find high valuable association rules much quicker than Apriori in both datasets.

6. Conclusion and Future Work

In this paper, we have presented a novel genetic based algorithm to mine multilevel association rules in big date sets. By utilizing the application domain knowledge that could be briefly explained as a catalog tree, we introduce a special subtree based encoding schema to make the GA-based algorithm possible. In addition, we personalized the initiation, crossover and mutation functions for the tree-based genetic operators. Based on our simulations and the experiment results, we can see that by building the dynamic fitness function from the multilevel support and confidence threshold, this algorithm is adaptive and convergent. Moreover, we test its performance in different databases and the algorithm performs better than the classic Apriori algorithm with faster and more accurate mining the high quality multilevel association rules.

Although we have proved that our GA-based approach is capable of dealing with some key challenges in the multilevel association rule mining in big databases, we leave many of the others in the future. Firstly, our approach is only good in the domains that the items in the association rules can be organized as a catalog tree. But when it is applied in some other domain with unstructured item sets, our basic design does not match. Second, our GA-based approach should be encoded in a distributed and parallel computation environment so as to optimize its performance. Moreover, deployment in real domain is the key to evaluate our approach and polish the algorithm for better performance.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research has been sponsored in part by National Natural Science Foundation of China nos. 61370151 and 61202211, National Science and Technology Support Program of China 2012BAI22B05, and Huawei Research Foundation YB2013120141.