Mathematical Problems in Engineering

Volume 2015, Article ID 623240, 14 pages

http://dx.doi.org/10.1155/2015/623240

## FI-FG: Frequent Item Sets Mining from Datasets with High Number of Transactions by Granular Computing and Fuzzy Set Theory

College of Mechatronics Engineering and Automation, National University of Defense Technology, Changsha 410073, China

Received 3 July 2015; Revised 7 November 2015; Accepted 23 November 2015

Academic Editor: Anna M. Gil-Lafuente

Copyright © 2015 Zhong-jie Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

Mining frequent item set (FI) is an important issue in data mining. Considering the limitations of those exact algorithms and sampling methods, a novel FI mining algorithm based on granular computing and fuzzy set theory (FI-GF) is proposed, which mines those datasets with high number of transactions more efficiently. Firstly, the granularity is applied, which compresses the transactions to some granules for reducing the scanning cost. During the granularity, each granule is represented by a fuzzy set, and the transaction scale represented by a granule is optimized. Then, fuzzy set theory is used to compute the supports of item sets based on those granules, which faces the uncertainty brought by the granularity and ensures the accuracy of the final results. Finally, Apriori is applied to get the FIs based on those granules and the new computing way of supports. Through five datasets, FI-GF is compared with the original Apriori to prove its reliability and efficiency and is compared with a representative progressive sampling way, RC-SS, to prove the advantage of the granularity to the sampling method. Results show that FI-GF not only successfully saves the time cost by scanning transactions but also has the high reliability. Meanwhile, the granularity has advantages to those progressive sampling methods.

#### 1. Introduction

Frequent item sets (FIs) contain the items which always appear together in a dataset with the frequency over a specified minimum support [1, 2]. For instance, given a dataset which contains the records of a store, if products and are found always bought together by customers, and the frequency of this phenomenon is over the minimum support, can be seen as a FI. FI can be classified into the quantitative FI (QFI) and the binary FI (BFI). QFI is mined from the quantitative dataset, where every item has a scale of value. BFI is mined from the binary dataset, where every item only has two states, presence or absence. For example, given a dataset which has four items, , , , and , a QFI may look like 0~1, −1~0.5, −2~1.5, −1~3.5, and a BFI may look like or . However, in most cases, when mining FIs, those quantitative datasets are always transformed into binary datasets firstly by the discretization, so in this thesis, we mainly focus on FI mining from the binary dataset.

FI mining is the stone of association rules mining and many other fields. Many algorithms have been presented in this realm, where Apriori and FP-growth are the most two famous methods of them. Apriori is the most widely applied algorithm [3], which mines all the FIs from a dataset by several loops. In the th loop, Apriori mines FIs with lengths , where the FIs got from the th loop are combined to make the candidate FIs of the th loop firstly, and Apriori scans the dataset to check every candidate FI and remove the fake candidates. The most serious problem of Apriori is the repeatedly scanning of dataset. If the transaction scale of a dataset is large, the mining speed becomes low. Therefore, another algorithm, FP-growth, is proposed to solve this problem [4]. In FP-growth, the whole dataset is firstly transformed to a complex data structure, and algorithm only scans this data structure but the original dataset, which saves much time. However, problem also exists. FP-growth consumes too much memory to save this big data structure, and the transformation from dataset to data structure is also very complex.

Furthermore, many other algorithms are also proposed in this field [1, 5]; these algorithms improve the Aprori and FP-growth through removing the minimum support, using the dynamically allocated memory, improving the data structure, and so on, but none of them solve the problem brought by the large transaction scale well.

By researching the above algorithms, we can find that all of them are the exact algorithm, whose aims are to mine the exact results out. However, in some applications of FI mining, the accuracy is much less important than the speed. Therefore, there is a feasible idea which can be used to solve the problem brought by the scanning, and the main thought of it is to sacrifice some accuracy of the results and to earn the faster speed of algorithm [6].

According to this idea, some sampling methods are presented [7–11], in which the original dataset is firstly sampled randomly, and then, the algorithm mines the results just from those samples but the whole dataset, which can extremely cut the cost. However, this solution also brings another problem. Despite the kernel thought of those sampling ways is to enhance the speed through sacrificing some accuracy, the loosing of accuracy also should be controlled in a reasonable range. Generally, there are 3 kinds of ways to achieve this goal, setting the bound of sample scale, estimating the accuracy of results, and progressive sampling.

The bound of sample scale is the shortest or the largest sample scale, which announces the safe range of the sample size. Those bounds are always got by some mathematical inferences and hypothesis, such as a certain distribution of the value of items. Some researches have been done. For example, Zaki and Chakaravarthy both studied the bound of sample size with their teams [7, 8]. However, in consideration of the difference among datasets, the mathematical inferences and hypothesis, which are used to make the bound, do not always apply, so the risk to set a wrong bound is very high.

Unlike setting the bound of sample size, another method estimates the possibility of every FI mined from the sample being the FI mined from the original dataset. For example, Toivonen studied a method, which builds a candidate of FI with these probabilities based on the sample size [9]. Nevertheless, the estimation of the accuracy just offers a reference to the user, which does not give a method to reduce the error probability.

Being different from the above two ways, progressive sampling not only provides a reliable way to control the loosing of accuracy but also avoids a lot of mathematical hypothesis, in which the sample scale keeps changing until the stop condition is satisfied. For example, Parthasarathy presents a progressive way, which keeps increasing or decreasing the sample size until the stopping conditions are satisfied [10]. Chen and his colleagues proposed a method with 2 phases, which progressively builds an appropriate sample [11]. But, the difficulty of these progressive sampling ways is also apparent, which is the difficulty to build an appropriate stop condition. A simple stop condition cannot ensure the accuracy of the final results well, and the complex stop condition may consume too much time and resources.

Considering the limits of sampling way, it is necessary to come up with a new method, which not only can reduce the cost of scanning dataset but also has an efficient mechanism to ensure the accuracy of results, where the efficiency means that the computing of this mechanism is simple, the quality of this mechanism is high, and it is suitable to different datasets.

Therefore, a novel FI mining algorithm, where the scale of transactions is cut by the granularity and the support is computed by fuzzy set theory, called FI-GF, is proposed in this paper.

The definitions of granular computing are varied [12]. Generally, it is a way to solve problem in different levels through different kinds of information granules. In most cases, granule is a collection of data, which can be built by clustering, partition, and so on and can be denoted by ordinary sets, fuzzy sets, rough sets, and so on [13, 14]. In FI-GF, it is built by the partition of the set of transactions and is represented by fuzzy set.

Fuzzy set describes uncertainty, where every element has a membership to represent the degree to which it is contained [15]. Fuzzy set theory is applied widely to the FI mining from those datasets with uncertainty [16–19]. Considering that the granule is denoted by fuzzy set and the granularity may bring uncertainty, fuzzy set theory is joined in FI-GF.

In short, the new proposed algorithm, FI-GF, has the following innovations and advantages:(1)The granularity is used to cut the transaction scale by compressing transactions to granules, and compared with progressive sampling, it is more efficient.(2)Fuzzy set is used to denote granules, and a method to calculate the supports of item sets based on those granules is designed, which helps the algorithm to deal with the uncertainty brought by transaction reduction well.

#### 2. Basic Concepts

##### 2.1. Frequent Item Sets Mining

A formal specification of the problem is presented as follows: Let be a set of distinct items. A dataset is a set of transactions, in which . A set is called an item set. , called the length of , is the number of items in . Item sets whose length is are called -item-sets. The support of , denoted by , is the fraction of containing .

is a FI if and only if is larger than a specific minimum support, denoted by min_sup, which indicates that the presence of is significant. Given min_sup, the goal of FI mining is to enumerate all the item sets whose supports are higher than min_sup [20–22].

##### 2.2. Granular Computing

Given a problem or a system , where is the basic element of , the granular computing can be described by where the granularity can be the partition, clustering, and other process of , whose kernel is to abstract or define the original problem or system at a different level and study the problem based on this new level. is the granule, which is the basic elements of the problem or system at the new level. In every computation, is the smallest unit, which means that the details inside are ignored after the granularity. In most cases, a granule represents a collection of , and a granule can be defined as ordinary sets, fuzzy sets, rough sets, and so on. The process based on is variant, which is according to the different definitions of . is the set of results or the set of answers of [23–25].

##### 2.3. Fuzzy Set

Fuzzy set is a widely applied tool and theory, which is proposed to represent the uncertainty concept. A fuzzy set of a reference set is identified by a membership function, which is denoted by [15]. For , shows the degree to which is contained by , and is the universal set. Generally, . An ordinary set can be regarded as a fuzzy set with the membership function . The length of a fuzzy set is

To operate fuzzy sets in a formal way, fuzzy set theory offers some operators, where two of the most popular formulas are shown as follows [15, 16]:where and .

#### 3. The Proposed Algorithm FI-GF

Given a dataset , Figure 1 is the simple working diagram of FI-GF. The whole process can be divided into two main parts, which are the granularity part and the mining part. In the granularity part, the dataset is scanned and partitioned from the beginning to end, and meanwhile, those transactions which are neighboring and similar are collected and represented by several granules. Every granule in FI-GF is represented by a fuzzy set, and the reference set, which is denoted by , of these fuzzy sets is the universe set of items of the dataset . In the mining part, the Apriori algorithm is used to mine FIs from those granules, and the support of every item set is computed through fuzzy set theory.