FI-FG: Frequent Item Sets Mining from Datasets with High Number of Transactions by Granular Computing and Fuzzy Set Theory

Zhang, Zhong-jie; Huang, Jian; Wei, Ying

doi:https://doi.org/10.1155/2015/623240

Mathematical Problems in Engineering

On this page

Abstract Introduction Conclusions References Copyright Related Articles

Research Article | Open Access

Volume 2015 | Article ID 623240 | https://doi.org/10.1155/2015/623240

FI-FG: Frequent Item Sets Mining from Datasets with High Number of Transactions by Granular Computing and Fuzzy Set Theory

Zhong-jie Zhang,¹Jian Huang,¹and Ying Wei¹

Academic Editor: Anna M. Gil-Lafuente

Received03 Jul 2015

Revised07 Nov 2015

Accepted23 Nov 2015

Published10 Dec 2015

Abstract

Mining frequent item set (FI) is an important issue in data mining. Considering the limitations of those exact algorithms and sampling methods, a novel FI mining algorithm based on granular computing and fuzzy set theory (FI-GF) is proposed, which mines those datasets with high number of transactions more efficiently. Firstly, the granularity is applied, which compresses the transactions to some granules for reducing the scanning cost. During the granularity, each granule is represented by a fuzzy set, and the transaction scale represented by a granule is optimized. Then, fuzzy set theory is used to compute the supports of item sets based on those granules, which faces the uncertainty brought by the granularity and ensures the accuracy of the final results. Finally, Apriori is applied to get the FIs based on those granules and the new computing way of supports. Through five datasets, FI-GF is compared with the original Apriori to prove its reliability and efficiency and is compared with a representative progressive sampling way, RC-SS, to prove the advantage of the granularity to the sampling method. Results show that FI-GF not only successfully saves the time cost by scanning transactions but also has the high reliability. Meanwhile, the granularity has advantages to those progressive sampling methods.

1. Introduction

Frequent item sets (FIs) contain the items which always appear together in a dataset with the frequency over a specified minimum support [1, 2]. For instance, given a dataset which contains the records of a store, if products and are found always bought together by customers, and the frequency of this phenomenon is over the minimum support, can be seen as a FI. FI can be classified into the quantitative FI (QFI) and the binary FI (BFI). QFI is mined from the quantitative dataset, where every item has a scale of value. BFI is mined from the binary dataset, where every item only has two states, presence or absence. For example, given a dataset which has four items, , , , and , a QFI may look like 0~1, −1~0.5, −2~1.5, −1~3.5, and a BFI may look like or . However, in most cases, when mining FIs, those quantitative datasets are always transformed into binary datasets firstly by the discretization, so in this thesis, we mainly focus on FI mining from the binary dataset.

FI mining is the stone of association rules mining and many other fields. Many algorithms have been presented in this realm, where Apriori and FP-growth are the most two famous methods of them. Apriori is the most widely applied algorithm [3], which mines all the FIs from a dataset by several loops. In the th loop, Apriori mines FIs with lengths , where the FIs got from the th loop are combined to make the candidate FIs of the th loop firstly, and Apriori scans the dataset to check every candidate FI and remove the fake candidates. The most serious problem of Apriori is the repeatedly scanning of dataset. If the transaction scale of a dataset is large, the mining speed becomes low. Therefore, another algorithm, FP-growth, is proposed to solve this problem [4]. In FP-growth, the whole dataset is firstly transformed to a complex data structure, and algorithm only scans this data structure but the original dataset, which saves much time. However, problem also exists. FP-growth consumes too much memory to save this big data structure, and the transformation from dataset to data structure is also very complex.

Furthermore, many other algorithms are also proposed in this field [1, 5]; these algorithms improve the Aprori and FP-growth through removing the minimum support, using the dynamically allocated memory, improving the data structure, and so on, but none of them solve the problem brought by the large transaction scale well.

By researching the above algorithms, we can find that all of them are the exact algorithm, whose aims are to mine the exact results out. However, in some applications of FI mining, the accuracy is much less important than the speed. Therefore, there is a feasible idea which can be used to solve the problem brought by the scanning, and the main thought of it is to sacrifice some accuracy of the results and to earn the faster speed of algorithm [6].

According to this idea, some sampling methods are presented [7–11], in which the original dataset is firstly sampled randomly, and then, the algorithm mines the results just from those samples but the whole dataset, which can extremely cut the cost. However, this solution also brings another problem. Despite the kernel thought of those sampling ways is to enhance the speed through sacrificing some accuracy, the loosing of accuracy also should be controlled in a reasonable range. Generally, there are 3 kinds of ways to achieve this goal, setting the bound of sample scale, estimating the accuracy of results, and progressive sampling.

The bound of sample scale is the shortest or the largest sample scale, which announces the safe range of the sample size. Those bounds are always got by some mathematical inferences and hypothesis, such as a certain distribution of the value of items. Some researches have been done. For example, Zaki and Chakaravarthy both studied the bound of sample size with their teams [7, 8]. However, in consideration of the difference among datasets, the mathematical inferences and hypothesis, which are used to make the bound, do not always apply, so the risk to set a wrong bound is very high.

Unlike setting the bound of sample size, another method estimates the possibility of every FI mined from the sample being the FI mined from the original dataset. For example, Toivonen studied a method, which builds a candidate of FI with these probabilities based on the sample size [9]. Nevertheless, the estimation of the accuracy just offers a reference to the user, which does not give a method to reduce the error probability.

Being different from the above two ways, progressive sampling not only provides a reliable way to control the loosing of accuracy but also avoids a lot of mathematical hypothesis, in which the sample scale keeps changing until the stop condition is satisfied. For example, Parthasarathy presents a progressive way, which keeps increasing or decreasing the sample size until the stopping conditions are satisfied [10]. Chen and his colleagues proposed a method with 2 phases, which progressively builds an appropriate sample [11]. But, the difficulty of these progressive sampling ways is also apparent, which is the difficulty to build an appropriate stop condition. A simple stop condition cannot ensure the accuracy of the final results well, and the complex stop condition may consume too much time and resources.

Considering the limits of sampling way, it is necessary to come up with a new method, which not only can reduce the cost of scanning dataset but also has an efficient mechanism to ensure the accuracy of results, where the efficiency means that the computing of this mechanism is simple, the quality of this mechanism is high, and it is suitable to different datasets.

Therefore, a novel FI mining algorithm, where the scale of transactions is cut by the granularity and the support is computed by fuzzy set theory, called FI-GF, is proposed in this paper.

The definitions of granular computing are varied [12]. Generally, it is a way to solve problem in different levels through different kinds of information granules. In most cases, granule is a collection of data, which can be built by clustering, partition, and so on and can be denoted by ordinary sets, fuzzy sets, rough sets, and so on [13, 14]. In FI-GF, it is built by the partition of the set of transactions and is represented by fuzzy set.

Fuzzy set describes uncertainty, where every element has a membership to represent the degree to which it is contained [15]. Fuzzy set theory is applied widely to the FI mining from those datasets with uncertainty [16–19]. Considering that the granule is denoted by fuzzy set and the granularity may bring uncertainty, fuzzy set theory is joined in FI-GF.

In short, the new proposed algorithm, FI-GF, has the following innovations and advantages:(1)The granularity is used to cut the transaction scale by compressing transactions to granules, and compared with progressive sampling, it is more efficient.(2)Fuzzy set is used to denote granules, and a method to calculate the supports of item sets based on those granules is designed, which helps the algorithm to deal with the uncertainty brought by transaction reduction well.

2. Basic Concepts

2.1. Frequent Item Sets Mining

A formal specification of the problem is presented as follows: Let be a set of distinct items. A dataset is a set of transactions, in which . A set is called an item set. , called the length of , is the number of items in . Item sets whose length is are called -item-sets. The support of , denoted by , is the fraction of containing .

is a FI if and only if is larger than a specific minimum support, denoted by min_sup, which indicates that the presence of is significant. Given min_sup, the goal of FI mining is to enumerate all the item sets whose supports are higher than min_sup [20–22].

2.2. Granular Computing

Given a problem or a system , where is the basic element of , the granular computing can be described by where the granularity can be the partition, clustering, and other process of , whose kernel is to abstract or define the original problem or system at a different level and study the problem based on this new level. is the granule, which is the basic elements of the problem or system at the new level. In every computation, is the smallest unit, which means that the details inside are ignored after the granularity. In most cases, a granule represents a collection of , and a granule can be defined as ordinary sets, fuzzy sets, rough sets, and so on. The process based on is variant, which is according to the different definitions of . is the set of results or the set of answers of [23–25].

2.3. Fuzzy Set

Fuzzy set is a widely applied tool and theory, which is proposed to represent the uncertainty concept. A fuzzy set of a reference set is identified by a membership function, which is denoted by [15]. For , shows the degree to which is contained by , and is the universal set. Generally, . An ordinary set can be regarded as a fuzzy set with the membership function . The length of a fuzzy set is

To operate fuzzy sets in a formal way, fuzzy set theory offers some operators, where two of the most popular formulas are shown as follows [15, 16]:where and .

3. The Proposed Algorithm FI-GF

Given a dataset , Figure 1 is the simple working diagram of FI-GF. The whole process can be divided into two main parts, which are the granularity part and the mining part. In the granularity part, the dataset is scanned and partitioned from the beginning to end, and meanwhile, those transactions which are neighboring and similar are collected and represented by several granules. Every granule in FI-GF is represented by a fuzzy set, and the reference set, which is denoted by , of these fuzzy sets is the universe set of items of the dataset . In the mining part, the Apriori algorithm is used to mine FIs from those granules, and the support of every item set is computed through fuzzy set theory.

3.1. The Collection of Transactions

Like what Figure 2 shows, the core idea of the granularity is to compress the transaction scale. During the granularity, the dataset is scanned, and those transactions which are similar and neighboring are collected and represented by a fuzzy set.

Therefore, the first problem which should be solved is how to partition the dataset and collect transactions.

Given a collection of transactions , which is used to build a granule , the evaluation of can be based on two factors of [14].

(1) The coverage of granule , denoted by , is the capability of to contain information. Generally, it is an increasing function of the data scale represented by . In FI-GF, it is defined asThis factor can be easily understood. The purpose of the granularity is to save the time for scanning transactions, so the more transactions a granule can represent, the more time and resources will be saved.

(2) The specificity of granule , denoted by , is the capability of to precisely represent information. The higher ) is, the more precisely can represent the information in . In this thesis, it is defined as where is the difference among the transactions in , which is calculated aswhere is the hamming distance between and . Obviously, is a decreasing function of the data scale in .

The second factor is also important. Since the granularity is used to compress the transactions, it will certainly destroy and ignore some original information of transactions. The more transactions a granule has, the more differences among transactions are ignored. This factor is to evaluate the degree to which the granule can preserve the original information.

However, in most cases, these two factors are restricted by each other, and an appropriate granule is supposed to not only cover more data but also represent information more precisely. Therefore, the collection of transactions can be formulated as the following optimization problem:

Furthermore, the weights of coverage and specificity are not always the same, so two parameters are involved to change the optimization problem into (8), where and are the parameters to control the importance of coverage and specificity. If is higher, the granule tends to contain more transactions and relatively disregards of the capability to precisely represent information, and vice versa. Consider

On the other hand, and have another purpose, which is to control the average transaction scale in the granule. Despite the transaction scales in different granules are determined by the balances of coverage and specificity, the average of those scales can be controlled to an expected value. Suppose the average of hamming distance between two transactions in a dataset is AHD; (8) can be approximately regarded asThen, the expected , which can make Col get to the peak, can also be estimated, which isand AHD can be estimated from the sample of transactions. Therefore, through controlling the value of and , the average transaction scale contained by a granule can be also controlled. Generally, if both and become smaller, the average transaction scale in a granule becomes larger, and if both and become larger, the average scale becomes smaller, which can be deduced from the above approximate function.

Generally, when is growing, the curves of , , and look like Figure 3. Firstly, the information of the granule is little, and the differences of transactions in the granule are relatively few, so sp is high and cov is low. Then, with the growing of transaction scale, the granule contains more information while bringing more differences at the same time, so cov increases but sp decreases. For Col, at the beginning, the capability to contain more information and the capability to precisely represent the information of the granule are not balanced, so Col is low initially. Then, with the increase of transaction scale, the balance of cov and sp becomes better, so Col increases too. However, after the peak of Col, the information in the granule is enough but the differences among different transactions make the granule hard to precisely represent the information, so the balance of cov and sp becomes worse, and Col starts to decrease. This phenomenon is verified and shown in Experiment 1 through 5 datasets.

To sum up, the stop condition of a collection is whether the current value of is near the optimal value or not. Considering that, after passes the peak, its tendency becomes decreasing, so the stop condition of a collection is designed as whether the following inequality is satisfied. Considerwhere is the collection of transactions which removes from , and is the th transaction which is put into . Therefore, is used to evaluate the growth rate of Col. Parameter is the value to help decide if Col is near the peak. means that the growth rate of Col is low enough, and Col is probably near the peak, so this collection can be stopped. Parameter is used to keep the calculation of the growth rate of Col from the fluctuation, which is brought by the fluctuation of . The growth rate of Col is not a smooth curve. If is high, the effect of fluctuation will be reduced into a low level, but the algorithm will respond more slowly to the coming of peak, and vice versa.

3.2. Representing the Collection by a Granule

After the partition of dataset, we need to design a way to represent those collections which contains transactions and as is shown in Figure 4; fuzzy set is used by FI-GF to do this job, whose form based on Zadeh’s [15] way is shown asin which is the universe set of items of the dataset and is the membership degree of the item in . Actually, can be seen as a simplification of the collection , and it is used to reduce the cost of scanning dataset.

The problem that emerges here is how to determine of every item in . The fuzzy statistic test is used to do this job in FI-GF [26].

If the granule is regarded as a concept, a transaction can be seen as a sample of . Then, the membership degree of an item in can be determined bywhere if or .

For example, given a granule , where is the collection of transactions which will be represented by , if , we can know thatso , and so on. Therefore,

To sum up, the pseudo of the granularity is designed and shown as Algorithm 1.

Input , , , and q.
Output , the set of granules, and , the set of the scale of data represented by every granule.
, , .
For
.
If
Using (14) to transform to .
, , .
End If
End For

3.3. The Calculation of Supports Based on Granules

After the granularity, transactions are compressed to a set of granules , and much time, which is the cost to scan the transactions, is saved. Next, how to calculate the supports of item sets through those granules should be discussed.

Considering that a granule in FI-GF is a fuzzy set, the calculation of supports in theory of fuzzy association rules can be used for [19], in which, given a set of fuzzy sets and an item set , the supports of are defined aswhere .

However, if (17) is applied in FI-GF, two defects will emerge. Firstly, only depends on the item of with the minimum degree of membership, and the effects of others are ignored. Secondly, in , the contributions of all the elements in are the same, but the transaction scales represented by different granules are variant. Therefore, another method should be proposed to calculate supports through .

Before the granularity, the contribution of a transaction to depends on whether covers , so the contribution of a granule to should be the degree to which the granule covers . According to this thought, we firstly define the degree to which a granule covers a fuzzy set , which is shown as follows:where is the reference set. Then, the contribution of a granule to can be defined as , where is a fuzzy set whose reference set is . If , or . The parameter determines how strictly the degree to which covers is evaluated. If is larger, the evaluation is more strict, and vice versa.

For example, there is a granule which isFor an item set ;

If , we can get that

However, if , we can get thatObviously, raises when becomes smaller. The meaning of is shown as the blue area in Figure 5.

Finally, the supports of an item set based on are defined aswhere is the transaction scales which are represented by .

For example, a dataset is given which only has 2 granules, and , where and , andFor an item set , we can get that when , and . So, the support of which is calculated based on and is

3.4. The Dynamical Minimum Support

After the granularity, Apriori is used to mine FIs based on those granules and (22). To reduce the candidate results, a dynamical minimum support, denoted by , is designed to replace the original one, where is the length of the item sets which will be mined out, is the current loop number of Apriori, and is used to control the change rate of .

Firstly, if the target FIs are shorter, is relatively large, because, compared with the longer one, the shorter item set are always more frequent. Then, if the loop number of Apriori is higher, also becomes larger, because the higher loop number causes the longer candidate results and the more complex computation, which means that more useless candidate results should be removed. Through these ways, the dynamical minimum support can efficiently reduce the cost of Apriori.

3.5. The Whole Pseudo of FI-GF

Summary, the pseudo of FI-GF, based on the granularity, , Apriori, and dynamical minimum support is shown as Algorithm 2.

Input , , , , , , , and min_sup.
Output The set of frequent -item-sets FI.
= granularity //Algorithm 1

For


For each
If

End if
End for
End for

Procedure apriori_gen()
For each
If

If each subset of satisfies that

End if
End if
End for
Return

4. The Experiments and Discussions

Several datasets are used to evaluate FI-GF. They are kosarak, T10I4D100K, retail, connect, and mushroom, which are downloaded from http://fimi.ua.ac.be/data/ and shown in Table 1. All experiments are run on Matlab under the Windows XP, and the computer which is used by us has a 2.4 GHz CPU and 2.92 GB RAM. All algorithms are coded by m language.

Experiment 1. In Experiment 1, the method of the granularity in FI-GF, which is Algorithm 1, is applied to the datasets in Table 1. The purpose of this experiment is to display the working principle of the granularity and to help us understand what happens when the granules are being generated.
Table 2 shows the parameter values of Algorithm 1 in Experiment 1, and Figures 6 and 7 exhibit the details of the granularity.
Figure 6 describes how the curves of cov, sp, and Col change when the transaction scales represented by those first granules in the datasets of Table 1 are growing. Cov, sp, and Col, respectively, describe the capability of a granule to contain more information, the capability of a granule to precisely represent information, and the balance of cov and sp. For every subfigure in Figure 6, the horizontal axis is the transaction scale represented by the first granule, and the vertical axis is the value of cov, sp, and Col.
Figure 7 describes the transaction scale of all the granules in the datasets of Table 1 after the granularity. For every subfigure, the horizontal axis represents the sequence numbers of granules, and the vertical axis represents the transaction scale of every granule.
By analyzing Figures 6 and 7, several phenomenon can be got and explained.
(1) According to Figure 6, we can know that when the transaction scale represented by a granule increases, cov increases and sp decreases. This phenomenon can be qualitatively explained as follows. To begin with, the more transactions a granule represents, the more information it contains, so cov increases along with the transaction scale. Then, because of the differences among transactions, the more transactions a granule represents, the more difficult it is for the granule to precisely represent the information, so sp decreases when the transaction scale goes up. On the other hand, the quantitative explanation can be got from the definition of cov and sp, which is (4) and (5). The function increases along with and function decreases along with , and is the transaction scale represented by the granule .
(2) Furthermore, according to Figure 6, we can see that every curve of Col goes up at the beginning and decreases after the peak. At the beginning, when the transaction scale is very low, although the granule can precisely represent those transactions, the information contained in it is too little. Therefore, when the transaction scale increases, Col goes up. Then, after the transaction scale grows to a certain level, the amount of information is enough. At the same time, the differences among transactions become significant, so Col begins to go down.
(3) According to Figures 6 and 7, we can know that the curves of cov, sp, and Col are different in different datasets, and the distributions of the transaction scales which are represented by granules are also different in different datasets. This phenomenon is caused by the diversity of difference in transactions from different datasets. The larger difference in transactions makes sp fall faster and the Col get to the peak earlier, and vice versa. On the other hand, and , which, respectively, decide the weight of cov and sp, also affect the shapes of curves. When increases, the capability to contain more information becomes more important and cov grows faster, and vice versa. Meanwhile, if increases, sp drops faster, and Col will get to the peak earlier.
(4) According to Figure 7, it is obvious that, for every dataset, the transaction scales represented by granules are variant. This phenomenon is brought by the variance of differences among transactions in different parts of a dataset. The large difference results in the faster falling of sp and the earlier peak of Col, which limits the scale of transactions, and vice versa.
To sum up, according to Figures 6 and 7, it can be known that, for a granule, the capability to contain more information and the capability to precisely represent the information are restricted by each other. Meanwhile, when the granularity is applied to different datasets or different parts in the same dataset, the changes of cov, sp, and Col and the transaction scales of granules are all different. This phenomenon is not only caused by the diversity of difference in transactions from different datasets and different parts of a same dataset but also brought by the setting of and in Table 2, which decides the weights of cov and sp.

(a) Kosarak

(b) T10I4D100K

(c) Retail

(d) Connect

(e) Mushroom

(a) Kosarak

(b) T10I4D100K

(c) Retail

(d) Connect

(e) Mushroom

Experiment 2. In Experiment 2, FI-GF is used to mine FIs from those datasets in Table 1, where FI-GF is compared with the original Apriori to test its reliability and efficiency. To be fair, the dynamical minimum support, , is also applied to Apriori. The parameters of the granularity in Experiment 2 are the same as Table 2. In addition to this, , and the other parameters are shown in Table 3.
In Figure 8, the time cost by Apriori and FI-GF under are recorded in a histogram. Parameter represents the stringency to evaluate the degree to which a granule contains an item set.
Table 3 describes the results which are mined by Apriori and FI-GF, where and , respectively, represent the base number and the parameter controlling the change rate of . In Table 3, 4 groups of results are recorded, which are the results generated by FI-GF when are, respectively, set to 0.9, 0.95, and 1 and the results generated by the original Apriori.
From Figure 8 and Table 3, the following can be got.
(1) According to Figure 8, we can know that, compared with Apriori, FI-GF saves a lot of time. For almost every dataset, FI-GF is at least twice more efficient than Apriori. This advantage is mainly caused by the granularity in FI-GF. To mine FIs from a dataset, Apriori has to repeatedly scan the dataset. If the transaction scale of a dataset is very large, too much time will be costly. However, FI-GF firstly puts all the transactions into some granules, and the algorithm just needs to scan the granules but the original transactions, so FI-GF is more efficient.
(2) According to Table 3, it can be known that the FIs, which are mined out by the original Apriori, can always be mined out by FI-GF too. This phenomenon proves that the uncertainty, brought by the granularity and the calculation of Gsup, has been faced by FI-GF successfully. The novel algorithm can ensure the reliability of the final results.
(3) Sometimes, the number of results, which are mined out by FI-GF, are more than the number of results which are mined out by Apriori. This phenomenon can be explained by the follows. Firstly, the contribution of a granule to the support of an item set is defined by . This formula describes the degree to which a granule contains but not whether it contains , so the criteria is lowered and more results are mined out. Furthermore, how strictly to evaluate containing can be adjusted by , which also effects the number of the final results. In summary, the extra FIs in FI-GF are caused by those methods which face the uncertainty brought by the granularity and ensure the appearance of the real FIs.
(4) Finally, when we check those results generated by FI-GF under , , and , we can find that the higher is, the less results are mined out. Because is used to control how strictly to evaluate a granule covering an item, the higher is, the less results can satisfy this criteria, and vice versa.

Experiment 3. In Experiment 3, the granularity, Algorithm 1, in FI-GF is compared with a classical and widely applied progressive sampling algorithm, RC-SS [10], to prove its advantage.
RC-SS keeps increasing sample size until the similarity between two consecutive samples approaches to a high level. The similarity is evaluated through a representative set. A representative set is constructed by some item sets, and every item set in it contains the most frequent FI, which is mined by Apriori from this sample. Given 2 representative sets and , which are mined out from two samples and , if and are similar, and can also be regarded as similar. The similarity between two representative sets can be simply computed as follows:where is the support of item set , and it is computed based on the sample . RC-SS also needs a minimum support, which is denoted by . This minimum support is used by Apriori to generate representative sets. Moreover, a sampling step, which decides the growth rate of sample size, is also needed, and a lowest of sample, which ensures that the final sample size will not be too small, is required too.
In this experiment, the representative set of a sample is defined as those FIs in which contain the most frequent item, the of RC-SS are all set to 0.1, the sampling steps are all set to 100, and the lowest sample sizes are set to 300. If 6 latest and consecutive are all higher than 0.9, the sampling will be stopped. The parameters of the granularity are the same as Table 2.
Figure 9 draws the time which is, respectively, consumed by the granularity and RC-SS. Figure 10 shows the transaction scales generated by RC-SS and granule scales generated by the granularity. Figure 11 shows the learning curves of the RC-SS, which describes the changes of when RC-SS is applied to those datasets in Table 1.
Through Figures 9–11, the following can be got.
(1) According to Figure 9, we can know that, in most cases, the time cost by the granularity is less than the time cost by RC-SS. This phenomenon can be explained through the following. To begin with, RC-SS has to repeatedly generate samples until the similarity between two consecutive samples becomes stable and high. Secondly, RC-SS has to apply Apriori to mine the representative sets out from every sample, which means that the sampling process still costs too much time to scan the transactions. Furthermore, the complexity of (25), which is the computation of similarity, also puts some burden on RC-SS.
(2) However, according to Figure 9, we can know that when the granularity and RC-SS are applied to the dataset retail, the time cost by the granularity is more than the time cost by RC-SS, and when the granularity and RC-SS are applied to the dataset kosarak, the granularity only has a slight advantage on efficiency. This phenomenon is caused by how FI-GF builds its granules. When FI-GF is building a granule, a collection of transactions needs to be built firstly. To ensure that the information in a collection can be represented more precisely, differences in the collection are computed by (6), and (6) needs to compute the hamming distance between two transactions. If the lengths of two transactions are longer, the computing of hamming distance will be more complex. Thus, considering the average transaction lengths of retail and kosarak are both very long, computations of hamming distances on them will certainly become slower, and meanwhile, the granularity takes more time.
(3) According to Figure 10, we can know that, in most cases, the granule scale which is built by the granularity is less than the transaction scale which is chosen by RC-SS. This phenomenon can be explained as follows. As mentioned in Section 3.1, the average transaction scale represented by a granule can be controlled by and , and the more transactions a single granule can represent, the less granules will be generated. Therefore, if and are set reasonably, the granule scale can be controlled in a low level, and this is also an advantage of the granularity. The granularity not only can make the simplified dataset as precise as possible but also provides a tool to control the granule scale. However, the progressive sampling can only do the first job, and the final sample size cannot be controlled. For example, through Figure 10, we can know that, even after the sampling, the final sample size of the dataset connect still approaches to 10300, which will also cost too much time for scanning.
(4) According to Figure 11, we can know that the learning curves of RC-SS are not always smooth and convergent. When RC-SS is applied to mine the datasets kosarak, retail, and mushroom, the convergences of learning curves are good. However, when RC-SS is applied to mine the datasets T10I4D100K and connect, RC-SS did not perform well. The reason of this phenomenon can be explained as follows.
The stop condition of RC-SS in this experiment is that 6 consecutive are all higher than 0.9. The more elements has, the higher will be. Meanwhile, the representative set of a sample is constructed by those FIs covering the most frequent item in this sample. If several items have the similar and high frequencies, a little change of the sample may change the rank of frequency and may result in a big change of the representative sets as well, so the similarity between the original sample and the sample after the tiny change is low. Thus, the larger the variance of the frequencies of items is, the better the convergence of RC-SS will be. Therefore, the different variances of the frequencies of items in the datasets in Table 1 cause the different convergences. This characteristic shows another disadvantage of RC-SS when compared with FI-GF, which is that RC-SS cannot adapt to any dataset well.

(a) Kosarak

(b) T10I4D100K

(c) Retail

(d) Connect

(e) Mushroom

5. Conclusions

The following conclusions can be deduced:(1)FI-GF has both the high efficiency and reliability. Firstly, the granularity of it not only decreases the scale of transactions but also ensures the precision of every granule. Then, the support computed by fuzzy theory successfully adapts to the uncertainty brought by the granularity.(2)In most of time, compared with RC-SS, which is the most representative progressive sampling way, the granularity costs less time. Furthermore, the granularity can control the final size of simplified dataset to some extent and can adapt to more datasets.(3)The application of granular computing to FI mining has broad prospects. The next work is to solve the problem of the huge computation brought by the long transactions.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

References

J. Han, M. Kamber, and J. Pei, Data Mining Concepts and Techniques, China Machine Press, Beijing, China, 3rd edition, 2012.
K. Luo, L.-L. Wang, and X.-J. Tong, “Mining association rules in incomplete information systems,” Journal of Central South University of Technology (English Edition), vol. 15, no. 5, pp. 733–737, 2008.
View at: Publisher Site | Google Scholar
R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the International Conference on Very Large Data Bases, pp. 487–499, Santiago de Chile, Chile, 1994.
View at: Google Scholar
J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '00), pp. 1–12, ACM, Dallas, Tex, USA, May 2000.
View at: Publisher Site | Google Scholar
M. J. Zaki, “Scalable algorithms for association mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 372–390, 2000.
View at: Publisher Site | Google Scholar
A. Pietracaprina, M. Riondato, E. Upfal, and F. Vandin, “Mining top-K frequent itemsets through progressive sampling,” Data Mining and Knowledge Discovery, vol. 21, no. 2, pp. 310–326, 2010.
View at: Publisher Site | Google Scholar | MathSciNet
M. J. Zaki, S. Parthasarathy, W. Li, and M. Ogihara, “Evaluation of sampling for data mining of association rules,” in Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97), pp. 42–50, IEEE, Birmingham, UK, April 1997.
View at: Publisher Site | Google Scholar
V. T. Chakaravarthy, V. Pandit, and Y. Sabharwal, “Analysis of sampling techniques for association rule mining,” in Proceedings of the 12th International Conference on Database Theory (ICDT '09), pp. 276–283, ACM, Saint-Petersburg, Russia, March 2009.
View at: Publisher Site | Google Scholar
H. Toivonen, “Sampling large databases for association rules,” in Proceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96), pp. 134–145, Mumbai, India, September 1996.
View at: Google Scholar
S. Parthasarathy, “Efficient progressive sampling for association rules,” in Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM '02), pp. 354–361, Maebashi, Japan, December 2002.
View at: Publisher Site | Google Scholar
B. Chen, P. Haas, and P. Scheuermann, “A new two-phase sampling based algorithm for discovering association rules,” in Proceedings of the International Conference on Knowledge Discovery in Databases (KDD '02), pp. 462–468, 2002.
View at: Google Scholar
Y. Y. Yao, “Granular computing: basic issues and possible solutions,” in Proceedings of the 5th Joint Conference on Information Sciences (JCIS '00), pp. 186–189, Atlantic City, NJ, USA, February-March 2000.
View at: Google Scholar
W. Pedrycz, Granular Computing: Analysis and Design of Intelligent Systems, CRC Press, London, UK, 1st edition, 2013.
W. Pedrycz and W. Homenda, “Building the fundamentals of granular computing: a principle of justifiable granularity,” Applied Soft Computing, vol. 13, no. 10, pp. 4209–4218, 2013.
View at: Publisher Site | Google Scholar
L. A. Zadeh, “Fuzzy sets,” Information and Computation, vol. 8, pp. 338–353, 1965.
View at: Google Scholar | MathSciNet
D. Dubois, E. Hüllermeier, and H. Prade, “A systematic approach to the assessment of fuzzy association rules,” Data Mining and Knowledge Discovery, vol. 13, no. 2, pp. 167–192, 2006.
View at: Publisher Site | Google Scholar | MathSciNet
T.-P. Hong, K.-Y. Lin, and B.-C. Chien, “Mining fuzzy multiple-level association rules from quantitative data,” Applied Intelligence, vol. 18, no. 1, pp. 79–90, 2003.
View at: Publisher Site | Google Scholar | Zentralblatt MATH
B. Pei, S. Zhao, H. Chen, X. Zhou, and D. Chen, “FRAP: mining fuzzy association rules from a probabilistic quantitative database,” Information Sciences, vol. 237, pp. 242–260, 2013.
View at: Publisher Site | Google Scholar | MathSciNet
M. Delgado, N. Marín, D. Sánchez, and M.-A. Vila, “Fuzzy association rules: general model and applications,” IEEE Transactions on Fuzzy Systems, vol. 11, no. 2, pp. 214–225, 2003.
View at: Publisher Site | Google Scholar
A. Salam and M. S. H. Khayal, “Mining top−k frequent patterns without minimum support threshold,” Knowledge and Information Systems, vol. 30, no. 1, pp. 57–86, 2012.
View at: Publisher Site | Google Scholar
P. Tzvetkov, X. Yan, and J. Han, “TSP: mining top-k closed sequential patterns,” Knowledge and Information Systems, vol. 7, no. 4, pp. 438–457, 2005.
View at: Publisher Site | Google Scholar
R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216, ACM, May 1993.
View at: Google Scholar
W. Pedrycz, “Granular computing-the emerging paradigm,” Journal of Uncertain Systems, vol. 1, no. 1, pp. 38–61, 2007.
View at: Google Scholar
W. Pedrycz, “From numeric models to granular system modeling,” Fuzzy Information and Engineering, vol. 7, no. 1, pp. 1–13, 2015.
View at: Publisher Site | Google Scholar | MathSciNet
A. Bargiela and W. Pedrycz, “The roots of granular computing,” in Proceedings of the IEEE International Conference on Granular Computing, pp. 806–809, Atlanta, Ga, USA, May 2006.
View at: Publisher Site | Google Scholar
P. Z. Wang, “From the fuzzy statistics to the falling random subsets,” in Advances in Fuzzy Sets, Possibility Theory, and Applications, pp. 81–96, Springer, New York, NY, USA, 1983.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2015 Zhong-jie Zhang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

887

Downloads

858

Citations