Abstract

With the development of human society and the development of Internet of things, wireless and mobile networking have been applied to every field of scientific research and social production. In this scenario, security and privacy have become the decisive factors. The traditional safety mechanisms give criminals an opportunity to exploit. Association rules are an important topic in data mining, and they have a broad application prospect in wireless and mobile networking as they can discover interesting correlations between items hidden in a large number of data. Apriori, the most influential algorithm of association rules mining, needs to scan a database many times, and the efficiency is low when the database is huge. To solve the security mechanisms problem and improve the efficiency, this paper proposes a new algorithm. The new algorithm scans the database only one time and the scale of data to deal with is getting smaller and smaller with the algorithm running. Experiment results show that the new algorithm can efficiently discover useful association rules when applied to data.

1. Introduction

With the rapid development of web technology, the number of choices is becoming overwhelming. It takes a long time to filter, prioritize, and efficiently deliver relevant information so as to alleviate the problem of information overload. Recommender systems [1] have grown so fast that they can meet the needs of users’ ambiguous requirements. They utilize statistic method and knowledge discovery technology, providing users with personalized content and services by searching through large volume of dynamically generated information. Recently, various approaches for building recommender systems have been developed, which can utilize collaborative filtering, content-based filtering, or hybrid filtering [24]. Among the above filtering techniques, collaborative filtering recommendation is the most mature and the most commonly implemented. Collaborative filtering technique can be divided into two classifications; they are model-based filtering and memory-based filtering. The model-based filtering learns a model from the user-item ratings which can be computed offline. Once the model is generated, the process of prediction will be easy and fast. Lots of model-based filtering techniques have been proposed by researchers such as Latent Semantic Indexing (LSI) [5], decision tree [6], Bayesian network approach models [7], and cluster models [8, 9]. Usually, model-based algorithm has better scalability but lower accuracy than memory-based algorithm. Although collaborative filtering technique is commonly used, it still encounters one crucial issue remaining to be solved, namely, data sparsity problem [1013], thus leading to the nonoptimal nearest neighbors because the core of the collaborative filtering algorithm is to find the -nearest neighbors [1418]. For lack of reference rating values, this step of searching neighbors causes big inaccuracy. In the traditional collaborative filtering algorithm, such similarity metrics are used to calculate the similarity between users or items as cosine, Pearson-correlation, and modified cosine [1922]. All of them present poor performance when they are applied to big data with high sparsity. This paper proposes a new algorithm, considering both user similarity and item ones. Matrix prefilling, a method of preprocessing, is based on association rules which are not proposed by others before when measuring similarity. Experimental results of the proposed model on a real dataset: the dataset proves to generate more accurate prediction results compared to the traditional ones. The remainder of this paper is organized as follows. Section 2 is a brief introduction of association rule whose concept and algorithm will be used in Section 3 to propose a new algorithm. Section 3 focuses on the algorithm for wireless and mobile networking which is the highlight of this paper. Experimental results and analyses are displayed in Section 4. Section 5 is the final part of this paper in which conclusion is reached.

2.1. Related Concepts of Association Rules

Transaction database is the set of all the transactions. is the set of all the items in   [2325]. Every transaction contains a set of items which is the subset of . Item set is a collection that contains 0 or more items. If the number of items an item set contains is , then the item set is called -item set. Support count is an important property of an item set. It indicates the number of a particular item set contained in the transactions. , the support count of item set , is defined as follows:

represents the number of elements in the collection.

A rule is defined as an implication form , where , , and . Support and confidence are two important measures of association rules. Support indicates the frequency of the rule in a dataset. It is defined as

is the total number of transactions.

The confidence of a rule is the proportion of transactions that contains which also contains . It is defined as

Support and confidence are two important measures to evaluate association rules. Rules with low support may occur only occasionally which are meaningless in most cases. Therefore, support is often used to delete those meaningless rules. Confidence is a measure of accuracy of association rules. If the confidence of the rule is high, the possibility of appearing in the transactions which contain is larger.

2.2. Apriori Algorithm

Apriori is a typical algorithm with candidate set generated. It uses the support based pruning method and a level-wise and breadth-first search to discover the frequent item sets. Apriori uses two properties below to compress search space.

Lemma 1. If the item set is frequent, then all nonempty subsets are frequent too.

Lemma 2. If the item set is nonfrequent, then all supersets are nonfrequent too.

Candidate item set generation is a very critical step. It should ensure that the candidate item sets are complete while avoiding too many unnecessary candidates. This step consists of two parts.

() In the join step, this paper joins two frequent (-1)-item sets L1 and L2 to generate candidate -item sets. This paper should make sure that the first -2 items of L1 and L2 are the same. Then, the first -2 items and the last item of L1 as well as the last item of L2 compose the candidate -item set.

() In the pruning step, this paper uses a strategy to delete some unnecessary candidates. According to Lemmas 1 and 2, for each -item set generated, this paper examines whether all the -1 subsets are frequent. If not, this paper removes it from the candidate -item sets.

Apriori algorithm effectively filters the unnecessary candidates. It will get a good data mining result, especially for short pattern data. However, the weakness is that the database needs to be scanned many times. It will produce tremendous I/O cost. Another weakness is that a lot of candidate item sets may be generated. It will cost a lot of time and memory space.

2.3. FP-Growth Algorithm

FP-Growth is a classic algorithm without candidate item sets generated. It compresses the data into a structure called FP-tree. The frequent item sets are discovered by doing a recursive search of the FP-tree.

The process of FP-Growth mainly consists of two steps.

(1) Constructing the FP-Tree. When the database is scanned for the first time, this paper selects the items which satisfy the minimum support and puts these items to a header table with a descending sort order according to support. When the database is scanned for the second time, the items contained in a transaction are sorted according to their order in the header table and are inserted in the FP-tree. Then combine the same paths in the tree.

(2) Discovering Frequent Item Sets by Searching the FP-Tree. If the FP-tree contains only one path, enumerate all the possible item sets. If not, for each item in the header table, this paper creates its conditional pattern base so as to construct the conditional pattern tree. The recursive process will not stop until the tree is empty.

FP-growth algorithm scans the database only two times and avoids the generation of candidate item sets, but the weakness is that when the database is huge, the FP-tree is too large and even cannot be constructed in memory because all the records in database are compressed into the FP-tree.

3. The Improved Apriori Algorithm Based on Matrix

To avoid the weakness of apriori algorithm, this paper proposes an improved algorithm on the basis of apriori algorithm. This paper converts the transaction database to a Boolean matrix and deletes the unnecessary rows and columns of the matrix to reduce the scale of the data.

3.1. Related Concept

Association rules usually focus on transaction databases. If this paper converts the transaction database to a Boolean matrix, on the one hand, the database can be scanned only one time so as to reduce the cost of I/O and, on the other hand, it may reduce the memory consumption when the data is in the form of 0 and 1.

Definition 3. Let be an item set and be a set of transactions in the database and each transaction in has a unique transaction id called TID. The method by which transactions are converted into a Boolean matrix is as follows: let be the binary relation from to . , . Then

An example of a transaction database is in Table 1. The Boolean matrix of the database is in Table 2.

The column vector of the Boolean matrix is defined as . The support count of is

For -item set , its support count is

is “and” operation. When are simultaneously 1, the support count is incremented by 1.

Lemma 4. If the number of “1” instances contained in a row of Boolean matrix is less than , then when this paper counts the support of -item set, this row can be deleted from the matrix.
According to the definition of support count, If the number of “1” instances contained in a row is less than , there will exist which makes ; then . Therefore this row makes no contribution to the support count of -item set.

Lemma 5. If there is an item , the number of instances that appear in frequent -item sets is less than ; the column of can be deleted in the process of frequent -item set generation.

Let be a frequent ()-item set; then all its -subsets are frequent. For each , the number of instances that appear in frequent -item sets should be . if the number is less than , then will not be the element of the frequent ()-item set.

3.2. The Searching of -Nearest Neighbors

After the process above, this paper takes user similarity into account. The similarity of user and user is computed as (8). denotes the set of all the items.

For each user , the aim to find the -nearest neighbor is to find a user set , , has the highest value, and has the second highest value, and so on.

3.3. The Generation of Recommendation

After the step of finding the -nearest neighbors, the next step is to generate recommendations. Let the set of -nearest neighbors of user be and the rating that user give to the item be ; the calculation is as follows:

3.4. Description of the Improved Algorithm

The process of the improved algorithm is descripted in Algorithm 1. First this paper converts a database to a Boolean matrix. Then according to Lemmas 4 and 5 the unnecessary rows and columns of the matrix are deleted with the algorithm running.

Input: dataset , minimum support minsup
Output: all the itemsets satisfied with minsup
() Scan transaction database and convert it to a Boolean matrix.
() Calculate the support of every 1-itemset. Item sets whose support are not less than min sup
   compose frequent 1-item sets . Delete the columns of the infrequent items. Delete the rows
   in which the number of “1” contained is less than 2.
() for (; ; ++) do begin
() Combine the items of each column and generate candidate - item sets.
() Calculate the support of .
() Item sets whose support is not less than min sup compose frequent -item sets.
() Delete the columns of the Items which are contained in infrequent item sets and the number
   Of which appear in is less than .
() Delete the rows in which the number of “1” contained is less than
() End

The improved algorithm based on matrix is shown in Algorithm 1.

3.5. Evaluation Criteria

Not all association rules are useful, so it is necessary to select the association rules in which we are interested. Support and confidence are two basic criteria to evaluate if an association rule is useful. However in some case the two criteria may give us an unexpected suggestion. So this paper uses the criterion called lift to evaluate the association rules in addition to support and confidence. The lift of a rule is defined as

Lift is the radio of a rule’s confidence and the consequent’s support. If the value of lift is 1, and are independent. If the value is above , and are positively correlated. If the value is below , and are negatively correlated.

3.6. Performance Analysis

Compared with apriori algorithm, the improved algorithm scans the database only one time. It converts the transaction database to a matrix. The remaining steps are operated on the matrix without scanning the database again. This will reduce the I/O cost. The other advantage of the improved algorithm is that the scale of data to be dealt with is getting smaller and smaller with the algorithm running. In the process of frequent item sets generation, the columns of items which will not be contained in frequent item sets and the rows which make no contribution to the support count will be deleted. Therefore the scale of the matrix will be smaller and smaller and the efficiency will be improved a lot. On the other hand, when a transaction contains many items, compared with transaction list, a Boolean matrix occupies less memory space.

4. Results and Analysis

To access the performance of the improved algorithm, this paper uses apriori algorithm and the improved algorithm proposed in this paper to mine frequent item sets from different agricultural databases. The experiments were performed on an Intel i5-2450 processor 2.5 GHz with 4G memory, running Windows 8. This paper used R language to code the algorithms.

Table 3 and Figure 1 show the performance of the two algorithms in the UCI dataset named mushroom. The dataset contains 7847 records and 118 items. The minimum confidence is set to be 0.5 and the minimum support is set, respectively, to be 0.60, 0.65, 0.70, 0.75, and 0.80.

Table 4 and Figure 2 show the performance of the two algorithms in the UCI dataset named soybean. The dataset contains 5264 records and 655 items. The minimum confidence is set to be 0.5 and the minimum support is set, respectively, to be 0.75, 0.76, 0.77, 0.78, 0.79, and 0.80.

The results show that the runtime of the improved algorithm is less than apriori algorithm. The improved algorithm is more effective than apriori algorithm.

The evaluation method lift is used to optimize the mining result. A subset of the mining result of mushroom dataset is shown in Algorithm 2. Algorithm 2 shows the association rules whose support and confidence are high but whose lift is 1. It means that the antecedent and consequent are independent, and these association rules are not the rules this paper expect even though they have high support and confidence.

() , ,
() , ,
() , ,
() , ,
,
() , , ,
() , , ,
() , ,
, ,

5. Conclusion

To avoid the weakness of apriori algorithm, this paper proposes an improved algorithm based on matrix and applies the improved algorithm to agricultural datasets. Experimental results show that the improved algorithm can efficiently discover useful association rules for the reason that database will be scanned only one time and that the data to deal with is getting smaller and smaller with the algorithm running. The improved algorithm is more applicable when the database is huge. But it is not that efficient compared with apriori when the database is not that large due to the fact that the scale data to deal with is small but the improved algorithm has an extra operation to covert the database to a matrix. Further research should be focused on the optimization of the proposed algorithm so as to further improve the efficiency when applied to big data. Algorithm parallelization can be taken into account. Therefore our future work is to improve our algorithm so as to be applicable for more kinds of database. Besides, new evaluation criteria can be used to optimize our mining result.

Conflicts of Interest

There are no conflicts of interest.