Abstract

Data mining is traditionally adopted to retrieve and analyze knowledge from large amounts of data. Private or confidential data may be sanitized or suppressed before it is shared or published in public. Privacy preserving data mining (PPDM) has thus become an important issue in recent years. The most general way of PPDM is to sanitize the database to hide the sensitive information. In this paper, a novel hiding-missing-artificial utility (HMAU) algorithm is proposed to hide sensitive itemsets through transaction deletion. The transaction with the maximal ratio of sensitive to nonsensitive one is thus selected to be entirely deleted. Three side effects of hiding failures, missing itemsets, and artificial itemsets are considered to evaluate whether the transactions are required to be deleted for hiding sensitive itemsets. Three weights are also assigned as the importance to three factors, which can be set according to the requirement of users. Experiments are then conducted to show the performance of the proposed algorithm in execution time, number of deleted transactions, and number of side effects.

1. Introduction

With the rapid growth of data mining technologies in recent years, useful information can be easily mined to aid mangers or decision-makers for making efficient decisions or strategies. The derived knowledge can be simply classified into association rules [15], sequential patterns [68], classification [9, 10], clustering [11, 12], and utility mining [1316], among others. Among them, association-rule mining is the most commonly used to determine the relationships of purchased items in large datasets.

Traditional data mining techniques analyze database to find potential relations among items. Some applications require protection against the disclosure of private, confidential, or secure data. Privacy preserving data mining (PPDM) [17] was thus proposed to reduce privacy threats by hiding sensitive information while allowing required information to be mined from databases. Privacy information includes some personal or confidential information in business, such as social security numbers, home address, credit card numbers, credit ratings, purchasing behavior, and best-selling commodity. In PPDM, data sanitization is generally used to hide sensitive information with the minimal side effects for keeping the original database as authentic as possible. The intuitive way of data sanitization to hide sensitive information is directly to delete sensitive information from amounts of data. Three side effects of hiding failure, missing cost, and artificial cost are then generated in data sanitization process but most approaches are designed to partially evaluate the side effects. Infrequent itemset is, however, not considered in the evaluation process, thus raising the probability of artificial itemsets caused. Besides, the differences between the minimum support threshold and the frequencies of the itemsets to be hidden are not considered in the above approaches.

In this paper, a hiding-missing-artificial utility (HMAU) algorithm is proposed for evaluating the processed transactions to determine whether they are required to be deleted for hiding sensitive itemsets by considering three dimensions as hiding failure dimension (HFD), missing itemset dimension (MID), and artificial itemset dimension (AID). The weight of each dimension in evaluation process can be adjusted by users. Experimental results showed that the proposed HMAU algorithm has good performance in execution time and the number of deleted transactions. Besides, the proposed algorithm can thus generate minimal side effects of three factors compared to the past algorithm for transaction deletion to hide the sensitive itemsets.

This paper is organized as follows. Some related works are reviewed in Section 2, including the data mining techniques, the privacy preserving data mining, and the evaluated criteria of PPDM. The proposed HMAU algorithm to hide the sensitive itemsets for transaction deletion is stated in Section 3. An illustrated example of the proposed HMAU algorithm is given in Section 4 step by step. Experiments are conducted in Section 5. Conclusion and future works are mentioned Section 6.

In this section, privacy preserving data mining (PPDM) techniques and evaluated criteria of PPDM are respectively reviewed.

2.1. Privacy Preserving Data Mining Techniques

Data mining is used to extract useful rules from large amounts of data. Agrawal and Srikant proposed Apriori algorithm to mine association rules in two phases to firstly generate the frequent itemsets and secondly derive the association rules [3]. Han et al. then proposed the Frequent-Pattern-tree (FP-tree) structure for efficiently mining association rules without generation of candidate itemsets [18]. The FP-tree was used to compress a database into a tree structure which stored only large items. It was condensed and complete for finding all the frequent patterns. The construction process was executed tuple by tuple, from the first transaction to the last one. After that, a recursive mining procedure called FP-Growth was executed to derive frequent patterns from the FP-tree.

Through various data mining techniques, information can thus be efficiently discovered. The misuse of these techniques may, however, lead to privacy concerns and security problems. Privacy preserving data mining (PPDM) has thus become a critical issue for hiding private, confidential, or secure information. Most commonly, the original database is sanitized for hiding sensitive information [1921].

In data sanitization, it is intuitive to directly delete sensitive data for hiding sensitive information. Leary found that data mining techniques can pose security and privacy threats [22]. Amiri proposed the aggregate, disaggregate, and hybrid approaches to, respectively, determine whether the transactions or the items are to be deleted for hiding sensitive information [23]. The approaches considered the ratio of sensitive itemsets to nonsensitive frequent itemsets to evaluate the side effects of hiding failures and missing itemsets. Oliveira and Zaïane designed the sliding window algorithm (SWA) [24], in which the victim item with the highest frequency in the sensitive rules related to the current sensitive transaction is selected. Victim items are removed from the sensitive transaction until the disclosure threshold equals 0. Hong et al. proposed a lattice-based algorithm to hide the sensitive information through itemset deletion by a lattice structure to speed up the sanitization process [25]. All the sensitive itemsets are firstly used to build the lattice structure. The sensitive itemsets are then gradually deleted bottom-up form the lowest levels to the highest ones until the frequencies of the sensitive itemsets are lower than the minimum support threshold. Different strategies for hiding sensitive itemsets are still designed in progress to find better results considering of side effects and the dissimilarity of database [21, 2630].

2.2. Evaluation Criteria

In data sanitization, the primary goal is to hide the sensitive information with minimal influences on databases. Three side effects of hiding failures, missing itemsets, and artificial itemsets are used to evaluate the performance of data sanitization. for data distortion [28, 31, 32] of sensitive itemsets in PPDM. The relationships between the side effects and mined itemsets of the original database and sanitized one are shown in Figure 1.

In Figure 1, represents the frequent itemsets mined from the original database, represents the frequent itemsets mined from the sanitized database, and represents the sensitive itemsets that should be hidden. The part is concerned as hiding failures that fail to hide the sensitive itemsets. Thus, is the intersection of and . part is concerned as missing itemsets that mistakenly to delete the nonsensitive frequent rules. Thus, is the difference between , , and . part is concerned as artificial itemsets which is unexpectedly generated. Thus, is the difference between and . In PPDM, it is intuitive to delete transactions with sensitive itemsets in the sanitization process. In this paper, , , and with adjustable weights are considered to evaluate whether the processed transactions are required to be deleted. Besides the above side effects, the number of deleted transactions or items is also a criterion to evaluate the data distortion [32, 33].

3. Proposed Hiding-Missing-Artificial Utility Algorithm

3.1. Definition of Formulas

Data sanitization is the most common way to protect sensitive knowledge from disclosure in PPDM. To avoid the side effects of hiding failures, missing itemsets, and artificial itemsets, minimal distortion of the databases is thus necessary. In this paper, a hiding-missing-artificial utility (HMAU) algorithm is proposed to hide sensitive itemsets through transaction deletion. Three dimensions of hiding failure dimension (HFD), missing itemset dimension (MID), and artificial itemset dimension (AID) are thus concerned to evaluate whether the transactions are required to be deleted for hiding the sensitive itemsets. The transactions with any of the sensitive itemset are first evaluated by the designed algorithm to find the minimal HMAU values among transactions, The transaction with minimal HMAU value will be directly removed from the database. The procedure is thus repeated until all sensitive itemsets are hidden. In order to avoid exposing the already hidden sensitive itemsets again, the minimum count is dynamically updated during the deletion procedure.

The value of each dimension is set from 0 to 1 (0 < value ≤ 1). In the proposed formulas, the differences between minimum support threshold and the frequencies of the sensitive itemsets are thus considered to evaluate whether the transactions are required to be deleted instead of only the presence of the itemsets in the transactions.

First, the HFD is used to evaluate the hiding failures of each processed transaction in the sanitization process. When a processed transaction contains a sensitive itemset , the HFD value of the processed transaction is calculated as where is defined as the percentage of the minimum support threshold, sensitive itemset is from the set of sensitive itemsets is the maximal count of the sensitive itemsets in the set of sensitive itemsets HS, is the number of transactions in the original database , and is the occurrence frequency of the sensitive itemset .

Second, the MID is used to evaluate the itemsets of each processed transaction in the sanitization process. When a processed transaction contains a frequent itemset , the MID value of the processed transaction is calculated as where an itemset is a frequent itemset from the set of large (frequent) itemsets is the maximal count of the large itemsets in the set of FI, and is the occurrence frequency of the large itemset .

Third, the AID is used to evaluate the artificial itemsets of each processed transaction in the sanitization process. In AID, only the small 1-itemsets are considered in the sanitization process since it is a nontrivial task to keep all infrequent itemsets. When a processed transaction contains a small 1-itemset , the AID value of the processed transaction is calculated as where a small 1-itemset is from the set of small 1-itemsets is the minimal count of the small 1-itemsets in the set of , and is the occurrence frequency of the small 1-itemset .

In this paper, a risky bound is designed to speed up the execution time of the proposed HMAU algorithm by avoiding the evaluation of all large itemsets and small 1-itemsets by considering MID and AID. A parameter is set as the percentage used to find the upper and lower boundaries of the minimum support threshold. Only the large itemsets and infrequent 1-itemsets within the boundaries are used to determine whether the processed transactions are required to be deleted. For the large itemsets, the minimum support threshold is set as the lower boundary, and the upper boundary is set as where is the number of transactions in the original database , is the minimum support threshold, is the risky bound, and is the occurrence frequency of the large itemset .

For small 1-itemsets, the minimum support threshold is set as the upper boundary, and the lower boundary is set as where is the occurrence frequency of the small 1-itemset .

The flowchart of the proposed HMAU algorithm is depicted in Figure 2.

3.2. Notation

See Table 1.

Details of the proposed HMAU algorithm are illustrated as follows.

Proposed HMAU Algorithm.

Input. This includes an original database , a minimum support threshold ratio , a risky bound , a set of large (frequent) itemsets , a set of small (nonfrequent) 1-itemsets , and a set of sensitive itemsets to be hidden .

Output. This includes a sanitized database with no sensitive information.

Step  1. Select the transactions to form a projected database , where each transaction in consists of sensitive itemsets within it, where .

Step  2. Process each frequent itemset in the set of FI to determine whether its frequency satisfies the condition , where is the number of transactions in the original database and is the occurrence frequency of the large itemset . Put the that do not satisfy the condition into the set of FItmp.

Step  3. Process each small 1-itemset in the set of to determine whether its frequency satisfies the condition , where is the occurrence frequency of the small 1-itemset . Put the that do not satisfy the condition into the set of .

Step  4. Calculate the maximal count () of the sensitive itemsets in the set of HS as where is the occurrence frequency of the sensitive itemset in the set of HS.

Step  5. Calculate the HFD of each transaction . Do the following substeps.

Substep  5.1. Calculate the HFD of each sensitive itemset within as

Substep  5.2. Sum the HFDs of sensitive itemsets within as

Substep  5.3. Normalize the for all transactions in .

Step  6. Calculate the maximal count (MAXFI) of the large itemsets in the set of FI as

Step  7. Calculate the MID of each transaction . Do the following substeps.

Substep  7.1. Calculate the MID of each large itemset within as

Substep  7.2. Sum the MIDs of large itemsets within as Substep  7.3. Normalize the for all transactions in .

Step  8. Calculate the minimal count () of the small 1-itemsets in the set of as

Step  9. Calculate the AID of each transaction . Do the following substeps.

Substep  9.1. Calculate the AID of each small 1-itemset within as

Substep  9.2. Sum the AIDs of small 1-itemsets within as

Substep  9.3. Normalize the for all transactions in .

Step  10. Calculate the HMAU for HFD, MID, and AID of each transaction as where , , and are the predefined weights by users.

Step  11. Remove transaction with value.

Step  12. Update the minimum count () of sanitized database.

Step  13. Update the occurrence frequencies of all sensitive itemsets in the sets of HS and HStmp. Put into the set of HStmp if < minimum count (), and put into the set of HS otherwise.

Step  14. Update the occurrence frequencies of all large itemsets in the sets of FI and FItmp. Put into the set of FItmp if < minimum count (), and put into the set of FI otherwise.

Step  15. Update the occurrence frequencies of all small 1-itemsets in the sets of and . Put into the set of if ≥ minimum count (), and put into the set of otherwise.

Step  16. Repeat Step  2 to Step  15 until the set of HS is empty .

4. An Illustrated Example

In this section, an example is used to illustrate the proposed algorithm step by step. Consider a database with 10 transactions (tuples) and 6 items (denoted as to ) shown in Table 2. Each transaction can be considered a set of purchased items in a trade. The minimum support threshold is initially set at 40%, and the risky bound is set at 10%. A set of sensitive itemsets, , is considered to be hidden by the sanitization process.

Based on an Apriori-like approach [3], the large (frequent) itemsets and small 1-itemsets are mined. The results are, respectively, shown in Tables 3 and 4.

The proposed algorithm then proceeds as follows to sanitize the database for hiding all sensitive itemsets in HS.

Step 1. The transactions in are selected with any of the sensitive itemsets in HS. In this example, the transactions 1, 3, 6, 7, 8, and 10 are selected to form the database shown in Table 5.

Step 2. The frequent itemsets in FI are processed to check whether the condition is satisfied, which is calculated as (= ). The itemsets satisfy the condition and are kept in FI; the remaining itemsets, , are put into the set of FItmp.

Step 3. The infrequent 1-itemsets in are then processed to check whether the condition is satisfied, which is calculated as (= ). The itemset satisfies the condition and is kept as ; the other itemset, , is put into the set of .

Step 4. The maximal count (MAXHS) among the sensitive itemsets in the set of HS is then calculated. In this example, the maximal count of the sensitive itemsets and is calculated as .

Step 5. The HFD of each transaction is calculated to evaluate the side effects of hiding failures of the processed transaction. In this example, transaction 7 is used to illustrate the following steps. According to formula (1), the HFD is calculated as and . The HFD of transaction 7 is calculated as . The other transactions are processed in the same way. The results are shown in Table 6.
The HFDs for all transactions are then normalized as shown in Table 7.

Step 6. The maximal count (MAXFI) among the large itemsets in the set of is then calculated. In this example, the large itemsets are , and the MAXFI is calculated as (=5).

Step 7. The MID of each transaction is calculated to evaluate the side effects of missing itemsets of the processed transaction. The frequent item in transaction 7 is used as an example to illustrate the steps. According to formula (2), the MID of the item is calculated as . The other frequent itemsets , and in transaction 7 are calculated in the same way, with , , , and . The MID of transaction 7 is then calculated as . The other transactions are processed in the same way. The results are shown in Table 8.
The MIDs for all transactions are then normalized as shown in Table 9.

Step 8. The minimal count () among the small 1-itemsets in the set of is then calculated. In this example, the small 1-itemset has only , and the minimal count of the small 1-itemset is calculated as = min = 3.

Step 9. The AID of each transaction is calculated to evaluate the side effects of artificial itemsets of the processed transaction. Small 1-itemset in transaction 7 is used as an example to illustrate the steps. According to formula (3), the AID of the small 1-itemset is calculated as ; since there is only one itemset in the set of , no other calculations are necessary. The AID of transaction 7 is calculated as . The other transactions are processed in the same way. The results are shown in Table 10.
The AIDs for all transactions are then normalized as shown in Table 11.

Step 10. The three dimensions for evaluating the selected transactions are then organized as in Table 12. The weights of hiding failures, missing itemsets, and artificial itemsets are, respectively, set to 0.5, 0.4, and 0.1. Note that these values can be defined by users to decide the importance among the dimensions. In this example, the HMAU of transaction 7 is calculated as
The other transactions are processed in the same way. The results are shown in the last column of Table 12.

Step 11. The selected transactions in Table 12 are then evaluated to find a transaction with the minimal HMAU value. In this example, transaction 8 has the minimal value and is directly removed from Table 12.

Step 12. Transaction 8 is deleted in the dataset in this example. The minimum count is updated as (= 4).

Step 13. The occurrence frequencies of all sensitive itemsets in the sets of HS and HStmp are, respectively, updated. Since the original database with transaction 8 consisted of the sensitive itemsets , which was deleted in Step 11, the counts of in the set of HS are, respectively, updated as (= 6 − 1) (= 5) and (= 4 − 1) (= 3). In this example, the set of HStmp is empty, so there is nothing to be done in this step. After the updating process, the itemset is put into the set of HStmp since its count is below the minimum count (3 < 4).

Step 14. The occurrence frequencies of all large itemsets in the sets of FI and FItmp are, respectively, updated. Since the original database with transaction 8 consisted of the large itemsets , which was deleted in Step 11, the counts of in the set of FI and FItmp are, respectively, updated as (= 5 − 1) (= 4), (= 7 − 1) (= 6), (= 8 − 1) (= 7), (= 5 − 1) (= 4), and (= 4 − 1) (= 3). After the updating process, the itemset is put into the set of FItmp since its count is below the minimum count (3 < 4).

Step 15. The occurrence frequencies of all small 1-itemsets in the sets of and are, respectively, updated. Since the original database with transaction 8 did not consist of any of the small 1-itemsets in and , nothing is done in this step.

Step 16. In this example, the sensitive itemset is already hidden, but the occurrence frequency of sensitive itemset is larger than the minimum count. Steps 2 to 15 are repeated until the set of sensitive itemsets HS is empty . After all Steps are processed, the sanitized database is obtained as shown in Table 13.

Comparing the original database and the sanitized one, transactions 1, 3, 6, and 8 are removed from the original database, and the minimum count is updated as 3. The updated frequent itemsets of the sanitized database are shown in Table 14.

Comparing the large itemsets in Table 3, the sensitive itemsets and are hidden and no artificial itemset is generated. Three itemsets, , are, however, missing itemsets of the sanitized database. In this example, the side effects of hiding failures, missing itemsets, and artificial itemsets are 0, 3, and 0, respectively.

5. Experimental Results

Experiments are conducted to show the performance of the proposed HMAU algorithm compared to that of the aggregate algorithm [23] for hiding sensitive itemsets through transaction deletion. The experiments were coded in C++ and performed on a personal computer with an Intel Core i7-2600 processor at 3.40 GHz and 4 GB of RAM running 64-bit Microsoft Windows 7. The real database BMS-WebView-1 [34] and a synthetic database (T7I7N200D20K) [35] from IBM data generator in which symbolizes the average length of the transactions, symbolizes the average maximum size of frequent itemsets, symbolizes the number of differential items, and symbolizes the size of database were used in the experiments. The details of the two databases are shown in Table 15.

For the BMS-WebView-1 database, the minimum support thresholds were, respectively, set at 1% and 2% to evaluate the performance of the proposed approach, and the percentages of sensitive itemsets were sequentially set from 5% to 25% of the number of frequent itemsets in 5% increments. In the experiments, the weights of HFD, MID, and AID in the proposed algorithm were, respectively, set at 0.5, 0.4, and 0.1.

For the T7I7N200D20K database, the minimum support thresholds were, respectively, set at 1.5% and 3%, and the percentages of sensitive itemsets were sequentially set at 2.5% to 12.5% of the number of frequent itemsets in 2.5% increments. In the experiments, the weights of HFD, MID, and AID in the proposed algorithm were, respectively, set at 0.5, 0.4, and 0.1.

5.1. Comparisons of Execution Time

Figure 3 shows the execution time of two algorithms in BMS-Web-View-1 database. Different minimum support thresholds of two algorithms are then compared in various sensitivity percentages of the frequent itemsets.

The execution time of the proposed HMAU algorithm is faster than those of the aggregate algorithm whether the minimum support threshold is set at 1% or 2%. Experiment is then conducted in T7I7N200D20K database and the results are shown in Figure 4.

From Figures 3 and 4, it is obvious to see that the proposed HMAU algorithm is faster than those of the aggregate method in two different databases.

5.2. Comparisons of Number of Deleted Transactions

Experiments were also conducted to evaluate the number of deleted transactions of the proposed algorithm in two different databases. For the BMS-WebView-1 database, the results are shown in Figure 5.

From Figure 5, it is obvious to see that the proposed HMAU algorithm deletes fewer transactions than the aggregate algorithm whether the minimum support threshold is set at 1% or 2% in BMS-WebView-1 database, thus achieving lower data distortion. For the T7I7N200D20K database, the results are shown in Figure 6.

From Figure 6, it is obvious to see that when the sensitive itemsets were set at 10% of the frequent itemsets with 1.5% minimum support threshold in T7I7N200D20K database, the proposed HMAU algorithm produced more transactions to be deleted for hiding sensitive itemsets. Since the proposed HMAU algorithm considers the three dimensions together, the selected transactions for deletion may consist of fewer large transactions rather than many sensitive itemsets.

5.3. Comparisons of Side Effects

Three side effects are then compared to show the performance of the proposed algorithm in two different databases.

The side effects of hiding failures, missing itemsets, and artificial itemsets are, respectively, symbolized as , , and . In Table 16, it can be seen that when the minimum support threshold was set at 1%, the proposed HMAU algorithm produces no side effects whereas the aggregate algorithm produces some artificial itemsets since the criteria of artificial itemsets are not considered in aggregate algorithm. Both the two algorithms produce no side effects when the minimum support threshold was set at 2%. The results to evaluate the side effects of the proposed HMAU algorithm in T7I7N200D20K database are shown in Table 17.

From Table 17, it is obvious to see that when the minimum support threshold was set at 1.5%, the proposed HMAU algorithm produces fewer artificial itemsets and missing itemsets than the aggregate algorithm for various sensitivity percentages of the frequent itemsets. The proposed HMAU algorithm produces no side effects at 3% minimum support threshold whereas the aggregate algorithm produces some artificial itemsets.

To summarize the above results for BMS-WebView-1 and T7I7N200D20K databases, the proposed HMAU algorithm outperforms the aggregate algorithm in terms of the execution time, the number of deleted transactions, and the number of side effects.

6. Conclusion and Future Works

In this paper, the HMAU algorithm is proposed for hiding sensitive itemsets in data sanitization process by reducing the side effects through transaction deletion. The formulas of three dimensions as HFD, MID, and AID are defined to evaluate the correlation between the processed transactions and side effects. The weights of three evaluation dimensions of HFD, MID, and AID can be set by users’ interests. In the experiments, both the real dataset and synthetic dataset are used to, respectively, evaluate the performances of the two proposed algorithms. Experimental results showed that the proposed HMAU algorithm outperforms the aggregate algorithm in terms of execution time, number of deleted transactions, and number of side effects.

In the future, the sensitive itemsets to be hidden can be extended to the sensitive association rules to be hidden. More considerations are necessary to be concerned to decrease not only the supports of sensitive itemsets but also the confidence of sensitive association rules. Other distortion approaches such as the noise addition and data modification are also the important issues to hide the sensitive information in PPDM.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the National Science Council of the Republic of China under Contract no. NSC-102-2923-E-390-001-MY3, and by the Natural Scientific Research Innovation Foundation in Harbin Institute of Technology under grant HIT.NSRIF.2014100.