Abstract

Data mining techniques are applied to identify hidden patterns in large amounts of patient data. These patterns can assist physicians in making more accurate diagnosis. For different physical conditions of patients, the same physiological index corresponds to a different symptom association probability for each patient. Data mining technologies based on certain data cannot be directly applied to these patients’ data. Patient data are sensitive data. An adversary with sufficient background information can make use of the patterns mined from uncertain medical data to obtain the sensitive information of patients. In this paper, a new algorithm is presented to determine the top most frequent itemsets from uncertain medical data and to protect data privacy. Based on traditional algorithms for mining frequent itemsets from uncertain data, our algorithm applies sparse vector algorithm and the Laplace mechanism to ensure differential privacy for the top most frequent itemsets for uncertain medical data and the expected supports of these frequent itemsets. We prove that our algorithm can guarantee differential privacy in theory. Moreover, we carry out experiments with four real-world scenario datasets and two synthetic datasets. The experimental results demonstrate the performance of our algorithm.

1. Introduction

The Internet of Things (IoT) involves a lot of different base technologies, such as wireless sensors, data management, and cloud computing [1]. Today, IoT technology is successfully applied in the field of eHealth [24]. Medical personnel can utilize IoT technology to collect large amounts of patient data that can assist them in providing better medical services to patients [5, 6].

Frequent itemsets mining is applied in fields such as eHealth and bioinformatics. Traditional algorithms for mining frequent itemsets from medical data are based on certain data [7] and can be applied to discover hidden symptom patterns from a huge amount of data on patient symptoms. These patterns can be used by health managers to provide better healthcare for users [8]. For example, in [9, 10], the Apriori algorithm was applied to identify prevalent diseases and analyze medical billing. However, the Apriori algorithm mines frequent itemsets from certain data. In medicine, for different physical conditions of patients, the same physiological index corresponds to a different symptom association probability for each patient. As a result, there is uncertainty in patient data. Therefore, traditional algorithms for mining frequent itemsets from certain data cannot be directly applied to patient data.

Another important factor is that medical records contain sensitive patient information. An adversary with sufficient background information can make use of frequent patterns mined from patient data to obtain the sensitive information of patients. Hence, it is very important to protect patient privacy when mining frequent itemsets from medical data [11].

The set of symptoms that a patient suffers from constitute the patient’s data. Because of the probabilities associated with these symptoms, there is uncertainty in patient data. A large amount of patient data constitutes uncertain data. In the field of medicine, there are plenty of researches on symptom association probability. For example, one study monitored oesophageal pH over a 24 h period to obtain symptom association probability, which was then utilized to evaluate the association between a patient’s symptoms and gastroesophageal reflux [12]. By analyzing the large amounts of patient data, Beglinger et al. determined the probability that a patient suffering from Huntington’s disease also had obsessive and compulsive symptoms [13]. By analyzing the data of patients suffering from irritable bowel syndrome, Arsiè et al. determined the probability that indicated the association between meal ingestion and abdominal pain symptoms for patients suffering from irritable bowel syndrome [14]. In this paper, based on symptom association probability obtained by medical technology, we focus on how to mine frequent itemsets from uncertain medical data, while also protecting data privacy. In the uncertain medical data, each item corresponds to a symptom of patients.

In this paper, a new algorithm, denoted as U-PrivMining (uncertain medical data differentially private frequent itemsets mining), is proposed to mine the top most frequent itemsets from uncertain medical data in a differentially private way. In uncertain medical data, each item corresponds to a symptom of patients. U-PrivMining has two phases. In the first phase, based on traditional algorithms for mining frequent itemsets from uncertain data, spare vector algorithm and the Laplace mechanism are applied to ensure differential privacy for all the frequent itemsets mined from uncertain medical data. In the second phase, based on the frequent itemsets, the Laplace mechanism is applied to ensure differential privacy for the top most frequent itemsets for uncertain data, as well as the expected supports of these frequent itemsets. We used the spare vector algorithm to improve the efficiency of our algorithm. The spare vector algorithm was used to mine the top most frequent itemsets from certain data and guaranteed differential privacy in [15]. One major advantage of the spare vector algorithm is that information disclosure affecting differential privacy occurs only for count queries above the threshold; negative answers do not count against the “privacy budget” [15]. The sparse vector algorithm is also suitable for guaranteeing differential privacy when mining frequent itemsets from uncertain data. For certain data, the fixed occurrence counting of an itemset has been applied to determine whether the itemset is frequent. For mining frequent itemsets based on expected support from uncertain data, the expectation of support of an itemset has been utilized to judge whether the itemset is frequent [16]. To summarize, our key contributions are the following:(i)A new algorithm is proposed to mine the top most frequent itemsets from uncertain medical data and ensure differential privacy. Traditional algorithms for mining frequent itemsets in differential privacy ways are based on certain data and thus cannot be directly applied to process uncertain medical data.(ii)Through privacy analysis, we prove that U-PrivMining guarantees differential privacy in theory. Our experimental results on four real-world scenario datasets and two synthetic datasets illustrate the efficiency of U-PrivMining.

This paper is organized as follows. Section 2 presents an overview of related work on eHealth, IoT, frequent itemsets mining for uncertain data, and differential privacy. In Section 3, some notations used in this paper are introduced. The U-PrivMining algorithm and the proof that U-PrivMining satisfies differential privacy in theory are presented in Section 4. In Section 5, the performance of U-PrivMining is evaluated with six datasets. In the last section, we conclude our work.

eHealth applies IoT technology to provide better healthcare services to users. In 2009, Niyato et al. proposed a remote and mobile patient monitoring system that applies heterogeneous wireless access to monitor the biosignals of patient mobility [17]. In 2015, based on the limitations of traditional cellular networks for eHealth services, Yi et al. designed a transmission scheduling mechanism for delay-sensitive medical packets in an eHealth network [18]. The eHealth system based on IoT used monitoring devices to collect large amounts of patient data. Data mining can find hidden patterns in these data, which can assist medical personnel in providing improved medical services to patients. In 2009, Karaolis et al. proposed an algorithm that used mining association rules to assess the risk of coronary events [19]. When traditional data mining technologies are applied to medical data, many useless patterns are discovered. In 2013, Lee et al. proposed a novel algorithm for mining association rule to determine the relationship between blood factors and disease history [20]. This algorithm reduced the number of useless patterns mined from medical data. In 2014, Park et al. used association rules mined from medical data to identify risk behaviors in daily life [21].

The phenomenon of data uncertainty is very common. Traditional algorithms for mining frequent itemsets based on certain data cannot be directly applied to mine frequent itemsets from uncertain data. There are two categories of research on mining frequent itemsets from uncertain data [22]. The first category is mining frequent itemsets based on expected support. In 2007, Chui et al. proposed the notion of expected support and proposed the U-Apriori algorithm based on the Apriori algorithm [23]. The second category is probabilistic frequent itemsets mining. In 2012, the characteristics of Poisson binomial distribution were introduced to mine probabilistic frequent itemsets [24]. In 2012, Bernecker et al. proposed an algorithm based on the frequent pattern tree to mine probabilistic frequent itemsets from uncertain data [25].

Protecting the privacy of patient data is challenge for eHealth and plenty of studies have been conducted on eHealth security [2633]. Differential privacy can ensure that when one record in the input database of mechanism is changed, the output of is insensitive to the change [34]. In 2006, Dwork et al. proposed the Laplace mechanism to ensure differential privacy for real-valued output [35]. In 2010, Bhaskar et al. proposed an algorithm based on truncated frequencies to ensure differential privacy for the top most frequent itemsets for certain data [36]. In 2012, Li et al. introduced the notion of basis set to ensure differential privacy for mining the top most frequent itemsets from certain data [37]. In 2014, Lee et al. applied sparse vector algorithm and the Laplace mechanism to guarantee differential privacy for the top frequent itemsets mined from certain data [15]. In 2015, Su et al. introduced a smart splitting method to mine frequent itemsets from certain data and ensure differential privacy [38].

Although there are many studies on mining frequent itemsets from certain data in differentially private ways, research on mining frequent itemsets from uncertain data in differentially private ways remains few. This paper focuses on research on mining the top most frequent itemsets from uncertain data in differentially private ways.

3. Preliminaries

The fundamental notions of mining frequent itemsets from uncertain data [23] and differential privacy [34, 35] will be reviewed in this section. These fundamental notions are used throughout this paper. The terms “item” and “symptom” are used interchangeably; “itemset” and “symptom set” can be swapped.

3.1. Frequent Itemsets Mining for Uncertain Data

Let be a set of items and as uncertain data with records. Each record is a set of uncertain items. For , is assigned with existential probability , which indicates the likelihood that appears in . For example, let . The uncertain data is shown in Table 1. We can obtain the information from Table 1 as follows. and , which means that user may be suffering from hypotension and eating disorder. The probability of existing in is equal to 0.3; in other words, . This means that the probability of user suffering from hypotension is equal to 0.3.

A set of possible worlds (possible certain database), denoted as, can be inferred from uncertain data . According to the existing probabilities , each possible world is illustrated by generating . Table 2 shows a set of possible worlds inferred from the uncertain data shown in Table 1. For instance, the possible world in Table 2 means that the user is suffering from hypotension and user is suffering from anemia and hypotension.

We assume that all the records in the uncertain data and all the uncertain items in the same record are mutually independent. The probability of a possible world , denoted as , can be obtained by the following [23]:where denotes the set of items contained in record and belonging to . The expected support of itemset , denoted as , can be obtained by the following [23]:where is the support count of itemset in possible world . For Table 2, in , we can obtain the information as and .

3.2. Differential Privacy

Differential privacy can ensure that output of the analysis mechanism is insensitive to changes in input records. If an analysis mechanism ensures differential privacy, its output will be insensitive to the addition or removal of a record from the input database. As a result, the output cannot be used by adversaries to gain access to a patient’s record using their background information [35]. Many studies on privacy protection are based on two assumptions. The first assumption is that the background information of adversaries is already known to the security manager. The second one is that the security manager has known which information should be kept private for users. Differential privacy can protect sensitive information of users without that information [34]. Two databases, and , are a pair of neighboring databases if and only if they differ by no more than one record.

Definition 1 (-differential privacy [34]). Let be the domain of a random algorithm ’s output. and are any pair of neighboring datasets. If (3) is satisfied, then algorithm guarantees -differential privacy.where is the privacy budget of differential privacy and .
The sensitivity is used to obtain the maximal possible difference value between outputs for any pair of neighboring datasets.

Definition 2 (sensitivity [34]). Given the function , the sensitivity of , denoted as , can be obtained bywhere and are any pair of neighboring datasets.

Definition 3 (the Laplace mechanism [35]). Given dataset , let be a query sequence and the sensitivity of is . Letbe a vector, in which are i.i.d. drawn from the Laplace distribution whose scale and mean are and 0, respectively. The algorithmguarantees -differential privacy.

Lemma 4 (composition lemma [34]). Given a sequence of algorithm, denoted as , if each algorithmguarantees -differential privacy, then ensures-differential privacy.

4. U-PrivMining Algorithm

This section introduces the U-PrivMining algorithm to determine the top most frequent itemsets from uncertain data, in which each item corresponds to a symptom of patients, in a differentially private way. The process of U-PrivMining consists of two phases. In the first phase, the assigned privacy budget is equal to. In the second phase, the assigned privacy budget is equal to . The parameteris applied to control the value of the privacy budgets assigned in the two phases. In this study, we chosefor all uncertain data. However, this choice may not be optimal. It appears that the optimal allocation depends on the characteristics of the uncertain medical data and value of [36].

4.1. Description of U-PrivMining

The whole process of U-PrivMining is introduced in this section. U-PrivMining is composed of two phases. In the first phase, we can obtain so that the expected supports of the top most frequent itemsets are greater than or equal to . The privacy budget allocated to this step is equal to. On the basis of traditional algorithm for mining frequent itemsets from uncertain data, we apply the sparse vector algorithm [15] and Laplace mechanism to ensure -differential privacy for this phase. The steps in the first phase of U-PrivMining are as follows.

Step 1. The expected support of the th most frequent itemset, denoted as , is obtained by utilizing traditional algorithms for mining frequent itemsets based on expected support from uncertain data.

Step 2. The noisy threshold, denoted as , can be obtained bywhere is the noisy data generated by the Laplace distribution, whose mean and scale are 0 and , respectively.

Step 3. On the basis of traditional algorithms for mining frequent itemsets from uncertain data, the sparse vector algorithm is applied to obtain all the frequent itemsets whose assessment expected supports are greater than or equal to the noisy threshold . The assessment expected support of an itemset , denoted as can be obtained bywhere is the expected support of itemset and is the noisy data generated by the Laplace distribution, whose mean and scale are 0 and , respectively.

Step 4. All the frequent itemsets obtained in Step and the expected supports of these itemsets are taken as the output of this phase.

In the second phase, according to the output of the first phase, U-PrivMining can obtain the top most frequent itemsets for uncertain data and the noisy expected supports of these frequent itemsets. The privacy budget allocated to the second phase is equal to . The privacy budgets allocated to ensure differential privacy for the top most frequent itemsets for uncertain data and for the expected supports of these itemsets for uncertain data are equal to and , respectively. The second phase of U-PrivMining is described below.

Let be a set of itemsets obtained in the first phase of U-PrivMining. Let be the expected support of itemset . The steps in the second phase of U-PrivMining are as follows.

Step 1 (if is less than or equal to , is equal to 0). All the itemsets in belong to the top most frequent itemsets for uncertain data. And then Step is directly executed.

Step 2 (if is greater than , is equal to 0.5). The perturbation expected supports of all the itemsets in can be obtained. The perturbation expected support of itemset , denoted as , can be obtained bywhere are mutually independent and drawn from the Laplace distribution, whose mean and scale are 0 and , respectively. The top most frequent itemsets for the perturbation expected supports in are the top most frequent itemsets for uncertain data.

Step 3. Let be the set of the top most frequent itemsets for uncertain data, which are obtained in above steps. The noisy expected supports of all the itemsets in can be obtained. The noisy expected support of itemset can be obtained bywhere are mutually independent and drawn from the Laplace distribution whose mean and scale are equal to 0 and, respectively.

Step 4. The top most frequent itemsets for uncertain data and the noisy expected supports of these itemsets are taken as the output of U-PrivMining.

4.2. Privacy Analysis for U-PrivMining

In this section, we prove that U-PrivMining is -differentially private. In order to prove that U-PrivMining guarantees differential privacy, we introduce the notions of count query set and threshold query set.

Definition 5 (count query set [15]). Let be a set of itemsets with itemsets. A count query set is composed of a number of queries. Let be the count query set, where each query asks for the expected support of the th itemset in .

Definition 6 (threshold query set [15]). Let be a set of itemsets with itemsets. A threshold query set is composed of a number of threshold queries. Let be the threshold query set, where each returns 1 if ; otherwise returns 0.
According to the definition of count query set, the sensitivity of the count query and count query set can be obtained as follows.

Lemma 7. Let be a count query set. The sensitivity of and CQ are equal to 1 and , respectively.

Proof. According to (2), we can obtain the other method to compute the expected support of an itemset , denoted as , as follows [23]:where is the number of records in an uncertain data and is a record in . Let and be a pair of neighbor databases. Let be the intersection of and . Let , , and be the total size of , , and , respectively. Let and be the expected supports of itemset for and , respectively. According to (10), the values of and can be computed as follows:where and. As a result, the sensitivity of each query is equal to 1. Since there are queries in CQ, the sensitivity of CQ is equal to .

Based on the sensitivity of the count query and count query set for uncertain data, we can conclude that U-PrivMining guarantees -differential privacy. The proof procedure is outlined below.

Theorem 8. The first phase of U-PrivMining is -differentially private.

Proof. According to Lemma 7, the sensitivity of obtaining is equal to 1. As a result, according to the Laplace mechanism, it is -differentially private to generate the noisy threshold . Let and be the noisy threshold for a pair of neighboring databases and , respectively. According to Definition 1, (12) is satisfied.Let be the set of itemsets. The threshold query set is applied to model the set of answers as a vector where if ; otherwise . Given any pair of neighboring databases and , and denote the output distribution on when and are input neighbor databases, respectively. Then, (13) is satisfied (the details of the proof are shown in [15]).Thus, the first phase of U-PrivMining ensures -differential privacy.

Theorem 9. The second phase of U-PrivMining is -differentially private.

Proof. According to Lemma 7, the sensitivity of obtaining the expected support of an itemset is equal to 1. Therefore, the sensitivity of obtaining the expected support of all frequent itemsets in is equal to . In that is greater than , according to the Laplace mechanism, the scale of the Laplace distribution, which is used to ensure differential privacy for the top most frequent itemsets, is equal to . Hence, obtaining the top most frequent itemsets ensures -differential privacy. The sensitivity of obtaining the expected supports of the top most frequent itemsets is equal to . According to the Laplace mechanism, the noisy data, which is used to obtain the noisy expected support of the top frequent itemsets, obeys the Laplace distribution whose scale is equal to . Hence, it ensures -differential privacy for obtaining noisy expected supports of the top most frequent itemsets for uncertain data. As a consequence, according to Lemma 4, the second phase of U-PrivMining guarantees -differential privacy.

According to analysis of the two phases of U-PrivMining, we can conclude that the first and second phases are -differentially private and -differentially private, respectively. According to Lemma 4, U-PrivMining is -differentially private.

5. Experiments

In our experiments, four real-world scenario datasets and two synthetic datasets were utilized to verify the efficiency of U-PrivMining, which can be downloaded from [39]. The parameters of these public datasets are shown in Table 3, where the number of items in the datasets is denoted as and the number of transactions in the dataset is denoted as . The maximal length of transactions in the dataset is denoted as max. The average length of transactions in the dataset is denoted as avg. In order to add uncertainty to these datasets, an existential random probability in the range of is assigned to each item in each transaction.

5.1. Evaluation Metrics

U-PrivMining applies the Laplace mechanism and the spare vector algorithm to ensure differential privacy for the top most frequent itemsets for uncertain data and the expected supports of these frequent itemsets. The Laplace mechanism can protect the privacy of U-PrivMining’s output by adding noisy data to the output of mining frequent itemsets from uncertain data. Thus, the F-score and relative error (RE) are applied to evaluate the influence of noisy data on the experimental results.

Definition 10 (F-score [15]). Let be the set of the top most frequent itemsets for uncertain data and be the set of the frequent itemsets obtained by U-PrivMining. The F-score can be obtained bywhere and .

Definition 11 (relative error [15]). Let and be the expected support of itemset for uncertain data and the noisy expected support of itemset , respectively, which is obtained in the second phase of U-PrivMining. The RE can be obtained byAs described in Definition 10, for all the itemsets mined by U-PrivMining, the precision is utilized to evaluate the proportion of itemsets mined by U-PrivMining and belonging to the correct top most frequent itemsets for uncertain data. The recall is also used to evaluate the proportion of itemsets mined by U-PrivMining and belonging to the correct top most frequent itemsets for uncertain data. The F-score is the harmonic mean of both precision and recall. When the number of the frequent itemsets obtained from the first phase of U-PrivMining is greater than or equal to , the value of and is equal to . As a result, the value of F-score and recall is equal to the value of precision.

As described in Definition 11, the value of RE is utilized to evaluate the influence of the noisy data on the noisy expected supports of the top most frequent itemsets for uncertain data. There may be extremely large or small values in the experimental results. The median was not skewed because these values were extremely large or small. Therefore, the median was applied to evaluate the relative error.

5.2. Analysis of Experimental Results

U-PrivMining can identify the top most frequent itemsets from uncertain data in differentially private way. In traditional algorithms for mining the top most frequent itemsets from uncertain data and certain data, the values were predetermined by users or domain experts [7]. In order to evaluate the influence of privacy budget on the F-score and RE, we conducted four group experiments. The values were set as 50, 100, 150, and 200, respectively.

Figure 1 shows the results of the F-score obtained by U-PrivMining running on the six public datasets under different privacy budget values. As it can be seen from the figure, when value is fixed, the F-score fluctuates and is close to 1 with increasing privacy budget. In the first phase of U-PrivMining, the algorithm obtains noisy data to generate noisy threshold and assessment expected supports of itemsets. According to (6) and (7), the greater the value of the privacy budget, the smaller the scale of Laplace distribution used to generate the noisy data in this step. In the second phase of U-PrivMining, the algorithm can obtain the top most frequent itemsets by adding noisy data to the expected support. The noisy data is drawn from the Laplace distributions, whose mean and scale are equal to 0 and , respectively. As a result, the F-score improves and is close to 1 with increasing privacy budget. From Figure 1, we can conclude that the lower the expected supports of the top most frequent itemsets for the uncertain data, the lower the convergence speed of the F-score. For the T10I4D100K dataset, the expected supports of the top most frequent itemsets are less than other datasets. Therefore, the convergence speed of U-PrivMining running on the T10I4D100K data set is lower than that of U-PrivMining running on the other datasets. U-PrivMining applied the Laplace mechanism to ensure data privacy. Hence, if the noisy data is relatively greater for the expected supports of the top most frequent itemsets, then the F-score of U-PrivMining is relatively lower.

Figure 2 shows the RE results obtained by U-PrivMining running on six public datasets under different privacy budget values. When is a fixed value, with increasing privacy budget, the value of RE fluctuates and is close to 0. The noisy expected support of an itemset is obtained by adding the noisy data drawn from the Laplace distribution to the expected support of the itemset. As a consequence, when the and privacy budget values are fixed values, the expected supports of the top most frequent itemsets for the different datasets are lower, and the RE of U-PrivMining is higher. When the privacy budget is a fixed value, and, with increasing , the lower expected supports of the top most frequent itemsets for different datasets, the higher RE of U-PrivMining. For the same dataset and privacy budget, RE values increase with increasing . The noisy expected support of an itemset can be obtained by adding the noisy data drawn from the Laplace distribution to the expected support of the itemset.

5.3. Discussion

In the field of medicine, for different physical conditions of patients, the same physiological index corresponds to a different symptom association probability for each patient. There are plenty of medical technologies to obtain symptom association probability for patients. There is uncertainty in patient data. However, existing algorithms for mining frequent itemsets from medical data in differentially private ways are all based on certain data and cannot be directly used for uncertain medical data. Therefore, in this paper, we proposed the U-PrivMining algorithm, which can mine the top most frequent itemsets from uncertain medical data and ensure differential privacy. The experimental results verified the effectiveness of U-PrivMining.

6. Conclusion

In this paper, we proposed a new algorithm to mine the top most frequent itemsets from uncertain medical data, where each item corresponds to a patient symptom, while protecting data privacy. These frequent itemsets can assist physicians in making diagnoses. Through theoretical and experimental analyses, we can conclude that not only does U-PrivMining ensure differential privacy but, with increasing privacy budget, the top most frequent itemsets obtained by U-PrivMining and the noisy expected supports of these frequent itemsets are close to the true top most frequent itemsets and expected supports of these itemsets for uncertain data, respectively. However, the privacy budget allocation may not be optimal. The optimization of privacy budget allocation will be focus of future research.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by the National Science Foundation of China (no. 61672135, no. 61502085, no. 61272527, and no. 61370026), the National High Technology Research and Development Program of China (no. 2015AA016007), China Postdoctoral Science Foundation Funded Project (no. 2015M570775), the Sichuan Science-Technology Support Plan Program (no. 2014GZ0106, no. 2015GZ0095, and no. 2016JZ0020), and the National Science Foundation of China-Guangdong Joint Foundation (no. U1401257).