Abstract

Data are distributed between different parties. Collecting data from multiple parties for analysis and mining will serve people better. However, it also brings unprecedented privacy threats to the participants. Therefore, safe and reliable data publishing among multiple data owners is an urgent problem to be solved. We mainly study the problem of privacy protection in data publishing. For a centralized scenario, we propose the LDA-DP algorithm. First, the within-class mean vectors and the pooled within-class scatter matrix are perturbed by the Gaussian noise. Second, the optimal projection direction vector with differential privacy is obtained by the Fisher criterion. Finally, the low-dimensional projection data of the original data are obtained. For distributed scenarios, we propose the Mul-LDA-DP algorithm based on a blockchain and differential privacy technology. First, the within-class mean vectors and within-class scatter matrices of local data are perturbed by the Gaussian noise and uploaded to the blockchain network. Second, the projection direction vector is calculated in the blockchain network and returned to the data owner. Finally, the data owner uses the projection direction vector to generate low-dimensional projection data of the original data and upload it to the blockchain network for publishing. Furthermore, in a distributed scenario, we propose a correlated noise generation scheme that uses the additivity of the Gaussian distribution to mitigate the effects of noise and can achieve the same noise level as the centralized scenario. We measure the utility of the published data by the SVM misclassification rate. We conduct comparative experiments with similar algorithms on different real data sets. The experimental results show that the data released by the two algorithms can maintain good utility in SVM classification.

1. Introduction

With the development of science and technology, effective data collection and analysis can help people make better decisions in production. For example, analyzing the information of the patient can help doctors improve the accuracy of diagnosis and level of medical services, and analyzing the trajectory data can improve city traffic congestion. The data contain sensitive information and need to be processed for privacy protection before publishing [1, 2]. There have been some studies on privacy preserving data publishing. For example, the -anonymity privacy protection technology [3], the encryption technology [4, 5], the blockchain technology [68], and differential privacy technology [911]. Differential privacy has been widely used for privacy protection in recent years, the principle of differential privacy is to add random noise to data, which makes the attacker unable to distinguish the original input data. Differential privacy can quantitatively measure the degree of privacy protection and can resist attacks from attackers with background knowledge. Privacy preserving data publishing based on differential privacy has become a research hot spot [1215].

However, in the distributed scenario, data are possessed by multiple data owners. Data from a single data owner may not be sufficient for statistical learning, and aggregating data by a single data owner may not be possible. For example [16], in Table 1, the data are possessed by three data owners. Each row in Table 1 represents the information of an individual, where records 1 to 4 are from data owner 1, records 5 to 8 are from data owner 2, and records 9 to 10 are from data owner 3. Simply integrating and publishing the data from each data owner will cause a serious privacy leakage. Sharing and exchange of data in a distributed environment requires security guarantees. In order to solve the proposed problem, we make the following contributions:(1)We propose two algorithms which are called LDA-DP and Mul-LDA-DP. The LDA-DP algorithm is used for privacy protection of data publishing in centralized scenario, and the Mul-LDA-DP algorithm is used for privacy protection of data publishing in distributed scenario.(2)In the distributed scenario, the data owners cooperate with each other to publish a projection data set which satisfies differential privacy. In order to improve the utility of the published data in the distributed scenario, we propose a correlated noise generation scheme that uses the additivity of the Gaussian distribution to mitigate the effects of noise and can achieve the same noise level as the centralized scenario.(3)We conduct experiments on different data sets. The experimental results show that the data released by LDA-DP and Mul-LDA-DP algorithms can maintain good utility in SVM classification.

In this section, we introduce the research status of privacy preserving data publishing in centralized scenario and distributed scenario, respectively.

2.1. Privacy Preserving Data Publishing in Centralized Scenario

Blum et al. [17] proposed the sublinear query (SULQ) input perturbation framework which adds noise to the covariance matrix, the framework can only be used for querying the projected subspace. Chaudhuri et al. [18] proposed the PPCA algorithm which is the improvement of SUQL algorithm. The PPCA algorithm randomly samples a -dimensional subspace which ensures differential privacy and is biased toward high utility. Both SUQL and PPCA procedures are differentially private approximations to the top- subspace. Zhang et al. [19] proposed the PrivBayes algorithm; first, they constructed a Bayesian network with differential privacy, and then they used the Bayesian network to generate a data set for publication. Chen et al. [20] presented the JTree algorithm. First, they explored the relationship between the attributes based on the sparse vector sampling technology, and then they constructed a Markov network that satisfies differential privacy and generated a synthetic data set for publication. Zhang et al. [21] proposed the PrivHD algorithm based on the JTree. They used high-pass filtering techniques to speed up the construction of Markov network and built a better joint tree for generating synthetic data set for publication. Xu et al. [22] proposed the DPPro algorithm; first, they randomly projected the original high-dimensional data into a low-dimensional space, and then they added noise to the projection vector and low-dimensional projection data; finally, they released the low-dimensional projection data. Zhang et al. [23] presented the PrivMN method. They constructed a Markov model with differential privacy, and then used the Markov model to generate a synthetic data set for publication. The algorithms mentioned above are mainly used for privacy preserving data publishing in centralized scenarios.

2.2. Privacy Preserving Data Publishing in Distributed Scenario

There are fewer researches on privacy protection of horizontally partitioned data publication. Ge et al. [24] proposed a distributed principal component analysis (DPS-PCA) algorithm with differential privacy; first, data owners collaborated to analyze the principal components, while protecting the private information, and then they released low-dimensional subspaces of high-dimensional sparse data. Wang et al. [25] proposed an efficient and scalable protocol for computing principal components in a distributed environment. First, the data owner encrypted the shared data and sent them to the semitrusted third party, then the semitrusted third party performed a private aggregation algorithm on the encrypted data and sent the aggregated data to data user for calculating the principal components. Imtiaz et al. [26] presented a distributed principal component analysis (DPdisPCA) algorithm with differential privacy. Each data owner used Gaussian noise to perturbed the local covariance matrix, and with the assistance of a semitrusted third party to calculate the principal components while ensuring local data privacy. Alhadidi et al. [27] proposed a two-party data publishing algorithm with differential privacy. They first presented a two-party protocol for the exponential mechanism which can be used as a subprotocol, the data released by this algorithm are suitable for classification tasks. Cheng et al. [28] proposed a differential privacy sequential update of the Bayesian network algorithm which is called DP-SUBN3, data owners collaboratively constructed the Bayesian network, data owners can treat the intermediate results as prior knowledge to construct the Bayesian network, and then they used the Bayesian network to generate a data set for publication. Wang et al. [29] proposed a distributed differential privacy anonymous algorithm and guaranteed that each step of the algorithm satisfies the definition of secure two-party computation. This is the first research about differentially private data publishing for arbitrarily partitioned data. In our prior work [16], we proposed the PPCA-DP-MH algorithm. First, data owners and a semitrusted third party cooperated to reduce the dimension of high-dimensional data to obtain the top principal components that satisfy differential privacy, and then each data owner used the generative model of probabilistic principal component analysis to generate a data set with the same scale as the original data for publication. Different from the prior work [16], this paper uses the linear discriminant analysis to publish the projection data with differential privacy. Linear discriminant analysis can retain the class information of the data while reducing the dimension, which is beneficial to maintain the utility of the published data in classification.

3. Preliminaries

3.1. Linear Discriminant Analysis (LDA)

Linear discriminant analysis proposed by Fisher is one of the most widely used and extremely effective methods in the field of dimensionality reduction and pattern recognition. Its typical applications include face recognition, target tracking and detection, credit card fraud detection, and speech recognition. The idea of linear discriminant analysis for binary classification is to choose the projection direction so that the samples of different classes after projection are as far apart as possible and the samples within each class are as clustered as possible. We denote the data set as , . . The within-class mean vector of samples in the original sample space is as follows:

The between-class scatter matrix is as follows:

The within-class scatter matrix is as follows:

Then, the pooled within-class scatter matrix is as follows:

It can also be expressed as follows:

The criterion of Fisher is as follows:

Using the Lagrange multiplier method to find the optimal projection direction vector, we obtain the following:

The result of linear discriminant analysis only gives the optimal projection direction, and does not give a clear classification result.

3.2. Differential Privacy

Differential privacy provides a rigorous privacy protection for sensitive information, it can be quantified by mathematical formulas. The essence of differential privacy is to use noise to randomly perturb the output results, so that it is difficult to distinguish the original input data according to the output results.

Definition 1. [30] A randomized algorithm is -indistinguishable if for any two neighboring databases and differing in a single entry, and for all :where is a small positive real number.
When is small, , so , is used to control the probability ratio of algorithm to obtain the same output on two neighboring databases, which reflects the level of privacy protection that can provide.

Definition 2. [30]. A randomized algorithm is differential privacy, if for any two neighboring databases and differing in a single entry, and for any there is the following:where is a small positive real number called privacy budget and is a small positive real number. It is also called -approximate -indistinguishability.

Definition 3. is the relaxed version of differential privacy. When , it becomes Definition 1, which is the strict version of differential privacy. Formula (9) means that it is allowed to break the limit of formula (8) with a small probability .

Theorem 1 ([31]). The sufficient condition for the random function to satisfy differential privacy is as follows:

Theorem 2 (Sequential Composition) [31]. Let be an differentially private algorithm, , then for the same data set , the combined algorithm is differential privacy.

Theorem 3 (Parallel Composition) [31]. Let be an differentially private algorithm, , are disjoint data sets, the combined algorithm is differential privacy.

Theorem 4 (Post Processing) [31]. Let be a randomized algorithm that is differential privacy, let be an arbitrary mapping, then is differential privacy.

4. Proposed Methods

In this section, we will propose two algorithms which are called LDA-DP and Mul-LDA-DP. The LDA-DP algorithm is used for privacy protection of data publishing in the centralized scenario, and the Mul-LDA-DP algorithm is used for privacy protection of data publishing in the distributed scenario. Without loss of generality, we assume that all individual data in this paper are normalized to -dimensional unit vectors.

4.1. LDA-DP Algorithm

In this section, we propose the LDA-DP algorithm for centralized data publishing.

4.1.1. Problem Statement and Algorithm Proposed

The data set contains two classes of data individuals denoted as , where . Our goal is to protect the privacy information of the original data from being leaked while publishing the projection data of the original data.

In order to solve this problem, we propose the LDA-DP algorithm, which is mainly divided into two stages. First, we use the Gaussian mechanism of differential privacy to perturb the within-class mean vectors . Second, we use the Gaussian mechanism to perturb the pooled within-class scatter matrix . Finally, we get the projection direction vector that satisfies differential privacy and publish the low-dimensional projected data of the original data. The specific details are in Algorithm 1.

Input: Data sets , privacy parameters ,
Output: Projection direction vector , projection data
(1)for to 2 do
(2) Set , which generates a dimension noise vector ; each entry is sampled from
(3) Computes
(4)end for
(5)return
(6)Set , which generates a random matrix . Let be a symmetric matrix with the upper triangle (including the diagonal) entries are sampled from and make the symmetrical position entries in the lower triangle matrix equal to the upper triangle.
(7)Computes
(8)Computes
(9)Computes
4.1.2. Privacy Analysis of LDA-DP Algorithm

Theorem 5. The within-class mean vector in Algorithm 1 satisfies differential privacy when each entry of is sampled from , where .

Proof. We denote the two neighboring data sets are and , where only one individual is different, without losing general assumption. Suppose the different individuals are in and , we denote them as , they are -dimensional unit vector. We denote and , let and , each entry of and is sampled from .

The log ratio of the probabilities and at a point is , the numerator in the ratio describes the probability of seeing when the data set is , the denominator corresponds the probability of seeing this same value when the data set is .

By Theorem 1, we will to find the value of such that the inequality holds at least with probability .

Using the Lagrange multiplier method, we can get the maximum value of the objective function is under the condition of .

Then, we can obtain: . Similarly, we can obtain the following:

So, , where , for all , and .

Then, , this quantity is bounded by whenever .

To ensure privacy loss bounded by with probability at least , we require to find that satisfies this inequality , due to symmetry, we will find such that .

The tail bound is as follows:

We let , then , then we obtain the following:

When , the first term in (14) is non-negative. To make the inequality (14) hold, we let , then we obtain the following:

Theorem 6. The pooled within-class scatter matrix in Algorithm 1 satisfies differential privacy, when each entry in the symmetric random matrix is sampled from , where

Proof. Two neighboring data sets are and , where only one entry is different, without losing general assumption, suppose the different entry are in and and denoted them as .

Because in (5) satisfies differential privacy has been proved by Theorem 5 which can be treated as a constant in (5), so if we want to prove that this theorem holds, it is only necessary to prove that the first item in (5) satisfies differential privacy after adding random matrix .

We denote , , let and , and are two independent symmetric random matrices with the upper triangle (including the diagonal) entries are sampled from , and make the symmetrical position entries in the lower triangle matrix equal to the upper triangle.

The log ratio of the probabilities and at a point is .

By Theorem 1, we need to find the value of such that the inequality holds at least with probability .

By using the Lagrange multiplier method and the inequality in [18], the following inequalities hold:

Then, , where for all .

The rest of the proof process is similar to Theorem 5, then we can obtain the following:

We have proven that the within-class mean vector satisfies differential privacy, the pooled within-class scatter matrix satisfies differential privacy, by the property of differential privacy sequential composition, the projection direction vector in the Algorithm 1 satisfies differential privacy, where . For the published projection data , , , we can regard as a set of undetermined system of equation, the number of variables are more than equations, so the equation has infinitely many sets of solutions, that is, it is impossible to infer the information of the original data from the published projection data .

4.2. Mul-LDA-DP Algorithm

In this section, we propose the Mul-LDA-DP algorithm for distributed data publishing. The mathematical notations used in this section are summarized in Table 2.

4.2.1. Problem Statement and Algorithm Proposed

In the distributed scenario, data are stored by multiple data owners rather than a single owner, and the data owners do not trust each other. Data at a single site may not be sufficient for statistical learning. One solution is that each data owner uses the LDA-DP algorithm in Section 4.1 to publish the projection data independently. Another solution is the data owners cooperate with each other to publish the projection data of the integrated data. Comparing the two solutions, it is obvious that the latter solution can improve the utility of publishing data. Based on the idea of the second solution and [32], we propose the Mul-LDA-DP algorithm for distributed data publishing. The entity description of the model is as follows.(1)Data owner. The data owner has a data set . Each data owner can generate random vectors and matrices to perturb the within-class mean vectors and within-class scatter matrices locally.(2)Data publisher. The data publisher is a data publishing platform based on blockchain. The data publisher aggregates the local within-class mean vectors and within-class scatter matrices with noise. The data publisher can obtain the projection vector that satisfies differential privacy and publishes the projection data of the pooled data.(3)A random number generator. It can generate random vectors and random matrices and send them to data owners and data publisher secretly.Threat Model. In our setting, we assume that the data owners and data publisher are honest-but-curious, that is, they follow the protocol but may try to deduce information of other data owners from the received messages.Two types of adversaries are considered, which are external attackers and internal attackers. External attackers which can be called an external eavesdropper may gain access to information such as data sent by data owners to the data publisher. Internal adversaries can be the data owners and the data publisher. The goal of each data owner is to extract the information not owned by him, while the goal of the data publisher is to extract the information from each data owner.Distributed Within-Class Mean Vectors and Pooled Within-Class Scatter Matrix Computation. When the data are owned by data owners, the within-class mean vectors (1) can be decomposed into the following:where .

The pooled within-class scatter matrix (5) can be decomposed into the following:where .

The abovementioned result allows each data owner to compute and perturb a partial result simultaneously locally. Therefore, we use the additivity of Gaussian distribution to propose a correlated noise generation scheme. We design the noise generation procedure such that (i) we can ensure that the data output from each data owner satisfy differential privacy and (ii) we can achieve the noise level of the same as the pooled data scenario.

Scheme for Perturbing Shared Data by Correlated Noise. To prevent the data publisher and other data owners learning the privacy of local data, the data owner uses the noise generated by himself and the noise generated by the random number generator to perturb the local within-class mean vectors and within-class scatter matrices. Through our correlated noise design scheme, the data aggregated by the data publisher contain the same level of noise as the centralized scenario. The scheme is described as below:(1)Initialization stage. The random number generator generates dimensional random vectors , each entry is sampled from , generates random matrices , let be the symmetric matrix with the upper triangle (including the diagonal) entries are sampled from , and makes the symmetrical position entries in the lower triangle matrix equal to the upper triangle, . Make these random vectors and matrices satisfy , then and are sent to data owner secretly, and are sent to the data publisher secretly.(2)Data owner generates dimensional random vectors , each entry is sampled from , computes , and sends them to the data publisher.(3)The data publisher computes and sends them to each data owner.(4)The data owner generates random matrix , let be the symmetric matrix with the upper triangle (including the diagonal) entries are sampled from , and make the symmetrical position entries in the lower triangle matrix equal to the upper triangle. Data owner computes and sends it to the data publisher.(5)The data publisher computes and calculates the projection vector that satisfies differential privacy.

The specific details of Mul-LDA-DP algorithm are in Algorithm 2. The input random vectors and random matrices in Algorithm 2 are generated in the initialization stage by the random number generator, .

Input: Data sets , , privacy parameters , , random vector and random matrix which are generated in initialization stage, .
Output: Projection direction vector , projection data
(1)for to do
(2)for to 2 do
(3)  Set , data owner generates dimensional random vector , each entry is sampled from
(4)  Compute
(5)end for
(6)end for
(7)Compute
(8)for to do
(9) Set , data owner generates symmetric random matrices , each entry is sampled from
(10)for to 2 do
(11)  Compute
(12)end for
(13) Compute
(14)end for
(15)Compute
(16)Compute
(17)return
4.2.2. Privacy Analysis of the Mul-LDA-DP Algorithm

Theorem 7. The within-class mean vector in Algorithm 2 satisfies differential privacy.

Proof. because each entry of is sampled from , and each entry of is sampled from , so each entry of obeys . By Theorem 5, satisfies differential privacy.

Due to the post-processing property of differential privacy, the within-class mean vector in Algorithm 2 satisfies differential privacy.

Theorem 8. The pooled within-class scatter matrix in Algorithm 2 satisfies differential privacy.

Proof. , where each entry of symmetric random matrix is sampled from , and each entry of symmetric random matrix is sampled from , so each entry of obeys . By Theorem 6, satisfies differential privacy. Due to the post-processing property of differential privacy, the pooled within-class scatter matrix in Algorithm 2 satisfies differential privacy.

We have proven both and satisfy differential privacy, we will show that the level of noise is the same as the centralized scenario. In the initialization stage, the noise vectors and matrices generated by the random number generator satisfy and .

The within-class mean vector is as follows:

Each entry of obeys .

The pooled within-class scatter matrix is as follows:

Each entry of obeys .

According to Theorems 5 and 6, the within-class mean vector and pooled within-class scatter matrix contain the same level of noise as the centralized scenario, and we achieve the purpose of improving the utility of publishing data while protecting the data privacy.

There are three opportunities for attackers to steal the data transmitted between the data owner and the data publisher. The first time is that the data owner sends the within-class mean vectors to the data publisher, the second time is that the data owner sends the within-class scatter matrices to data publisher. From Theorems 7 and 8, we know that the within-class mean vectors and the within-class scatter matrices satisfy differential privacy. Therefore, the attacker cannot infer the information of the original data from the eavesdropped data. The third time is that the data owner sends projection data to the data publisher, in Section 4.1.2, we have analyzed that it is impossible to infer the information of the original data from the published projection data.

5. Experiment

In order to measure the usability of the LDA-DP and Mul-LDA-DP algorithms proposed in this paper, we conduct experiments on real data sets which are Adult and NLTCS. Adult data set is extracted from the 1994 US Census, it contains 45222 individuals, each individual has 15 attributes. NLTCS data set is extracted from the National Long Term Care Survey, and recorded the daily activities of 21574 disabled persons at different time periods, each individual has 16 attributes. We use the SVM misclassification rate to measure the availability of the published data. For the Adult data set, it is necessary to predict whether a person (1) holds a post-secondary degree and (2) earns more than 50K. For the NLTCS data set, we need to predict whether a person (1) is unable to get outside, (2) is unable to manage money, (3) is unable to travel, and (4) is unable to bath. In our experiments, we set to remain unchanged, and to take different values. We uniformly divide the privacy parameters into 2 portions . Each experiment was repeated 50 times, and the mean value was taken as the experimental result. We use “No Privacy” to represent the SVM misclassification rate on the original data set.

5.1. Comparing the Performance of LDA-DA, PrivBayes, and PRivHD Algorithms under Different Privacy Budgets

The LDA-DA, PrivBayes, and PrivHD algorithms are all suitable for the centralized data publishing scenario, so in this set of experiments, we set the number of data owners to 1, and privacy budget takes different values. As can be seen from Figure 1, for both Adult and NLTCS data sets, the SVM classification utility of the data published by the LDA-DP algorithm outperforms the PrivBayes algorithm. The LDA-DP algorithm outperforms the PrivHD algorithm on the NLTCS dataset; however, the LDA-DP algorithm has slightly lower SVM classification utility on the Adult dataset than the PrivHD algorithm. We can also observe a commonality, for LDA-DA, PrivBayes, and PRivHD algorithms, the SVM misclassification rate decreases with the increase of the privacy budget . This phenomenon is consistent with the theory that as the privacy budget increases, privacy protection will weaken and the availability of data will increase.

5.2. Comparing the Performance of Mul-LDA-DA and DP-SUBN3 Algorithms under Different Privacy Budgets

The algorithm Mul-LDA-DP proposed in this paper is suitable for the distributed data publishing scenario, so in this set of experiments, we set the number of data owners to 3, and privacy budget takes different values. We train classifiers on published data set to compare the efficacy of Mul-LDA-DA and DP-SUBN3 algorithms. From Figure 2, we can see that the SVM classification utility of the data published by the Mul-LDA-DP algorithm outperforms the DP-SUBN3 algorithm. Both on money of NLTCS and education of Adult classifiers, the misclassification rate of Mul-LDA-DA algorithm is significantly lower than the DP-SUBN3 algorithm especially.

5.3. Comparing the Performance of Mul-LDA-DA and DP-SUBN3 Algorithms under Different Number of Data Owners

In this section, the experiment studied the relationship between SVM misclassification rate and the number of data owners. The number of data owners is set to 2, 4, 6, 8, 10, and the privacy budget is set to 0.2, We trained two classifiers, education classifier, and salary classifier on Adult data set. The results in Figure 3 show that the SVM misclassification rate of the Mul-LDA-DP algorithm remains stable with the change of the number of data owners. The reason is that we perturb the local shared data by generating correlated noise based on the additivity of the Gaussian distribution. This scheme ensures that the level of Gaussian noise added to the data in the distributed scenario is similar to the noise level in the centralized scenario. Therefore, as the number of data owners increases, the misclassification rate remains stable. The SVM misclassification rate of DP-SUBN3 algorithm decreases as the number of data owners increases. This is because as the number of data owners increases, the number of update iterations increases when constructing the Bayesian network, and the Bayesian network constructed is closer to the distribution of the original data. However, from Figure 3, we can see that the performance of Mul-LDA-DA algorithm is still better than DP-SUBN3 algorithm when the number of data owners is no more than 10.

6. Conclusion

In this paper, we propose two algorithms for privacy preserving data publishing, the LDA-DP algorithm for data publishing in the scenario, and the Mul-LDA-DP algorithm for multiparty horizontally split data publishing. We use the additivity of Gaussian distribution to alleviate the effects of noise and can achieve the same noise level as the centralized scenario. The experimental results show that the projection data released by the two algorithms can maintain high utility in SVM classification. However, the research in this paper also has limitations. 1)We only research the privacy protection problem when the data are a binary classification, but they are often multiclassification data. 2)The data released by the two algorithms in this paper are low-dimensional projection data of the original data, which limit the analysis and mining of the released data in many aspects. In the future, we will continue to conduct research on the abovementioned issues.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.