Abstract

In this paper, we study the privacy-preserving data publishing problem in a distributed environment. The data contain sensitive information; hence, directly pooling and publishing the local data will lead to privacy leaks. To solve this problem, we propose a multiparty horizontally partitioned data publishing method under differential privacy (HPDP-DP). First, in order to make the noise level of the published data in the distributed scenario the same as in the centralized scenario, we use the infinite divisibility of the Laplace distribution to design a distributed noise addition scheme to perturb the locally shared data and use Paillier encryption to transmit the locally shared data to the semitrusted curator. Then, the semitrusted curator obtains the estimator of the covariance matrix of the aggregated data with Laplace noise and then obtains the principal components of the aggregated data and returns them to each data owner. Finally, the data owner utilizes the generative model of probabilistic principal component analysis to generate a synthetic data set for publication. We conducted experiments on different real data sets; the experimental results demonstrate that the synthetic data set released by the HPDP-DP method can maintain high utility.

1. Introduction

The ability of people to collect and analyze data is gradually improving with the development of the artificial intelligence. Sometimes the data are stored by different sites(data owners), and each site holds a smaller number of samples. For example, in Figure 1, there are three hospitals, the patients in each hospital are different from each other, but the data features of each patient are the same. In order to better mine the useful information behind the data, a large number of samples are needed. Pooling data in one central location enables efficient data analysis and mining, but data contain sensitive privacy; directly sharing or pooling the data will lead to privacy leakage [1, 2], which prevents people from sharing data. That is to say, data are facing serious privacy leakage risks in the process of data sharing, network transmission, and storage [3]. It is important to protect the privacy of shared data and weigh the security and availability of data [4, 5]. Therefore, it is desirable to propose an efficient distributed algorithm, which can provide the utility close to the centralized case and protect the privacy of data. In recent years, there have been some researches on privacy-preserving data publishing and sharing, for example, the anonymity [6] technology, the encryption techniques, such as lattice-based cryptography [7] and quantum cryptography [8, 9]. The differential privacy [10] has been widely used for privacy-preserving data publishing; privacy-preserving data publishing based on differential privacy has become a research hot spot [1115].

However, there are still some challenges when using the differential privacy technique to protect the privacy of the published data. One is that the data are stored by different data owners; directly pooling and publishing the data will lead to privacy leakage. When data are stored by multiple data owners, as the number of data owners increases, if differential privacy is used independently to add noise to the locally shared data, the utility of the published data will be reduced. In view of this, we propose a horizontally partitioned data publication approach with differential privacy. We make the following contributions:(1)We propose a method for horizontally partitioned data publication with differential privacy (HPDP-DP). In a distributed environment, data are owned by multiple parties. We use the weighted average of the noised covariance matrices of the local data to estimate the covariance matrix of the pooled data. The data owners and a semitrusted curator collaborate to get the principal components of the pooled data and generate a synthetic data set for publishing.(2)In the distributed scenario, in order to make the noise level of the aggregated data the same as in the centralized scenario, the HPDP-DP method utilizes the infinite divisibility of the Laplace distribution and Paillier homomorphic encryption to alleviate the effects of noise and can achieve the same noise level as the centralized scenario.(3)We evaluate the performance of HPDP-DP method through experiments on real data sets, and the experimental results show that HPDP-DP method can generate synthetic data with high efficiency.

In this section, we introduce the research status of privacy-preserving data release based on differential privacy in the centralized and distributed scenarios, respectively.

2.1. Privacy-Preserving Data Publishing in Centralized Environment

In recent years, there are many researches on privacy-preserving data publishing based on differential privacy. Jiang et al. [16] proposed a method that adding Laplace noise to the covariance matrix and the projection matrix and then using the noisy projection matrix to restore and generate the synthetic data set for publishing. Zhang et al. proposed the PrivBayes method in [17]; they used the relationship between the features to build a Bayesian network. They added Laplace noise to the low-dimensional marginal distribution to make the Bayesian network satisfy differential privacy, and then they used the Bayesian network to generated a synthetic data set for publishing. Chen et al. proposed the Jtree method in [18]. First, they proposed a sampling-based testing framework that is used to explore pairwise dependencies while satisfying differential privacy. Then, they applied the connection tree algorithm to construct an inference mechanism to infer the joint data distribution. Finally, they efficiently generated a synthetic data set by using the noise margin table and inference model. Xu et al. [19] proposed DPPro scheme; they released high-dimensional data by using randomly projected. They projected the original high-dimensional data into a randomly selected low-dimensional subspace and added noise to the low-dimensional projected data. They theoretically demonstrated that the data published by the DPPro method have similar squared Euclidean distances to the original data. In order to solve the problem of dimensional disaster in high-dimensional data publishing, Zhang et al. [20] presented the PrivHD method with the junction tree. First, they used exponential mechanism to construct a Markov network; in order to reduce the candidate space, high-pass filtering technique is used in sampling. Then, they used the maximum spanning tree method to build a better joint tree. At last, a high-dimensional synthetic data set is generated for publication. Zhang et al. [21] presented the PrivMN method. They first constructed a Markov model to express the relationship of features. Then, they used the Laplace mechanism to add noise to the marginal distribution to generate the noisy marginal distribution table. Finally, they used the noisy marginal distribution to generate a synthetic data set for publishing. Gu et al. [22] proposed the PPCA-DP method; they first used the principal component analysis to reduce the dimensionality of high-dimensional data and then added Laplace noise to the low-dimensional projection data; finally, they used the generative model of probabilistic principal component analysis to generate a synthetic data set for publishing. The above are all studies on privacy-preserving data publishing in centralized scenarios.

2.2. Privacy-Preserving Data Publishing in Distributed Environment

At present, most of the existing privacy-preserving data publishing works focus on the centralized scenario; there are fewer studies on privacy-preserving data publishing in distributed scenario. The multiparty data release scenario studied in this paper is that each data owner owns a data set and uses the differential privacy technology to protect the privacy of the local data set rather than the scenario that multiple individuals keep their data locally. The latter typically utilize the local differential privacy [23] techniques to protect the privacy of individual data [24, 25]. In the following, we will introduce the research status of privacy-preserving data release in multiparty data release, where each data owner owns a data set.

Alhadidi et al. [26] proposed the first noninteractive two-party horizontally partitioned data publication method that satisfies differential privacy and secure multiparty computation. The data set published by this method is suitable for classification tasks. Hong et al. [27] constructed the framework (CELS protocol) that enables distributed parties to securely generate outputs while satisfying differential privacy. The security and differential privacy guarantees of the protocol are proved. Ge et al. [28] presented the DPS-PCA algorithm. Data owners collaborated to compute the principal components while protecting the privacy of data. The DPS-PCA algorithm can trade off the relationship between the accuracy of estimating principal components and the degree of privacy protection, but this method only outputs a low-dimensional subspace of high-dimensional sparse data. An efficient and scalable distributed PCA protocol is proposed by Wang et al. [29] for the computation of principal components of split horizon data in a distributed environment. First, the shared data are encrypted and sent to a semitrusted third party. Second, the shared data are aggregated by a semitrusted third party, and the aggregated result is sent to the data consumer. Finally, the data consumer performed a principal component analysis and obtained the principal components of the pooled data. Cheng et al. [30] presented the DP-SUBN3 approach; the data owners built a Bayesian network with the assistance of a semitrusted curator, and then the Bayesian network is used to generate a synthetic data set. In DP-SUBN3 approach, the four stages of correlation quantification, structure initialization, structure update, and parameter learning all need to access the local data set, and each stage satisfies differential privacy, which in turn makes the DP-SUBN3 approach satisfy differential privacy. For the privacy protection of data publishing in arbitrary partitions between two parties, Wang et al. [31] presented the first distributed algorithm, which generates anonymous data from two parties. In order to prevent both parties from leaking private information, the anonymization process satisfies both differential privacy and secure two-party computation. Gu et al. [32] presented the PPCA-DP-MH approach. The data owners collaborate with a semitrusted curator to reduce the dimensionality of the data, and then the data owners used the probabilistic generative model of principal component analysis to generate a published data set. In the PPCA-DP-MH method, since multiple data owners add noise to the data locally and independently, the utility of publishing data gradually decreases as the number of data owners increases. In response to this challenge, we propose the HPDP-DP method in this paper. We design the generation and addition scheme of correlated noise, so that the utility of publishing data will not decrease with the increase of data owners, and even the utility of publishing data will gradually increase with the increase of data owners.

3. Preliminaries

3.1. Probabilistic Principal Component Analysis (PPCA)

Principal component analysis is one of the commonly used dimensionality reduction methods. Principal component analysis is a statistical analysis method that converts multiple variables into a few hidden variables through dimensionality reduction techniques. These fewer low-dimensional and not correlated hidden variables are also called principal components. The principal components can reflect most of the information of the original variables. Next, the main process of finding principal components is introduced. First, computing the covariance matrix of the data. Then perform eigenvalue decomposition on the covariance matrix , , where is a diagonal matrix and the elements on the diagonal are the eigenvalues of the matrix , , . The corresponding eigenvectors are as follows: which are called the principal components. is an orthogonal matrix consisting of the eigenvectors. Usually, the top principal components retained are determined by the cumulative contribution rate .

However, Michael et al. [33] proposed that the principal component analysis (PCA) is a nongenerative model, they presented that the principal component analysis (PCA) also has a generative model called probabilistic principal component analysis (PPCA). The most common model to associate low-dimensional latent variables with high-dimensional observable variables is the factor analysis model, i.e. , where is -dimensional observation vector consisting of the original variables, is a -dimensional vector consisting of latent variables, , the matrix associates the vector with the vector . The vector allows the model to have a nonzero mean vector.

Theorem 1 [33]. From Figure 2 and the latent variable model , when , , then , where the maximum likelihood estimation of , and arewhere is the mean vector, the column vectors in is the eigenvectors corresponding to the top eigenvalues of the covariance matrix.

3.2. Differential Privacy

Differential privacy is a strong privacy protection model independent of background knowledge. If the output of a privacy-preserving algorithm is insensitive to small changes in the input, the algorithm satisfies differential privacy. The essence of differential privacy is to randomly perturb the query results, so that people cannot infer the original input information based on the query results.

Definition 1 (Differential Privacy) [10]. A random algorithm satisfies differential privacy, if for any two neighboring data sets (only one record differs between the two data sets) and for any there is is a small positive real number, which is also called privacy budget.
In the Definition 1, is used for controlling the probability ratio of the random algorithm to obtain the same output on the two neighboring data sets and ; it reflects the level of privacy protection that the algorithm can provide.

Definition 2 (Sensitivity). [10]. Let be a function that maps a data set into a fixed size vector of real numbers, , for any neighboring data sets and , the sensitivity of is defined as follows:where denotes the norm.

Definition 3 (Laplace mechanism). [34]. For any function , if the random algorithm satisfies the equation:then the algorithm satisfies differential privacy, are independent Laplace random variables.

Theorem 2 [35]. Let Laplace , then, the distribution of is infinitely divisible. Furthermore, for every integer , , where and are i.i.d. with the Gamma density .

Theorem 3 (Sequential Composition). [34]. Let be a series of privacy algorithms, and their privacy budgets are , for the same data set , the combined algorithm provides differential privacy.

Theorem 4 (Parallel Composition). [34]. Let be a series of privacy algorithms, which privacy budgets are , are disjoint data sets, the combined algorithm provides differential privacy.

3.3. Paillier Encryption and Decryption

In this paper, we use Paillier encryption scheme [36] to encrypt the local shared data before being aggregated. The Paillier encryption scheme is described as follows:(1)Key generation: , where and are large primes, . Euler function , the is public key and is private key.(2)Encryption: plaintext , randomly select , ciphertext mod .(3)Decryption: ciphertext , plaintext mod , where .

Paillier encryption is additively homomorphic. We use [[m]] to represent the encrypted ciphertext of . Then, , and .

4. The HPDP-DP Method

4.1. Problem Statement

There exist data owners, the -th data owner holds a local data set denoted as , is the number of individuals owned by data owner , . Each individual is a -dimensional vector. The data sets can be viewed as horizontally split the integrated data set by data owners. That is all the local data sets have the same attributes and do not intersect with each other. Our goal is to design an algorithm that can publish these horizontally partitioned data sets privately; specifically, it is that with the assistance of a semitrusted curator, the data owners and the curator collaborate to publish a synthetic data set , which has the same scale and statistical properties as the data set . Typically, we assume that the data owners and the curator are honest-but-curious, that is, they will follow the protocol but try to find out as much secret information as possible.

Input: Data sets , . Private key , public key . , where . Privacy budget and cumulative contribution rate
Output: Synthetic data set
(1)for to do
(2)Data owner generates noise matrices and , let and be the symmetric matrix with the upper triangle (including the diagonal) entries are sampled from Gamma , and set .
(3)Compute:
(4)Compute:
(5)for to do
(6)  for to do
(7)   Compute: mod
(8)  end for
(9)end for
(10)end for
(11)return
(12)Compute the Hadamard product:
(13)Decrypt :
(14)Compute:
(15)Eigenvalue decomposition of matrix , return eigenvalues in descending order , and corresponding eigenvectors
(16)for to do
(17)ifthen
(18)  
(19)  
(20)end if
(21)end for
(22)return
(23)for to do
(24) Compute
(25) Use the model defined in Theorem 1 to generate a synthetic data set
(26)end for
(27)return

In view of the above scenario, we propose a horizontally partitioned data publishing method with differential privacy (HPDP-DP). The Algorithm 1 depicts the HPDP-DP algorithm. First, the data owner perturbs the local scatter matrix with random noise that obeys the Gamma distribution and sends it to the semitrusted curator. Then the semitrusted curator aggregates all the local scatter matrices to get the noisy estimator of the covariance matrix of the pooled data. The semitrusted curator performs eigenvalue decomposition on the covariance matrix to get the principal components and then the top principal components are sent to each data owner. At last, each data owner uses the top principal components and the generative model of probabilistic principal component analysis to generate a synthetic data set.

In order to reduce the impact of noise on the availability of published data, the HPDP-DP algorithm employs a distributed Laplace mechanism to add noise to the local scatter matrix. According to Theorem 2, the infinite additivity of Laplace distribution, we perturb the local scatter matrix with the noise follows a Gamma distribution, which makes the estimator of the covariance matrix of the pooled data contain the same level of noise as the centralized scene. Inspired by [37], since the step of perturbing the local scatter matrix with gamma-distributed noise does not satisfy differential privacy, we will use the Paillier encryption scheme to encrypt the perturbed scatter matrix to protect the privacy of local data. The HPDP-DP algorithm mainly consists of the following stages.

Initialization phase: in the initialization phase, the Paillier cryptographic system generates the public key and the private key . The system also generates factors , where and . The factor and the private key are secretly sent to the curator. The public key and are secretly sent to the data owner , .

Perturbation and encryption phase: each data owner randomly perturbs the local scatter matrix. The scatter matrix of the data owner is given bywhere .

The data owner generates two symmetric random matrices and ; and are sampled from , . Then, the local noisy scatter matrix is . Using the public key and to encrypt each element of to get the encrypted matrix which will be sent to the curator, .

Aggregation and decryption phase: After receiving these encrypted matrices , the curator performs the Hadamard product on these encrypted matrices. We use the symbol as the Hadamard product of matrices.where holds due to Theorem 2. The curator decrypts the above results to get the sum of local scatter matrices with Laplace noise , which is used as an estimation of the scatter matrix of the pooled data, and then the estimation of the covariance matrix of the pooled data is .

In this stage, our idea is to use the weighted average of the local covariance matrices to estimate the covariance matrix of the pooled data. Assuming that the covariance matrix of data owner is , the relationship with the scatter matrix is , and then the estimation of the covariance matrix of the pooled data is .

Principal component analysis phase: the curator performs eigenvalue decomposition on matrix . The curator gets the eigenvectors (the top principal components) and then sends them to each data owner.

Generate synthetic data set phase: Each data owner uses the returned top principal components and the generative model of probabilistic principal component analysis in Theorem 1 to generate a synthetic data set.

4.2. Analysis
4.2.1. Security Analysis

Theorem 5. The data set owned by is and its corresponding scatter matrix is , . Defining the query function,the output result intended to be protected. and are symmetric random matrices will be added to , and are sampled from , . If the random algorithm holdsthen the algorithm satisfies differential privacy.

Proof. According to Theorem 2, it can be known each element of obeys . So, next we will prove if algorithm holds is a symmetric random matrix and is sampled from , , then the algorithm satisfies differential privacy.
We denote the two neighboring data sets as and ; there is only one individual is different, without losing general assumption, suppose the different individuals are in and . We denote the only two different individuals as and . Assume that all individual data have been normalized to the [0,1] interval. The estimation of the scatter matrices of and are as follows:andLet and be two independent symmetric random matrices, where and are sampled from .
Let and , then the log ratio of the probabilities of and at a point is given byAccording to the definition of differential privacy (Definition 1), we need to prove that the following inequalities holds:The mean vectors of and are as follows:andso . Hence, we have the following:Therefore, the following formula holds:So the conclusion of Theorem 5 holds.
Security against external attacks: external attacker will eavesdrop on data sent by local data owners to the curator. According to the semantic security of Paillier encryption against plaintext attacks, external attacker unable to decrypt data without knowing private key and . External attacker may also eavesdrop on the aggregated value of the data owners , external attacker unable to decrypt data without knowing private key . Even though the external attacker get the sum of scatter matrices with noise , because it contains Laplace noise, so the local data are still safe according to Theorem 5. Security against internal attacks: internal adversaries are data owners and the curator. The data owner holds secretly, the rest of the data owners and the curator cannot decrypt without private key and unless the curator colluded with the data owners. The curator can use private key and to decrypt the aggregated value , but the curator can only get the aggregated value with Laplace noise, so the local data are safe according to Theorem 5.

4.2.2. Complexity Analysis

Computation time cost analysis: the total time complexity of Algorithm 1 is , where is the number of data owners, is the number of attributes, , is the number of samples owned by data owner , . It is due to the following facts. In Algorithm 1, the major computational cost of Algorithm 1 is reflected in lines 1–11, lines 16–21, and lines 23–26. The lines are to perturb and encrypt the scatter matrix of the local data of the data owners, and the time complexity is . The lines are to perform principal component analysis on the aggregated scatter matrix, and its time complexity is , where is the number of retained principal components, which is proportional to , so the complexity is . The lines are that each data owner uses Theorem 1 to generate a published data set, and the time complexity is . In summary, the time complexity of Algorithm 1 is , which is .

Communication cost analysis. There exist three stages that incur communication costs. The first stage is the data owners send the local scatter matrix to the curator, the size of the message sent by each data owner is , the total size of the message sent in this stage is . The second stage is the curator sends the top eigenvalues and their corresponding eigenvectors to each data owner; the total size of the message sent in this stage is . The third stage is each data owner sends the synthetic data set to the curator; the size of the message sent by data owner is ; the total size of the message sent during this stage is .

5. Experiment

In this section, we experimentally evaluate the performance of HPDP-DP algorithm by comparing with the DP-SUBN3 algorithm [30]. We conduct experiments on different real data sets that are NLTCS [38] and Adult [39] data sets. NLTCS data set contains 21574 individuals, each individual has 16 attributes. Adult data set contains 45222 individuals, each individual has 15 attributes. We use the method in [30] to preprocess the Adult data set. After processing, the number of attributes in the Adult data set is 52. We use SVM classification accuracy to evaluate the performance of HPDP-DP algorithm. We train multiple classifiers on published synthetic data sets. For NLTCS data set, predicting whether a person is unable to go outside and whether a person is unable to manage money. For Adult data set, predicting whether a person holds a postsecondary degree and whether a person earns more than . In each classification task, we use 20% of the individuals as the test set and 80% of the individuals as the training set. Each experiment is run five times, and the average results are reported. The number of retained principal components is determined by the cumulative contribution rate . The cumulative contribution rate is set to 0.8 for NLTCS data set and 0.95 for Adult data set. In order to measure the performance of the HPDP-DP algorithm more clearly, the same SVM classifier are trained on the original data set; we label the SVM classification accuracy on the original data set with “No Privacy.”

5.1. The Impact of the Number of Principal Components Retained on the SVM Classification Accuracy

In this section, we train multiple classifiers to study the influence of the number of principal components retained on the SVM classification accuracy. In this set of experiments, the number of data owners is set to 3; the privacy budget is set to 0.5.

For the Adult data set, Figures 3(a) and 3(c) show the cumulative contribution rate and individual contribution rate of the principal components. Because there are more attributes after preprocessing the Adult data set, so we only marked the corresponding SVM classification accuracy when the number of retained principal components are , and 40 in Figures 3(b) and 3(d). For the NLTCS data set, it can be seen from Figures 3(e) and 3(g) that the contribution rate of only the first principal component has reached more than . The cumulative contribution rate of the top seven principal components can reach , and it can be seen from Figures 3(f) and 3(h) that the corresponding SVM classification accuracy can reach more than .

The common conclusion is that when the cumulative contribution rate increases (the number of principal components retained increases), the SVM classification accuracy increases accordingly. This phenomenon is consistent with the principle of principal component analysis. The principal components are not correlated with each other and contain the information of the original data. The more principal components retained, the more information of the original data contained in the published data, and the better the performance of the published data set.

5.2. Performance Comparison of HPDP-DP and DP-SUBN with Different Privacy Budgets

In this part of the experiments, we fixed the number of data owners to three while making the privacy budget take different values. Figure 4 shows the impact of privacy budgets on HPDP-DP and DP-SUBN3 algorithms. Figures 4(a) and 4(b) show the SVM classification accuracy of the HPDP-DP and DP-SUBN3 algorithms on Adult data set. Figures 4(c) and 4(d) show the SVM classification accuracy of the HPDP-DP and DP-SUBN3 algorithms on NLTCS data set. From Figure 4, except for the salary classifier of the Adult data set, the performance of HPDP-DP algorithm is significantly better than DP-SUBN3 algorithm. Even for the salary classifier of the Adult data set, the SVM classification accuracy of HPDP-DP algorithm is still not lower than DP-SUBN3 algorithm. From Figure 4, the experimental results show that the SVM classification accuracy of both synthetic data sets released by HPDP-DP and DP-SUBN3 algorithms increases with the increase of the privacy budget. This is because, according to the definition of differential privacy, when the privacy budget increases, the degree of privacy protection decreases and the availability of the released data increases.

5.3. The Impact of the Number of Data Owners on the SVM Classification Accuracy

In order to study the effect of the number of data owners on the performance of the HPDP-DP algorithm, in this section, we set the number of data owners to 2, 4, 6, 8, and 10. We fix the privacy budget to 0.2. The results in Figure 5 show that the performance of HPDP-DP algorithm is better than that of DP-SUBN3 algorithm. We can observe that when the number of data owners increases, the SVM classification accuracy of the synthetic data sets released by HPDP-DP and DP-SUBN3 algorithms increases accordingly. For DP-SUBN3 algorithm, the reason is that when the number of data owners increases, the number of update iterations in DP-SUBN3 algorithm increases, which helps to get better Bayesian network. For HPDP-DP algorithm, we use the weighted average of the local covariance matrices as an estimate of the covariance matrix of the pooled data, and the estimation effect will get better as the number of data owners increases. At the same time, we use the distributed Laplace mechanism to add noise to the shared data, so even when the number of data owners increases, the aggregated result still contain only one share of random noise (the same level as the centralized scene). The scale of random noise is determined only by the privacy budget and the sensitivity. Therefore, the SVM classification accuracy of the synthetic data set released by HPDP-DP algorithm increases as the number of data owners increases.

6. Conclusion

In this paper, in order to privately publish the horizontally partitioned data owned by multiple parties, we present a multiparty horizontally partitioned data publishing method with differential privacy. We use the weighted average of the covariance matrices of the local data to estimate the covariance matrix of the pooled data and then obtain the principal components of the pooled data. In order to protect the privacy of the local data and improve the utility of the published data, we exploit the infinite divisibility of the Laplace distribution to add noise to the locally shared data to improve the utility of the published data. The experimental results show that the synthetic data set released by the HPDP-DP algorithm can maintain high utility. However, this paper also has limitations. (1) The principal component analysis is only suitable for linear dimensionality reduction and not for nonlinear dimensionality reduction. (2) The HPDP-DP algorithm is only suitable for horizontally partitioned data publishing, not for vertically partitioned data publishing. We will conduct research on these aspects in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.