Abstract

The amount of Internet data is significantly increasing due to the development of network technology, inducing the appearance of big data. Experiments have shown that deep mining and analysis on large datasets would introduce great benefits. Although cloud computing supports data analysis in an outsourced and cost-effective way, it brings serious privacy issues when sending the original data to cloud servers. Meanwhile, the returned analysis result suffers from malicious inference attacks and also discloses user privacy. In this paper, to conquer the above privacy issues, we propose a general framework for Preserving Multiparty Data Privacy (PMDP for short) in cloud computing. The PMDP framework can protect numeric data computing and publishing with the assistance of untrusted cloud servers and achieve delegation of storage simultaneously. Our framework is built upon several cryptography primitives (e.g., secure multiparty computation) and differential privacy mechanism, which guarantees its security against semihonest participants without collusion. We further instantiate PMDP with specific algorithms and demonstrate its security, efficiency, and advantages by presenting security analysis and performance discussion. Moreover, we propose a security enhanced framework sPMDP to resist malicious inside participants and outside adversaries. We illustrate that both PMDP and sPMDP are reliable and scale well and thus are desirable for practical applications.

1. Introduction

With the significantly increasing data size and the rapid development of the corresponding data analysis technology, the original data, which usually has characteristics of big volume, heterogeneity, and low quality, begins to play a very important role in various fields, such as healthcare, advertisement, government decision-making, and transportation. This is mainly because making deep mining and analysis over these large datasets (i.e., big data) would reveal some hidden and valuable information, and further produces great benefits. On the other hand, owing to the characters of big data and the pursuit of better outputs with more complex analysis, big data processing requires more computational overhead and resource expenditure, which challenges the traditional data processing model.

Cloud computing provides a ubiquitous and on-demand approach of accessing a shared pool of configurable computing resources, which can be rapidly provisioned and released with minimal management effort [1]. Therefore, it gives a desirable platform for big data processing and enables users to outsource their computations to cloud servers with powerful computing capabilities sufficient for big data processing. To this end, users need to outsource their original data to cloud servers. However, this will bring serious security and privacy issues, especially when the data is sensitive for users. For exemplary purpose, we consider the following scenario.

The mobile wearable device has been very popular in recent years. With a smart band on your wrist, you can not only collect your own health data like sleep time, heart rate. and motion trail, but also compare your quantity of motion with average level or other people’s level. In this case, the data containing your privacy is collected, aggregated, analyzed, and published by businesses with the assistance of cloud servers, which may result in privacy disclosure of users. Some related reports have been published (https://techcrunch.com/2016/11/03/fitbit-jawbone-garmin-and-mio-fitness-bands-criticized-for-privacy-failings/). Even worse, with the appearance of the sharing economy, user privacy issues become increasingly prominent. For instance, Uber, which brings convenience by providing car-ride service, is accused of allowing its employees to look through and gather users’ travel data and device information at will, and its application named God View has always been criticized since it can track users even after they get off the car (https://www.nbcnews.com/tech/tech-news/uber-fined-settlement-ny-over-god-view-tracking-n491706.).

To guarantee data confidentiality and preserve user privacy, various security mechanisms have been developed and employed in each phase of the data life cycle, which roughly includes data storage, data processing, and data publishing (there also exist security issues in the process of data acquisition and data destruction, but they are out of the scope of this paper). For example, attribute-based encryption schemes [2, 3] are used to secure data storage in public cloud servers, secure multiparty computation schemes [4, 5] are introduced to protect data aggregation, and differential privacy mechanisms [6] provide a way to quantize the disclosed privacy in data publishing. In addition, we also note that there are several works proposed to protect data processing and data publishing by combining differential privacy with other cryptography primitives. However, there are few studies concerning the entire privacy preservation throughout the full life cycle of multiparty data, especially in the context of cloud computing. On the other hand, from a practical viewpoint, once the data privacy gets exposed in some phase of the data life cycle, the security mechanisms deployed in other phases would be useless. Therefore, it is necessary to conquer the privacy issues of big data in cloud computing from a global perspective.

1.1. Our Contribution

In this paper we propose a general framework for Preserving Multiparty Data Privacy (PMDP for short) in cloud computing, which provides complete protection throughout the entire life cycle of users’ data and is suitable for securing multiparty data aggregation and publication with the assistance of an untrusted cloud server. Specifically, the contributions of this study can be summarized as follows:(1)Based on well studied security mechanisms for preserving user privacy in the process of data storage, processing, and publishing, respectively, we combine these techniques in a nontrivial and tight manner and propose the PMDP framework that covers the full lifecycle of multiple users’ data.(2)We present the principles of choosing the building security mechanisms involved in the PMDP framework and a specific instance. Furthermore, to illustrate the advantage and practicability of the PMDP framework, we evaluate the performance of the instance in terms of efficiency and functionalities by comparing it with other related works.(3)We formally discuss the security of the PMDP framework. Concretely speaking, we reduce its security to the security of the building mechanisms including fully homomorphic encryption, secure multiparty computation, and differential privacy, which are all with the feature of provable security. Thus, the PMDP framework is also provably secure.(4)We put forward a reinforced version of the PMDP framework named sPMDP, to provide stronger security and privacy guarantee. In addition, we also show the application scenarios of the PMDP and sPMDP frameworks.

1.2. Outline

The remainder of the paper is organized as follows. In Section 2, we review the related work on techniques for data privacy in different phases. Section 3 introduces some preliminary knowledge used in this paper. The PMDP framework is presented in Section 4. Section 5 illustrates an instantiation of the framework. In Section 6, we evaluate the performance of the framework and discuss its security along with its application scenarios. We propose the reinforced framework sPMDP and analyze its security in Section 7. Finally, some concluding remarks are given in Section 8.

In this section, under the background of cloud computing, we briefly introduce the security mechanisms used to protect data privacy in each phase of the life cycle of data.

Secure Data Storage. When the data storage is outsourced to cloud servers, data owner completely loses the access control of his/her data. But data owner hopes that the outsourced data can only be accessed by authorized users for privacy issues. A natural solution is to encrypt the data before sending it to cloud servers so that the users holding corresponding secret keys can decrypt the data. Although traditional public key encryption schemes can guarantee data security, they suffer from the limitation of efficiency. A new approach is identity-based encryption, but it is faced with some new challenges [7]. As an extension of identity-based encryption, attribute-based encryption (ABE) [8] enables the data owner to place fine-grained access policy over the outsourced data and can perfectly conquer the problem of securing data storage in cloud computing. For this reason, many ABE schemes [9, 10] with extended functionalities have been proposed. In addition, there are also various privacy-preserving authentication protocols, like two-factor authentication [11, 12], three-factor authentication [13, 14], end-to-end authentication [15], and so on. And some new works focus on practical application fields, such as smart metering [16], Internet of Things [17], and WBAN [18].

Secure Data Processing. The purpose of aggregating and storing data is to make analysis on them and further find the valuable information. However, when the data is outsourced to cloud servers with the above encryption mechanisms, the data analyst has to download and decrypt the data before processing it, which is not convenient enough to satisfy the data analyzing demands under the background of big data. Fortunately, fully homomorphic encryption (FHE) [19], which has the feature of allowing cloud servers to evaluate arbitrary functions on the encrypted data without decryption, can simultaneously guarantee the security of the data in phases of storage and processing. Due to its significant advantages of securing data sharing, many FHE schemes were proposed to improve the security and efficiency of the original one. Besides, some frameworks for efficient and privacy-preserving outsourced computation have been proposed based on FHE, such as EPOM [20], POFD [21], and POCR [22].

Another important security mechanism used to secure data processing is secure multiparty computation (MPC), which enables multiple users to perform the assigned computation on their collected data and obtain the computation result without getting any information about one another’s data. To improve the efficiency of MPC in the context of cloud computing, several variants of MPC have been proposed, such as on-the-fly MPC [23] and cloud-assisted MPC. In addition, motivated by security requirements of practical applications (e.g., outsourced database query, private set intersection, and information retrieval), some efficient and specific cloud-based MPC protocols [24, 25] have been designed.

Secure Data Publishing. The significance of output privacy is remarkable especially under the background of big data. Due to the technology upgrade of data mining, preserving data privacy is getting more and more difficult since the sensitive information in original data would suffer from direct and indirect (via inference) exposure during the mining process. Namely, not only the original data but also the data mining output can lead to the disclosure of sensitive information. So it is necessary to pay attention to output privacy. Anonymization technologies are widely used to preserve data privacy in the process of data publishing. Although classical anonymization methods (e.g., -anonymity, -diversity, and -closeness) have been well studied and there are various corresponding algorithms, they cannot resist structure-based deanonymization attacks [26], which implies that the published data would reveal user privacy. In contrast, differential privacy [27] provides strong theoretical guarantees on the privacy of data by adding noise with specific distributions to raw data. Roughly, the research of differential privacy applied to data publishing is comprised of two aspects, interactive data publishing and noninteractive data publishing. In the first one, the fundamental methods are about responding to queries by disturbing the outcome derived from the original dataset, including Laplace mechanism and exponential mechanism [27]. These methods are easy to implement but the noises achieving privacy protection are relatively big. Afterwards, researchers developed techniques [28] providing responses to queries according to the histogram with noise generated from raw data, which have low sensitivity and comparatively small noise. As for the noninteractive mode, the research results at present are mostly focusing on batch query [29], contingency tab publishing [30], grouping and generalization [31], and sanitized dataset publishing [32].

Secure Data in Multiphase. The MPC technique can protect data privacy in the input and computation process, but it is not designed for output privacy. From the theoretical perspective, it is obvious that MPC cannot guarantee output privacy. With the rapid development of database technology and cloud computing, researchers begin to design security schemes with the capability of preserving data privacy in multiple phases of its lifecycle simultaneously. For instance, Pettai and Laud [33] gave a good example of combining MPC and differential privacy with reasonable performance to achieve both computational and output privacy. However, their framework, which we call DPSharemind, is built upon GUPT [34], which is secure under the assumption of the existence of a trusted third party. Consequently, in the situation where multiple clients intend to delegate the computation of a joint function on their data to an untrusted cloud, it can not guarantee the security of the framework. Bindschaedler et al. [35] showed how to obtain a noisy (differentially private) aggregation result in a star network topology using Shamir secret sharing scheme and additively homomorphic encryption; we call their work DPStar. They also ensured that the amount of noise in the final result would neither be reduced by colluding entities nor be influenced by a cheating aggregator secretly, which had important practical significance.

3. Preliminaries

In this section we review the concepts of secure multiparty computation and differential privacy, which are building tools of our PMDP framework.

3.1. Secure Multiparty Computation

Our framework is partially built upon the on-the-fly MPC protocol that is constructed from multikey FHE. In this paper we use the FHE from NTRU encryption scheme of Hoffstein with the modifications of Stehlé and Steinfeld [36]. So we start from the NTRU encryption.

NTRU Encryption. The NTRU cryptosystem is constructed over the ring , where for some integer . Let be an odd prime number and be a -bounded distribution over   . Denote the polynomial ring by and the coefficient-wise reduction modulo into the set by . Roughly, given a security parameter , the NTRU cryptosystem is specialized as follows.

Keygen(): the key generation algorithm samples polynomials and lets . Particularly, if is not invertible in , then is resampled. Denote the inverse of in by ; then the public key pk and the secret key sk are calculated as follows:

Enc: the encryption algorithm samples polynomials for plaintext , and outputs the corresponding ciphertext as follows:

Dec: the decryption algorithm decrypts the ciphertext by computing mod 2.

With the NTRU encryption scheme and the conversion techniques introduced in [23], we can derive a multikey fully homomorphic encryption scheme. Denote by  the family of the resulting multikey fully homomorphic encryption schemes and by the collection of parties. The notation Eval means the homomorphic evaluation performed by the cloud (relinearization and squashing involved in the homomorphic multiplication are not detailed due to space limitations). Then, an on-the-fly MPC protocol secure against semimalicious adversaries can be constructed as follows.

Step 1. For each , the participant samples a key tuple  and uses  to encrypt his/her input , where  is an evaluation key. Then  is sent to a cloud server . At this point a function , represented as a circuit , has been selected on for some . Let .

Step 2. The cloud server performs homomorphic evaluation on ciphertextsand broadcasts to parties .

Step 3. By running a general secure MPC protocol , these parties compute the function

The above on-the-fly MPC protocol can be modified to achieve security against malicious adversaries by adding zero-knowledge proofs and succinct noninteractive arguments of knowledge system, which is used in both of our frameworks to guarantee their security property.

3.2. Differential Privacy

Informally, differential privacy guarantees that a single record in a dataset being missing has only limited impact on the outputs of the queries executed on the dataset. The formal definition is captured as follows.

Definition 1 (-differential privacy [27]). An algorithm satisfies -differential privacy (-DP) if for any pair of neighboring datasets and , and any , it holds that where denotes the collection of all possible outputs of the algorithm .

The datasets and are neighboring provided that they differ by only one tuple. We denote this by . We can see that the change in the probability distribution of the output caused by adding/removing any single tuple is bounded by .

As a major -differential privacy mechanism, Laplace mechanism perturbs the output of a function on a dataset by adding to a noise randomly sampled from the Laplace distribution. We define the global sensitivity of as Then a Laplace mechanism is given as follows: where

Sample-and-Aggregate. The Sample-and-Aggregate technique provides a way to lower the global sensitivity and improve the parallel degree of algorithm and further results in a differentially private method of computing a function . The basic mechanism is specified as shown in Algorithm 1.

Input:
 Dataset , length of the dataset , privacy parameter ,
 clipping range .
Let
Randomly partition into disjoint blocks
for to do
output of operates on dataset
  If , then
  If , then
end for
return ;

4. Our Framework

In this section, we first introduce the entities involved in the PMDP framework and then provide an overview of the PMDP framework, followed by its details. Before presenting the details of our framework, we summarize the basic notations used throughout this paper in Notations.

4.1. Involved Entities

The PMDP framework involves the following entities:(1)Completely trusted authority: the authority takes charge of producing secret keys for all legal system users. Since the authority can decrypt any ciphertext generated by any user, we thus suppose that it is completely trusted.(2)Semitrusted cloud server: a cloud server has powerful resources of storage and computation that can be easily accessed when required by system users. In the framework it is in charge of performing evaluation part of FHE in a MPC protocol and is assumed to be semitrusted. Namely, the cloud server will honestly complete the given computation assignments but will try to learn the information of the outsourced data.(3)System users: in the PMDP framework, several system users can compute the value of a public function on their private data. Each participant is associated with a unique identifier    and holds the corresponding secure key issued by the authority. Let be the collection of all parties. Each user is possible to be corrupted by adversaries, which results in the disclosure of the user’s secret key.(4)Malicious adversaries: in fact, malicious adversaries do not participate in the procedure of the PMDP framework. But they do exist when we consider the security of the framework. In this paper, only adversaries that can corrupt any subset of parties are considered. In the privacy analysis of our framework, we assume that adversaries have strong background knowledge.

4.2. Overview of the PMDP Framework

Roughly, the PMDP framework is a nontrivial and tight integration of MPC and DP proposed by using the Sample-and-Aggregate mechanism, and the enhanced framework sPMDP in Section 7 is also nontrivial and even tighter because it is based on PMDP and its noise addition part is conducted on the cloud in the evaluation stage of MPC.

The PMDP framework consists of the following six stages.

Stage 0. Initially, the authority sets up the system by generating public parameters and corresponding secret keys. Like most of other frameworks, the setup procedure is important to make it work correctly.

Stage 1. Each participant encrypts his/her private data with a multikey fully homomorphic encryption scheme and outsources the resulting ciphertext to the cloud server.

Stage 2. The cloud server identifies the parties involved in the multiparty computation of the corresponding ciphertexts and partitions them into some blocks.

Stage 3. The cloud server operates on the encrypted data with on-the-fly MPC and outputs the calculated results to their corresponding blocks in the form of ciphertext.

Stage 4. These parties in the same block decrypt the returned ciphertext. Furthermore, all these decrypted results from different blocks are aggregated into a final result in accordance with the partition and sample method in the second stage.

Stage 5. Finally, to ensure output privacy, a designated participant first runs a differential privacy mechanism on the final result and then publishes the designated result.

4.3. Details of the PMDP Framework

Now we present the details of the PMDP framework. As shown in Figure 1, the entire procedure is comprised of the following phases.

4.3.1. System Setup

Initially, the authority makes the framework specific by choosing appropriate algorithms for each part and presetting related parameters. Since our framework is mainly based on the on-the-fly MPC from multikey FHE, the following components are necessary:(a)A collection of multikey fully homomorphic encryption schemes with semantical security (b)A NIZK argument system for the NP relation (c)An adaptively extractable SNARK system for all of NP.(d)A family of collision-resistant hash functions (e)An -party MPC protocol secure against malicious adversaries corrupting parties, for computing the family of decryption functionswhere

In addition, the method of sampling and partitioning should also be determined in this part, after which the aggregation mechanism would be explicit. Note that we use differential privacy in the last stage. Since there are several available mechanisms to achieve differential privacy and many variants of the original definition of -differential privacy, participants should pay attention to select an appropriate one in the light of the function to be computed on the cloud sever.

Moreover, the authority specializes the NIZK proof system according to a security parameter and send the resulting common reference string to the cloud server and all parties. The privacy budget in differential privacy is also initialized.

4.3.2. Data Encryption and Uploading

In this phase all parties encrypt their data and upload them to the cloud server, before making the decision of performing which kind of multiparty computation on the outsourced data. Namely, this phase can be completed in an offline way, which reduces the communication delay of the framework. Specifically, to encrypt his/her data , each party runs the following algorithms:

In addition, samples a hash key  and computes a hash value of the ciphertext as . Moreover, generates a tuple of verification reference string and private verification key  After that, sends the tuple  to the cloud server and keeps the corresponding secret/private keys and random values as secret.

Note that after receiving the tuple from , the cloud server will check whether the ciphertext is well formed by verifying the associated proof .

4.3.3. Sampling and Partitioning

In this phase all participants agree on the mechanism of sampling and partitioning. So far, there are a variety of sampling and partitioning mechanisms with different features for differential privacy when dealing with datasets, such as random sampling, uniform (fixed size) sampling, fraction sampling, Bernoulli sampling, random partitioning, and cell-based and kd-tree-based partitioning. Such mechanisms can not only decrease the sensitivity of the function to be computed, but also make the computation paralleled on the cloud server.

Typically, if sampling is not necessary, random partitioning is a common choice due to its simplicity, effectiveness, and practicability. This is the first step of the well-known Sample-and-Aggregate algorithm. By randomly partitioning the original dataset into disjoint subsets with almost the same size, we can perform the secure delegation of multiparty computation on each subset in the next phase. When the original dataset is partitioned, the corresponding collection of parties is also partitioned into several subsets . The parties belonging to the same form a group.

It is suggested that the size of the blocks obtained from sampling and partitioning should be in the range of , since a too big size makes the sampling and partitioning operation meaningless, and a too small size will threaten the data privacy of participants.

4.3.4. Homomorphic Evaluation

In this phase the cloud server first represents the function to be computed as a circuit , and then performs the multiparty computation on each subset .

Concretely, for each , let the size of be , and the corresponding parties be . The cloud server computes and produces succinct arguments for the following NP language:

4.3.5. Decryption and Aggregation

In this phase all participants decrypt the evaluated ciphertext and further aggregate the resulting data.

First, for to , each participant belonging to the group runs the algorithm  to verify the validity of the argument . If the verification is successful for all parties in the same group, then the MPC protocol is performed to compute

Second, these participants from different groups merge their outputs into a common output by performing the aggregation mechanism

To simplify the above aggregation procedure, the party with the most computing power in each group will be assigned as the agent of the group, and all these agents are designated to accomplish the aggregation work. Moreover, the aggregation procedure can also be outsourced to the cloud server by using a secure delegation protocol, for example, the cloud-assisted MPC from threshold FHE [5].

4.3.6. Privacy-Preserving Result Release

To preserve the output privacy well, in this phase, the noise generated by differential privacy mechanisms is added to the aggregated output. Specifically, the following issues should be synthetically considered by parties before sampling the noise.

First, since there are a few variants of -differential privacy, it is necessary and important to make a proper choice among them according to the participants’ requirements. Particularly, the -differential privacy is the most widely used one and is adaptable to the situation with less restriction of privacy loss. Personalized differential privacy provides more accurate control of the consumption of privacy budget at an individual level and is usually used in interactive queries. Concentrated differential privacy gives high probability bounds for cumulative loss of numerical computations and works well in the aspect of preserving group privacy. All in all, the choice should be based on the computation to be made on the cloud server and the desired privacy-preserving level.

Second, researchers have developed several differential privacy mechanisms, which can achieve the same DP definition in different application scenes. For example, the Laplace mechanism can achieve -DP for real valued queries, while the exponential mechanism is a -DP method of sampling from a discrete set of candidate outputs. Thus, if parties want to publish their average wage in a differentially private way, the Laplace mechanism is available; but if they want to release the result of their secure voting for a new leader, the exponential mechanism is a better option.

Third, by the definition of differential privacy, we know that the sensitivity of the function is an important parameter in the specific realization of differential privacy. Additionally, the computation outsourced to the cloud server and the used aggregation mechanism significantly affect the magnitude of sensitivity. Therefore, it is essential to pay attention to both of them.

The last issue needed to be considered is which party should be responsible for generating, handling, and adding the noise. It seems that any party can take on such a task. But this is built upon the assumption that all parties are at least semihonest without collusion in the result release phase. Otherwise, malicious participants may try to reduce the differential privacy of honest participants through sampling a noise that is smaller than the required magnitude, and semihonest participants may collude to disclose the information of honest participants by sharing their input data and subtracting the noise term from the published result. In our framework we assume that all participants are honest, or at least semihonest without collusion in this phase. As a result, the output privacy can be guaranteed by selecting an agent of all parties to run the differential privacy mechanism on the aggregated result.

5. An Instantiation of the Framework

In this section, to illustrate the effectiveness of the PMDP framework, we present an instance of the PMDP framework and show how to use it to solve the problem that parties securely compute and publish their average wage with the help of a semitrusted cloud sever. Specifically, this instantiated framework consists of the following stages.

Stage 0. The authority chooses the following cryptographic primitives:(a)The family of multikey FHE schemes proposed by Stehlé and Steinfeld [36] (b)The NIZK argument system constructed by Groth et al. [38] for the NP relation Let be the common reference string for .(c)The adaptively extractable SNARK system presented by Bitansky et al. [39]for all of NP.(d)The family of cryptographically collision-resistant hash functions HmacSHA256(e)The cloud-assisted -party MPC protocol proposed by Asharov et al. [5] for computing the family of decryption functions

Stage 1. Each party first performs the following algorithms:where are randomly sampled from and is ’s wage (the new symbols are related to ek, which can be further referred to in [23]).

Then, sends the tuple  to the cloud server, who will verify the correctness of the proof . These values ,, and  are locally maintained by .

Stage 2. After gathering the encrypted data from all parties, the cloud sever forms them as a ciphertext dataset . Then, by calling the Sample-and-Aggregate algorithm [37], it randomly partitions into disjoint subsets , which are with almost the same size. Meanwhile, the collection of all parties is also partitioned into several corresponding groups .

Stage 3. Since the problem is to compute the average wage of all parties, the cloud server needs to conduct the mean value function which is represented as a circuit . To this end, for each subgroup (), the cloud server computes and generates the succinct arguments for the following NP language:

Stage 4. Fist, each party belonging to    runs the algorithm  to verify the argument . If verification is successful for all parties in the same group, they perform the cloud-assisted MPC protocol to compute the following function. Then, the cloud server broadcasts to all the parties in the same group together with the tuple .

Then, the agent in the group computes where , are the left and right bounds in the Sample-and-Aggregate algorithm.

Furthermore, all agents calculate the following function with the help of on-the-fly MPC protocol:

Stage 5. All parties vote for a publisher from all subgroup agents and authorize him/her to sample a noise and publish the finial average wage of all parties as where is the privacy budget.

6. Performance Discussion and Security Analysis

In this section we first systematically analyze the security of the PMDP framework along with its application scenarios. Then, we roughly discuss the performance of the PMDP framework in terms of computation overhead and security property.

6.1. Security Analysis

To simplify the security analysis of the PMDP framework, we separate participants into two categories, honest and semihonest, as in the security analysis of MPC protocol. On the other hand, the cloud server is always assumed to be untrusted. In addition, since the framework uses the differential privacy mechanism, we thus take the background knowledge attack into consideration. Below we present the security analysis of our framework in honest model and semihonest model, respectively. Since we extend the security models in traditional MPC protocols to those in our sPMDP framework, in the following security analysis we mainly focus on demonstrating that the output privacy cannot be violated since the input and computational privacy are already proven to be guaranteed by former works on MPC [23].

6.1.1. Honest Model

We first show that the PMDP framework does preserve the data privacy under the assumption that all participants are honest and do not attempt to get others’ information. That is, each participant will honestly follow the framework procedure as required.

Specifically, in Stage of the framework, each participant’s data is encrypted with FHE and then outsourced to the cloud server. Thus, the data privacy in the storage phase is ensured by the security of the underlying FHE. In Stage , all data are computed in the form of ciphertext with on-the-fly MPC protocol based on multikey FHE, whose security and correctness have been proved before. Therefore, data privacy in the processing phase is also protected, which means that each participant will only know his/her input and the output of his/her group. The operations of sampling, partitioning, and aggregation in Stages and do not influence the data privacy or the correctness of intermediate results if chosen appropriately, though there may be accuracy loss. The aggregated result is transformed into differentially private form and published in Stage , which ensures the output privacy owing to the theoretical guarantee of differential privacy. To sum up, in the honest model, we can see that the PMDP framework preservers the data privacy of all participants throughout its lifecycle, without affecting the data usability.

Although the framework in the honest model is very simple and seems to be idealized, it can be deployed in particular applications. An actual example is that some medical facilities try to obtain meaningful insights on health according to analyses of their clients’ health data. Obviously, these facilities have the need for health data storage, processing, and release. By calling our framework, these facilities can be regarded as participants, whose data is taken good advantage of without any risk of privacy disclosure. In this instance, the medical facilities only care about the features, objective laws, and tendency of the collectivity rather than information of individuals, so it complies with the definition of honest model.

6.1.2. Semihonest Model

In the semihonest model, a semihonest participant will follow the framework procedure but will also try to learn information about other parties. Moreover, a semihonest participant may collude with others, including other semihonest participants and external adversaries with background knowledge. In this work we assume that there is at least one honest party in the collection of all parties. We will show that the PMDP framework is secure only when there is no collusion in the semihonest model.

Firstly, we consider the noncolluding situation. For a semihonest participant , he/she knows his/her input , the local result of his/her group , the aggregated global result of all groups , the noise , and the output . Besides, he/she knows the method of sampling, partitioning, and aggregation, the function computed on the cloud server, and who are in the same group with him/her. If he/she does not collude with anyone, he/she can not infer anyone’s input from what has been known, and the reason is not far to seek. From what has been known, has the following equation. As we suggested before, the size of each group is at least . Considering our framework is designed to meet the challenge of big data privacy, we can assume that , which implies that . For , he/she only knows , so there are at least 2 unknown elements in (36), meaning that it is unsolvable and can not get others’ inputs.

Now we consider the colluding situation. We present two possible attacks aiming at the PMDP framework from semihonest participants colluding with others.

Case 1. In each group , there is only one honest party ; the others are semihonest and can collude with one another. Then, these semihonest parties can infer the input of as follows. Recall that in the following equation: the only unknown variable is . It is not hard to solve (37) if the function is not too complicated. If there is an external adversary who corrupts parties of an -party group, he/she can conduct a similar attack.

Case 2. In each group , there is only one semihonest party , and the others are honest. Suppose that the party is corrupted by an external adversary with strong background knowledge; for example, the adversary knows the input of each party in except a party . Then, the adversary can obtain from and launch an attack by solving the equation in Case 1 to get .

We note that an entity who wants to attack the framework must possess the inputs of at least parties in an -party group. The condition is a little strict, but if the cloud server is corrupted and the sampling and partitioning method is decided by the cloud server, it will be much easier.

The above two cases suggest that the framework may suffer from potential attacks when some participants are semihonest. To avoid these attacks in the semihonest model, the framework needs to be reformed.

6.2. Performance Discussion

Since the PMDP framework involves several general security mechanisms, it is difficult to accurately evaluate its computation and computation costs. Thus, we show its efficiency by discussing the performance of the instance presented in prior section.

In the instance, each general party , who is not assigned to be an agent, needs to generate his/her public/secret keys and perform encryption and NIZK operations in Stage . Moreover, is also in charge of running the verification and decryption algorithms in Stage . Thus, ’s computation cost is roughly the same as that of each party in an on-the-fly MPC protocol. According to López-Alt et al.’s [23] analysis, we know that the computation cost of each general party is at most polylogarithmic in the circuit size and the total size of all inputs and polynomial in his/her own input . For an assigned agent, he/she has an extra calculation task, that is, the aggregation operation. Mostly, the aggregation complexity is for the agent of group . In addition, since we assume that the cloud server has sufficient computation resources, we thus omit its computation overhead.

In Table 1 we compare the PMDP framework with other related works from the aspect of security properties, including delegation of storage, privacy in different processes, the reliability of cloud server, security model, and whether they support distributing computation. We select three representative works for comparison, namely, the GUPT system [34], the scheme designed by Pettai and Laud [33], which we call DPSharemind, and the protocol from Bindschaedler et al. [35], which we call DPStar. The last row is about the sPMDP framework, which we will introduce later.

From Table 1, we can see that overall PMDP overweighs the works in the first three rows. Firstly, PMDP enjoys the advantage in delegation of storage, which is important in the era of big data. This is because it employs FHE, while GUPT does not consider data storage and both DPSharemind and DPStar employ secret sharing, thus breaking the integrity of data. Consequently, all works but GUPT can provide lifelong data privacy guarantee, covering input privacy, computational privacy, and output privacy. All of the works can be cloud-assisted, but only PMDP and DPStar allow the cloud server to be untrusted. The security model is mainly about the participants of the works. We have shown that PMDP is secure under noncolluding semihonest model, and this is where it is weaker than DPStar. However, DPStar does not support distributed computing; only PMDP and DPSharemind do since they use sampling and aggregation. In order to make up for the shortcoming of PMDP in security model, we propose the sPMDP framework in the next section.

Before introducing sPMDP, we further illustrate the differences between PMDP and DPStar, which is a latest public work in addressing the privacy concerns related to multiparty computation. The fundamental difference is that DPStar uses Shamir secret sharing and homomorphic encryption, and PMDP uses on-the-fly MPC from multikey FHE. As a result, the security model of PMDP can be strengthened easily while that of DPStar can not. Besides, in [35] DPStar is firstly introduced as a summation protocol and extended to other queries including count queries, histograms, and linear combinations, meaning that it has to be readjusted and several new operations have to be added when facing different queries. By contrast, PMDP is a framework and all of its details have been explained; thus it can be easily applied to different queries. So the generality and usability of DPStar is not as well as PMDP. From what has been mentioned above, we can conclude that DPStar has good security properties and important practical significance, and our PMDP framework has advantage over it in delegation of storage, distributed computing, generality, and usability. As a reinforced version of PMDP, sPMDP also has advantage over DPStar in security model, and we prove this in the next section.

7. Security Enhanced PMDP Framework

The main reason that the PMDP framework suffers from the above attacks is that too much information is accessible to participants. If the intermediate result of each group is unknown to each one, then some attacks are not effective anymore. Furthermore, the aggregated result and noise should also be not available to any participant; otherwise the adversary with strong background knowledge might be able to infer users’ inputs. Motivated by the above observations, we propose a security enhanced PMDP framework (sPMDP for short).

Since the initialization mechanism and cryptography primitives used in the sPMDP framework are similar to that in PMDP, we briefly introduce its details. As shown in Figure 2, the sPMDP framework consists of the following stages.

Stage 0. Initialize the system as in the PMDP framework.

Stage 1. parties encrypt their inputs with multikey FHE and upload the resulting ciphertexts to the cloud server.

Stage 2. Parties decide the method of sampling and partitioning and obtain groups .

Stage 3. Each party samples a noise as the agent does in the PMDP framework, and all parties calculate the ciphertext of the average noise of all parties with the help of an on-the-fly MPC protocol. Particularly, we use the protocol without the decryption, and the ciphertext of average noise will be used in the next stage. The average noise is denoted by and its ciphertext is denoted by . The value of will not be known by any entity in the whole procedure.

Stage 4. All parties (or an agent of all parties) firstly merge the original function to be computed on each group and the aggregation function for all groups into one function , which is taken as inputs of all parties’ inputs Moreover, let denote the sum of and : Then, the function is represented as a circuit , and the cloud server computes The cloud server also produces succinct arguments for the NP language similar to the language in the original framework.

Stage 5. Each party runs the algorithm to verify the validity of the argument . If the verification is successful for all parties in the same group, they run an MPC protocol to compute the functionand is released as the final result.

As to the security of sPMDP against malicious adversaries, we prove it from two different perspectives. Firstly, we demonstrate the security by theoretical proof. The difference between PMDP and sPMDP  is obvious. In sPMDP, we integrate the operations of sampling, aggregation, generating, and adding noise into the on-the-fly MPC scheme, thus letting the on-the-fly MPC scheme provide global security guarantee. The aim of merging functions is to turn the original evaluation function and the function of average noise addition into one function conducted by the cloud, thus protecting the amount of noise against malicious participants. Another advantage of merging functions besides enhancing security is that it optimizes the structure of PMDP framework, reducing participants assignments and achieving a better combination between different techniques. So sPMDP is a tighter combination between MPC protocol and differential privacy. It has been proved in [23] that the on-the-fly MPC scheme is secure against malicious adversary, which means that sPMDP is secure under malicious model. So the security property of sPMDP is better than PMDP and DPStar. It is remarkable that the computation of encrypted average noise also uses the on-the-fly MPC scheme, thus ensuring that no one will know about others’ noise or the value of average noise.

Now we roughly explain the security of the sPMDP framework from another perspective. Note that each party in sPMDP only knows his/her own input , noise , and the final result . With all these values, each party can only get the equation that is to say Because there is at least one honest party, w.l.o.g., we assume that party is honest, who will not reveal his input and noise to others. Therefore there are at least two unknowns, namely, and , in (44) for parties other than . From the perspective of solutions of the equation, the equation has infinitely many solutions when it has more than one unknown with no more conditions. Therefore the input and the noise of party will always be unknown to others. Again, we see that in sPMDP the privacy of honest parties will be preserved.

As the cost of stronger security guarantees, the computation overhead of each party in sPMDP is larger than that in PMDP, and the additional part comes from the computation of the encrypted average noise in Stage . Overall, its computation cost also follows the on-the-fly MPC protocol.

8. Conclusion

In this work we focus on the problem of how to preserve multiparty data privacy throughout its lifecycle in cloud computing and propose a privacy-reserving framework named PMDP. This framework is built upon the on-the-fly MPC, sampling, partitioning, and aggregation mechanisms, and differential privacy, thus guaranteeing input privacy, computational privacy, and output privacy in cloud computing simultaneously, even if the cloud server is untrusted. The security analysis shows that the framework can achieve intended security goals in the honest model. To conquer those potential attacks in the semihonest model, we further present a security enhanced PMDP framework. The performance discussion indicates that the proposed framework owns advantages in security guarantees and thus is more desirable for secure multiparty data aggregation and publishing.

Notations

:An odd prime number
:A polynomial ring
:A distribution over a polynomial ring
:Security parameter
:Plaintext and its ciphertext
:The public, secret, and evaluation key
:Some intermediate results in a framework
:The privacy budget
:The sensitivity of
:An algorithm satisfying -differential privacy
:All possible outputs of
:A pair of neighboring datasets
:The differentially private output
:A function on dataset
:Clipping range
:The participant of the framework
:The number of participants
:The input of a participant
:The number of blocks (groups)
:The original dataset and its subsets
:The original dataset and its subsets
:An adaptively extractable SNARK system
:A NIZK argument system
:NP relation
:A family of collision-resistant hash function
:An -party decryption MPC protocol
:Decryption functions
:Hash keys
:The hash digest of the ciphertext
:A NIZK of the ciphertext
:A verification reference string
:A private verification key
:The verification of argument is successful
:Security parameter
:The set of parties and its subset
:The set of all parties’ data and its subset
:An honest participant
:Semihonest participants
:The aggregation function
:Random value
:Polynomials sampled from some distribution
:The function that multiparties want to compute
:The mergence of and aggregation function
:The sum of and average noise
:The circuit of and
:The random noise following some distribution
:Average noise and its ciphertext.

Conflicts of Interest

The authors declare that they have no conflicts of interest

Acknowledgments

This work was supported in part by the National Nature Science Foundation of China under Grant 61702549, Grant 61502527, and Grant 61379150 and in part by the Open Foundation of State Key Laboratory of Networking and Switching Technology (Beijing University of Posts and Telecommunications) under Grant SKLNST-2016-2-22.