Abstract

With the advent of big data era, clients who lack computational and storage resources tend to outsource data mining tasks to cloud service providers in order to improve efficiency and reduce costs. It is also increasingly common for clients to perform collaborative mining to maximize profits. However, due to the rise of privacy leakage issues, the data contributed by clients should be encrypted using their own keys. This paper focuses on privacy-preserving -means clustering over the joint datasets encrypted under multiple keys. Unfortunately, existing outsourcing -means protocols are impractical because not only are they restricted to a single key setting, but also they are inefficient and nonscalable for distributed cloud computing. To address these issues, we propose a set of privacy-preserving building blocks and outsourced -means clustering protocol under Spark framework. Theoretical analysis shows that our scheme protects the confidentiality of the joint database and mining results, as well as access patterns under the standard semihonest model with relatively small computational overhead. Experimental evaluations on real datasets also demonstrate its efficiency improvements compared with existing approaches.

1. Introduction

With tremendous amount of data collected each day, it is increasingly difficult for resource-constrained clients to perform computationally intensive tasks locally. There are numerous cases regarding this, such as mobile phones or sensors in the IoT system with limited battery and small companies that lack hardware and software infrastructures. Thus, it is a viable option to outsource data mining tasks to Cloud Service Provider (CSP) which provides massive storage and computation power in a cost-efficient way [1]. By leveraging the cloud platforms, a great many giant IT companies have offered machine learning services to help clients to train and deploy their own models, for example, Amazon Machine Learning [2], Google Cloud Machine Learning Engine [3], and IBM Watson [4]. Despite the advantages, privacy issues are the critical concerns for cloud users to employ these services. For many data records may contain sensitive information, such as health condition, financial records, and location information, outsourcing them in plain form inevitably reveals personal privacy to CSP which may be untrustworthy or even malicious. For instance, it is favorable to improve the diagnosis accuracy by utilizing data mining techniques over medical records gathered from multiple patients [5], while releasing such information to public directly is prohibited by laws in many countries, for example, HIPAA [6]. Thereby, appropriate mechanisms should be designed to guarantee that the outsourced mining tasks are executed in a privacy-preserving manner.

In this paper, we focus on privacy protection techniques on outsourced -means clustering, which is a widely used data mining algorithm in the fields of image analysis, information retrieval, pattern recognition, and so on. The outsourcing datasets are contributed by multiple data owners who are willing to collaborate in outsourced clustering in order to obtain more accurate results. Normally, the data records are encrypted via cryptographic tools to prevent them from being disclosed to other parties. The goal of our solution is to let cloud servers perform clustering over the encrypted data.

Traditional privacy-preserving clustering schemes cannot be directly adopted to address the privacy issues for outsourcing. They aim at computing clusters through interactions among different data holders without revealing respective data to others [79], whereas, in our case, the data are stored and processed by the cloud rather than clients themselves.

Most existing works on outsourced privacy-preserving clustering require cloud clients to utilize the same key for data encryption [1015]. In practice, sharing the same encryption key has some disadvantages: (1) for symmetric encryption scheme, compromised data owners can easily decrypt other owners’ encrypted data if they launch eavesdropping attacks, which suites the case in [1014]; (2) for asymmetric encryption, if the datasets are encrypted under cloud’s public key, data owners cannot decrypt their uploaded data due to not knowing the private key [15]. One way to solve this is allowing data owners to encrypt their data by their own keys, but this calls for computation over encrypted data under multiple keys (denoted by multikey). The only work [16] concerning multikey clustering was constructed on the multiplicative transformation method proposed in [17], whereas their proposed solutions expose all owners’ keys to the query user, which causes security risks if the user is compromised.

Another issue of current research is that their threat models do not suffice higher level attacks. The essence of underlying schemes in [10, 14, 16] is to apply random matrix to encrypt data. These are secure against known-sample attack [17] (namely, attacker only knows some instances). But if attacker also knows the corresponding encrypted values of some data (i.e., chosen plaintext attack), the remaining instances may be recovered by setting up enough equations. In addition, the fully homomorphic encryption (FHE) used in [12, 13] is not secure according to [18]. Furthermore, access patterns (assignment of data objects, etc.) are disclosed to cloud servers [14, 16]. They may be used to derive sensitive information regardless of data encryption, as indicated by [19, 20]. Last but not least, few works consider combining their privacy-preserving techniques with large-scale data processing frameworks for boosting efficiency.

To address these challenges, we propose a novel solution for Privacy-Preserving -means Clustering Outsourcing under Multikeys (PPCOM), which enables distributed cloud servers to perform clustering collaboratively over the aggregated datasets with no privacy leakage. Specifically, the major contributions of this paper are fourfold:(i)Firstly, our solution allows the cloud to perform arithmetic computations over encrypted data under multiple keys. It is achieved by transforming ciphertexts under different keys into ones under the unified key through the double decryption cryptosystem. Since the encryption scheme is only partially homomorphic, we propose a secure addition subprotocol under the noncolluding two-server model, which does not reveal anything about input or output, including the input ratio. Based on these, the cloud can compute Euclidean distances between records and cluster centers.(ii)Secondly, we propose two secure primitives to evaluate equality test and compare ciphertexts, addressing the problem that encrypted data are incomparable because of probabilistically random distribution. These two are further utilized in other privacy-preserving building blocks, including minimum Euclidean distance computing and encrypted bitmap converting, as well as cluster center updating.(iii)Thirdly, on the basis of those privacy-preserving building blocks, we design PPCOM protocol by integrating Spark framework to accelerate the outsourcing process, which works in distributed cloud environments. Moreover, PPCOM requires no clients’ participation after they upload their encrypted datasets under their own private keys.(iv)Fourthly, theoretical analysis demonstrates that the proposed protocol not only protects the privacy of aggregated data records and clustering centers, but also hides access patterns under the semihonest model. Experimental results on real dataset shows that PPCOM is much more efficient than existing methods in terms of computational overhead.

The rest of the paper is organized as follows. Section 2 reviews the related works. In Section 3, we describe the preliminaries required in the understanding of our proposed PPCOM. In Section 4, we formalize the system model, threat model, and design objectives. The design details of the proposed privacy-preserving building blocks as well as outsourcing protocol are presented in Section 5. We provide security analysis in Section 6. Section 7 shows the theoretical and experimental evaluation results. Finally, we conclude the paper and outline future work in Section 8.

There have been a lot of works on privacy-preserving distributed -means clustering [79]. These works have different security requirements and design goals compared with our work. In the distributed setting where data are partitioned among multiple parties, the clustering task is undertaken by data holders instead of centralized servers. Generally, their schemes exploit secure multiparty computation (SMC) techniques so as to preclude one’s data from being disclosed to the others except the final results. Whereas, in terms of clustering outsourcing, data owners intend to transfer the major workloads to cloud servers for the sake of reducing costs and improving efficiency. During the entire outsourcing process, all the inputs and outputs, as well as intermediate results, are supposed to be encrypted to ensure confidentiality.

As for outsourced -means clustering, Liu et al. [12] first leveraged FHE technique to perform outsourced clustering. To compare encrypted distances, their approach requires data owner to provide trapdoor information during each iteration, which entails heavy overhead on clients. To reduce the amount of data owner participation, Almutairi et al. [13] presented an efficient mechanism by using the concept of an Updatable Distance Matrix (UDM). Nevertheless, both works reveal partial privacy to cloud servers, such as the size of each cluster and the distance between data object and centroid. Moreover, the encryption scheme adopted in [12, 13] is not secure according to [18].

Another line of research for outsourced clustering is to use distance-preserving data perturbation or transformation techniques to encrypt dataset [21]. Keeping the distance after encryption enables the cloud to update clusters independently without data owner’s involvement, which achieves approximate efficiency compared with unencrypted data. However, as [17, 22] pointed out, these solutions are weak in security. Specifically, if attackers obtain some original instances (i.e., known-sample attack (KSA) [17]), the rest may be recovered by identifying corresponding encrypted ones and setting up enough equations. The work by Lin [11] focused on kernel -means instead of standard -means to avoid the preserving-distance vulnerability of random transformation. For outsourced collaborative data mining, Huang et al. [16] utilized asymmetric scalar-product-preserving encryption (ASPE) proposed in [17] that is resilient to KSA to compare distances. However, Yao et al. [23] demonstrated that ASPE is vulnerable to the linear analysis attack (LAA) [23]. To accelerate clustering efficiency, much research [24, 25] has been done to integrate MapReduce into -means algorithm and optimize its performance, while few of them take privacy protection into account. Most recently, a secure scheme based on MapReduce to support large-scale dataset was proposed by Yuan and Tian [14]. By preserving the sign of encrypted distance difference like ASPE, their approach enables the cloud to assign data object to its closest cluster. It is resistant against both KSA and LAA. Unfortunately, none of the previously mentioned encryption schemes were formally proved to be secure against chosen plaintext attack (CPA); meanwhile some sensitive information, such as assignment of data objects and size of clusters, is directly disclosed to cloud servers.

To further reenforce security, Rao et al. [15] proposed a semantic secure scheme for outsourced distributed clustering over the aggregated encrypted data from multiple users. Based on Paillier cryptosystem, their solution protects not only confidentiality of data contents, but also access patterns from cloud servers and other users. In addition, participation of users is no longer required during outsourcing. Their design objective is similar to ours, except that their protocol does not support computation under multikeys and Spark framework. Besides, the cost of secure comparison is too heavy since each input has to be decomposed into encrypted bits by calling SBD subroutine [26]. This will be demonstrated in the experimental evaluations in Section 7. In regard to computation under multikey setting, López et al. [27] proposed a new FHE; however its efficiency suffers from complex key-switching and heavy interactions among users. There are other works that utilize double decryption mechanism [28] or proxy reencryption technique [29] to convert ciphertext keys, allowing two servers to conduct addition and multiplication operations under multikeys. Nevertheless, these basic operations still cannot fulfill the need to perform more sophisticated data mining computations, for example, similarity measurement.

3. Preliminaries

In this section, we briefly introduce the typical -means clustering algorithm and public key cryptosystem with double decryption mechanism, serving as the basis of our solution.

3.1. -Means Clustering

Given records , the -means clustering algorithm partitions them into disjoint clusters, denoted by . Let be the centroid value of . Record assigned to has the shortest distance to compared with its distances to other centroids, where and . Let be the matrix defining the membership of records, in which , for , . Note that the th record belongs to if ; otherwise, .

Initial records are selected randomly as cluster centers . Then the algorithm executes in an iterative fashion. For , the algorithm computes Euclidean distance between and every centroid for and updates according to , that is, assigns to the closest cluster . Later, the centroid is derived by computing the mean values of attributes of records belonging to . With the updated , the clustering algorithm begins the next iteration. Finally, the algorithm terminates if the matrix does not vary any more or if a predefined maximum count of iterations is reached [10].

3.2. Public Key Cryptosystem with Double Decryption

Public key cryptosystem with double decryption mechanism (denoted by PKC-DD) allows an authority to decrypt any ciphertext by using the master secret key without consent of corresponding owner. In this paper, we use the scheme proposed by Youn et al. [30] as our secure primitive, which is more efficient than the scheme in [31] in that Youn et al.’s approach applies smaller modulus in cryptographic operations. The major steps are shown in the following.(i)Key generation (KeyGen): given a security parameter , the master authority chooses two primes and () and defines . Then it chooses a random number in such that the order of is . The master secret key is known only to the authority. The public parameters are . A cloud user picks a random integer as secret key and computes as public key.(ii)Encryption (Enc): the encryption algorithm takes the message and as inputs and outputs , where , , and is a random bit integer.(iii)Decryption with user key (uDec): the decryption algorithm takes ciphertext and as inputs, and outputs the message by computing .(iv)Decryption with master key (mDec): given , , and , the authority decrypts by factorizing . The secret key of can be obtained by computing , where function is defined as . Then, is recovered by computing .

By applying the general conversion method in [32], the scheme was claimed to be IND-CCA2 secure under the hardness of solving the -DH Problem [30]. However, Galindo and Herranz [33] have constructed an attack by generating invalid public keys and querying for the master decryption, which may lead to factorization of . To solve this, we adopt a slight modification of the scheme by checking the validity of the secret key, as proposed in [33]. If , the master entity outputs a rejection message; otherwise, the decryption proceeds as usual.

4. Problem Statement

In this section, we formally describe our system model, threat model, and design objectives.

4.1. System Model

In our system model depicted in Figure 1, there are two types of entities, that is, Cloud Users and Cloud Service Provider. Cloud Users consist of Data Owners and Query Client. The Cloud Service Provider can be divided into Storage and Computation Provider and Cryptographic Service Provider. Storage and Computation Provider is composed of dozens of Executing Servers, whereas Cryptographic Service Provider comprises a Key Authority Server and a group of Assistant Servers. The description of each party is given as follows.(1)Data owner (DO): DO is the proprietor of a large dataset. Due to lack of hardware and software resources, DO prefers to outsource his data to the cloud for storage and collaborative data mining. There are in the system. has dataset which contains attributes and records, for . The total number of records is .(2)Query client (QC): QC is an authorized party requesting -means clustering tasks over the federate datasets. QC should not be involved in outsourced computation and should be able to decrypt the result with his own secret key.(3)Executing worker (EW): EW server is a cluster node within Storage and Computation Provider, which is responsible for storing DO’s dataset and performing computation over them. There are in the system. They together constitute a parallel Spark cluster, working on the same distributed file system like HDFS and providing cloud users with massive storage and computing power.(4)Key authority (KA): KA belongs to Cryptographic Service Provider, which is assigned with distribution and management of public parameters and public/private keys, as well as the master key of the cryptosystem.(5)Assistant worker (AW): AW is also part of Cryptographic Service Provider. AW server holds the public/private keys generated by KA, with which AW is able to assist EW to execute a series of privacy-preserving building blocks. We assume that there are AWs, that is, . All AWs and KA constitute the cluster of Cryptographic Service Provider. Note that they offer sufficient computing power for temporal tasks, while they do not store the combined database.

Previous study has shown that it is impossible to implement a noninteractive protocol in the single server setting under the partially homomorphic encryption scheme [34]. So at least two servers are required to complete the outsourced computation [35]. In design of the system model, we take into account the situation that there are usually a large number of servers in one CSP. Moreover, it is feasible to propose secure outsourcing protocols through the cooperation between cloud servers from different CSPs. , generates its own key pair using the parameters produced by KA and encrypts with before uploading it to EW. With the joint datasets as inputs, the distributed cloud servers are scheduled to perform -means clustering algorithm in a privacy-preserving manner.

4.2. Threat Model

In our threat model, all cloud servers and clients are assumed to be semihonest, which means that they strictly follow the prescribed protocol but try to infer private information using the messages they receive during the protocol execution. This assumption is consistent with existing works [1116] on privacy-preserving clustering in cloud environment. Furthermore, the cloud servers have some prior knowledge regarding distribution of owners’ datasets that may be used to launch inference attacks by analyzing access patterns [19]. DOs, QCs, EWs, AWs, and KA are also interested in learning plain data belonging to other parities. Therefore, a CPA adversary is introduced in the threat model. The target of is to decrypt the ciphertexts from the challenge DO and challenge QC with the following abilities:(i) may compromise all the EWs to guess the plaintexts of received ciphertexts from DOs and AWs during the execution of the protocol.(ii) may compromise all the AWs and KA to guess the plaintext values of ciphertexts sent from EWs during the protocol interactions.(iii) may compromise one or more DOs and QCs except the challenge DO and the challenge QC to decrypt the ciphertexts belonging to the challenge party.

Nevertheless, we assume that the adversary cannot compromise EWs and AWs and KA simultaneously; otherwise, is able to decrypt any ciphertext stored on Storage and Computation Provider with the keys from Cryptographic Service Provider. In other words, there is no collusion between these two cloud providers, whereas servers within the provider itself may collude with each other. We remark that such assumptions are typical in adversary models used in cryptographic protocols (e.g., [15, 28, 36]), in that cloud providers are mostly competitors and economically driven by different business models.

4.3. Design Objectives

Given the aforementioned system model and threat model, our design should achieve the following objectives:(i)Correctness: if the cloud users and servers both follow the protocol, the final decrypted result should be the same as in the standard -means algorithm.(ii)Data confidentiality: nothing regarding the contents of datasets and cluster centers , as well as the size of each cluster, should be revealed to the semihonest cloud servers.(iii)Access pattern hidden: access patterns of clustering process, such as which records are assigned to which clusters, should not be revealed to the cloud to prevent any inference attacks [37].(iv)Efficiency: most computation should be processed by cloud in an efficient way, while DOs and QCs are not required to be involved in the outsourced clustering.

5. The PPCOM Solution

In this section, we first discuss a set of privacy-preserving building blocks. Then the complete protocol of PPCOM is presented.

Recall that, in Section 3.1, the semihonest but noncolluding cloud servers need to cooperate to perform computation over encrypted data under PKC-DD scheme. At first, KA takes a security parameter as input and generates public parameter and master secret key by executing . Also, KA generates a key pair used to unify ciphertext encryption key. After that, and are sent to AWs, while is distributed to DOs and QC, which are used to produce their own key pair for . Their generated public keys are sent back to KA for management. Hereafter, let denote the underlying encryption and and denote user-side decryption and master-side decryption, respectively. represents the bit length of .

5.1. Privacy-Preserving Building Blocks

We present eight privacy-preserving building blocks under multikeys. Five of them aim at solving basic operations over ciphertexts, including ciphertext transformation, multiplication, addition, equality test, and minimum, while the rest are especially designed for outsourced clustering.

It can be apparently observed that the underlying encryption scheme is multiplicatively homomorphic due to the following equation:where , for . This property is so critical that multiplication over ciphertexts can be evaluated by one EW server independently, as long as the encryptions are under the same public key. Hereafter, “” denotes multiplication operation in the encrypted domain while “” represents multiplication in the plaintext domain.

5.1.1. Secure Ciphertext Transformation (SCT) Protocol

Given that EW holds , and AW holds , the goal of the SCT protocol is to transform encrypted message under public key into another ciphertext under public key . During execution of SCT, the plaintext should not be revealed to EW or AW; meanwhile the output is only known to EW. The complete steps are shown in Algorithm 1.

Require:  EW has , , and ; AW has , , and .
 () EW:
   (a) Generate a random number: , s.t., ;
   (b) Compute ;
   (c) Send to AW;
 () AW:
   (a) Decrypt ;
   (b) Encrypt ;
   (c) Send to EW;
 () EW:
   (a) Compute ;

To start with, EW generates a number , which means that is randomly picked in . Note that ensures that is invertible in due to . Then we exploit multiplicative homomorphic property of PKC-DD to blind from AW, even if it is able to decrypt via using . Finally, EW removes the randomness by multiplying the encrypted inverse of due to . SCT protocol is especially useful for converting data under different encryption keys into ones under the same key so that EW can perform homomorphic operation.

5.1.2. Secure Addition (SA) Protocol

It takes and held by EW and held by AW as inputs. The output is the encrypted addition of and , that is, , which is only known to EW. As the encryption scheme is not additively homomorphic, it requires interactions between EW and AW.

To preclude AW from obtaining and , a straightforward solution for EW is to blind the inputs with a random value by multiplying , where and . Then the encrypted randomized data are sent to AW. Since AW holds the secret key, it is able to get , through decryption. The randomized addition (denoted by ) is computed by . After that, AW encrypts and sends it back to EW who can get the desired output by running . Nevertheless, it is very possible that partial privacy is revealed to AW. This is because the ratio of inputs can be calculated via , which may be utilized to distinguish inputs. As our threat model assumes the semihonest servers have some background knowledge of dataset distribution, it is effortless for AW to find correlations between encrypted records and known samples. Therefore, to achieve the privacy-preserving guarantees, disclosing input ratio should be prohibited during SA execution.

We propose an enhanced SA protocol still under two-server model, which protects confidentiality of inputs and outputs as well as intermediate results. There are five steps in SA, the details of which are presented in Algorithm 2.

Require:  EW has and ; AW has the secret key
 () EW:
   (a) Generate random numbers: , s.t., for ;
   (b) Compute , and ;
   (c) Compute , and ;
   (d) Generate random numbers: , s.t., , ;
   (e) Compute , and ;
   (f) Compute , and ;
   (g) Compute , and ;
   (h) Send to AW;
 () AW:
   (a) Receive from AW;
   (b) Decrypt , for ;
   (c) Compute , , and ;
   (d) Encrypt , for ;
   (e) Send to EW;
 () EW:
   (a) Receive from EW;
   (b) Generate a random number: , s.t., ;
   (c) for to do
    (i) Compute ;
    (ii) Compute ;
   (d) Send to AW;
 () AW:
   (a) Receive from AW;
   (b) Decrypt , for ;
   (c) Compute ;
   (d) Encrypt ; Send to EW;
 () EW:
   (a) Receive from EW;
   (b) Compute ;
   (c) Compute ;
   (d) Compute ;

During Step (1), the cloud EW generates a set of random numbers, namely, , and calculates their corresponding encryptions under . By exploiting (1), EW computes several intermediate results, such as , and outputs . It can be easily verified that , , , , , , , , , .

During Step (2), AW decrypts using , for . It then computes additions: and their corresponding encryptions. It can be verified that , , .

During Step (3), the blinding factors in are removed by multiplying the encrypted values of . Then, EW generates a random number , which is used to blind . At Step (4), AW decrypts as , and calculates their addition . It can be verified that .

In the end of Step (5), EW gets rid of randomness by . So we can verify thatAfter that, EW computes the inverse of modulo . Note that can be obtained due to . We have , if and are primes of the forms and , where and are also primes, which means that exists in the degree domain. The desired output is calculated by . The correctness of SA protocol can be proven by the following equation:

Executing Algorithm 2 requires two rounds of interactions between EW and AW, which incurs more computational and communication overhead than the simple solution. However, it reveals no privacy to both cloud servers. The formal security analysis of SA is given in Section 6.

5.1.3. Secure Equality Test (SET) Protocol

Given that EW holds two encrypted values and while AW holds the secret key , the goal of SET is to test whether and are equal without revealing them to cloud servers. The detailed steps are presented in Algorithm 3.

Require:  EW holds two encrypted values and ; AW holds the secret key .
 () EW:
   (a) Compute , , where ,
    ;
   (b) Generate a random number that ;
   (c) Compute , where ; Send to AW;
 () AW:
   (a) Decrypt ;
   (b) if then ; else ;
 () return ;

At the first step, EW computes the fractions of two input ciphertexts. Supposing that and , it can be verified that , where . During Step (2), AW decrypts using as follows:Since is randomly selected in , is a random value if and only if . For , it is obvious to infer that . Thus, if AW obtains , the returned value is set to be true (denoted by ); otherwise, is false (denoted by ). It is worth noting that neither nor the intermediate result is revealed to the cloud during execution of SET.

5.1.4. Secure Squared Euclidean Distance (SSED) Protocol

For -means algorithm, we use squared Euclidean distance to measure the distance between the data record and cluster centroid, denoted by , supposing that EW holds the ciphertext of th data record , and the ciphertext of th cluster centroid , while AW holds the secret key , for and .

Note that is a vector composed of attributes which may be rational numbers. However, the ring does not support rational division operation, so a new form of expression is required to represent the cluster center. Let denote the new form of cluster center, where and represent the sum vector and the total number of records belonging to , respectively. It is easily observed that and . Furthermore, is defined as the scaled squared distance between and , which satisfies . Thus, can be calculated as follows:where , , and is the dimension size.

Taking encrypted record and encrypted center as inputs, EW and AW jointly execute SSED by invoking SA and output encryption of squared Euclidean distance , in which , , and . The SSED protocol should reveal neither the contents of and nor the Euclidean distance between them to cloud servers. Since implementation of SSED is straightforward by SA scheme, the design details are omitted.

5.1.5. Secure Squared Distance Comparison (SSDC) Protocol

Suppose that EW holds and AW holds , where , , . Apart from these, EW also has encrypted secrets associated with the distances, that is, . The output of SSDC is the encrypted minimum squared distance and its corresponding secret. Since our encryption scheme is probabilistic and does not preserve the order of messages, EW and AW should jointly compute the minimum without revealing , , and , as well as , to both servers.

Our key idea is to compute the fraction value between the two squared Euclidean distances, based on which AW is able to judge its relationship and returns EW as an encrypted identifier that indicates the minimum value. The fraction between the two squared Euclidean distances can be calculated as follows:Since and are integers within , the ratio may be a rational value in , according to (6). It can be observed that if ( truncates the decimal fraction while keeping the integer part), we deduce that ; otherwise, . The overall steps of SSDC are given in Algorithm 4.

Require:  EW has two encrypted squared distances along with their corresponding encrypted secrets, i.e.,
       , while AW holds secret key , where ,
       , , .
 () EW:
   (a) Generate a random number , two non-zero random numbers ;
   (b) Compute ;
   (c) Compute ;
   (d) if then
    (i) Compute ;
    (ii) Compute ;
   (e) else
    (i) Compute ;
    (ii) Compute ;
   (f) Send to AW;
 () AW:
   (a) Decrypt , for ;
   (b) Compute ;
   (c) if then
    (i) Compute ;
   (d) else if then
    (i) Compute ;
   (e) Encrypt ; Send to EW;
 () EW:
   (a) if then
    (i) Compute ;
    (ii) Compute ;
    (iii) Compute ;
   (b) else
    (i) Compute ;
    (ii) Compute ;
    (iii) Compute ;
 () return ;

To begin with, EW computes encryptions of randomized distances, that is, . It can be verified that and . During Step (2), AW obtains through decrypting with . Next, it calculates in the rational field.From (7), it is easy to infer that AW locates the minimum value by , for the blinding factor does not alter the relationship between and . If , it means and AW assigns the indicator to be ; otherwise, and . One may prefer to use and to represent the indicator, which is more straightforward, whereas it is not secure to utilize encryption of “0”, because , in which . In that case, EW can obviously infer that the encrypted message is zero by observing the ciphertext part.

During Step (3), EW takes the received and encrypted squared distances as well as secrets as inputs and computes the target minimum values. It invokes a secure subroutine called ComputeMin, as shown in Algorithm 5. The correctness of SSDC protocol is proven as follows. Let us take as an example; if , EW and AW jointly execute ComputeMin and get the following:Apparently, it can be observed that if , then we have ; otherwise, . Likewise, and are calculated in a similar way.

Require:  EW holds two encrypted values , and an encrypted identifier , where , ,
      ; AW has ;
 () EW computes ;
 () EW computes ;
 () EW computes ;
 () EW and AW jointly compute ;
 () EW and AW jointly compute ;
 () EW computes ;
 () EW and AW jointly compute ;
 () return ;
5.1.6. Secure Minimum among Squared Distances (SMkSD) Protocol

We assume that EW holds a set of encrypted squared Euclidean distances , where , , , and AW holds the secret key . Besides, EW also has the encryptions of secrets corresponding to their distances, that is, . The goal of SMkSD is to compute the encryption of the shortest squared distance along with its encrypted secret, denoted by , respectively. To execute SMkSD, we compute the minimum by utilizing SSDC with two inputs each time in a sequential fashion. The computation complexity of this algorithm is . Also, it can be executed in a binary tree hierarchy like SMINn in [15], which takes at most iterations.

5.1.7. Secure Index to Bitmap Conversion (SIBC) Protocol

Given that EW has the encrypted index of the closest cluster denoted by (), and AW holds the private key , the output of SIBC is a bitmap vector composed of encrypted elements. During execution of SIBC, neither the index nor the bitmap should be revealed to both servers. If , SIBC denotes ; otherwise, . This indicates that the position of in is the index of the nearest cluster. The typical form of is as follows: The complete steps are presented in Algorithm 6.

Require:  EW holds encrypted index ; AW has secret key .
 () EW:
   (a) for to do
    (i) Generate non-zero random numbers ;
    (ii) Compute ;
    (iii) Compute ;
   (b) Generate a random permutation function ;
   (c) Compute , where ;
   (d) Send to AW;
 () AW:
   (a) for to do
    (i) Decrypt , , where ;
    (ii) Compute ;
    (iii) if then
     (A) Compute ;
    (iv) else
     (A) Compute ;
   (b) Send to EW;
 () EW:
   (a) Compute ;

During Step (1), EW generates nonzero random numbers and uses them to blind and , for . It can be verified that and . For , constitutes an ordered set , which is further permutated by a random permutation function .

During Step (2), AW decrypts the permutated set and computes the fraction for each part. It is easy to infer the following equation:where and may not be the same. Regardless of , if , then we have ; otherwise, is a random number in . Note that AW cannot know the relationship between and for , since they are randomized by blinding factors. Besides, the ratio is also a random number. Thus, as long as is kept confidential, the index of closest cluster is not revealed to AW. In the end, EW recovers the true sequence of by running the inverse permutation .

Furthermore, it should be emphasized that the method based on (10) can also be used in other scenarios where equality test is required.

5.1.8. Secure New Cluster Computation (SNCC) Protocol

Given that EW holds the assignment membership matrix , the encrypted dataset , and target cluster , the goal of SNCC is to calculate the new cluster centroid denoted by , where . During execution of SNCC, nothing regarding the data record, sum of attributes, and cluster size should be disclosed to cloud servers. The complete steps are shown in Algorithm 7.

Require:  EW holds the assignment matrix , the encrypted dataset , the cluster id , while AW has the secret key ,
      where , , .
 () EW:
   (a) for to do
     (i) for to do
      (A) Compute ;
      (B) Compute with AW;
     (ii) Compute , where ;
     (iii) Compute with AW;
 () EW:
   (a) for to do
    (i) Compute with AW;
   (b) Compute with AW;
 () return ;

During Step (1), EW computes the encryption of summation for each attribute, that is, for and . Supposing that , where and , EW obtains . After rounds of iterations, the important observation isRecall that means that belongs to the cluster , while denotes . Therefore, (11) adds all the records that are assigned to the target cluster. Moreover, only needs to be calculated only once during the entire outsourcing period, for .

During Step (2), EW and AW jointly compute the encryption of cluster size by invoking SA subprotocol. It can be verified thatObviously, (12) sums up those whose equals and discards those with , through which the final result is the size of the cluster .

5.2. The Complete PPCOM Protocol

In this subsection, we present our proposed PPCOM protocol for the standard -means algorithm which works in the distributed cloud environments.

The primary goal of PPCOM is to schedule a group of cloud servers to perform clustering task over the joint datasets encrypted under multiple keys, meanwhile no privacy of data records, intermediate results, and final clusters should be revealed to the semihonest servers. In order to improve the overall performance, we leverage a large-scale data processing engine called Spark [38]. It develops a data structure called the resilient distributed dataset (RDD) for data parallelism and fault-tolerance, which facilitates iterative algorithms in machine learning. Though it provides a scalable machine learning library—MLlib—which includes -means algorithm [39], it does not take privacy protection into account and cannot process encrypted data either. Therefore, it is essential to combine our proposed building blocks in Section 5.1 and Spark computing framework to design PPCOM.

PPCOM is composed of four stages, that is, Data Uploading, Ciphertext Transformation, Clustering Computation, and Result Retrieval, the details of which are described in the following.

5.2.1. Data Uploading Stage

We assume that all data records should have been preprocessed by data owners. One essential preprocessing step is to normalize data values, because different attributes have different value domains, which possibly leads to the case that those attribute values whose domain is large have greater impacts on accuracy of distance-based clustering. Normalization enables records to fall into the common range by endowing all attributes with equal weights. In this paper, we adopt Mix-Max Normalization [40]. Suppose that attribute owns observed values and is the range of , while is the target range. Mix-Max Normalization maps into in by calculating the following equation:

After the preprocessing step, encrypts its dataset with by calculating for . Let denote the th attribute value of th record in for and . denotes the encryption of . After all DOs complete uploading their datasets to EWs, the cloud aggregates the distributed datasets into a joint database . Under this circumstance, is still able to retrieve its data and decrypt it with its private key , whereas cannot decrypt without corresponding for .

5.2.2. Ciphertext Transformation Stage

Upon receiving clustering request from QC, EWs initiate ciphertext transformation procedure which aims at converting ciphertexts under into encryptions under the unified key , for . EW first replicates into to ensure DOs’ accessibility to their original dataset. Then SCT subprotocol is executed to output the converted dataset (denoted by ). This stage is important for two reasons: () EW is able to perform homomorphic operation merely under the same public key; () user-decryption is much more efficient than master decryption.

5.2.3. Clustering Computation Stage

With all the converted records , for , , the objective of this stage is to compute the cluster centroids and the membership matrix without compromising privacy. The outsourcing process is not only protected by the proposed secure building blocks, but also accelerated by Spark framework. The stage includes four steps, namely, Job Assignment, Map Execution, Reduce Execution, and Update Judgement, as shown in Figure 2.

Step 1 (Job Assignment). Firstly, CSPs select minimum computing units (denoted by MCUs). Each MCU is composed of one server from and one from ; that is, . Obviously, MCU is able to perform cryptographic building blocks on its own. Since the workload of AW is relatively light compared to EW, it is preferable to share one AW within several MCUs to maximize resource usage. We assume that every cloud node possesses sufficient storage and computational power for its assigned job. The set is divided into two disjoint sets, that is, Map set and Reduce set. Without loss of generality, Map consists of , while Reduce comprises . Then is divided into uniformly distributed partitions , which are transferred to their corresponding MCU nodes in Map set. In this paper, we assume that the initial cluster centers are chosen by DOs in advance and they are also encrypted. , Map is aware of the initial cluster centroid set , where , for .

Step 2 (Map Execution). Given and as inputs, Map () outputs a key-value table , in which the key is the record and the value is the encryption of bitmap that indicates the closest cluster. Suppose that includes data records . Here, () is an -dimension vector .
As presented in Algorithm 8, each MCU in Map executes the following steps in parallel: () computes the encryption of squared Euclidean distance between and cluster center , for and (Steps ()–()); () computes the encrypted index of the minimum among distances for each record by calling SMkSD (Step ()); () converts the index into an encrypted bitmap via SIBC scheme (Step ()). The final output for Map is table with entries. Each entry consists of a data record as the key and its corresponding assignment bitmap as value.

Require:  Map MCUs have datasets and cluster centroid set .
 ()  , Map concurrently executes:
 ()  for to do
 ()   for to do
 ()    
     , where ;
 ()   end for
 ()   
    , where , for ;
 ()   
      ;
 ()    and ;
 ()   end for
 () return for ;

Step 3 (Reduce Execution). Taking from Map MCUs and the aggregated dataset as inputs, Reduce computes the cluster center for based on the assignment membership matrix . The major steps are presented in Algorithm 9. It can be observed that each MCU in Reduce concurrently executes the following: () converges the assignment vectors in into the complete membership matrix (Step ()); () computes the new cluster center for the target cluster by invoking SNCC (Step ()). The final output for Reduce is the encrypted centroid for and the matrix .

Require:  Reduce MCUs have and the aggregated dataset .
 () , Reducer concurrently executes:
 () for to do
 ()  for to do
 ()  
       ;
 ()  end for
 () end for
 ()
    ;
 () return , , where ;

Step 4 (Update Judgement). This step is to determine whether the predefined termination condition of -means algorithm is satisfied. In this paper, we consider that the membership matrix is not changing as the termination condition. Given that EW holds the previous matrix and the current matrix and AW holds the key , our strategy is to find out whether the elements in and are equal one by one by utilizing SET subprotocol. Once a mismatch is detected, EW replaces with and goes to Step 2 of Clustering Computation Stage. Otherwise, these servers continue comparing till the end of matrix. If the outcome is which means that the assignment of clusters does not vary any more, EW then terminates the iteration and activates Result Retrieval Stage.

5.2.4. Result Retrieval Stage

Since the cluster center set and assignment matrix are encrypted under , QC cannot decrypt them without . Firstly, SCT scheme is invoked to transform the encryption key of and from to . After that, QC is able to download them and decrypt the final result with . The final cluster center of is recovered by , for .

6. Security Analysis

We first analyze the security of the privacy-preserving building blocks. Since all parties are semihonest, security in this model can be proven under “Real-versus-Ideal” framework [41]: there is an ideal model where all computations are performed by a trusted third party. The protocol is secure if all adversarial behaviors in the real world can be simulated in the ideal world. Considering that the proofs of the proposed building blocks are basically the same, we just take the formal proof of SA subprotocol as an example.

Theorem 1. The SA protocol described in Section 4.1 securely computes the addition over ciphertexts by using the PKC-DD cryptosystem in presence of two semihonest but noncolluding cloud servers.

Proof. Since this algorithm is collaboratively completed by EW and AW, we need to prove that SA is not only secure against adversary corrupting EW, but also against corrupting AW, respectively.(1)Security against : in Step (1), the real world view of in SA includes inputs , random values , and outputs . In Step (2), the real world view includes inputs , outputs , and intermediate results . In Step (3), the real world view is composed of input , outputs , and a fixed value . Without the secret key , is unable to decrypt those ciphertexts. Thus, we can build a simulator in the ideal world. It generates encryptions of inputs and random numbers by randomly selecting from . Then, follows the protocol to compute the corresponding outputs. Due to the semantic security of the PKC-DD, it is computationally difficult for to distinguish the views from the real world and the ideal world. Therefore, we can prove that where . Herein, means computationally distinguishable.(2)Security against : in Step (2), the real world view of includes inputs and outputs . Note that the output values are blinded by random . Besides, it is extremely hard for to guess the correct ratio, that is, . Suppose that and , where . There are 8 unknown factors (), whereas can only set up 6 equations () that are not enough to crack . As for Step (4), the real world view includes inputs and outputs . Note that the inputs are randomized by . Therefore, we can build a simulator in the ideal world, which generates random numbers as inputs, that is, . then executes the protocol to produce output through , , and modular addition operations. Due to the semantic security of the PKC-DD and randomness of blinding factors, the views of in the real world and ideal world are computationally distinguishable. Therefore, for , we have

During outsourcing process, cloud servers invoke the proposed building blocks as subroutines and all transmitted data are encrypted for each step. Note that the data records are held by cloud parties without decryption key. The assistant parties with the key can decrypt the received data, but the real data are randomized. Since PKC-DD is semantically secure and blinding factors are randomly selected, nothing regarding the data contents and computed clusters is revealed to the servers. Moreover, the access patterns of which encrypted input denotes the minimum Euclidean distance (as shown in SSDC, Algorithm 4) and of which record is assigned to which cluster are protected from the cloud due to encryption of (as shown in SNCC, Algorithm 7). By the Composition Theorem [41], the sequential composition of four stages in PPCOM is secure under the semihonest model.

Discussions. Note that the order of computing addition via SA cannot be altered in ComputeMin called SSDC even though the final outcome is the same. Suppose that the clouds choose to compute or ; if the result is , it will inevitably reveal or to both servers. Similarly, the order of computation steps in Algorithm 7 in SNCC ought to be kept unchanged.

7. Performance Analysis

In this section, we analyze the performance of PPCOM protocol from both theoretical and experimental perspectives.

7.1. Theoretical Analysis

Let Exp, Mul denote the modular exponentiation and multiplication operations, respectively. Let represent the key size of the double decryption scheme. The encryption of the underlying cryptosystem incurs . The cost of normal decryption is , while that of authority decryption is . Recall that is the dimension size of a record, and is size of the joint dataset. The computational and communication overheads for the major building blocks and clustering algorithms in one iteration are given in Table 1. It can be observed that the addition and comparing operations incur a lot of operations at the cost of hiding access patterns. For Data Uploading Stage, it takes each DO computational cost and bits. The computational overhead of Ciphertext Transformation Stage is while its communication cost is . We stress that Stage 1 and Stage 2 of PPCOM protocol are executed only once. These overheads are amortized through a number of iterations. Furthermore, the number of MCU in Map set is closely related to the costs, while affects the performance of Reduce more significantly (usually ). It is easy to find that the larger the computing cluster is, the less the tasks () are distributed to each unit. This is because the -means jobs can be parallelized under Spark. As for the Update Judgement step, the worst case is to compare every element of the matrix, the computational and communication costs of which are related to and .

7.2. Experimental Analysis

The experiments are conducted on our local cluster, in which each server running CentOS6.5 has Intel Xeon E5-2620 @ 2.10 GHz with 12 GB memory. We implemented all the outsourcing protocols using the Crypto++ 5.6.3 library and Spark framework. In this paper, key size of similar methods (Paillier used in PPODC [15] and BCP encryption in [28]) is 1024 bits, which is commonly acceptable. To achieve the same security level, of PKC-DD should be bits more than RSA modulus [42]. During all tests, we choose the security parameter , so the key size .

To facilitate comparisons, we use KEGG Metabolic Reaction Network dataset [43]. The dataset includes 65554 instances and 29 attributes. Before clustering, all records are first normalized into integers in to prevent impacts of large unit values as mentioned in Data Uploading Stage of Section 5.2. Note that the first attribute is excluded from tests, since it is just the identifier of pathway. All of the testing records are randomly selected from KEGG dataset.

7.2.1. Privacy-Preserving Building Blocks’ Performance

We first measure the execution time of each privacy-preserving building block on a single server through 1000 times, the average costs of which are shown in Table 2. It can be seen that costs of the compound protocol (made up of basic primitives, e.g., SSED, SMkSD, and SNCC) are relatively high, since they are made up of several SA operations, which involve many encryptions and decryptions, as well as rounds of interactions to preserve data privacy. This is consistent with theoretical analysis.

We then evaluate the performance of SCT scheme. Table 3 shows the ciphertext transformation time for varying dataset size () in SCT and KeyProd in [28]. It can be seen that the cloud running time grows with increasing value of . Our scheme executes about 4 times faster than KeyProd in that PKC-DD works more efficiently than theirs. Besides, we remark that schemes in [28] are designed for basic arithmetic operations under multikey rather than complex mining tasks like -means algorithm.

We also compare the cloud running time for SSED and SMkSD with counterpart methods in PPODC [15], respectively. As shown in Figures 3(a) and 3(b), computation time of both grows with increase of dataset size and our schemes outperform PPODC’s. Let denote the bit length of plaintext message. In Figure 3(b), it is easy to find that, with growth of , the computation time of PPODC grows more rapidly. It is because ciphertexts have to be decomposed into encryption of bits before comparison in [15].

7.2.2. Factors Affecting PPCOM’s Performance

There are three major factors affecting the outsourced performance: () the number of clusters (); () the number of parallelized MCUs (); () the size of aggregated dataset ().

First, we evaluate the overhead on cloud servers with varying and when and , in comparison with the optimized PPODC with 8 parallelized server pairs. The results are given in Figures 4(a) and 4(b). It can be seen that the computation costs of both protocols grow almost linearly with and PPCOM protocol outperforms PPODC; for example, when , , the cloud computation time of PPODC is  min, that is, 4.33 times that of our scheme. The efficiency is gained not only by improved secure primitives, but also by Spark framework. Nevertheless, the communication overhead of PPCOM is relatively high, which is mainly caused by frequent interactions during SA process. Furthermore, the growth of dimension size also increases the computational and communication overhead of both protocols.

Next, we evaluate the overhead on cloud servers with varying when . As shown in Figure 5(a), the computation time decreases with the growth of . It can be derived that () the scaling of parallelized servers can accelerate the outsourced clustering task; () it takes PPCOM less computational cost than PPODC to accomplish the same amount of work. Figure 5(b) shows that the communication cost of both schemes remains unchanged regardless of , because the total amount of clustering task is fixed. However, PPCOM incurs heavier communication overhead to protect privacy and access patterns.

Moreover, we evaluate the impact of on cloud servers’ performance with , . From Figure 6(a), we observe that the running time of both protocols increases with , as more data need to be clustered. It is obvious that the cost of PPODC grows more sharply than ours. Figure 6(b) shows the computation overhead of Map stage and Reduce stage during execution of PPCOM, respectively. Map stage takes larger proportion of total cost than Reduce, while they scale linearly with . In addition, the doubled parallelized MCUs save almost 40% of Map execution time.

7.3. Comparative Summary

As shown in Table 4, we summarize qualitative comparisons with existing outsourced -means protocols. All protocols are claimed to protect input data privacy. The encryption schemes of first three are constructed on random transformation, such as randomized kernel matrix [11] and random invertible matrices [14, 16]. They were proven to be secure against KSA (known-sample attack) that the attacker knows a set of plain data objects in the dataset, but not the corresponding encrypted values. Yuan and Tian’s work [14] can also defeat LAA (linear analysis attack) introduced by [23]. However, these schemes are weak considering that attacker gets both some data objects and their encryptions. The latter four outsourcing protocols are proposed based on homomorphic encryption techniques, which can resist CPA (chosen plaintext attack). References [12, 13] adopt Liu’s FHE as the underlying encryption scheme, which yet may not be secure enough as illustrated by [18]. Rao et al.’s [15] and our encryption schemes achieve semantic security, relying on hardness of Decisional Composite Residuosity [44] and Diffie-Hellman Problem [30], respectively.

From this table, it can be seen that only [16] and ours support computation over encrypted data under multikeys, whereas one drawback in [16] is that they either reveal data owners’ keys to query user or reveal query user’s key to owners. Rao et al.’s [15] and ours hide access patterns by executing clustering in an oblivious way, preventing cloud servers from knowing the assignment membership of encrypted records which may be used to launch inference attack. Several schemes require data owners to participate in the mining process so as to update cluster centroids or to assist similarity comparison, except those from [11, 15] and ours. Almost all works allow the cloud to perform comparison operation between encrypted distances, while approach in [13] adopts a plain updatable matrix to compare. What is more, only Yuan and Tian’s [14] and our researches consider how to integrate big data processing framework into privacy-preserving protocols. As a consequence, our solution achieves the most comprehensive security requirements and feasibility for clustering outsourcing compared with current works.

8. Conclusion

In this paper, we proposed an efficient privacy-preserving protocol for outsourced -means clustering over joint datasets encrypted under multiple data owners’ keys. By utilizing double decryption cryptosystem, we proposed a series of privacy-preserving building blocks to transform ciphertexts and evaluate addition, multiplication, equality, and comparison, and so on, over encrypted data. Our protocol not only protects privacy of the aggregated database, but also hides access patterns under the semihonest model. Another improvement is that the outsourced clustering works under big data processing framework, which can be scaled to process big data. Experiments on real dataset show that our scheme is more efficient than existing approaches. However, the computation and communication costs of PPCOM are still heavy for large datasets. Our future work will focus on improving the efficiency of outsourced protocols while protecting data privacy to withstand advanced attacks, for example, collusion attack between two CSPs.

Conflicts of Interest

The authors declare that they have no conflicts of interest.