Big Data Analytics for Information SecurityView this Special Issue
Research Article | Open Access
Hong Rong, Huimei Wang, Jian Liu, Jialu Hao, Ming Xian, "Privacy-Preserving -Means Clustering under Multiowner Setting in Distributed Cloud Environments", Security and Communication Networks, vol. 2017, Article ID 3910126, 19 pages, 2017. https://doi.org/10.1155/2017/3910126
Privacy-Preserving -Means Clustering under Multiowner Setting in Distributed Cloud Environments
With the advent of big data era, clients who lack computational and storage resources tend to outsource data mining tasks to cloud service providers in order to improve efficiency and reduce costs. It is also increasingly common for clients to perform collaborative mining to maximize profits. However, due to the rise of privacy leakage issues, the data contributed by clients should be encrypted using their own keys. This paper focuses on privacy-preserving -means clustering over the joint datasets encrypted under multiple keys. Unfortunately, existing outsourcing -means protocols are impractical because not only are they restricted to a single key setting, but also they are inefficient and nonscalable for distributed cloud computing. To address these issues, we propose a set of privacy-preserving building blocks and outsourced -means clustering protocol under Spark framework. Theoretical analysis shows that our scheme protects the confidentiality of the joint database and mining results, as well as access patterns under the standard semihonest model with relatively small computational overhead. Experimental evaluations on real datasets also demonstrate its efficiency improvements compared with existing approaches.
With tremendous amount of data collected each day, it is increasingly difficult for resource-constrained clients to perform computationally intensive tasks locally. There are numerous cases regarding this, such as mobile phones or sensors in the IoT system with limited battery and small companies that lack hardware and software infrastructures. Thus, it is a viable option to outsource data mining tasks to Cloud Service Provider (CSP) which provides massive storage and computation power in a cost-efficient way . By leveraging the cloud platforms, a great many giant IT companies have offered machine learning services to help clients to train and deploy their own models, for example, Amazon Machine Learning , Google Cloud Machine Learning Engine , and IBM Watson . Despite the advantages, privacy issues are the critical concerns for cloud users to employ these services. For many data records may contain sensitive information, such as health condition, financial records, and location information, outsourcing them in plain form inevitably reveals personal privacy to CSP which may be untrustworthy or even malicious. For instance, it is favorable to improve the diagnosis accuracy by utilizing data mining techniques over medical records gathered from multiple patients , while releasing such information to public directly is prohibited by laws in many countries, for example, HIPAA . Thereby, appropriate mechanisms should be designed to guarantee that the outsourced mining tasks are executed in a privacy-preserving manner.
In this paper, we focus on privacy protection techniques on outsourced -means clustering, which is a widely used data mining algorithm in the fields of image analysis, information retrieval, pattern recognition, and so on. The outsourcing datasets are contributed by multiple data owners who are willing to collaborate in outsourced clustering in order to obtain more accurate results. Normally, the data records are encrypted via cryptographic tools to prevent them from being disclosed to other parties. The goal of our solution is to let cloud servers perform clustering over the encrypted data.
Traditional privacy-preserving clustering schemes cannot be directly adopted to address the privacy issues for outsourcing. They aim at computing clusters through interactions among different data holders without revealing respective data to others [7–9], whereas, in our case, the data are stored and processed by the cloud rather than clients themselves.
Most existing works on outsourced privacy-preserving clustering require cloud clients to utilize the same key for data encryption [10–15]. In practice, sharing the same encryption key has some disadvantages: (1) for symmetric encryption scheme, compromised data owners can easily decrypt other owners’ encrypted data if they launch eavesdropping attacks, which suites the case in [10–14]; (2) for asymmetric encryption, if the datasets are encrypted under cloud’s public key, data owners cannot decrypt their uploaded data due to not knowing the private key . One way to solve this is allowing data owners to encrypt their data by their own keys, but this calls for computation over encrypted data under multiple keys (denoted by multikey). The only work  concerning multikey clustering was constructed on the multiplicative transformation method proposed in , whereas their proposed solutions expose all owners’ keys to the query user, which causes security risks if the user is compromised.
Another issue of current research is that their threat models do not suffice higher level attacks. The essence of underlying schemes in [10, 14, 16] is to apply random matrix to encrypt data. These are secure against known-sample attack  (namely, attacker only knows some instances). But if attacker also knows the corresponding encrypted values of some data (i.e., chosen plaintext attack), the remaining instances may be recovered by setting up enough equations. In addition, the fully homomorphic encryption (FHE) used in [12, 13] is not secure according to . Furthermore, access patterns (assignment of data objects, etc.) are disclosed to cloud servers [14, 16]. They may be used to derive sensitive information regardless of data encryption, as indicated by [19, 20]. Last but not least, few works consider combining their privacy-preserving techniques with large-scale data processing frameworks for boosting efficiency.
To address these challenges, we propose a novel solution for Privacy-Preserving -means Clustering Outsourcing under Multikeys (PPCOM), which enables distributed cloud servers to perform clustering collaboratively over the aggregated datasets with no privacy leakage. Specifically, the major contributions of this paper are fourfold:(i)Firstly, our solution allows the cloud to perform arithmetic computations over encrypted data under multiple keys. It is achieved by transforming ciphertexts under different keys into ones under the unified key through the double decryption cryptosystem. Since the encryption scheme is only partially homomorphic, we propose a secure addition subprotocol under the noncolluding two-server model, which does not reveal anything about input or output, including the input ratio. Based on these, the cloud can compute Euclidean distances between records and cluster centers.(ii)Secondly, we propose two secure primitives to evaluate equality test and compare ciphertexts, addressing the problem that encrypted data are incomparable because of probabilistically random distribution. These two are further utilized in other privacy-preserving building blocks, including minimum Euclidean distance computing and encrypted bitmap converting, as well as cluster center updating.(iii)Thirdly, on the basis of those privacy-preserving building blocks, we design PPCOM protocol by integrating Spark framework to accelerate the outsourcing process, which works in distributed cloud environments. Moreover, PPCOM requires no clients’ participation after they upload their encrypted datasets under their own private keys.(iv)Fourthly, theoretical analysis demonstrates that the proposed protocol not only protects the privacy of aggregated data records and clustering centers, but also hides access patterns under the semihonest model. Experimental results on real dataset shows that PPCOM is much more efficient than existing methods in terms of computational overhead.
The rest of the paper is organized as follows. Section 2 reviews the related works. In Section 3, we describe the preliminaries required in the understanding of our proposed PPCOM. In Section 4, we formalize the system model, threat model, and design objectives. The design details of the proposed privacy-preserving building blocks as well as outsourcing protocol are presented in Section 5. We provide security analysis in Section 6. Section 7 shows the theoretical and experimental evaluation results. Finally, we conclude the paper and outline future work in Section 8.
2. Related Work
There have been a lot of works on privacy-preserving distributed -means clustering [7–9]. These works have different security requirements and design goals compared with our work. In the distributed setting where data are partitioned among multiple parties, the clustering task is undertaken by data holders instead of centralized servers. Generally, their schemes exploit secure multiparty computation (SMC) techniques so as to preclude one’s data from being disclosed to the others except the final results. Whereas, in terms of clustering outsourcing, data owners intend to transfer the major workloads to cloud servers for the sake of reducing costs and improving efficiency. During the entire outsourcing process, all the inputs and outputs, as well as intermediate results, are supposed to be encrypted to ensure confidentiality.
As for outsourced -means clustering, Liu et al.  first leveraged FHE technique to perform outsourced clustering. To compare encrypted distances, their approach requires data owner to provide trapdoor information during each iteration, which entails heavy overhead on clients. To reduce the amount of data owner participation, Almutairi et al.  presented an efficient mechanism by using the concept of an Updatable Distance Matrix (UDM). Nevertheless, both works reveal partial privacy to cloud servers, such as the size of each cluster and the distance between data object and centroid. Moreover, the encryption scheme adopted in [12, 13] is not secure according to .
Another line of research for outsourced clustering is to use distance-preserving data perturbation or transformation techniques to encrypt dataset . Keeping the distance after encryption enables the cloud to update clusters independently without data owner’s involvement, which achieves approximate efficiency compared with unencrypted data. However, as [17, 22] pointed out, these solutions are weak in security. Specifically, if attackers obtain some original instances (i.e., known-sample attack (KSA) ), the rest may be recovered by identifying corresponding encrypted ones and setting up enough equations. The work by Lin  focused on kernel -means instead of standard -means to avoid the preserving-distance vulnerability of random transformation. For outsourced collaborative data mining, Huang et al.  utilized asymmetric scalar-product-preserving encryption (ASPE) proposed in  that is resilient to KSA to compare distances. However, Yao et al.  demonstrated that ASPE is vulnerable to the linear analysis attack (LAA) . To accelerate clustering efficiency, much research [24, 25] has been done to integrate MapReduce into -means algorithm and optimize its performance, while few of them take privacy protection into account. Most recently, a secure scheme based on MapReduce to support large-scale dataset was proposed by Yuan and Tian . By preserving the sign of encrypted distance difference like ASPE, their approach enables the cloud to assign data object to its closest cluster. It is resistant against both KSA and LAA. Unfortunately, none of the previously mentioned encryption schemes were formally proved to be secure against chosen plaintext attack (CPA); meanwhile some sensitive information, such as assignment of data objects and size of clusters, is directly disclosed to cloud servers.
To further reenforce security, Rao et al.  proposed a semantic secure scheme for outsourced distributed clustering over the aggregated encrypted data from multiple users. Based on Paillier cryptosystem, their solution protects not only confidentiality of data contents, but also access patterns from cloud servers and other users. In addition, participation of users is no longer required during outsourcing. Their design objective is similar to ours, except that their protocol does not support computation under multikeys and Spark framework. Besides, the cost of secure comparison is too heavy since each input has to be decomposed into encrypted bits by calling SBD subroutine . This will be demonstrated in the experimental evaluations in Section 7. In regard to computation under multikey setting, López et al.  proposed a new FHE; however its efficiency suffers from complex key-switching and heavy interactions among users. There are other works that utilize double decryption mechanism  or proxy reencryption technique  to convert ciphertext keys, allowing two servers to conduct addition and multiplication operations under multikeys. Nevertheless, these basic operations still cannot fulfill the need to perform more sophisticated data mining computations, for example, similarity measurement.
In this section, we briefly introduce the typical -means clustering algorithm and public key cryptosystem with double decryption mechanism, serving as the basis of our solution.
3.1. -Means Clustering
Given records , the -means clustering algorithm partitions them into disjoint clusters, denoted by . Let be the centroid value of . Record assigned to has the shortest distance to compared with its distances to other centroids, where and . Let be the matrix defining the membership of records, in which , for , . Note that the th record belongs to if ; otherwise, .
Initial records are selected randomly as cluster centers . Then the algorithm executes in an iterative fashion. For , the algorithm computes Euclidean distance between and every centroid for and updates according to , that is, assigns to the closest cluster . Later, the centroid is derived by computing the mean values of attributes of records belonging to . With the updated , the clustering algorithm begins the next iteration. Finally, the algorithm terminates if the matrix does not vary any more or if a predefined maximum count of iterations is reached .
3.2. Public Key Cryptosystem with Double Decryption
Public key cryptosystem with double decryption mechanism (denoted by PKC-DD) allows an authority to decrypt any ciphertext by using the master secret key without consent of corresponding owner. In this paper, we use the scheme proposed by Youn et al.  as our secure primitive, which is more efficient than the scheme in  in that Youn et al.’s approach applies smaller modulus in cryptographic operations. The major steps are shown in the following.(i)Key generation (KeyGen): given a security parameter , the master authority chooses two primes and () and defines . Then it chooses a random number in such that the order of is . The master secret key is known only to the authority. The public parameters are . A cloud user picks a random integer as secret key and computes as public key.(ii)Encryption (Enc): the encryption algorithm takes the message and as inputs and outputs , where , , and is a random bit integer.(iii)Decryption with user key (uDec): the decryption algorithm takes ciphertext and as inputs, and outputs the message by computing .(iv)Decryption with master key (mDec): given , , and , the authority decrypts by factorizing . The secret key of can be obtained by computing , where function is defined as . Then, is recovered by computing .
By applying the general conversion method in , the scheme was claimed to be IND-CCA2 secure under the hardness of solving the -DH Problem . However, Galindo and Herranz  have constructed an attack by generating invalid public keys and querying for the master decryption, which may lead to factorization of . To solve this, we adopt a slight modification of the scheme by checking the validity of the secret key, as proposed in . If , the master entity outputs a rejection message; otherwise, the decryption proceeds as usual.
4. Problem Statement
In this section, we formally describe our system model, threat model, and design objectives.
4.1. System Model
In our system model depicted in Figure 1, there are two types of entities, that is, Cloud Users and Cloud Service Provider. Cloud Users consist of Data Owners and Query Client. The Cloud Service Provider can be divided into Storage and Computation Provider and Cryptographic Service Provider. Storage and Computation Provider is composed of dozens of Executing Servers, whereas Cryptographic Service Provider comprises a Key Authority Server and a group of Assistant Servers. The description of each party is given as follows.(1)Data owner (DO): DO is the proprietor of a large dataset. Due to lack of hardware and software resources, DO prefers to outsource his data to the cloud for storage and collaborative data mining. There are in the system. has dataset which contains attributes and records, for . The total number of records is .(2)Query client (QC): QC is an authorized party requesting -means clustering tasks over the federate datasets. QC should not be involved in outsourced computation and should be able to decrypt the result with his own secret key.(3)Executing worker (EW): EW server is a cluster node within Storage and Computation Provider, which is responsible for storing DO’s dataset and performing computation over them. There are in the system. They together constitute a parallel Spark cluster, working on the same distributed file system like HDFS and providing cloud users with massive storage and computing power.(4)Key authority (KA): KA belongs to Cryptographic Service Provider, which is assigned with distribution and management of public parameters and public/private keys, as well as the master key of the cryptosystem.(5)Assistant worker (AW): AW is also part of Cryptographic Service Provider. AW server holds the public/private keys generated by KA, with which AW is able to assist EW to execute a series of privacy-preserving building blocks. We assume that there are AWs, that is, . All AWs and KA constitute the cluster of Cryptographic Service Provider. Note that they offer sufficient computing power for temporal tasks, while they do not store the combined database.
Previous study has shown that it is impossible to implement a noninteractive protocol in the single server setting under the partially homomorphic encryption scheme . So at least two servers are required to complete the outsourced computation . In design of the system model, we take into account the situation that there are usually a large number of servers in one CSP. Moreover, it is feasible to propose secure outsourcing protocols through the cooperation between cloud servers from different CSPs. , generates its own key pair using the parameters produced by KA and encrypts with before uploading it to EW. With the joint datasets as inputs, the distributed cloud servers are scheduled to perform -means clustering algorithm in a privacy-preserving manner.
4.2. Threat Model
In our threat model, all cloud servers and clients are assumed to be semihonest, which means that they strictly follow the prescribed protocol but try to infer private information using the messages they receive during the protocol execution. This assumption is consistent with existing works [11–16] on privacy-preserving clustering in cloud environment. Furthermore, the cloud servers have some prior knowledge regarding distribution of owners’ datasets that may be used to launch inference attacks by analyzing access patterns . DOs, QCs, EWs, AWs, and KA are also interested in learning plain data belonging to other parities. Therefore, a CPA adversary is introduced in the threat model. The target of is to decrypt the ciphertexts from the challenge DO and challenge QC with the following abilities:(i) may compromise all the EWs to guess the plaintexts of received ciphertexts from DOs and AWs during the execution of the protocol.(ii) may compromise all the AWs and KA to guess the plaintext values of ciphertexts sent from EWs during the protocol interactions.(iii) may compromise one or more DOs and QCs except the challenge DO and the challenge QC to decrypt the ciphertexts belonging to the challenge party.
Nevertheless, we assume that the adversary cannot compromise EWs and AWs and KA simultaneously; otherwise, is able to decrypt any ciphertext stored on Storage and Computation Provider with the keys from Cryptographic Service Provider. In other words, there is no collusion between these two cloud providers, whereas servers within the provider itself may collude with each other. We remark that such assumptions are typical in adversary models used in cryptographic protocols (e.g., [15, 28, 36]), in that cloud providers are mostly competitors and economically driven by different business models.
4.3. Design Objectives
Given the aforementioned system model and threat model, our design should achieve the following objectives:(i)Correctness: if the cloud users and servers both follow the protocol, the final decrypted result should be the same as in the standard -means algorithm.(ii)Data confidentiality: nothing regarding the contents of datasets and cluster centers , as well as the size of each cluster, should be revealed to the semihonest cloud servers.(iii)Access pattern hidden: access patterns of clustering process, such as which records are assigned to which clusters, should not be revealed to the cloud to prevent any inference attacks .(iv)Efficiency: most computation should be processed by cloud in an efficient way, while DOs and QCs are not required to be involved in the outsourced clustering.
5. The PPCOM Solution
In this section, we first discuss a set of privacy-preserving building blocks. Then the complete protocol of PPCOM is presented.
Recall that, in Section 3.1, the semihonest but noncolluding cloud servers need to cooperate to perform computation over encrypted data under PKC-DD scheme. At first, KA takes a security parameter as input and generates public parameter and master secret key by executing . Also, KA generates a key pair used to unify ciphertext encryption key. After that, and are sent to AWs, while is distributed to DOs and QC, which are used to produce their own key pair for . Their generated public keys are sent back to KA for management. Hereafter, let denote the underlying encryption and and denote user-side decryption and master-side decryption, respectively. represents the bit length of .
5.1. Privacy-Preserving Building Blocks
We present eight privacy-preserving building blocks under multikeys. Five of them aim at solving basic operations over ciphertexts, including ciphertext transformation, multiplication, addition, equality test, and minimum, while the rest are especially designed for outsourced clustering.
It can be apparently observed that the underlying encryption scheme is multiplicatively homomorphic due to the following equation:where , for . This property is so critical that multiplication over ciphertexts can be evaluated by one EW server independently, as long as the encryptions are under the same public key. Hereafter, “” denotes multiplication operation in the encrypted domain while “” represents multiplication in the plaintext domain.
5.1.1. Secure Ciphertext Transformation (SCT) Protocol
Given that EW holds , and AW holds , the goal of the SCT protocol is to transform encrypted message under public key into another ciphertext under public key . During execution of SCT, the plaintext should not be revealed to EW or AW; meanwhile the output is only known to EW. The complete steps are shown in Algorithm 1.
To start with, EW generates a number , which means that is randomly picked in . Note that ensures that is invertible in due to . Then we exploit multiplicative homomorphic property of PKC-DD to blind from AW, even if it is able to decrypt via using . Finally, EW removes the randomness by multiplying the encrypted inverse of due to . SCT protocol is especially useful for converting data under different encryption keys into ones under the same key so that EW can perform homomorphic operation.
5.1.2. Secure Addition (SA) Protocol
It takes and held by EW and held by AW as inputs. The output is the encrypted addition of and , that is, , which is only known to EW. As the encryption scheme is not additively homomorphic, it requires interactions between EW and AW.
To preclude AW from obtaining and , a straightforward solution for EW is to blind the inputs with a random value by multiplying , where and . Then the encrypted randomized data are sent to AW. Since AW holds the secret key, it is able to get , through decryption. The randomized addition (denoted by ) is computed by . After that, AW encrypts and sends it back to EW who can get the desired output by running . Nevertheless, it is very possible that partial privacy is revealed to AW. This is because the ratio of inputs can be calculated via , which may be utilized to distinguish inputs. As our threat model assumes the semihonest servers have some background knowledge of dataset distribution, it is effortless for AW to find correlations between encrypted records and known samples. Therefore, to achieve the privacy-preserving guarantees, disclosing input ratio should be prohibited during SA execution.
We propose an enhanced SA protocol still under two-server model, which protects confidentiality of inputs and outputs as well as intermediate results. There are five steps in SA, the details of which are presented in Algorithm 2.
During Step (1), the cloud EW generates a set of random numbers, namely, , and calculates their corresponding encryptions under . By exploiting (1), EW computes several intermediate results, such as , and outputs . It can be easily verified that , , , , , , , , , .
During Step (2), AW decrypts using , for . It then computes additions: and their corresponding encryptions. It can be verified that , , .
During Step (3), the blinding factors in are removed by multiplying the encrypted values of . Then, EW generates a random number , which is used to blind . At Step (4), AW decrypts as , and calculates their addition . It can be verified that .
In the end of Step (5), EW gets rid of randomness by . So we can verify thatAfter that, EW computes the inverse of modulo . Note that can be obtained due to . We have , if and are primes of the forms and , where and are also primes, which means that exists in the degree domain. The desired output is calculated by . The correctness of SA protocol can be proven by the following equation:
Executing Algorithm 2 requires two rounds of interactions between EW and AW, which incurs more computational and communication overhead than the simple solution. However, it reveals no privacy to both cloud servers. The formal security analysis of SA is given in Section 6.
5.1.3. Secure Equality Test (SET) Protocol
Given that EW holds two encrypted values and while AW holds the secret key , the goal of SET is to test whether and are equal without revealing them to cloud servers. The detailed steps are presented in Algorithm 3.
At the first step, EW computes the fractions of two input ciphertexts. Supposing that and , it can be verified that , where . During Step (2), AW decrypts using as follows:Since is randomly selected in , is a random value if and only if . For , it is obvious to infer that . Thus, if AW obtains , the returned value is set to be true (denoted by ); otherwise, is false (denoted by ). It is worth noting that neither nor the intermediate result is revealed to the cloud during execution of SET.
5.1.4. Secure Squared Euclidean Distance (SSED) Protocol
For -means algorithm, we use squared Euclidean distance to measure the distance between the data record and cluster centroid, denoted by , supposing that EW holds the ciphertext of th data record , and the ciphertext of th cluster centroid , while AW holds the secret key , for and .
Note that is a vector composed of attributes which may be rational numbers. However, the ring does not support rational division operation, so a new form of expression is required to represent the cluster center. Let denote the new form of cluster center, where and represent the sum vector and the total number of records belonging to , respectively. It is easily observed that and . Furthermore, is defined as the scaled squared distance between and , which satisfies . Thus, can be calculated as follows:where , , and is the dimension size.
Taking encrypted record and encrypted center as inputs, EW and AW jointly execute SSED by invoking SA and output encryption of squared Euclidean distance , in which , , and . The SSED protocol should reveal neither the contents of and nor the Euclidean distance between them to cloud servers. Since implementation of SSED is straightforward by SA scheme, the design details are omitted.
5.1.5. Secure Squared Distance Comparison (SSDC) Protocol
Suppose that EW holds and AW holds , where , , . Apart from these, EW also has encrypted secrets associated with the distances, that is, . The output of SSDC is the encrypted minimum squared distance and its corresponding secret. Since our encryption scheme is probabilistic and does not preserve the order of messages, EW and AW should jointly compute the minimum without revealing , , and , as well as , to both servers.
Our key idea is to compute the fraction value between the two squared Euclidean distances, based on which AW is able to judge its relationship and returns EW as an encrypted identifier that indicates the minimum value. The fraction between the two squared Euclidean distances can be calculated as follows:Since and are integers within , the ratio may be a rational value in , according to (6). It can be observed that if ( truncates the decimal fraction while keeping the integer part), we deduce that ; otherwise, . The overall steps of SSDC are given in Algorithm 4.