Abstract

As a commonly used algorithm in data mining, clustering has been widely applied in many fields, such as machine learning, information retrieval, and pattern recognition. In reality, data to be analyzed are often distributed to multiple parties. Moreover, the rapidly increasing data volume puts heavy computing pressure on data owners. Thus, data owners tend to outsource their own data to cloud servers and obtain data analysis results for the federated data. However, the existing privacy-preserving outsourced -means schemes cannot verify whether participants share consistent data. Considering the scenarios with multiple data owners and sensitive information security in an outsourced environment, we propose a verifiable privacy-preserving federated -means clustering scheme. In this article, cloud servers and participants perform -means clustering algorithm over encrypted data without exposing private data and intermediate results in each iteration. In particular, our scheme can verify the shares from participants when updating the cluster centers based on secret sharing, hash function and blockchain, so that our scheme can resist inconsistent share attacks by malicious participants. Finally, the security and experimental analysis are carried out to show that our scheme can protect private data and get high-accuracy clustering results.

1. Introduction

Data mining technology can be used to analyze and extract potentially valuable information from large collections of data. Clustering algorithms are widely used in data mining and have an important role in research for medical, scientific, and commercial applications in practical life. In brief, clustering [1, 2] algorithms can divide data items into groups according to their features and attributes, such that the data items are sufficiently similar in the same group. As a well-known clustering algorithm, -means clustering [3] algorithm has the advantages of simple process and good clustering results and it can assign data into clusters based on the distances from cluster centers.

Data analysis can extract a lot of useful information, but in the process of data analysis, a large amount of personal privacy data will be collected and analyzed, such as living habits, criminal records, and medical records. Furthermore, privacy data breaches often result in financial losses and panic for society and companies. Data privacy has gained more attention than before. There are some researches about privacy-preserving in [46]. Moreover, a lot of privacy-preserving data mining schemes can be found in [79] in recent years.

The traditional privacy-preserving -means schemes are primarily achieved through the interaction of participants. Vaidya and Clifton [10] firstly proposed the multi-party privacy-preserving -means clustering protocol on vertically partitioned data, where the secure distance computation and comparison are supported by the secure permutation scheme and homomorphic encryption. Jha et al. [11] introduced a two-party privacy-preserving -means clustering protocol based on oblivious polynomial evaluation and homomorphic encryption. Bunn and Ostrovsky [12] presented an efficient two-party clustering protocol on arbitrarily partitioned data, where the intermediate values are not disclosed by using division protocol and random values protocol. Yi and Zhang [13] introduced an equally contributory privacy-preserving -means clustering protocol based on ElGamal, plaintext equivalence test protocol, and mix networks. Xing et al. [14] proposed a mutual privacy-preserving -means scheme in the social scene where the parties are grouped with the help of a data analyst, and the scheme can resist collusion attacks. Zhang et al. [15] combined secure multi-party computing and differential privacy technology to train a privacy-preserving -means clustering model. After that, privacy-preserving -means schemes in the malicious model are proposed in [16, 17].

Recently, explosive growth data poses a challenge for data owners in storing and computing, and a cloud server with high storage capacity and strong computing power is a good solution to the problem. Privacy protection and audit research in the cloud environment have also been studied in [1822]. So there are some privacy-preserving -means schemes in an outsourced environment. Liu et al. [23], following the framework in [24], presented a privacy-preserving outsourced -means clustering protocol that one party outsourced the distance computation to a cloud server without revealing both the data and clustering results to any party and cloud server. Jiang et al. [25] introduced an efficient two-party privacy-preserving -means clustering protocol, and this scheme can compute distance safely using subprotocols in [26] and update cluster centers using garbled circuit proposed in [27]. Zou et al. [28] proposed a highly secure privacy-preserving outsourced means clustering scheme using BCP homomorphic encryption and AES encryption under multiple keys. Sakellariou and Gounaris [29] introduced a privacy-preserving outsourced -means scheme with low client-side load based on switch key and Paillier encryption. However, the existing privacy-preserving outsourced -means schemes cannot verify whether participants share consistent data. In this article, we propose a multi-party verifiable privacy-preserving federated -means scheme for horizontally partitioned data.

Our main contributions can be summarized as follows:(1)We propose a privacy-preserving -means scheme based on Paillier cryptosystem, secret sharing, hash function, and blockchain. In the multi-party scenario, we outsource the main computing task to the cloud server and reduce the computing overhead of participants.(2)Our scheme can protect the participants’ information and avoid leaking the clustering centers to participants in each iteration. Furthermore, the malicious participant can be detected using hash function and blockchain in the process of updating cluster centers.(3)We carry out an experimental study to validate our scheme, and the experimental results show that the cloud server’s running time can reach 77 % in our scheme. Moreover, our scheme can obtain high-accuracy clustering results.

The rest of the article is organized as follows. Section 2 introduces the preliminaries about -means clustering and cryptography knowledge. Section 3 presents the framework and specific entities in our scheme. Basic secure protocol and our scheme are detailed in Section 4. Security analysis is carried out in Section 5. Performance analysis is presented in Section 6. Moreover, we conclude this article in Section 7.

2. Preliminaries

For better elaboration, the notations used in this article and their semantic meanings are presented in Table 1.

2.1. -Means Clustering Algorithm

The -means clustering algorithm is one of the most well-performed unsupervised clustering algorithms. The process of -means clustering algorithm is described as follows. Assume that there is a set of samples and each sample is an -dimensional data. Suppose that the samples need to be grouped into clusters , where the cluster center of th cluster is denoted by . Initially, randomly select samples as the initial cluster centers. There are many iterations to measure the distances between each sample and cluster centers. In this article, we adopt Euclidean distance as the criterion. Sample belongs to cluster if the cluster center is the closest to . In each iteration, each sample is reassigned to the nearest cluster and recompute the cluster centers as in the following equation (1). The iteration terminates when there is no or little change in the cluster centers. The specific description of the -means clustering algorithm is shown in Algorithm 1.

Randomly select cluster centers from the dataset
Repeat
(1) Calculate the distances between each sample and cluster centers
(2) Assign each sample to the closest cluster
(3) Replace each cluster centers with the mean of the th cluster
until Cluster centers do not change
2.2. Homomorphic Encryption

Homomorphic encryption allows certain computation over encrypted data. Paillier [30] cryptosystem is a popular homomorphic encryption scheme based on the decisional composite residuosity class problem. Furthermore, the Paillier cryptosystem can provide fast encryption and decryption, and it is widely used in privacy-preserving data mining. We adopt Paillier cryptosystem in our scheme. The Paillier cryptosystem is briefly introduced as follows:(i)Key generation: an entity selects two large primes and and compute and . Then randomly choose an integer and check whether , where . The public key is and the secret key .(ii)Encryption: let be a message and be a random number. The ciphertext of is computed bywhere denotes the encryption with the .(iii)Decryption: decrypt the ciphertext of bywhere denotes the decryption with the .(iv)Homomorphic: the Paillier cryptosystem is additive homomorphic, which satisfies the following equation:

2.3. Blockchain

Blockchain is the underlying technology of Bitcoin [31], which is essentially a distributed database. Blockchain is a very new network form, which uses cryptography, hash function, and proof of work (Pow). There are a lot of nice features of blockchain, such as decentralization, tamper resistance, and transparency.(i)Decentralization: the data on the blockchain are maintained by all nodes in the peer-to-peer networks. Moreover, all nodes compete to generate a block of block without relying on a centralized third party to record transactions.(ii)Tamper resistance: each node in the peer-to-peer networks saves a copy of data on the blockchain, so it is impossible to tamper with the data once the data has been recorded on the blockchain.(iii)Transparency: the records on the blockchain are transparent to all nodes, and anyone can access the data on the blockchain.

3. Scheme Model

The scheme model is illustrated in Figure 1, where the multi-party verifiable privacy-preserving federated -means model includes three entities. The first entity is participants; the data owners provide the original data for the -means algorithm, such as hospitals, scientific research companies, and government agencies. The second entity is a cloud server with adequate storage and computing resources, where the cloud server is responsible for storing the encrypted samples of the participants and undertakes the main computational task in privacy-preserving -means clustering. The last entity is blockchain, which is used to store hash values of secret shares. Because of the tamper-resistant nature of blockchain, once the hash values have been uploaded to the blockchain, it can be guaranteed that they will not change.

In the multi-party verifiable privacy-preserving federated -means scheme, the specific descriptions of entities are shown below.(i)Participants: the samples data to be analyzed are horizontally distributed on the participants, where the participants are denoted by . Moreover, each participant holds samples, where the samples are -dimensional. Participants in red represent malicious participants. Malicious participants may not follow the protocol and share inconsistent data information. In our scheme, participants generate their public key and secret key and upload the encrypted samples to the cloud server. Besides, participants need to share data with a secret sharing scheme when updating the cluster centers. In order to verify the secret shares, participants need to compute the hash values of secret shares and upload them to the blockchain in advance.(ii)Cloud Server: all encrypted samples are stored on the cloud server. The cloud server is responsible for interacting with participants and calculating and comparing the distances between encrypted samples and cluster centers. Then, the cloud server assigns samples to different clusters.(iii)Blockchain: blockchain is responsible for storing hash values generated by the participants.

The aim of our scheme is to build a multi-party verifiable federated -means scheme. In our scheme, participants only get the final -means clustering results, and the cluster centers in each iteration are only known to the cloud server. Furthermore, participants and cloud servers cannot know or infer private information about the other side. In the process of updating the cluster centers, participants send secret shares to other participants and verify the secret shares received from other participants with hash values. Once there exists a secret share that fails to verify, it means that malicious participants have sent inconsistent secret shares so that malicious participants can be detected in time and avoid losses due to incorrect -means results.

4. Our Construction

We construct a multi-party verifiable privacy-preserving federate -means scheme. In our scheme, we use Paillier encryption to protect the information of participants. Furthermore, secret sharing, hash function, and blockchain technology are used to verify and update new cluster centers.

4.1. Basic Secure Protocol

We present a set of subprotocols that will be used in constructing the multi-party verifiable privacy-preserving federate -means scheme.(i)Secure Multiplication (SM) Protocol. In this protocol, participants have input and output to cloud server , where neither nor knows and . Furthermore, information concerning and is not leaked to or .(ii)Secure Squared Euclidean Distance (SSED) Protocol. There exist two -dimensional vectors denoted by and , where and denote the encrypted components sets of and . The cloud server with the input and participants with securely compute the encrypted value of the squared Euclidean distance between vectors and .(iii)Secure Bit-Decomposition (SBD) Protocol. SBD [32] protocol considers cloud server with input and participants securely compute encrypted values of the individual bits of , where and is not known to and . No information regarding output is revealed to . Here, are the most and least significant bits of .(iv)Secure Minimum out of 2 Numbers (SMIN2) Protocol. We assume that and are two -bit strings and . Let and represent the encrypted bits of and , respectively, where and denote each bit of and . This protocol considers cloud server with input and participants with securely output the encrypted values of the individual bits of , where and are the most and least significant bits of and , respectively.(v)Secure Minimum out of k Numbers (SMINk) Protocol. In this protocol, is a distance and denotes a bit of . Let , where is the most and least significant bits of . Consider cloud server with encrypted vectors and participants with . The goal of this protocol is to compute the without revealing any information regarding to and .(vi)Secure and Verifiable Vector Share (SVVS) Protocol. Participant has a dataset, . Samples are -dimensional vectors. In this protocol, securely share samples using a secret sharing scheme. Firstly, generate polynomial and compute shares . Then, computes the hash values of and upload them to the blockchain.(vii)Verify and Recovery Secret (VRS) Protocol. In this protocol, participants with input clustering results . sends shares to and receives shares from . Then, computes hash values of and verifies whether hash values are on the blockchain. If all hash values exist on the blockchain, computes the sum of shares , where is not known to other participants. sends to cloud server , and solves the set of using Lagrange’s interpolation to recovery the sum of the secret values.

4.2. The Proposed Scheme

This section presents the detailed steps of the multi-party verifiable privacy-preserving federated -means scheme.

4.2.1. Step 1: Assign Samples to Clusters

This step aims to assign each sample to its nearest cluster. Where the cluster centers are initialized by cloud server .(i)Step 1.1: participants generate polynomials and compute hash values. For each sample of , participant runs the SVVS protocol to generate polynomial values and computes hash values of . We use SHA256 in the SVVS protocol. Then, participants upload hash values to blockchain .(ii)Step 1.2: participants encrypt samples and upload ciphertext to cloud server. Each participant encrypts their samples, and represents encrypted samples of , where , and uploads ciphertext to cloud server.(iii)Step 1.3: cloud server randomly choosescluster centers. Cloud server randomly chooses cluster centers , where . Cloud server encrypts cluster centers using the of each participant to get , where , .(iv)Step 1.4: cloud server and participants compute distances. Cloud server has ’s encrypted samples and cluster centers , and has the secret key . and run protocol to compute encrypted distances between each sample and cluster centers. The distances are represented as . The denotes the encrypted distances between ’s sample and cluster centers in encrypted form.(v)Step 1.5: cloud server assigns samples toclusters. To compare the encrypted distances, and run protocol to get the minimum distance of sample . The sample will be assigned to the th cluster if the minimum distance is equal to . Finally, the cloud server gets the clustering results of each participant’s samples , where and denotes the set of ’s samples in the th cluster. Then, cloud server sends samples clustering results to the corresponding participants. In other words, participants only know the clustering results of their own samples, and they do not know the clustering results of other participants’ samples.

4.2.2. Step 2: Update Cluster Centers

This step aims to compute the new cluster centers.(i)Step 2.1: participants send and receive shares of samples. After assigning all samples to clusters, participants run protocol to update cluster centers. Because each sample’s shares have been calculated by participants through the protocol in Step 1, it is very easy to get the shares of each sample. Suppose we want to update the th cluster center, participants send shares of their samples in th cluster to corresponding participants and receive shares of other participants’ samples in th cluster. After receiving shares from other participants, participants need to compute hash values of shares and verify whether these hash values exist on the blockchain. If the hash values exist on the blockchain, it indicates that other participants share the consistent values. Then, participants compute the sum of shares and send to the cloud server.(ii)Step 2.2: calculate the new centers. The cloud server receives the sum of shares from participants and uses Lagrange’s interpolation to find the sum of samples . Because the cluster server has the clustering results of all samples, it can easily get the number of samples in each cluster. Then, the cluster server computes the new centers . Moreover, the new cluster centers cannot be leaked to participants.

4.2.3. Step 3: Stop Iteration

This step aims to determine whether to terminate the privacy-preserving federated -means algorithm. After Step 1 and Step 2, the cloud server should compare the new cluster centers with previous cluster centers . If they are close enough (e.g., the difference is no more than the threshold set earlier), the -means ends and the clustering results have been sent to participants in Step 1.5. Otherwise, the cloud server encrypts the new cluster centers using each participants’ public key. Then, go to Step 1.4 and iterate.

5. Security Analysis

In this section, we discuss the privacy protection capabilities offered by our scheme. Firstly, we define the goals in privacy protection for different entities.

For each participant in the federated -means scheme, it should not access to the following information:(1)Samples’ information from other participants(2)Clustering results for other participants’ samples(3)Clustering centers in each iteration

For the cloud server, it can only get the cluster centers in each iteration and cannot obtain the samples’ information of participants.

The blockchain is only used to store hash values, which is publicly available and does not interact with the participants or the cloud server, so it does not access any private information about participants or the cloud server.

5.1. Privacy-Preserving Analysis for Assigning Samples

In the stage of assigning samples, each participant encrypts samples with the Paillier cryptosystem. Because Paillier’s cryptosystem is semantically secure, other participants and cloud servers cannot decrypt and deduce additional information from the encrypted data.

According to our scheme, we should compute the distances between samples and cluster centers with the and protocols. In protocol, the cloud server has the encrypted values and , and the participants hold the secret key . The aim of the protocol is to return encrypted values of (e.g., ) to cloud server with the interaction between cloud server and participants. The idea of protocol is based on the following property:

The detailed protocol is shown in Algorithm 2. Firstly, the cloud server chooses two random numbers , which are only known to the cloud server. Then, the cloud server computes and sends to the participant. After receiving and , participant decrypts and computes . Then, the participant sends to the cloud server. Finally, the cloud server computes . Note that is equivalent to for any . During the protocol, no information about and except for is revealed to the cloud server.

Require: has ; has
(1):(a)Pick two random numbers (b)(c)Send to
(2):(a), (b)Send to
(3):(a), (b)Compute

In the protocol shown in Algorithm 3, the cloud server has the encrypted vectors and the participants hold the secret key . The goal of the protocol is to compute the distance between and . Cloud server computes using the protocol and then computes because the Paillier cryptosystem has additive homomorphism. Since the protocol does not reveal private information to the cloud server, does not leak information either.

Require: has ; has
(1):
  fordo:
   
(2) and :
  Fordo:
   Use SM to compute with
(3):
  Compute

After computing the distances between samples and each cluster center, each sample is assigned to the closest cluster with , , and protocols. In protocol, for random , the cloud server only can get the individual bits of the binary representation of . During the process, the cloud server has and participants have , and is not known to both of them. The detailed protocol is described in [32].

The protocol is described in Algorithm 4, and this protocol can help cloud server get the encrypted results of the individual bits of . The participant can not only directly obtain or but also send an identifier to the cloud server. Then, cloud server can access without decrypting or knowing . Similarly, the protocol shown in Algorithm 5 is private and secure.

Require: has and , where ; has
(1):(a)Randomly choose the functionality (b)fordo:  ifthen:   , ;   else   , ;    , ; and    , ; (c), . Send and to
(2):(a); for (b)if such that then:else(c); for , Send and to
(3):(a)(b)fordo:ifthen:else(c)Get the
Require: has ; has
(1)  :  ; for ,
(2) and :(a)  for   for    ifthen:      SMIN2 ,     else      SMIN2 , (b)  
(3):  , Get
5.2. Privacy-Preserving Analysis for Updating Cluster Centers

In the step of updating the cluster centers, we mainly used Shamir’s secret sharing. The protocol is described in Algorithm 6. In the protocol, each participant generates a polynomial for each sample . Participants compute shares and hash values of shares. Then, participants upload hash values to the blockchain. The protocol is described in Algorithm 1. In the protocol, participants send and receive shares with each other. The secret values of a participant cannot be revealed even if all remaining participants exchange their shares. Because each participant executes Shamir’s secret sharing algorithm with a random polynomial of degree . In order to compute the coefficients of the corresponding polynomial, at least values of the polynomial are needed. In the protocol, each participant actually computes values of polynomial but only sends polynomial values to the other participants, keeping one polynomial value for itself. Thus, as long as the participants do not reveal the polynomial value they hold, the secret value cannot be deduced even if the remaining participants combine their shares. So the protocol can protect the privacy of the participants.

Require: has a random number generate; has
(1):(a)Choose publicly known random numbers (b)Send to
(2):(a)fordo:  Fordo:    chooses a random polynomial , where (b)fordo: computes shares computes and uploads to blockchain
Require: has the clustering results
(1):
 In th cluster, have the set of samples . For ease of expression, we assume here that . In fact, may contains a lot of samples.(a)fordo:   has computed the polynomial values of and hash values in SVVS protocol, where the polynomial values are   fordo:    sends to and receive form .    computes .   while is not on B do:    Call resend .(b)fordo: computes . sends to C.
(2): Solves the set of using Lagrange’s interpolation to find sum of samples , where .
5.3. Resisting Attacks from Malicious Participants

In our multi-party verifiable privacy-preserving federated -means scheme, we make no assumption that all participants are semihonest. Thus, there will be malicious participants in the privacy-preserving -means scheme. In the following, we will show that our scheme can resist inconsistent sharing attacks from malicious participants. We assume that the cloud server is semihonest and will follow the proposed protocol.

Because a malicious participant may send inconsistent shares (e.g., shares of samples that do not belong to him) to other participants in the process of updating cluster centers. When participants receive shares from others, they should compute the hash values of shares and verify whether the hash values are on the blockchain. Furthermore, once the information has been agreed and added to the blockchain, it is recorded by all nodes together and is cryptographically guaranteed to be interlinked backward and forward, making tampering very difficult and costly. Therefore, if participants receive a share that cannot be verified, it indicates that other participants have sent inconsistent data share. And we can conclude that the participant does not follow the proposed protocol, and it is considered to be a malicious participant. So incorrect -Means results due to inconsistent shares by malicious participants can be avoided.

6. Performance Analysis

6.1. Experimental Analysis

We use two datasets of different sizes for the experiments. The first one is a 2-dimensional synthetic dataset S1 [33], which is a clustering benchmark. S1 contains 2000 samples and 7 cluster centers. The second dataset is the HCV [34], which contains laboratory values of blood donors and hepatitis C patients. HCV consists of 588 10-dimensional samples and 4 cluster centers.

To show that our scheme is suitable for multi-party scenarios, we assume that samples are randomly distributed among three participants, and we choose SHA256 as the required hash function. In the process of calculating metadata, the blockchain needs to run a consensus algorithm. The consensus time depends on the specific blockchain system selected. Therefore, this section only conducts experimental analysis on the proposed scheme and does not consider the execution cost of the blockchain itself.

Running time of the experiment on datasets S1 and HCV is shown in Tables 2 and 3, respectively, where , , , and represent the data size, number of clusters, cloud server, and participants. The unit of time is seconds. We record the running time under different data size in one iteration, and it is obvious that the running time increases proportionally with the number of data size increasing.

We calculate the respective time proportions of cloud server and participants to judge the performance of our scheme. In the process of our scheme, the cloud server undertakes the main computing tasks, and running time accounts for 77%.

6.2. Accuracy Analysis

We calculate the percentage of correctly classified samples as the standard for the accuracy of our scheme. For dataset S1, we compare the result of our privacy-preserving -means clustering with plaintext -means clustering to measure the accuracy of our scheme. The result of privacy-preserving -means clustering on S1 is shown in Figure 2 and the accuracy of the S1 dataset is 99.25%. In the same way, we calculated the accuracy of the HCV dataset to be 99.14%. So, we can conclude that our scheme can get high-accuracy clustering results.

6.3. Functional Analysis

We analyze our scheme’s function and compare it with schemes proposed in [13, 14, 25] from different properties, including parties, participants’ information protection, cluster centers protection, and verifiability. The result is shown in Table 4. In [25], Jiang et al. combined homomorphic encryption and garbled circuit to design an outsourced two-party privacy-preserving -means clustering scheme. Compared with the scheme in [25], our scheme extends to multi-party. Furthermore, our scheme can verify the shares from other participants with secret sharing, hash function, and blockchain technology. In the step of updating cluster centers, the shares are verified by hash values calculated in advance, and the blockchain is used to ensure that once the hash value is verified, the participants can be guaranteed to share the consistent data. From Table 4, the significant advantages of our scheme compared to the other three schemes are verifiable.

7. Conclusion

In this article, we propose a multi-party verifiable privacy-preserving federated -means scheme that provides a validation mechanism in an outsourced environment. By computing hash values of secret shares in advance, we can detect malicious participants that send inconsistent shares and avoid incorrect -means clustering results. The security and experimental analysis show that our scheme can protect privacy and get high clustering results.

Data Availability

The dataset used to support the findings of this study comes from https://cs.joensuu.fi/sipu/datasets/and https://archive-beta.ics.uci.edu/ml/datasets/hcv+data.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this manuscript.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (no. 61702067) and in part by the National Natural Science Foundation of Chongqing (no. cstc2020jcyj-msxmX0343).