Wireless Communications and Mobile Computing

Wireless Communications and Mobile Computing / 2021 / Article
Special Issue

Federated Learning for Internet of Things and Big Data

View this Special Issue

Research Article | Open Access

Volume 2021 |Article ID 6692061 | https://doi.org/10.1155/2021/6692061

Zhou Zhou, Youliang Tian, Changgen Peng, "Privacy-Preserving Federated Learning Framework with General Aggregation and Multiparty Entity Matching", Wireless Communications and Mobile Computing, vol. 2021, Article ID 6692061, 14 pages, 2021. https://doi.org/10.1155/2021/6692061

Privacy-Preserving Federated Learning Framework with General Aggregation and Multiparty Entity Matching

Academic Editor: Zhipeng Cai
Received03 Jan 2021
Revised27 Jan 2021
Accepted19 Jun 2021
Published28 Jun 2021

Abstract

The requirement for data sharing and privacy has brought increasing attention to federated learning. However, the existing aggregation models are too specialized and deal less with users’ withdrawal issue. Moreover, protocols for multiparty entity matching are rarely covered. Thus, there is no systematic framework to perform federated learning tasks. In this paper, we systematically propose a privacy-preserving federated learning framework (PFLF) where we first construct a general secure aggregation model in federated learning scenarios by combining the Shamir secret sharing with homomorphic cryptography to ensure that the aggregated value can be decrypted correctly only when the number of participants is greater than . Furthermore, we propose a multiparty entity matching protocol by employing secure multiparty computing to solve the entity alignment problems and a logistic regression algorithm to achieve privacy-preserving model training and support the withdrawal of users in vertical federated learning (VFL) scenarios. Finally, the security analyses prove that PFLF preserves the data privacy in the honest-but-curious model, and the experimental evaluations show PFLF attains consistent accuracy with the original model and demonstrates the practical feasibility.

1. Introduction

In 2016, AlphaGo used 300,000 sets of flag games as training data and beat the world’s top professional go players. Artificial intelligence (AI) has shown great potential and is expected to show itself in many fields and make important contributions [1]. In traditional AI, data processing needs to aggregate a large amount of data for model training. However, due to industry competition, privacy protection requirements, business management, and other issues, data of various industries forms islands, which are difficult to share. Therefore, data quality and availability is one of the constraints on AI development [2]. On the other hand, data privacy and security have become the focus of the world’s attention [3] following the devastating losses caused by data leaks in recent years. The European Union recently introduced a new law—General Data Protection Regulations (GDPR) [4]—that shows the increasingly strict management of user data privacy and security will be the world trend. So the enactment of laws and regulations also brings new challenges to the traditional AI processing mode.

How to solve the problem of data isolation and data fusion on the premise of protecting users’ privacy has become an urgent task for the development of AI. The federated learning (FL) framework, first proposed by Google in 2016 [5, 6], well meets those requirements. In the FL model, each participant keeps the local data training model, only transmits the parameters of each model to an aggregation server using the new cryptography technology, and the server returns the aggregation parameters to each participant for updating after the completion of parameter aggregation [7]. In the end, establishing the virtual common model, using the encryption mechanism to complete the parameter exchange is consistent with the optimal model trained from the data aggregation [8] under the traditional model. Recently, research of the federated learning has become a hot topic, and a lot of deep learning works focusing on privacy protecting have been done. In 2019, Yang et al. systematically introduced the federated learning framework, application, and research direction [9], which helps us to control and understand federated learning as a whole. The federated learning framework has been applied and extended to Deep Neural Networks (DNN), eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and other algorithms [2, 10, 11], of which the used techniques include secret sharing [12], differential privacy [13], and homomorphic cryptography [14].

At present, there are still the following problems about privacy-protecting FL. First, there are few protocols that involve multiple user entity matching [15] and ensuring their privacy in concurrent mode. In addition, too much interaction between users is required when encrypted gradients or parameters are passed to the server. Furthermore, the protocol only considers that all users participate online throughout the training cycle without going offline or that the recovery of correct data requires the assistance of other participants when one participant goes offline. Finally, the existing schemes are of poor generality and are often only for specific machine learning algorithms and application scenarios.

To solve the above problems, we propose a novel PFLF with general aggregation and multiparty entity matching. In this framework, we propose a general aggregation model (GAM) that can be used in many applications where aggregation is required and privacy is protected. Under our GAM, we construct a multiparty entity matching protocol (MEMP), which can complete the confirmation of the common user data without leaking any disjoint entity information. In addition, we design the vertical federated logistic regression (VFLR) algorithm while keeping the data in the local database. In summary, our contributions can be summarized as follows: (i)We propose a PFLF that includes the GAM, MEMP, and VFLR to achieve multiscenario data aggregation, multiparty matching, and privacy protections(ii)We exploit the homomorphic encryption and the improved Shamir secret reconstruction to ensure that only the aggregator receives messages from at least participants; it can recover the secret and remove mask to acquire correct parameters or product. In the GAM, there is little interaction with other participants(iii)We propose MEMP to confirm common entities based on GAM and the multiplicative homomorphism of RSA. It can enable participants with different data characteristics to determine common entities without leaking or inferring other useful information(iv)We design a privacy-preserving VFLR by using Paillier homomorphic encryption and merging GAM and LR. It can train a secure VFLR and support the withdrawal of participants. In addition, the prediction accuracy of the model is not affected(v)We give a comprehensive security analysis for our framework. We claim that the attackers will not acquire any useful information even if there is no more than participant collusion. Besides, extensive experiments are operated to confirm that our framework is effective and efficient

The rest of the paper is organized as follows. In Section 2, we describe the preliminaries and the main technology. In Section 3, we describe the system architecture, security model, and problem description. In Section 4, we describe the algorithm details of our GAM with privacy protection and construct the secure MEMP and VFLR model. In Section 5, we demonstrate the security analysis of framework. Experimental evaluation and related work are discussed in Sections 6 and 7, respectively. Finally, we give the conclusion of the paper in Section 8.

2. Preliminary

2.1. Logistic Regression Algorithm

Consider a dataset with dimension, in which , . The predicted value is mapped between 0 and 1 by the sigmoid function [16] , where and . The objective function is defined as follows:

The gradient descent method is used to minimize the value of the objective function, and the model parameters are updated as follows:

When given a new data , the predictive value of logistic regression is set to

2.2. Homomorphic Encryption

The Paillier scheme satisfying the additive homomorphism [17] is as follows: are large primes of equal length chosen randomly; are calculated. Given the random number , then we have the public key and the private key . For the encryption, given the random number satisfying , we have the ciphertext , where is the plaintext. In the decryption phase, is obtained by computing , where .

We mainly use the following properties of Paillier homomorphic encryption. Additivity can be indicated as .

2.3. Secret Sharing

The secret sharing (SS) scheme [18] adopted in our scheme is used to mask the data ciphertext transmitted by participants, but ensures that the aggregator recovers the ciphertext product and it also helps the scheme support the withdrawal of participants. For SS scheme, the secret is split into shares. can be recovered only if at least random shares are provided. The share generation algorithm is illustrated as , in which represents the number of participants involved in SS and is the set of participants and is the share for each user . The secret can be recovered by at least participants contained in using Lagrange interpolation base as follows: where is computed. Here, we can use representing the identity of the th participant.

3. System Architecture

In this section, we introduce the system architecture, illustrated in Figure 1. Some frequently used notations of the paper are listed in Table 1.


NotationDescription

The homomorphic encryption with public key
The public key and private key for RSA
A cyclic group of order , primitive
The secret share distributed to the participants
The th sample of the th participant
The characteristic weight of the th participant
A hash function
Function prediction of logistic regression

3.1. System Model

Our framework involves three types of participating entities: a key generation center (KGC), a center server (CS), and a set of average participants (AP). Details are presented as follows.

Key Generation Center. KGC primarily performs key generation and distribution. Its main purpose is to initialize the system, generate public and private keys for homomorphic encryption, generate subsecrets based on Shamir secret sharing, assign corresponding public and private keys to CS, and distribute subsecrets to each general participant. Afterwards, it will go offline.

Center Server. CS is often the initiator of a federated learning mission. It is the one who has data labels, coordinating the execution of the entire process. It aggregates the parameters uploaded by all online participants. In MEMP, it calculates the user intersection and returns the common entity. In VFLR, it returns the calculated sample error. In the process, we hope that CS can infer nothing except the uploaded ciphertext and the final result.

Average Participants. AP refers to participants who participate in model training without tags. It involves multiple average participants, namely, . In the aggregation framework, they are usually done with local encryption and send values for aggregation.

3.2. Security Model

Based on [19], in our scheme, we assume that the interaction channels through CS and AP are secure and not subject to risks such as tampering, and all participants except KGC are considered to be honest-but-curious. KGC is a trusted party which always performs its tasks honestly and does not collude with any entity. CS and AP honestly follow the agreed process, but may try to learn all possible information that is of interest to them from their received messages. We define a threat model with an honest-but-curious adversary who can corrupt at most parties and obtain their inputs or other private information. In the entity matching protocol, what wants to know is users’ information and CS’s private key. In the model building and prediction phase, makes full use of the information it holds to learn about the data including data characteristics and weights of other honest parties. Our model needs to meet the following security requirements.

Data Privacy. CS cannot recognize any private data uploaded by , and other () cannot infer the private data of others. For example, mark matrix and model parameters must not be exposed.

Secure Withdrawal. CS and cannot continue to use the information of the exiting participants for subsequent calculations, and the process of recovering the aggregated value cannot reveal the parameters of the exiting participants if any participant drops in a round. There should be a safe way to deal with delayed transmissions and not be mistaken for offline.

3.3. Problem Description

In order to achieve a GAM that can enable aggregated messages to be decrypted only if they come from at least participants, while ensuring that individual parameter ciphertext is not exposed, we introduce some cryptographic tools. For example, the transformed Shamir secret sharing scheme helps achieve threshold aggregation and cover the homomorphic ciphertext of each participant. Homomorphic encryption features facilitate obtaining the product or sum of parameter plaintext through ciphertext aggregation. Furthermore, the same entity between different participants needs to be determined for multiple participants with different characteristics in the vertical federated mode. Firstly, we should design a MEMP with privacy protection, through which multiple participants obtain their overlapping entity IDs without exposing their respective data. After then, we use these common entities’ data with different characteristics to train the learning model while ensuring local data privacy. To achieve these two goals, under our aggregation model, we use RSA blind signature to generate data identity libraries, record the matching results by token matrix, and use RSA and Paillier as homomorphic encryption for specific functional requirements. In particular, the prediction accuracy of VFLR realized by the framework is unchanged, and it can support the withdrawal of participants.

4. Construction of PFLF

Our PFLF implements a systematic FL process, including three main functions. Firstly, it can realize multiparty data aggregation without data leakage. Secondly, it can find the common set of entities of multiple participants without revealing useful information. For VFL scenarios, subsequent joint training can only be completed if the common entities are identified. When using logistic regression algorithm in VFL scenario, secure data aggregation is necessary after entity matching is completed. So, thirdly, we design the VFLR. In particular, the aggregation in our framework is generic, not only for a specific machine learning algorithm but also for all application scenarios based on thresholds. In this section, we present the details of our GAM and its role in constructing MEMP and VFLR.

4.1. A Novel GAM

A common aggregation model is suitable for such application scenarios where the aggregation server CS can decrypt and obtain the desired results through homomorphic encryption only when messages received are from at least participants. Through this model, the participants’ private data is fully protected in the process of achieving the interaction purpose according to the protocol, and when there are participants offline, the aggregated messages that do not involve the offline information can be recovered quickly. Here, firstly, based on the mentioned SS scheme, we can make the following transformation.

Each user chooses a random number , where is the number of participants who reconstructed the secret and is the number of samples. For the secret reconstruction formula , let us multiply both sides of this equation by random numbers and the equation is transformed into the following form:

We can sum both sides of this equation and get

When used for threshold encryption, it converts to

We assume that in such a scenario, each participant having a message needs to ask CS to help calculate , but does not want to disclose to others and also does not want CS to infer some private data through the message sent by themselves to carry out various possible calculations. Figure 2 shows its workflow. In this way, they can do it like this: first computes and sends it to CS according to the above equation, the receiver with the private key can decrypt and get when each participant makes public the value corresponding to , and there exists the secret and the public key . Besides, satisfies homomorphic encryption which refers to multiplicative homomorphism in our entity matching protocol and additive homomorphism in our joint model training. How the model supports users’ withdrawal is described in detail in Section 4.3. If the model is used for horizontal federated learning, can be gradient and other important parameters. Later in the VFLR section, we will focus on using logistic regression to explain the framework.

4.2. Secure Multiparty Entity Matching

As shown in Algorithm 1, the secure multiparty entity matching protocol completes the confirmation of the common entity of multiple participants under the premise of protecting privacy. In the protocol, there is a CS with data sample identity and a set of average participants with their own data sample , which represents the identity of the th sample for the th participant. Note that we briefly describe the situation of sending a message from to as . The protocol workflow is shown in Figure 3. The process of the protocol is shown as below.

In the initial parameter setting phase, the CS sets the penalty term, coefficient , and maximum iteration number of the model. generates a public-private key for homomorphic encryption of the later model building and predicting. The algorithm introduces an RSA encryption with a blind factor to mask confidential information, so the public-private key , , are generated by . In the following, we omit the modulus for RSA. In addition, also generates subshares and of the public key and for based on the identity of using or .

In the exchange of information phase, each participant chooses a random number , computes , and sends it to CS. CS chooses a random number to reflect the randomness of the interaction and uses the private key for signature and sends disturbed to . Disturbing the order of can eliminate the correspondence between and so that cannot lock the IDs corresponding to the intersection of and CS in the comparison phase. Furthermore, CS calculates random identifiers of each entity for different participants: . Then, CS sends to corresponding participants for comparison.

In the comparison phase, each participant eliminates the blind factor and opens its signature for to obtain and uses the hash function to compute . By comparing them with , CS gets an -dimensional matrix . if belongs to . If not, , where is a random number and . In this way, each participant creates a comparison matrix.

In the solution phase, each participant chooses a set of random number (i.e.,) and encrypts its matrix by computing with RSA.

After then, sends their encrypted matrix to CS for aggregation. At last, CS aggregates matrix values of participants by computing and uses its private key to decrypt, obtaining . Because CS can recover the secret by computing . Specific analysis can refer to Secure Model Building.

In the identification phase, CS finds the corresponding entity by the value of . Because each comes from CS, then if it is 1, it means that each participant has a corresponding , and if not, it means that at least one participant does not have . Therefore, CS can find the right entity based on . In the end, CS broadcasts common entity IDs to other participants for following model training.

Input: a central server CS, a set of participants , and a trusted party .
Output: common entity IDs.
1: CS sets the parameters for model training .
2: generates a public-private key for homomorphic encryption, a public-private key for RSA encryption for CS, also generates average participants’ public-private keys and subshares of the public key and based on the identity of the participants, i.e., . Here, get subshares for and subshares for .
3: Each participant chooses a random number , computes , and operates .
4: CS chooses a random number , uses the private key for signature , gets each , and returns them to after disturbing the order of .
5: fordo
6: fordo
7:  CS computes for the entities: .
8: CS sends to corresponding participants .
9: fordo
10: Each participant eliminates the blind factor and for , obtains , and computes their hash values .
11: fordo
12:  Each participant generates its own -dimensional matrix by determining whether belongs to its .
13:  ifthen
14:   .
15:  else
16:   , where is a random number and .
17: Each participant chooses a set of random number and encrypts its matrix with , operating .
18: CS aggregates matrix values by computing and obtains by decrypting.
19: ifthen
20: CS finds the corresponding .
21: CS broadcasts to other participants .
22: return common entity IDs: .
4.3. Secure Model Building
4.3.1. Secure Training

For logistic regression to find the better model with gradient descent method, the part that needs to be computed jointly is using the predicted value and the sample label. It is better for CS to do the aggregation and calculation since sample labels are mastered by CS. The workflow is shown in Figure 4. To protect the confidentiality of each participant’s data, the aggregated data is received in ciphertext. Consequently, we chose to use Paillier homomorphic encryption to do the computation. However, each participant’s data cannot be decrypted separately by CS; for this reason, we apply Shamir secret sharing scheme such as Formula (7) to ensure that CS could decrypt only after the aggregation operation was completed. The detailed process is shown in Algorithm 2.

The number of common entities of all participants is , the average participants in the joint modeling are , and each average participant secretly keeps a subsecret as . Now given a cyclic group and its primitive , computes , in which is the th sample for the th participant . With subsecret and identities, can compute its own . In order to keep the subsecret dynamic, for each sample, adds a random factor or the time stamp to calculate and sends it to CS after other participants release their . At this point, although CS gets the ciphertext of each participant, he cannot decrypt it because he cannot get the subsecret . Only when messages are received from at least participants can aggregation that can be generated, i.e., where . Because and are public to CS, CS can decrypt and get to compute . The next step, CS will broadcast to current participants . Each participant and CS can update weight parameter by computing . All steps are repeated until the maximum number of iterations is reached.

4.3.2. Withdrawal of Participants

Some participants may withdraw from federated learning, such as being unwilling to contribute models or dropping offline. In order to deal with the above situation, we can make a contract to reduce the occurrence of withdrawal. It is assumed that each average participant will be paid a certain amount based on their contribution in each iteration. The total expenditure and maximum number of iterations set by CS are and . The contract signed by all participants is as follows: (1) average participants submit a deposit to CS, and the deposit is . (2) In the FL, if CS receives all the messages within the maximum allowable period, the errors will be returned normally to all participants according to the protocol. (3) Else, CS will send withdrawal confirmation request to participants who did not send the message. If they report it is delayed, the delayed messages can still be used to compute the aggregated value. But these delayed participants will not get paid this round. (4) Once the participant reports withdrawal, the deposit will be distributed to other online participants, including CS. Rational participants generally do not withdraw in order to maximize their own interests. (5) Upon completion of the FL, deposits will be returned to all the online participants. Particularly, if some participants withdraw during training, CS requires each participant to resend the message without the identity of the quitters in order to decrypt. Because of the randomness of , CS cannot use the message sent twice to perform comparison calculation and get useful data. The reason is that CS still cannot get the subsecret to reconstruct the polynomial.

Input: a central server CS, a set of participants , instance space of samples of each participants, subshares , cyclic group , and its primitive .
Output: federated logistic regression model.
1: fordo
2: fordo
3:   computes .
4:   chooses random number and makes public its .
5:   uses others’ to compute and sends it to CS.
6: fordo
7: if someone exits then
8:  CS eliminates the value involving information of quitters in .
9: CS performs the aggregation and decrypts to get .
10: CS computes .
11: broadcasts .
12: Each participant and CS can update weight parameter by computing .
13: Repeat all until reaching the termination condition.
14: return built model.
4.4. Secure Predicting

Secure prediction should ensure that user data privacy is not compromised and that model parameters are not exposed. As described by Algorithm 3, results inquirer intends to provide a set of data for prediction without privacy leakage and all data characteristics correspond to all participants in the current model. First, , respectively, encrypts the data with , for example, belongs to a characteristic that corresponds to . Each participant computes through ’s public key and communicates it to CS. The aggregation operation is still done by CS to protect the parameters from being exposed. Then, CS computes and returns it to . The joint value could be get with private key by , which computes predicted results using .

Input: federated logistic regression model, results inquirer , and its instance space.
Output: predicted results.
1: fordo
2: , respectively, encrypts the data belonging to the characteristics of different participants including CS.
3: Each participant computes and communicates it to CS.
4: CS does an aggregate operation and returns it to .
5: decrypts and gets .
6: gets predicted results by .
7: return result .

5. Security Analysis

In this section, we prove that our scheme is secure based on the simulator under the honest-but-curious setting. Recall that the involved parties are participants and the CS. We assume an honest-but-curious adversary who can corrupt participants but at most . According to [20], we define the security of our framework by comparing the real interaction and ideal interaction. In the real interaction, there is an environment which chooses inputs and receives outputs of uncorrupted participants. The adversary who can interact arbitrarily with the environment forwards all received messages to and acts as instructed by . We let represent the view of . Similarly, we let represent the view of where adversary and honest participants interact with the environment running the dummy protocol in the presence of functionality .

Definition 1. A protocol is secure if for every admissible adversary attacks the real interaction, there exists a simulator attacking the ideal interaction, such that the environment cannot distinguish between the ideal view of and the real view of .

5.1. Security of GAM

In the GAM, although CS can collude with at most participants to obtain the privacy of honest participants, they get nothing but aggregated results. Since each participant’s data is encrypted as , based on the security of Shamir secret sharing and homomorphic encryption, only the privacy of is discussed here.

Theorem 2. For secure aggregated framework, there exists a PPT simulator or that can simulate the ideal view which is computationally indistinguishable from the real view of or. is illustrated as Table 2, where a subset represents joint attackers.


Give function with homomorphic encryption function and when .
’s operation is as follows:
(1) On input (Input, ) from , set and send (Input, ) to adversary .
(2) On input (Compute, ) from , choose randomly, compute , and send them to CS if . If is corrupted, send () to it.
(3) On input (Aggregate, ), compute and send it to CS.

According to whether CS is involved in collusion, the discussion is divided into two situations.

Case 1. Excluding CS from .

Proof. Since CS is not compromised, the view constructed by simulator is independent of the input of CS. So the simulator can execute a simulation by asking to generate fake data as inputs of the honest users, but the true inputs for honest-but-curious users. When sending , the simulator utilizes random number instead of true data. As aggregating and decrypting, CS returns aggregated result that does not indicate which special participants’ are aggregated. Hence, the ideal view simulated by is indistinguishable from the real view of since it is impossible to determine that is obtained from real data.☐

Case 2. Including CS in .

Proof. To prove the indistinguishability of the views in Case 2, the simulator gradually makes some improvements to the protocol. There exists hyb1, hyb2 that imply secure modification to the original protocol, ensuring the indistinguishability of the changed protocol from the original protocol, in our hybrid argument.
hyb1: in this hybrid, the simulator generates the masked input for honest participants as below: instead of utilizing Because is a random number, we can get is also a random value. The DDH assumption ensures that it is easy to infer they are indistinguishable.
hyb2: in this hybrid, the simulator generates encrypted by replacing with a random number . The security of the encryption algorithm ensures the indistinguishability of the two ciphertexts. Therefore, the simulator submits instead of sending Accordingly, the simulation has been completed since successfully simulates the real view without acquiring () and subsecret and we can infer that the output of this hybrid is indistinguished from the real one.☐

5.2. Security of MEMP

In the process of confirming the identity of the common entity, no entity information other than a common identity is available between participants. Thus, not only can the true identity of the entity not be exposed but its hash value cannot be either. The reason is that some honest-but-curious participants can calculate the hash value of a possible identifier to determine whether it belongs to . It is necessary to cover the confidential information with a random factor similar to a blind signature. We prove the following two theorems by constructing two separate simulators and to show that the real views and ideal views are computationally indistinguishable.

Theorem 3. For group entity matching, there exists a PPT simulator or that can simulate the ideal view which is computationally indistinguishable from the real view of or .

Let us divide into two parts, such as and . is the process from the beginning to the establishment of the comparison matrix, and means the process from the encrypted comparison matrix to the end, similar to the previous . The description of is shown in Table 3.


’s operation is as follows:
(1) On input (Init, ) from , generate random numbers for the corresponding , store (,…), and send (,…) to corrupted adversary .
(2) On input (Init, ) from CS, generate random numbers .
Once the IDs of the CS and are equal, there sets , .
For corrupted CS, send (,) to it. Store them.
(3) On input (signature, ) from , choose randomly, and compute .
Send them to CS and store ().
(4) On input (Bsignature, ) from CS, compute .
Send to , where are random numbers.
(5) On input (Open,