Abstract

The requirement for data sharing and privacy has brought increasing attention to federated learning. However, the existing aggregation models are too specialized and deal less with users’ withdrawal issue. Moreover, protocols for multiparty entity matching are rarely covered. Thus, there is no systematic framework to perform federated learning tasks. In this paper, we systematically propose a privacy-preserving federated learning framework (PFLF) where we first construct a general secure aggregation model in federated learning scenarios by combining the Shamir secret sharing with homomorphic cryptography to ensure that the aggregated value can be decrypted correctly only when the number of participants is greater than . Furthermore, we propose a multiparty entity matching protocol by employing secure multiparty computing to solve the entity alignment problems and a logistic regression algorithm to achieve privacy-preserving model training and support the withdrawal of users in vertical federated learning (VFL) scenarios. Finally, the security analyses prove that PFLF preserves the data privacy in the honest-but-curious model, and the experimental evaluations show PFLF attains consistent accuracy with the original model and demonstrates the practical feasibility.

1. Introduction

In 2016, AlphaGo used 300,000 sets of flag games as training data and beat the world’s top professional go players. Artificial intelligence (AI) has shown great potential and is expected to show itself in many fields and make important contributions [1]. In traditional AI, data processing needs to aggregate a large amount of data for model training. However, due to industry competition, privacy protection requirements, business management, and other issues, data of various industries forms islands, which are difficult to share. Therefore, data quality and availability is one of the constraints on AI development [2]. On the other hand, data privacy and security have become the focus of the world’s attention [3] following the devastating losses caused by data leaks in recent years. The European Union recently introduced a new law—General Data Protection Regulations (GDPR) [4]—that shows the increasingly strict management of user data privacy and security will be the world trend. So the enactment of laws and regulations also brings new challenges to the traditional AI processing mode.

How to solve the problem of data isolation and data fusion on the premise of protecting users’ privacy has become an urgent task for the development of AI. The federated learning (FL) framework, first proposed by Google in 2016 [5, 6], well meets those requirements. In the FL model, each participant keeps the local data training model, only transmits the parameters of each model to an aggregation server using the new cryptography technology, and the server returns the aggregation parameters to each participant for updating after the completion of parameter aggregation [7]. In the end, establishing the virtual common model, using the encryption mechanism to complete the parameter exchange is consistent with the optimal model trained from the data aggregation [8] under the traditional model. Recently, research of the federated learning has become a hot topic, and a lot of deep learning works focusing on privacy protecting have been done. In 2019, Yang et al. systematically introduced the federated learning framework, application, and research direction [9], which helps us to control and understand federated learning as a whole. The federated learning framework has been applied and extended to Deep Neural Networks (DNN), eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and other algorithms [2, 10, 11], of which the used techniques include secret sharing [12], differential privacy [13], and homomorphic cryptography [14].

At present, there are still the following problems about privacy-protecting FL. First, there are few protocols that involve multiple user entity matching [15] and ensuring their privacy in concurrent mode. In addition, too much interaction between users is required when encrypted gradients or parameters are passed to the server. Furthermore, the protocol only considers that all users participate online throughout the training cycle without going offline or that the recovery of correct data requires the assistance of other participants when one participant goes offline. Finally, the existing schemes are of poor generality and are often only for specific machine learning algorithms and application scenarios.

To solve the above problems, we propose a novel PFLF with general aggregation and multiparty entity matching. In this framework, we propose a general aggregation model (GAM) that can be used in many applications where aggregation is required and privacy is protected. Under our GAM, we construct a multiparty entity matching protocol (MEMP), which can complete the confirmation of the common user data without leaking any disjoint entity information. In addition, we design the vertical federated logistic regression (VFLR) algorithm while keeping the data in the local database. In summary, our contributions can be summarized as follows: (i)We propose a PFLF that includes the GAM, MEMP, and VFLR to achieve multiscenario data aggregation, multiparty matching, and privacy protections(ii)We exploit the homomorphic encryption and the improved Shamir secret reconstruction to ensure that only the aggregator receives messages from at least participants; it can recover the secret and remove mask to acquire correct parameters or product. In the GAM, there is little interaction with other participants(iii)We propose MEMP to confirm common entities based on GAM and the multiplicative homomorphism of RSA. It can enable participants with different data characteristics to determine common entities without leaking or inferring other useful information(iv)We design a privacy-preserving VFLR by using Paillier homomorphic encryption and merging GAM and LR. It can train a secure VFLR and support the withdrawal of participants. In addition, the prediction accuracy of the model is not affected(v)We give a comprehensive security analysis for our framework. We claim that the attackers will not acquire any useful information even if there is no more than participant collusion. Besides, extensive experiments are operated to confirm that our framework is effective and efficient

The rest of the paper is organized as follows. In Section 2, we describe the preliminaries and the main technology. In Section 3, we describe the system architecture, security model, and problem description. In Section 4, we describe the algorithm details of our GAM with privacy protection and construct the secure MEMP and VFLR model. In Section 5, we demonstrate the security analysis of framework. Experimental evaluation and related work are discussed in Sections 6 and 7, respectively. Finally, we give the conclusion of the paper in Section 8.

2. Preliminary

2.1. Logistic Regression Algorithm

Consider a dataset with dimension, in which , . The predicted value is mapped between 0 and 1 by the sigmoid function [16] , where and . The objective function is defined as follows:

The gradient descent method is used to minimize the value of the objective function, and the model parameters are updated as follows:

When given a new data , the predictive value of logistic regression is set to

2.2. Homomorphic Encryption

The Paillier scheme satisfying the additive homomorphism [17] is as follows: are large primes of equal length chosen randomly; are calculated. Given the random number , then we have the public key and the private key . For the encryption, given the random number satisfying , we have the ciphertext , where is the plaintext. In the decryption phase, is obtained by computing , where .

We mainly use the following properties of Paillier homomorphic encryption. Additivity can be indicated as .

2.3. Secret Sharing

The secret sharing (SS) scheme [18] adopted in our scheme is used to mask the data ciphertext transmitted by participants, but ensures that the aggregator recovers the ciphertext product and it also helps the scheme support the withdrawal of participants. For SS scheme, the secret is split into shares. can be recovered only if at least random shares are provided. The share generation algorithm is illustrated as , in which represents the number of participants involved in SS and is the set of participants and is the share for each user . The secret can be recovered by at least participants contained in using Lagrange interpolation base as follows: where is computed. Here, we can use representing the identity of the th participant.

3. System Architecture

In this section, we introduce the system architecture, illustrated in Figure 1. Some frequently used notations of the paper are listed in Table 1.

3.1. System Model

Our framework involves three types of participating entities: a key generation center (KGC), a center server (CS), and a set of average participants (AP). Details are presented as follows.

Key Generation Center. KGC primarily performs key generation and distribution. Its main purpose is to initialize the system, generate public and private keys for homomorphic encryption, generate subsecrets based on Shamir secret sharing, assign corresponding public and private keys to CS, and distribute subsecrets to each general participant. Afterwards, it will go offline.

Center Server. CS is often the initiator of a federated learning mission. It is the one who has data labels, coordinating the execution of the entire process. It aggregates the parameters uploaded by all online participants. In MEMP, it calculates the user intersection and returns the common entity. In VFLR, it returns the calculated sample error. In the process, we hope that CS can infer nothing except the uploaded ciphertext and the final result.

Average Participants. AP refers to participants who participate in model training without tags. It involves multiple average participants, namely, . In the aggregation framework, they are usually done with local encryption and send values for aggregation.

3.2. Security Model

Based on [19], in our scheme, we assume that the interaction channels through CS and AP are secure and not subject to risks such as tampering, and all participants except KGC are considered to be honest-but-curious. KGC is a trusted party which always performs its tasks honestly and does not collude with any entity. CS and AP honestly follow the agreed process, but may try to learn all possible information that is of interest to them from their received messages. We define a threat model with an honest-but-curious adversary who can corrupt at most parties and obtain their inputs or other private information. In the entity matching protocol, what wants to know is users’ information and CS’s private key. In the model building and prediction phase, makes full use of the information it holds to learn about the data including data characteristics and weights of other honest parties. Our model needs to meet the following security requirements.

Data Privacy. CS cannot recognize any private data uploaded by , and other () cannot infer the private data of others. For example, mark matrix and model parameters must not be exposed.

Secure Withdrawal. CS and cannot continue to use the information of the exiting participants for subsequent calculations, and the process of recovering the aggregated value cannot reveal the parameters of the exiting participants if any participant drops in a round. There should be a safe way to deal with delayed transmissions and not be mistaken for offline.

3.3. Problem Description

In order to achieve a GAM that can enable aggregated messages to be decrypted only if they come from at least participants, while ensuring that individual parameter ciphertext is not exposed, we introduce some cryptographic tools. For example, the transformed Shamir secret sharing scheme helps achieve threshold aggregation and cover the homomorphic ciphertext of each participant. Homomorphic encryption features facilitate obtaining the product or sum of parameter plaintext through ciphertext aggregation. Furthermore, the same entity between different participants needs to be determined for multiple participants with different characteristics in the vertical federated mode. Firstly, we should design a MEMP with privacy protection, through which multiple participants obtain their overlapping entity IDs without exposing their respective data. After then, we use these common entities’ data with different characteristics to train the learning model while ensuring local data privacy. To achieve these two goals, under our aggregation model, we use RSA blind signature to generate data identity libraries, record the matching results by token matrix, and use RSA and Paillier as homomorphic encryption for specific functional requirements. In particular, the prediction accuracy of VFLR realized by the framework is unchanged, and it can support the withdrawal of participants.

4. Construction of PFLF

Our PFLF implements a systematic FL process, including three main functions. Firstly, it can realize multiparty data aggregation without data leakage. Secondly, it can find the common set of entities of multiple participants without revealing useful information. For VFL scenarios, subsequent joint training can only be completed if the common entities are identified. When using logistic regression algorithm in VFL scenario, secure data aggregation is necessary after entity matching is completed. So, thirdly, we design the VFLR. In particular, the aggregation in our framework is generic, not only for a specific machine learning algorithm but also for all application scenarios based on thresholds. In this section, we present the details of our GAM and its role in constructing MEMP and VFLR.

4.1. A Novel GAM

A common aggregation model is suitable for such application scenarios where the aggregation server CS can decrypt and obtain the desired results through homomorphic encryption only when messages received are from at least participants. Through this model, the participants’ private data is fully protected in the process of achieving the interaction purpose according to the protocol, and when there are participants offline, the aggregated messages that do not involve the offline information can be recovered quickly. Here, firstly, based on the mentioned SS scheme, we can make the following transformation.

Each user chooses a random number , where is the number of participants who reconstructed the secret and is the number of samples. For the secret reconstruction formula , let us multiply both sides of this equation by random numbers and the equation is transformed into the following form:

We can sum both sides of this equation and get

When used for threshold encryption, it converts to

We assume that in such a scenario, each participant having a message needs to ask CS to help calculate , but does not want to disclose to others and also does not want CS to infer some private data through the message sent by themselves to carry out various possible calculations. Figure 2 shows its workflow. In this way, they can do it like this: first computes and sends it to CS according to the above equation, the receiver with the private key can decrypt and get when each participant makes public the value corresponding to , and there exists the secret and the public key . Besides, satisfies homomorphic encryption which refers to multiplicative homomorphism in our entity matching protocol and additive homomorphism in our joint model training. How the model supports users’ withdrawal is described in detail in Section 4.3. If the model is used for horizontal federated learning, can be gradient and other important parameters. Later in the VFLR section, we will focus on using logistic regression to explain the framework.

4.2. Secure Multiparty Entity Matching

As shown in Algorithm 1, the secure multiparty entity matching protocol completes the confirmation of the common entity of multiple participants under the premise of protecting privacy. In the protocol, there is a CS with data sample identity and a set of average participants with their own data sample , which represents the identity of the th sample for the th participant. Note that we briefly describe the situation of sending a message from to as . The protocol workflow is shown in Figure 3. The process of the protocol is shown as below.

In the initial parameter setting phase, the CS sets the penalty term, coefficient , and maximum iteration number of the model. generates a public-private key for homomorphic encryption of the later model building and predicting. The algorithm introduces an RSA encryption with a blind factor to mask confidential information, so the public-private key , , are generated by . In the following, we omit the modulus for RSA. In addition, also generates subshares and of the public key and for based on the identity of using or .

In the exchange of information phase, each participant chooses a random number , computes , and sends it to CS. CS chooses a random number to reflect the randomness of the interaction and uses the private key for signature and sends disturbed to . Disturbing the order of can eliminate the correspondence between and so that cannot lock the IDs corresponding to the intersection of and CS in the comparison phase. Furthermore, CS calculates random identifiers of each entity for different participants: . Then, CS sends to corresponding participants for comparison.

In the comparison phase, each participant eliminates the blind factor and opens its signature for to obtain and uses the hash function to compute . By comparing them with , CS gets an -dimensional matrix . if belongs to . If not, , where is a random number and . In this way, each participant creates a comparison matrix.

In the solution phase, each participant chooses a set of random number (i.e.,) and encrypts its matrix by computing with RSA.

After then, sends their encrypted matrix to CS for aggregation. At last, CS aggregates matrix values of participants by computing and uses its private key to decrypt, obtaining . Because CS can recover the secret by computing . Specific analysis can refer to Secure Model Building.

In the identification phase, CS finds the corresponding entity by the value of . Because each comes from CS, then if it is 1, it means that each participant has a corresponding , and if not, it means that at least one participant does not have . Therefore, CS can find the right entity based on . In the end, CS broadcasts common entity IDs to other participants for following model training.

Input: a central server CS, a set of participants , and a trusted party .
Output: common entity IDs.
1: CS sets the parameters for model training .
2: generates a public-private key for homomorphic encryption, a public-private key for RSA encryption for CS, also generates average participants’ public-private keys and subshares of the public key and based on the identity of the participants, i.e., . Here, get subshares for and subshares for .
3: Each participant chooses a random number , computes , and operates .
4: CS chooses a random number , uses the private key for signature , gets each , and returns them to after disturbing the order of .
5: fordo
6: fordo
7:  CS computes for the entities: .
8: CS sends to corresponding participants .
9: fordo
10: Each participant eliminates the blind factor and for , obtains , and computes their hash values .
11: fordo
12:  Each participant generates its own -dimensional matrix by determining whether belongs to its .
13:  ifthen
14:   .
15:  else
16:   , where is a random number and .
17: Each participant chooses a set of random number and encrypts its matrix with , operating .
18: CS aggregates matrix values by computing and obtains by decrypting.
19: ifthen
20: CS finds the corresponding .
21: CS broadcasts to other participants .
22: return common entity IDs: .
4.3. Secure Model Building
4.3.1. Secure Training

For logistic regression to find the better model with gradient descent method, the part that needs to be computed jointly is using the predicted value and the sample label. It is better for CS to do the aggregation and calculation since sample labels are mastered by CS. The workflow is shown in Figure 4. To protect the confidentiality of each participant’s data, the aggregated data is received in ciphertext. Consequently, we chose to use Paillier homomorphic encryption to do the computation. However, each participant’s data cannot be decrypted separately by CS; for this reason, we apply Shamir secret sharing scheme such as Formula (7) to ensure that CS could decrypt only after the aggregation operation was completed. The detailed process is shown in Algorithm 2.

The number of common entities of all participants is , the average participants in the joint modeling are , and each average participant secretly keeps a subsecret as . Now given a cyclic group and its primitive , computes , in which is the th sample for the th participant . With subsecret and identities, can compute its own . In order to keep the subsecret dynamic, for each sample, adds a random factor or the time stamp to calculate and sends it to CS after other participants release their . At this point, although CS gets the ciphertext of each participant, he cannot decrypt it because he cannot get the subsecret . Only when messages are received from at least participants can aggregation that can be generated, i.e., where . Because and are public to CS, CS can decrypt and get to compute . The next step, CS will broadcast to current participants . Each participant and CS can update weight parameter by computing . All steps are repeated until the maximum number of iterations is reached.

4.3.2. Withdrawal of Participants

Some participants may withdraw from federated learning, such as being unwilling to contribute models or dropping offline. In order to deal with the above situation, we can make a contract to reduce the occurrence of withdrawal. It is assumed that each average participant will be paid a certain amount based on their contribution in each iteration. The total expenditure and maximum number of iterations set by CS are and . The contract signed by all participants is as follows: (1) average participants submit a deposit to CS, and the deposit is . (2) In the FL, if CS receives all the messages within the maximum allowable period, the errors will be returned normally to all participants according to the protocol. (3) Else, CS will send withdrawal confirmation request to participants who did not send the message. If they report it is delayed, the delayed messages can still be used to compute the aggregated value. But these delayed participants will not get paid this round. (4) Once the participant reports withdrawal, the deposit will be distributed to other online participants, including CS. Rational participants generally do not withdraw in order to maximize their own interests. (5) Upon completion of the FL, deposits will be returned to all the online participants. Particularly, if some participants withdraw during training, CS requires each participant to resend the message without the identity of the quitters in order to decrypt. Because of the randomness of , CS cannot use the message sent twice to perform comparison calculation and get useful data. The reason is that CS still cannot get the subsecret to reconstruct the polynomial.

Input: a central server CS, a set of participants , instance space of samples of each participants, subshares , cyclic group , and its primitive .
Output: federated logistic regression model.
1: fordo
2: fordo
3:   computes .
4:   chooses random number and makes public its .
5:   uses others’ to compute and sends it to CS.
6: fordo
7: if someone exits then
8:  CS eliminates the value involving information of quitters in .
9: CS performs the aggregation and decrypts to get .
10: CS computes .
11: broadcasts .
12: Each participant and CS can update weight parameter by computing .
13: Repeat all until reaching the termination condition.
14: return built model.
4.4. Secure Predicting

Secure prediction should ensure that user data privacy is not compromised and that model parameters are not exposed. As described by Algorithm 3, results inquirer intends to provide a set of data for prediction without privacy leakage and all data characteristics correspond to all participants in the current model. First, , respectively, encrypts the data with , for example, belongs to a characteristic that corresponds to . Each participant computes through ’s public key and communicates it to CS. The aggregation operation is still done by CS to protect the parameters from being exposed. Then, CS computes and returns it to . The joint value could be get with private key by , which computes predicted results using .

Input: federated logistic regression model, results inquirer , and its instance space.
Output: predicted results.
1: fordo
2: , respectively, encrypts the data belonging to the characteristics of different participants including CS.
3: Each participant computes and communicates it to CS.
4: CS does an aggregate operation and returns it to .
5: decrypts and gets .
6: gets predicted results by .
7: return result .

5. Security Analysis

In this section, we prove that our scheme is secure based on the simulator under the honest-but-curious setting. Recall that the involved parties are participants and the CS. We assume an honest-but-curious adversary who can corrupt participants but at most . According to [20], we define the security of our framework by comparing the real interaction and ideal interaction. In the real interaction, there is an environment which chooses inputs and receives outputs of uncorrupted participants. The adversary who can interact arbitrarily with the environment forwards all received messages to and acts as instructed by . We let represent the view of . Similarly, we let represent the view of where adversary and honest participants interact with the environment running the dummy protocol in the presence of functionality .

Definition 1. A protocol is secure if for every admissible adversary attacks the real interaction, there exists a simulator attacking the ideal interaction, such that the environment cannot distinguish between the ideal view of and the real view of .

5.1. Security of GAM

In the GAM, although CS can collude with at most participants to obtain the privacy of honest participants, they get nothing but aggregated results. Since each participant’s data is encrypted as , based on the security of Shamir secret sharing and homomorphic encryption, only the privacy of is discussed here.

Theorem 2. For secure aggregated framework, there exists a PPT simulator or that can simulate the ideal view which is computationally indistinguishable from the real view of or. is illustrated as Table 2, where a subset represents joint attackers.

According to whether CS is involved in collusion, the discussion is divided into two situations.

Case 1. Excluding CS from .

Proof. Since CS is not compromised, the view constructed by simulator is independent of the input of CS. So the simulator can execute a simulation by asking to generate fake data as inputs of the honest users, but the true inputs for honest-but-curious users. When sending , the simulator utilizes random number instead of true data. As aggregating and decrypting, CS returns aggregated result that does not indicate which special participants’ are aggregated. Hence, the ideal view simulated by is indistinguishable from the real view of since it is impossible to determine that is obtained from real data.☐

Case 2. Including CS in .

Proof. To prove the indistinguishability of the views in Case 2, the simulator gradually makes some improvements to the protocol. There exists hyb1, hyb2 that imply secure modification to the original protocol, ensuring the indistinguishability of the changed protocol from the original protocol, in our hybrid argument.
hyb1: in this hybrid, the simulator generates the masked input for honest participants as below: instead of utilizing Because is a random number, we can get is also a random value. The DDH assumption ensures that it is easy to infer they are indistinguishable.
hyb2: in this hybrid, the simulator generates encrypted by replacing with a random number . The security of the encryption algorithm ensures the indistinguishability of the two ciphertexts. Therefore, the simulator submits instead of sending Accordingly, the simulation has been completed since successfully simulates the real view without acquiring () and subsecret and we can infer that the output of this hybrid is indistinguished from the real one.☐

5.2. Security of MEMP

In the process of confirming the identity of the common entity, no entity information other than a common identity is available between participants. Thus, not only can the true identity of the entity not be exposed but its hash value cannot be either. The reason is that some honest-but-curious participants can calculate the hash value of a possible identifier to determine whether it belongs to . It is necessary to cover the confidential information with a random factor similar to a blind signature. We prove the following two theorems by constructing two separate simulators and to show that the real views and ideal views are computationally indistinguishable.

Theorem 3. For group entity matching, there exists a PPT simulator or that can simulate the ideal view which is computationally indistinguishable from the real view of or .

Let us divide into two parts, such as and . is the process from the beginning to the establishment of the comparison matrix, and means the process from the encrypted comparison matrix to the end, similar to the previous . The description of is shown in Table 3.

Here, we only complete the proof of the ideal view which is computationally indistinguishable from the real view of or . There are still two cases to prove.

Case 1 (excluding CS from ). Just consider the average participants that are compromised.

Proof. Suppose a group of participants is corrupted in the beginning. Because the views of corrupted participants and the inputs of honest participants are irrelevant and the values of all honest participants are meaningless for corrupted participants according to the real protocol , so the simulator first can run the protocol with the true inputs while using dummy data as the inputs of honest participants, namely, asks to generate random numbers as the inputs of honest particpants. After the blind signature is executed, the value returned from CS is added with a random number of CS so that cannot distinguish whether the values are generated from real data. Then, an -dimensional matrix is generated by comparing dummy data with a set of generated comparative values from CS. Since the participant’s entity data does not leave the local, the generated matrix is derived from random numbers. Hence, the simulated -dimensional matrix is indistinguishable from the one that is using true data for comparison.☐

Case 2 (including CS in ). Namely, consider the corrupted CS and .

Proof. For the corrupted CS and , denote the views of CS and participants as and . Based on the process of the MEMP, we can derive and , where , and refers to the th entity’s identity of the th participants. It can be found that the elements belonging to and can be treated as random values. Therefore, we can infer that and are simulatable for and , and the simulated views cannot be distinguished computationally by the adversary.

The proof of the second part about is similar to Theorem 2, so we will not go into details.

5.3. Security of VFLR

In the model training, CS needs to get messages sent by at least participants to decrypt the correct value so that the data for each participant cannot be retrieved and messages retrieved by CS can only be aggregated values without revealing anything else due to the combination of homomorphic encryption and threshold methods. The following theorem will be proved to show the security of the VFLR model.

Theorem 4. For secure VFLR model, there exists a simulator or that can simulate the ideal view which is computationally indistinguishable from the real view of or .

Our VFLR calls the previous aggregation framework, and its security is based on the proof of Theorem 2. Since Theorem 2 has been proved, here we only do a simple description for VFLR.

Proof. Similar to the proof of Theorem 2, for a group of corrupted participants, can run them using their local data while for honest participants simulate them with dummy data. Therefore, run to generate random values to replace the masked as the inputs of honest participants. In the model building, what the honest-but-curious participants get is the errors rather than some information that reflects real data and they cannot identify whether the aggregated values used to calculate the errors are based on real data. Therefore, the view of is indistinguishable from a real one.☐

Considering the corrupted CS, run to generate dummy labels, with which computes the errors after obtaining the aggregation. Because local data are within honest participants, the outputs cannot reveal specific information. Therefore, there exist a PPT simulator that can simulate the ideal view which is computationally indistinguishable from the real view of .

6. Performance Evaluation

In this section, the effectiveness and efficiency of the experiment are presented.

6.1. Experiment Configuration

Four clients , , , and are built to simulate the feasibility and performance, in which , , and refer to the average participants and means the aggregator with sample labels. We carry out our experiments on the device with CPU i7-6700, 3.2 GHz, and Memory 24 G. The programs in the experiments are implemented in Python, and the length of and are set to 512 bits for RSA and Paillier. We adopt a finite field with and the standard Shamir’s secret sharing to generate the shares of secret.

In the MEMP, we generated different random values between 52252800000000000000 and 52252800000000000550 as identification of user samples for , , , and , respectively, satisfying that the number of intersection of , , , and is 60, 120, 180, and 240. In VFLR, we selected 300 digits from the handwritten digits dataset (handwritten digit dataset: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html), which contains 64 features. , , and , respectively, hold 20 features in these samples, while contains 4 features and 1 label. Since the dataset is multiclassified, we set 0 to 4 as category 1 and 5 to 9 as category 2.

6.2. Performance Analysis of the MEMP

For the MEMP, in the simulations, we get the execution time of () and the CS () when increasing the amount of data from 100 to 500. Note that the time to initialize the system ignored in all experiments and represents the size of samples in , the same thing for . () first calculates the median value for comparison through CS, the computation of which is related to the amount of their own data, and then the size of the calculated comparison matrix is related to the amount of CS’s data. When the number of samples’ identification of , , and increases, the increase of computation time of , , or is mainly reflected in generating sample identification library, and the most time-consuming in is the blind signature. Figure 5 shows the calculation time of and when the number of samples in is 100 and the samples in () fluctuate from 100 to 500. Figures 6 and 7 show the changing trend of running time of and with the increase of when the amounts of , , and are all 100. As the sample amount of is increasing, the encryption time is proportional to the data volume of . Because the time of aggregation and decryption is related to its own data volume, its identification volume affects the size of an identification matrix, thereby affecting the aggregation volume, resulting in a linear increase in the time of aggregation and decryption. However, changes in the intersection of , , , and will not affect the respective calculation time that is only related to the amount of data of all parties, as is shown in Figures 8 and 9.

6.3. Performance Analysis of the VFLR

In the VFLR, there is no approximation algorithm applied here so that the updated parameters in our VFLR are exactly the same as those in the traditional logistic regression. Therefore, the training accuracy is also consistent. The main observation here is the running time of the VFLR. The extra cost in training is mainly from power exponent and homomorphism. In the training phase, represents the time of a complete encryption and represents the time of an aggregation and decryption for paticipants. When the sample size is , complexity time of encryption and aggregation, respectively, is and . We selected 400 samples and set the number of features at each end to 20. When the number of data amount is 100, 200, 300, and 400, respectively, the time for a single end to complete an iteration and the time for to complete aggregation and decryption are captured. (See Figures 10 and 11.) As the figures show, we gradually adjust the data volume from 100 to 400, and the running time increases approximately linearly.

6.4. Communication Overhead

Since the establishment of the user identification library and the generation of the comparison matrix can be done offline in the MEMP, we discuss the communication overhead of the MEMP from the perspective of the proposed aggregation model. The same is true for the VFLR, because its execution process is completely consistent with the GAM. Let the length of the ciphertext of each participant be and the length of the and , respectively, be and where and . In the MEMP, it only takes one turn to complete the comparison and identify the common entity. If CS sends entity mask, the communication load for each participant is . In order for participants to complete the comparison, the CS needs to send corresponding entity mask for participants so that the communication load of CS is . In the VFLR, we just consider the communication load for one iteration. Each participant sends ciphertext of size . The CS just needs to return errors of samples, so the communication load of CS is . In this way, the total space complexity of MEMP and VFLR can be expressed as and . We can see that as the number of participants or samples increases, the total communication overhead also grows linearly.

Many privacy-preserving models for specific machine learning algorithms have emerged. There mainly exist two kinds of technologies adopted in the privacy-preserving training, i.e., differential privacy [21] and cryptography-based approaches. Differential privacy applied to FL can prevent clients from trying to reconstruct the private data of other clients by exploiting the global model, as done in [13, 22]. It adds noise to the original dataset or gradient parameters while ensuring the availability of the data. But it brings low accuracy. The cryptographic technologies can provide privacy protection while ensuring accuracy. Secure multiparty computation, secret sharing, and homomorphic encryption are the common methods. For example, Aono et al. [23] used homomorphic encryption to improve the logistic regression algorithm ensuring the security of the training and predicting data. Liu et al. [24] propose a secret sharing-based federated extreme boosting learning framework to achieve privacy-preserving model training for mobile crowdsensing. Xu et al. [25] proposed a privacy-preserving and verifiable federated learning framework based on homomorphic hash functions, in which clients can verify whether the result returned by cloud server is correct.

Some previous works with privacy preserving over vertical data partition are discussed in [26, 27]. However, there exist potential privacy risks as a result of revealing class distribution over the given attributes. Research on VFL is first proposed in [28], where a federated logistic regression scheme is designed through an additively homomorphic scheme. Nock et al. [29] then provide a formal assessment of how errors in entity resolution impact learning. Cheng et al. [30] propose a novel lossless privacy-preserving tree-boosting system, which conducts a learning process on multiple parties with partially common user samples but different feature sets. Fu et al. [31] combines Lagrange interpolation and Chinese remainder theorem to realize the secure aggregation of gradients. But some works assume sample entities already being matched or they only deal with two VFL participants. The proposed framework is more advantages than those approaches as it can support multiparticipant VFL by taking into account entity matching and model training with withdrawal of participants. Table 4 shows the functional comparison between our framework and existing two main works from GAM, multiparticipants (M-p), entity matching (EM), withdrawal of participants (WP). Table 5 shows the comparison of computation and communication mainly for secure aggregation from the participant and aggregator’s main function operation, communication rounds (Comm-rounds), and mask number (i.e., the number of message received by the aggregator), where SC represents the times of secret reconstruction, CR represents the times of calculation of Chinese residual theorem, HE represents the times of homomorphic encryption, represents the times of pseudorandom generator, and represents the times of calculation of the power exponent. It can be seen from Table 5 that our scheme has advantages over reference [25] in terms of calculation and communication overhead. Compared with [25, 31], although our scheme applies the principle of secret sharing, it does not need to spend the overhead of secret reconstruction.

8. Conclusion

For privacy protection of data aggregation and joint training in federated learning, as well as entity matching, we designed a PFLF where we proposed a general aggregation model and designed a multiparty entity matching protocol, which can find the common entity of multiple participants without data disclosure. In addition, our GAM was used to improve logistic regression algorithm to ensure the confidentiality of data samples during training and support the withdrawal of participants over VFL. The security analysis of the scheme was given based on the simulator, and the performance of the system was tested with the experimental data. The next research will focus on optimizing the operating load of the system and considering cases where the participants are malicious to construct a verifiable federated learning framework and design incentives to facilitate federated learning.

Data Availability

The data used to support the finding of this study are included in the article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61662009 and 61772008), Science and Technology Major Support Program of Guizhou Province (Grant No. 20183001), Key Program of the National Natural Science Union Foundation of China (Grant No. U1836205), Science and Technology Program of Guizhou Province (Grant No. [2019]1098), Project of Innovative Group in Guizhou Education Department (Grant No. [2013]09), Project of High-level Innovative Talents of Guizhou Province (Grant No. [2020]6008), and Science and Technology Program of Guiyang (Grant No. [2021]1-5).