Abstract

Collaborative learning is an emerging distributed learning paradigm, which enables multiple parties to jointly train a shared machine learning (ML) model without causing the disclosure of the raw data of each party. As one of the fundamental collaborative learning algorithms, privacy-preserving collaborative logistic regression has recently gained attention from industry and academia, which utilizes cryptographic techniques to securely train joint logistic regression models across data from multiple parties. However, existing schemes have high communication and computational overhead, lose the ability to deal with high-dimensional sparse samples, cut down the accuracy of the model, or exist the risk of leaking private information. To overcome these issues, considering vertically distributed data, we propose a privacy-preserving vertical collaborative logistic regression ( VCLR) based on approximate homomorphic encryption (HE), which enables two parties to jointly train a shared model without a trusted third-party coordinator. Our scheme utilizes batching method in approximate HE to encrypt multiple data into a single ciphertext and enable a parallel processing through single instruction multiple data (SIMD) manner. We evaluate our scheme by using three publicly available datasets, the experimental results indicate that our scheme outperforms existing schemes in terms of training time and model performance.

1. Introduction

Machine learning (ML) [1] is a method for analyzing large-scale data and is widely used in practice to train predictive models for practical applications. As one of the basic machine learning algorithms, logistic regression (LR) [2] has attracted much attention for its powerful ability to solve classification problems in practical applications, such as disease diagnosis [3], credit evaluation [4].

In recent years, in order to obtain massive data for training better-performing models [5], there is growing interest in machine learning by combining the data from different institutions [6]. For instance, different hospitals would like to combine health data to jointly train models to facilitate more accurate disease diagnosis; different financial companies want to collaborate to train more effective credit card scoring and fraud detection models. Unfortunately, due to regulatory and competitive reasons, it is difficult or even impossible to directly exchange data of different parties for model training in practice [7]. That is, the data of different organizations is isolated. To eliminate the issue of “data isolation“, the idea of collaborative learning [8] is introduced. Its goal is to cooperatively train a shared ML model on distributed data while complying with regulation and protecting privacy. The security, privacy, and efficiency concerns remain main challenges for practical applications. Recently, as a fundamental collaborative learning algorithm, privacy-preserving collaborative logistic regression (PPCLR) [924] has received considerable attention recently, which utilizes cryptographic primitives such as homomorphic encryption (HE) algorithm [25] and multi-party computation (MPC) protocol [26] to securely train a joint logistic regression model across data from multiple parties. However, for the HE-based schemes [911], model weights are exposed to all parties at each iterative update of global model parameters during training, which is able to be utilized to deduce additional private information [27]; for the MPC-based schemes [12, 14], after using secret sharing (SS) [28] on training samples of all parties, even previously sparse samples become dense, so they are not able to efficiently handle sparse samples and require high communication complexity when the training data becomes large.

To solve the problems mentioned above, in a two-party setting, considering two vertically distributed training data with the same sample distributions but different feature distributions, we construct a privacy-preserving vertical collaborative logistic regression ( VCLR) based on the HE for arithmetic of approximate numbers [29]. The main contributions are as follows:(1)Firstly, we construct a VCLR framework for collaborative learning of vertical distributed features, which can securely realize the joint modeling of both parties without the assistance of a trusted third-party (TTP), and hence greatly reduces the system complexity.(2)Secondly, to improve the training efficiency, using the batching technique in HE [29], the proposed scheme can pack multiple samples into a single plaintext with multiple slots, encrypt it into a single ciphertext, and enable a parallel computing through using SIMD.(3)Finally, we conduct performance evaluations on three datasets [30], and the experimental evaluation results indicate that our scheme achieves a significant improvement in efficiency and performance than existing schemes [9, 21]. Specifically, the training time of the model is decreased by almost 32.3%-72.5%; the accuracy, F1-score, and AUC of the model have nearly 0.3% - 3.0%, 0.1% - 2.7% and 0 - 0.03 improvement, respectively. Furthermore, the security analysis indicates that the proposed VCLR scheme is secure against semi-honest adversaries, and neither of the both parties can know each other's raw data.

The rest of this work is arranged as follows. Several works related to our scheme are introduced in Section [2]. In Section [3], we review some preliminaries. In Section [4], our scheme is described. In Section [5], the evaluations for our scheme are presented. The security analysis of our scheme is shown in Section [6]. In Section [7], we conclude this work.

There are several works that have been made to joint train a LR model across multiple data owners. In general, a common approach is to implement secure logistic regression by using cryptographic primitives like HE [25] and MPC [26]. The existing works [924] can be divided into two categories: PPCLR with a TTP coordinator [916] and PPCLR without a TTP coordinator [1724]. A summary of the existing works [924] follows.

As for the PPCLR with TTP coordinator [916], Hardy et al. [9] described a privacy-preserving federated LR scheme by employing additively HE scheme [25], which centralizes two vertically distributed training data in one TTP coordinator, but the approximation of non-polynomial function reduces the model accuracy. Yang et al. [10] shown a quasi-Newton way for achieving vertical federated LR model based on the additively HE scheme [25]. Using an additive HE [31] and an aggregation method [32], Mandal et al. [11] built a privacy-preserving regression analysis protocol on the horizontally distributed high-dimensional data. Employing an additive secret sharing technique [33], Zhang et al. [12] proposed a privacy-preservation collaborative learning for ensuring local training data and model information. Liu et al. [13] introduced a collaborative learning platform, which supports multiple institutions to build machine learning models collaboratively over large-scale horizontally and vertically partitioned data. By means of MPC from additive secret sharing [34, 35], Cock et al. [14] proposed a protocol for securely training LR model over distributed parties, where TTP initializer assigns relevant random values to two computing severs. Based on multi-key fully HE [36], Wang et al. [15] designed a secure cloud-edge collaborative LR system, which employs the cloud centre and edge nodes to collaboratively train a LR model over encrypted data. Zhu et al. [16] proposed a value-blind LR training method in a collaborative setting based on HE [25], where the central server updates model parameters without access to the training data and intermediate values, and model parameters are shared among the central server with collaborating parties. However, it's inherently difficult to establish a third party trusted by any data owners in a real-world scenario. Moreover, data interactions between data owners and TTP raise the risk of leakage of sensitive data of the data owner.

To decrease the complexity of training a joint model for any two parties, by removing the TTP coordinator, Yang et al. [17] constructed a parallel distributed LR method for vertical federated learning based on HE [25], which allows two parties to jointly train models without the help of a TTP coordinator. Using the secure MPC protocol and ciphertext domain conversion protocol [37], Chen et al. [18] presented a collaborative learning system for jointly building better models over vertically partitioned multiple data. Based on the HE scheme [29], Li et al. [19] introduced a collaborative learning method on encrypted data, which could securely train LR models over vertically distributed data from both data owners. Based on asynchronous gradient sharing and HE algorithm [29], Wei et al. [20] designed a two-parties collaborative LR protocol, which can train securely joint model on the vertically partitioned data. Chen et al. [21] combined the HE [25] and secret sharing [38] to build securely LR model on the vertically distributed large-scale sparse training data. Over the horizontally partitioned data, based on secure MPC protocol, Ghavamipour et al. [22] described two methods to train LR model in a privacy-preserving manner. However, each data owner requires to compute multiple shares of its sensitive training data and sends them separately to each non-collusion computation party, this leads to heavy communication costs. He et al. [23] constructed a vertical federated LR method through a HE algorithm [25], which uses a piecewise function to ensure the accuracy of the loss function, but this results in a loss of efficiency. With the HE scheme [25] and differential privacy algorithm [39], Sun et al. [24] introduced a vertical federated LR solution, which alleviates the constraints on feature dimensions. However, the existing PPCLR schemes [1724] without a TTP coordinator lead to high communication and computational overhead.

3. Preliminaries

3.1. System Architecture

For ease of reading, the definitions of the symbols in our VCLR scheme are displayed in Table 1. As is shown in Figure 1, the system architecture of our VCLR includes two semi-trusted entities: and . and hold the vertically distributed datasets and , respectively. and have the same sample space but different feature distribution, namely, holds the part of the features, holds another part of the features and the label. cooperates with to train a shared LR model without disclosing the privacy of training data. Specifically, generates [29], sends polynomial-degree , coefficient-modulus , scaling factor , public key , relinearization key , galois key to , and securely store secret key . Then, encrypts its own data with , and sends the encrypted data to . Finally, and jointly execute VCLR algorithm to obtain the training result.

3.2. Homomorphic Encryption

HE allows direct operations on ciphertext without decryption, and can ensure that the computation on the ciphertext is consistent with the computation on the plaintext. Cheon et al. [29] introduced an approximate HE algorithm from ring learning with errors (RLWE) [40], which supports the following operations.(1): Given the parameters , it generates , , , for .(2): Given a message vector and , it generates a ciphertext .(3): Given and , it generates a message vector .(4): Given and , it generates a ciphertext .(5): Given and a message vector , it generates a ciphertext .(6): Given a ciphertext list , it generates a ciphertext .(7): Given and , it generates a ciphertext .(8): Given and , it generates a ciphertext .(9): Given , and , it generates a ciphertext .(10): Given , and , it generates a ciphertext .(11): Given , and , it rotates left by rotation value , and generates a ciphertext .

3.3. Logistic Regression

Let a dataset includes samples , where an input maps to a binary dependent variable , the goal of binary LR is to compute weights that minimizes the log-likelihood loss function , where . Assuming that and denote the model weights and learning rate at iteration , respectively, the gradient descent (GD) is able to be used to compute the extremum of by . Since the HE scheme (CKKS) [29] is not able to effectively support non-polynomial arithmetic operations, we use a 7-degree polynomial function f to approximate sigmoid function over the domain [-8, 8], where , , , , and .

4. Privacy-Preserving Vertical Collaborative Logistic Regression

Over vertically distributed datasets and , we propose a VCLR scheme based on an approximate HE [29]. Using batching method in approximate HE, the proposed scheme packs a message vector with multiple messages into a plaintext with multiple plaintext slots, and performs parallel training based on SIMD. For ease of readability, we give the Algorithm , which can be found in Appendix. We assume that the samples of and held by and have been aligned, namely,

and consist of samples of the form and , respectively, where . Each column of denote the features. The last column of represents the label, and other columns of represent the features. cooperates with to train a shared LR model without revealing the data privacy. Suppose , the details of the proposed VCLR are described below.Input: and for and respectivelyOutput: and for and respectivelyPreprocessing:1: computes , , , lets,generates , encrypts dataset into ciphertexts,lets,encrypts the initial weight into one ciphertext,and sends , , , , , , , , to .2: computes , , , lets,,sets data set into message vectors,,,,lets,sets the initial weight into one message vector,sets the learning rate into one message vector,lets,sets the message vectors,sets the lists,,.Training:3: computes 4: for ( to ) do5: computes 6: end for7: for ( to ) do8: for ( to ) do9: computes 10: computes 11: computes 12: computes 13: computes 14: computes 15: end for16: computes 17: computes 18: computes 19: computes 20: chooses random message vector 21: computes 22: sends to 23: computes to 24: sets 25: sets 26: sets 27: sends to 28: sets 29: sets 30: computes 31: end forReconstructing:32: sends to 33: computes 34: sends to 35: computes 36: return: and for and respectively

5. Performance Evaluation

We execute the performance comparisons among our VCLR scheme and existing schemes [9, 21]. We perform all experiments on a 64-bits Linux system machine with i7 CPU and 16 GB memory. For all experiments, we choose the initial weights , , the learning rate , and the maximum number of iterations . The schemes [9, 21] choose the Paillier cryptosystem [25] to provide the additive HE operations, the proposed scheme uses the Microsoft SEAL library [41] to instantiate the HE operations [29]. To achieve bits security, for the schemes [9, 21], we set the prime number bits and bits; for the proposed scheme, we choose the polynomial-degree , the coefficient-modulus , and the scaling factor . On three publicly available datasets [30]: - Umaru Impact Study, - Myocardial Infarction from Edinburgh, and - Nhanes III, we compare the proposed scheme and schemes [9, 21] in terms of training time, accuracy, F1-score, AUC. has the first 4 features of all samples of , has the last 4 features and labels of all samples of ; has the first 5 features of all samples of , has the last 4 features and labels of all samples of ; has the first 8 features of all samples of , has the last 7 features and labels of all samples of . We get the validity of the experimental results by using 5-fold cross-validation. All experiment results are shown as the average of 10 experiments. The performance comparisons between the proposed scheme and schemes [9,21} are described in Table 2, in which '' '' denotes ''satisfied'' and '' '' means ''unsatisfied''. From Table 2, we can see that our VCLR scheme outperforms existing schemes [9, 21] in both training time and model performance, and does not need a TTP coordinator.

From Figure 2, we can get that, for dataset , in our scheme, the training time of the model is 0.86 min, which is decreased by nearly 32.3% and 60.6% compared with that of [9, 21], respectively; for dataset , in our scheme, the training time of the model is 1.49 min, which is reduced by almost 32.3% and 56.3% in comparison to that of [9, 21], respectively; for dataset , in our scheme, the training time of the model is 5.87 min, which is nearly 47.3% and 72.5% less than that of [9, 21], respectively.

From Figure 3, we can get that, for dataset , in our scheme, the accuracy of the model is 74.4%, which has nearly 0.3% and 0.3% improvement compared with that of [9, 21], respectively; for dataset , in our scheme, the accuracy of the model is 92.3%, which has an increase of almost 1.0% and 1.4% in comparison to that of [9, 21], respectively; for dataset , in our scheme, the accuracy of the model is 85.7%, which is nearly 3.0% and 3.0% higher than that of [9, 21], respectively.

From Figure 4, we can get that, for dataset , in our scheme, the F1-score of the model is $85.2\%$, which has nearly 0.1% and 0.1% improvement compared with that of [9, 21], respectively; for dataset , in our scheme, the F1-score of the model is 78.0%, which has an increase of almost 0.5% and 2.7% in comparison to that of [9, 21], respectively; for dataset , in our scheme, the F1-score of the model is 61.9%, which is nearly 1.8% and 1.8% higher than that of [9, 21], respectively.

From Figure 5, we can get that, for dataset , in our scheme, the AUC of the model is 0.58, which has nearly 0.01 and 0.02 improvement compared with that of [9, 21], respectively; for dataset , in our scheme, the AUC of the model is 0.96, which is the same as that of [9, 21]; for dataset , in our scheme, the AUC of the model is 0.91, which is nearly 0.03 and 0.02 higher than that of [9, 21], respectively.

6. Security Analysis

In the semi-honest model [42], we let the parties and know , , , and only has . The proposed VCLR scheme belongs to secure two-party computation, which denotes an objective functionality . For the inputs , where is from party and is from party , the outputs are random. is the output for , and is for , and neither party can know more private information than its output. According to the simulation-based security [43], we perform a security analysis of our VCLR scheme.

Definition 1. Let be a deterministic functionality and be a secure two-party computation protocol to compute . Given 's input , 's input , and security level , the views of and in the protocol are denoted as and , where and are the messages received by and . We think that, in semi-honest model, can securely calculate if there are the probabilistic polynomial-time (PPT) simulators and , such that

Theorem 1. Assuming that the and do not collude with each other, and the HE scheme (CKKS) [29] satisfies the semantic security, our VCLR scheme can ensure the security in semi-honest model.
Security Proof. Security proof of our VCLR scheme follows the simulation-based security [43]. We prove that we are able to build and , such thatwhere and denote the views of and , respectively. Next, we show that the above two equations are indistinguishable for the corrupted parties and , respectively.
Against corrupted : We build that, when given , 's input and 's output , is able to simulate 's view in the execution of the protocol. In this respect, we then analyze 's view in the execution of the protocol. The only message gets is the ciphertext . Therefore, consists of 's secret key , random message vector and ciphertext . Given , , and , generates a simulation of . encrypts with into , and generates the output . Therefore, we can obtain two equations as follows:Through the above analysis, we are able to get that probability distribution of 's view and 's output is indistinguishable. Therefore, the proposed VCLR scheme is secure against the corrupted in semi-honest model.
Against corrupted : We build that, when given , 's input and 's output , is able to simulate 's view in the execution of the protocol. For this reason, we analyze 's view in the execution of the protocol. does not receive any message vectors from . Therefore, consists of 's input and random message vector . Given , , and , generates a simulation of by outputting . Therefore, we have the following two equations:Through the above analysis, we are able to get that probability distributions of 's view and 's output are indistinguishable. Therefore, the proposed VCLR scheme is secure against the corrupted in the semi-honest model.

7. Conclusion

In this paper, to improve the efficiency of the collaborative LR, based on an approximate HE algorithm, we propose a VCLR over vertically distributed data while realizing the security of training data and the privacy of model parameters for all parties. We then evaluate the proposed scheme on the public datasets. The evaluation results show that our VCLR scheme achieves a better performance in terms of joint training time and model performance in comparison to that of existing schemes [9, 21]. Specifically, the training time of the model is decreased by almost 32.3%-72.5%; the accuracy, F1-score, and AUC of the model have nearly 0.3% - 3.0%, 0.1% - 2.7% and 0 - 0.03 improvement, respectively. In the future, we will extend our method for supporting more complex ML, and deploy our scheme for practical applications.

Appendix

Input: Output: 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: return: Input: Output: 1: 2: for (; ; ) do3: 4: 5: end for6: return: Input: Output: 1: 2: for (; ; ) do3: 4: 5: end for6: return: Input: Output: 1: 2: for (; ; ) do3: 4: 5: end for6: return:

Data Availability

Previously reported datasets were used to support this study and are available at https://doi.org/10.1186/s12920-018-0401-7. These prior studies (and datasets) are cited at relevant places within the text as references [30].

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work is supported by the National Key R\&D Program of China (Grant No. 2019YFE0113200) the National Natural Science Foundation of China (Grant No. U19B2021, No. 61901317), and the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2020A1515110496, No. 2020A1515110079).