Abstract

Logistic regression is a data statistical technique, which is used to predict the probability that an event occurs. For some scenarios where the storage capabilities and computing resources of the data owner are limited, the data owner wants to train the logistic regression model on the cloud service provider, while the high sensitivity of training data requires effective privacy protection methods that enable efficient model training without exposing information about the training data to untrusted cloud service providers. Recently, several works have used cryptographic techniques to implement privacy-preserving logistic regression in such application scenarios. However, on large-scale training datasets, the existing works still have the problems of long model training time and poor model performance. To solve these problems, based on the homomorphic encryption (HE), we propose an efficient privacy-preserving outsourced logistic regression (P2OLR) on encrypted training data, which enables data owners to utilize the powerful storage and computing resources of cloud service providers for logistic regression analysis without exposing data privacy. Furthermore, the proposed scheme can pack multiple messages into one ciphertext and perform the same arithmetic evaluations on multiple plaintext slots by using the batching technique and single instruction multiple data (SIMD) mechanism in HE. On three public training datasets, the experimental results show that, compared with the existing schemes, the proposed scheme has better performance in terms of the encryption and decryption time of the data owner, the storage of encrypted training data, and the training time and accuracy of the model.

1. Introduction

Logistic regression (LR) [1] is a popular classification method, which has been used in numerous practical applications including cancer diagnosis [2], credit scoring [3], genome-wide association study [4], and more. LR can not only be applied to the problem of predicting the probability of occurrence of various events, but also is competitive with other classification algorithms in terms of prediction accuracy. In some practical application setting, the data owners have the limited computing and storage resources, and thus wants to outsource some of the heavy computation in logistic regression model training, the outsourced data analysis [5] has received considerable attention recently, which enables data owners to train a LR model using the powerful storage capacity and computing resources of cloud service providers [6].

However, the high sensitivity of training data requires to perform an effective privacy protection [710] that enable efficient and secure logistic regression analysis without leaking information about the training data to untrusted cloud service provider. Recently, to meet such application requirements, based on the cryptographic techniques like secure multiparty computation (MPC) [11] and homomorphic encryption (HE) [12], there have been several researches on the privacy-preserving logistic regression (PPLR) [1322], which enables data owners to employ the service providers’ powerful data storage and computing resources for logistic regression model training without exposing its own data privacy. Specifically, the data owner encrypts its training data, and sends encrypted training data to the service provider. The service provider can train a logistic regression model on encrypted training data, and returns the encrypted training result to the data owner. The data owner can decrypt the encrypted training result to obtain final training result.

Unfortunately, on large-scale training dataset, the existing PPLR schemes [1322] still have the bottlenecks of high model training time and low model precision. To solve these problems, based on the HE cryptographic technique [23] that has the property that the operation results on ciphertexts are consistent with those on plaintexts, we design an efficient privacy-preserving outsourced logistic regression (P2OLR). The main contributions are as follows:(1)Firstly, we propose a method for achieving P2OLR on encrypted data from HE. To speed up the model training, the proposed P2OLR scheme employs the batching technique to pack multiple elements into multiple plaintext slots, encrypts them into one ciphertext, and performs the same arithmetic operations to multiple plaintext slots in the SIMD mechanism.(2)Secondly, we evaluate the proposed P2OLR on three public datasets [18]. Under the same experimental environment, compared with the related P2OLR [17, 18, 22], the model training time of the proposed P2OLR is reduced by more than 71.7%, and the proposed P2OLR has a better model performance.

The rest of this paper is arranged as follows. We present the related works in Section 2. We review the preliminaries related to our P2OLR in Section 3. In Section 4, ourP2OLR is described. The performance evaluation for our P2OLR is presented in Section 5. The security analysis of our P2OLR is shown in Section 6. Finally, we conclude in Section 7.

There have been a lot of works on achieving PPLR using cryptographic techniques. In this paper, we mainly focus on the PPLR based on HE. To outsource the LR model training to a cloud service provider in a privacy-preserving manner, based on the HE scheme (FV) [24], Charlotte et al. [13] proposed an algorithm to train a LR model on an homomorphically encrypted dataset, which is implemented based on the FV-NFLlib library [25]. However, the accuracy of model is poor due to the use of a quadratic polynomial to approximate the sigmoid function. Furthermore, the training time grows linearly in the number of training samples. Using the HE scheme (FV) [24] and 1 bit gradient descent (GD) method, Chen et al. [14] presented a method to train LR over encrypted data, which is implemented through the SEAL library [26], and allows an arbitrary number of iterations by using bootstrapping [27] in FV, but bootstrapping introduces a significant decrease in performance. Focusing on the prediction process of LR, based on the HE scheme (BGV) [28], Li and Sun [15] proposed a secure protocol to solve the data leakage problem during the LR prediction process, and implement their scheme by the HElib library [29]. Based on the Chimera framework [30] that allows switching between HE schemes TFHE [31] and CKKS [23], Carpov et al. [16] proposed a solution to achieve semi-parallel LR on encrypted genomic data, which performs the bootstrapping [27] without re-encrypting the genomic data for an arbitrary number of iterations, and is implemented by using TFHE library [32] and HEAAN library [33].

Adapting the packing and parallelization techniques of approximate HE scheme (CKKS) [23], Kim et al. [17] proposed a PPLR, which is implemented through using the HEAAN library [33], and uses least squares approximation to improve the accuracy and efficiency of LR model training. However, as the number of iterations increases, the parameters of the CKKS scheme also need to become larger, which makes the training time increase dramatically. Kim et al. [18] applied the HE scheme (CKKS) [23] to achieve PPLR. Their scheme is implemented via using the HEAAN library [33]. Moreover, they devised an encoding method to decrease the storage of encrypted training data and adapted Nesterov’s accelerated GD method to reduce the number of iterations as well as the computational cost. However, their scheme requires the assumption that both the number of training samples and features are power-of-two, which makes the scheme unsuitable for practical applications. To reduce the number of iterations, Cheon et al. [19] proposed an ensemble GD method based on the HE scheme (CKKS) [23], and applied it to the PPLR, in which they approximate the sigmoid function using a polynomial of 5-degree obtained by least squares approximation. Their scheme is implemented based on the HEAAN library [34]. To run a genome-wide association study on encrypted data, using the SIMD capabilities of HE scheme (CKKS) and Nesterov’s accelerated GD, Bergamaschi et al. [20] introduced a method for homomorphic training of LR model, which is implemented based on the HElib library [29]. To protect the private information of both parties, based on the HE scheme (CKKS) [23] and gradient sharing technology, Wei et al. [21] proposed a protocol to train an LR model on vertically distributed data between two parties, which does not require trusted third-party nodes and is implemented by the HElib library [29]. Based on the HE scheme (CKKS) [23], Fan et al. [22] offered a PPLR algorithm, where they approximate the sigmoid function in LR by Taylor’s theorem, and use row encoding to encrypt training samples, but as the number of samples increased, this will lead to longer model training time.

3. Preliminaries

3.1. System Model

As can be seen in Figure 1, the system model of the proposed P2OLR considers two entities, namely a data owner (DO) and a service provider (SP). For readability, the definitions of the notations in this paper are shown in Table 1. DO: It has limited computational resources, and wants to use SP’s data analysis service on encrypted data to train a LR model without revealing its own training data privacy. SP: It is a semi-trusted entity with powerful data storage and computing capabilities, and can provide data analysis and statistical services on encrypted data for DO. Specifically, DO chooses poly_modulus_degree , coeff_modulus , and runs key_generation algorithm to generate the secret_key , public_key , relinearization_key , galois_key . Next, DO encrypts the training data into ciphertexts , encrypts the initial weight into ciphertexts , encrypts the learning rate into one ciphertext , and sends , , , , , , , , , to SP. SP performs the algorithm and returns the ciphertext result of the -th iteration to DO. DO decrypts the ciphertext result to obtain final result .

3.2. Homomorphic Encryption

Homomorphic encryption (HE) is a cryptographic technique, which allows operations on ciphertexts without decryption, and guarantees that the computation results on ciphertexts are consistent with the computation results on plaintexts. We adopt the HE scheme (CKKS) [23] based on the Ring Learning with Errors (RLWE) problem, which can encrypt multiple elements in one ciphertext and supports the single instruction multiple data (SIMD) operations. Suppose denotes the -th cyclotomic polynomial, where is power of 2. denotes the cyclotomic ring of polynomials. denotes the residue ring of modulo . denotes a subring of complex vector that is isomorphic to . denotes a canonical embedding that transforms a plaintext polynomial into a complex vector . denotes a natural projection that transforms a complex vector to . HE scheme (CKKS) [23] supports the operations as follows, which can be found in the Appendix. For ease of description, we define the Algorithms 1–9.

Input:
Output:
(1)encode_double (, , )
(2)encrypt (, )
(3)return:
Input:
Output:
(1)for ( to ) do
(2) decrypt (, )
(3) decode_double (, )
(4)
(5)end for
(6)return:
Input:
Output:
(1)mod_switch_to_inplace (, .parms_id())
(2)multiply (, , )
(3)relinearize_inplace (, )
(4)rescale_to_next_inplace ()
(5).set_scale ()
(6)return:
Input:
Output:
(1)encode_double (, , )
(2)mod_switch_to_inplace (, .parms_id())
(3)multiply_plain (, , )
(4)rescale_to_next_inplace ()
(5).set_scale ()
(6)return:
Input:
Output:
(1)mod_switch_to_inplace (, .parms_id())
(2)add (, , )
(3)return:
Input:
Output:
(1)encode_double (, , )
(2)mod_switch_to_inplace (, .parms_id())
(3)add_plain (, , )
(4)return:
Input:
Output:
(1)mod_switch_to_inplace (, .parms_id())
(2)sub (, , )
(3)return:
Input:
Output:
(1)mod_switch_to_inplace (, .parms_id())
(2)add (, )
(3)return:
Input:
Output:
(1)
(2)for (; ; ) do
(3) rotate_vector (, , , )
(4) add_inplace (, )
(5)end for
(6)return:
3.3. Sigmoid Approximation

Since the existing HE scheme can only effectively support polynomial arithmetic computations, the computation of sigmoid function using HE is a barrier to the realization of P2OLR. To find a approximate polynomial of , adapting the least squares method, we consider the 7° polynomial over the domain , where , , , , . and can be seen in Figure 2, the maximum errors between and are about 0.032. over encrypted data from HE can be achieved by the Algorithm 10.

Input:
Output:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)return: .
3.4. Logistic Regression

Logistic regression (LR) is a statistical analysis method for predicting the probability of an event. We consider the case where the predicted value is a binary dependent variable. Assuming that a dataset consists of samples of the form with and , the goal of LR is to find the optimal parameters that minimizes the negative log-likelihood function (loss function) , where . A common method for minimizing loss function is a gradient descent (GD) algorithm, which finds the local extremum of a loss function by following the direction of the gradient. The gradient of with respect to is calculated by . Let be the regression parameters and is a learning rate in the -th iteration of the GD algorithm, the GD algorithm can update by .

4. Privacy-Preserving Outsourced Logistic Regression

Based on the HE scheme, we propose a P2OLR, where we employ the batching method to pack multiple elements into multiple plaintext slots, and encrypt them into one ciphertext, and then perform the same arithmetic evaluations to multiple plaintext slots through the SIMD mechanism. To reduce the parameters of HE scheme (CKKS) as well as improve the performance of P2OLR, the proposed P2OLR allows the interaction between DO and SP during iterative training. Specifically, SP returns the ciphertext training result to DO after certain number of iterations . DO decrypts the ciphertext training result, and determines whether the performance of the model has met the requirements. If so, stops training. Otherwise, sends encrypted weights to SP to continue training. Letdenote the training data sets held by DO, where consists of samples of the form with and . The first column of denotes the label, other columns denote the features. Since DO has limited computational resources, DO wants to outsource to SP to train a LR model without disclosing its own training data privacy. The specific description of the proposed P2OLR is as follows.(1)DO generates , computes , calls the Algorithm 1 to encrypt the training data into ciphertextscalls the Algorithm 1 to encrypt the initial weight into ciphertextscalls the Algorithm 1 to encrypt the learning rate into one ciphertextand sends , , , , , , , , , , , , , , , , , , , , , , Q, , , , , to SP.(2)SP computes ciphertextsand sets the listsNext, SP calls the Algorithm 11, and returns the ciphertext result to DO.(3)DO calls the Algorithm 2 to decrypt the ciphertext result into the result . Next, DO judges whether has met the requirements. If so, terminates the training. Otherwise, DO calls the Algorithm 1 to encrypt into ciphertextsand sends , , , to SP to continue training.

Input: , , , , , , , , Q,., , , , ,
Output:
(1)for ( to ) do
(2)for ( to ) do
(3)  for ( to ) do
(4)   
(5)  end for
(6)  
(7)  for ( to ) do
(8)   
(9)  end for
(10)  
(11)  
(12)  for ( to ) do
(13)   
(14)  end for
(15)end for
(16)for ( to ) do
(17)  
(18)  for ( to ) do
(19)   
(20)  end for
(21)  
(22)  
(23)  
(24)  
(25)  
(26)end for
(27)end for
(28)return:

5. Performance Evaluation

We implement all experiments on a 32-core Intel Xeon CPU with 256 GB RAM. We compare the performance of the proposed P2OLR with the related P2OLR [17, 18, 22]. We employ 5-fold cross-validation method to obtain the validity of the experimental results. For [17, 18], the implementations are publicly available at [35, 36], respectively, which use the HEAAN library [33] to provide HE cryptographic operations. For [22] and the proposed P2OLR, we employ the Microsoft SEAL library [26] for the HE cryptographic operations. For all experiments, we set the learning rate , random initial weight vector maximum number of iterations , and scaling factor . To guarantee bit security, the scheme [17] takes the polynomial-modulus-degree , coefficient-modulus around 2204 to 2406 bits; the scheme [18] sets the , bits; the scheme [22] chooses the , bits; For the proposed P2OLR, we select , bits. Using the three datasets [18]: —Umaru Impact Study, —Myocardial Infarction Study from Edinburgh, —Nhanes III, we compare the proposed P2OLR with the related P2OLR [17, 18, 22] in terms of the encryption time (E. time) and decryption time (D. time) of DO, storage of encrypted training data, and training time (T. time), accuracy, precision, recall, F1-score and AUC of model. All comparison results are shown as an average of 10 experiments. The performance comparisons of the proposed P2OLR and the related P2OLR [17, 18, 22] are shown in Table 2.

From Table 2, we can see that, compared with the related P2OLR [17, 18, 22], the proposed P2OLR has a better performance. Specifically, as shown in Figure 3, under the training dataset , the encryption time of DO in the proposed P2OLR is 2.01 s, which is reduced by nearly 71.4%, 7.8%, and 93.3% respectively compared with the encryption time of DO in [17, 18, 22]; under the training dataset , the encryption time of DO in the proposed P2OLR is 2.16 s, which is reduced by nearly 73.6%, 2.3%, and 96.8% respectively compared with the encryption time of DO in [17, 18, 22]; under the training dataset , the encryption time of DO in the proposed P2OLR is 3.49 s, which is reduced by nearly 75.9%, 81.6%, and 75.0% respectively compared with the encryption time of DO in [17, 18, 22].

As can be seen in Figure 4, under the training dataset , the decryption time of DO in the proposed P2OLR is 0.23 s, which is reduced by almost 95.3% and 41.0% respectively in comparison to the decryption time of DO in [17, 18]; under the training dataset , the decryption time of DO in the proposed P2OLR is 0.26 s, which is reduced by almost 95.0% and 36.6% respectively in comparison to the decryption time of DO in [17, 18]; under the training dataset , the decryption time of DO in the proposed P2OLR is 0.45 s, which is reduced by almost 96.1% and 6.1% respectively in comparison to the decryption time of DO in [17, 18]. The decryption time of DO in [22] is smaller in comparison to that of the proposed P2OLR.

As described in Figure 5, under the training dataset , the storage of encrypted training data in the proposed P2OLR is 72.00 MB, compared with the storage of encrypted training data in [17, 22], which is reduced by nearly 88.9% and 95.0%; under the training dataset , the storage of encrypted training data in the proposed P2OLR is 80.00 MB, compared with the storage of encrypted training data in [17, 22], which is reduced by nearly 89.0% and 97.4%; under the training dataset , the storage of encrypted training data in the proposed P2OLR is 128.00 MB, compared with the storage of encrypted training data in [17, 18, 22], which is reduced by nearly 89.4%, 13.0% and 99.7% respectively. Although the storage of encrypted training data for dataset and in [18] is smaller than that of the proposed P2OLR, as the number of samples and features increases, for dataset , the storage of encrypted training data in the proposed P2OLR is smaller than that of [22].

As displayed in Figure 6, under the training dataset , the training time of model in the proposed P2OLR is 2.64 min, which is reduced by almost 96.6%, 73.8%, and 90.1% respectively than the training time of model in [17, 18, 22]; under the training dataset , the training time of model in the proposed P2OLR is 2.91 min, which is reduced by almost 96.5%, 71.7%, and 95.0% respectively than the training time of model in [17, 18, 22]; under the training dataset , the training time of model in the proposed P2OLR is 4.21 min, which is reduced by almost 96.5%, 79.8%, and 99.4% respectively than the training time of model in [17, 18, 22].

As illustrated in Figure 7, under the training dataset , the average accuracy of model in the proposed P2OLR is 80.6%, which has nearly 5.8%, 6.2%, and 6.2% improvement respectively compared with the average accuracy of model in [17, 18, 22]; under the training dataset , the average accuracy of model in the proposed P2OLR is 90.6%, which has nearly 9.0%, 7.6%, and 7.9% improvement respectively compared with the average accuracy of model in [17, 18, 22]; under the training dataset , the average accuracy of model in the proposed P2OLR is 83.7%, which has nearly 4.6%, 4.5%, and 5.8% improvement respectively compared with the average accuracy of model in [17, 18, 22].

As illustrated in Figure 8, under the training dataset , the average precision of model in the proposed P2OLR is 95.6%, which has nearly 3.3%, 4.7%, and 4.7% improvement respectively compared with the average precision of model in [17, 18, 22]; under the training dataset , the average precision of model in the proposed P2OLR is 95.1%, which has nearly 5.4%, 4.7%, and 4.7% improvement respectively compared with the average precision of model in [17, 18, 22]; under the training dataset , the average precision of model in the proposed P2OLR is 60.3%, which has nearly 10.3%, 10.1%, and 7.9% improvement respectively compared with the average precision of model in [17, 18, 22].

As illustrated in Figure 9, under the training dataset , the average recall of model in the proposed P2OLR is 77.4%, which has nearly 6.0%, 6.0%, and 6.0% improvement respectively compared with the average recall of model in [17, 18, 22]; under the training dataset , the average recall of model in the proposed P2OLR is 90.6%, which has nearly 8.2%, 7.1%, and 7.7% improvement respectively compared with the average recall of model in [17, 18, 22]; under the training dataset , the average recall of model in the proposed P2OLR is 64.2%, which has nearly 3.0%, 2.9%, and 2.0% improvement respectively compared with the average recall of model in [17, 18, 22].

As illustrated in Figure 10, under the training dataset , the average F1-score of model in the proposed is 85.5%, which has nearly 5.0%, 5.5%, and 5.5% improvement respectively compared with the average F1-score of model in [17, 18, 22]; under the training dataset , the average F1-score of model in the proposed P2OLR is 92.8%, which has nearly 6.9%, 4.0%, and 4.3% improvement respectively compared with the average F1-score of model in [17, 18, 22]; under the training dataset , the average F1-score of model in the proposed P2OLR is 62.2%, which has nearly 7.2%, 7.0%, and 5.3% improvement respectively compared with the average F1-score of model in [17, 18, 22].

As demonstrated in Figure 11, under the training dataset , the AUC of model in the proposed P2OLR is 0.73, compared with the AUC of model in [17, 18, 22], which has nearly 0.05, 0.08, and 0.07 improvement respectively; under the training dataset , the AUC of model in the proposed P2OLR is 0.88, compared with the AUC of model in [17, 18, 22], which has nearly 0.06, 0.02, and 0.02 improvement respectively; under the training dataset , the AUC of model in the proposed P2OLR is 0.85, compared with the AUC of model in [17, 18, 22], which has nearly 0.02, 0.14, and 0.14 improvement respectively.

6. Security Analysis

In a semi-honest adversary model, we assume that DO and SP hold the public key , relinearization key , galois key , and only DO holds the secret key . For our P2OLR that evaluates deterministic function , following the simulation-based paradigm [37], we consider the security model for security analysis, namely, DO encrypts its private data and sends to SP. SP performs the homomorphic operations on to obtain , homomorphically evaluates on to obtain , and sends to DO. DO decrypts and obtains .

Theorem 1. We assume that SP is a semi-honest entity and assume that DO and SP do not collude with each other. Let be a private data of DO. If the HE scheme [23] provides semantic security, after performing the homomorphic operations on and the evaluation of on , DO learns but nothing else, SP learns nothing.
Security Proof. The security proof of the proposed P2OLR follows the simulation-based paradigm [37]. Let the view of DO and SP during the evaluation be and , respectively. The view of SP consists of . We construct a simulator as follows. randomly chooses input data . Then, simulates by . Since the HE scheme [23] provides semantic security by assumption, and are indistinguishable. Therefore, the proposed P2OLR is secure against a semi-honest SP.

7. Conclusion

In this paper, we present a method for achieving a P2OLR on encrypted training data, which enables data owners to utilize the powerful storage and computing resources of cloud service providers for logistic regression analysis without exposing the privacy of training data. We take advantage of the batching technique and SIMD mechanism in HE to speed up the training progress. On the three public datasets, compared with the related P2OLR schemes [17, 18, 22], the model training time of the proposed P2OLR is reduced by more than 71.7%, and the proposed P2OLR has over 4.5%, 3.3%, 2.0%, 4.0%, and 0.02 performance in terms of the accuracy, precision, recall, F1-score, and AUC of model. There are still some limitations in applying our scheme to arbitrary datasets and performing arbitrary number of iterations on encrypted training data. In the future, we will extend our scheme to efficiently support P2OLR with arbitrary number of iterations.

Appendix

(1)key_generation() {, , , }: Given the poly_modulus_degree and coeff_modulus , it returns the secret_key , public_key , relinearization_key , galois_key .(2)encode_double(, , ): Given the message vector and scaling factor , it expands to by , scales by , and outputs the plaintext .(3)decode_double(, ): Given the plaintext , it computes , , and outputs the message vector .(4)encrypt (, ): Given the plaintext , it encrypts into a ciphertext , and outputs the ciphertext .(5)decrypt (, ): Given a ciphertext , it decrypts into a plaintext , and outputs the plaintext .(6)add (, , ): Given two ciphertexts and , it computes and saves the result as a new ciphertext .(7)add_inplace(, ): Given two ciphertexts and , it computes and saves the result in ciphertext .(8)add_plain(, , ): Given a ciphertext and a plaintext , it computes and saves the result as a new ciphertext .(9)sub(, , ): Given two ciphertexts and , it computes and saves the result as a new ciphertext .(10)multiply(, , ): Given two ciphertexts and , it computes and saves the result as a new ciphertext .(11)multiply_plain(, , ): Given a ciphertext and a plaintext , it computes and saves the result as a new ciphertext .(12)mod_switch_to_inplace(\ , .parms_id()): Given a ciphertext/plaintext \ and a levels .parms_id() of ciphertext , it switches the levels of \ to .parms_id().(13)relinearize_inplace(, ): Given a ciphertext and a relinearization_key , it relinearizes and saves the result in ciphertext .(14)rescale_to_next_inplace(): Given a ciphertext , it switches the modulo of to the next levels, reduces the length of the plaintext accordingly, and saves the result in ciphertext .(15)set_scale(): Given a scaling factor , it scales the ciphertext by computing .set_scale(), and outputs the ciphertext .(16)rotate_vector(, , , ): Given a ciphertext , a rotation value , and galois_key , it rotates left by , and saves the result as a new ciphertext .

Data Availability

Previously reported Umaru Impact Study, Myocardial Infarction dataset from Edinburgh and Nhanes III datasets were used to support this study and are available at https://doi.org/10.1186/s12920-018-0401-7. These prior studies (and datasets) are cited at relevant places within the text as references [18].

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China under Grant no. U19B2021.