Abstract

With the widespread application of machine learning (ML), data security has been a serious issue. To eliminate the conflict between data privacy and computability, homomorphism is extensively researched due to its capacity of performing operations over ciphertexts. Considering that the data provided by a single party are not always adequate to derive a competent model via machine learning, we proposed a privacy-preserving training method for the neural network over multiple data providers. Moreover, taking the trainer’s intellectual property into account, our scheme also achieved the goal of model parameter protection. Thanks to the hardness of the conjugate search problem (CSP) and discrete logarithm problem (DLP), the confidentiality of training data and system model can be reduced to well-studied security assumptions. In terms of efficiency, since all messages are coded as low-dimensional matrices, the expansion rates with regard to storage and computation overheads are linear compared to plaintext implementation without accuracy loss. In reality, our method can be transplanted to any machine learning system involving multiple parties due to its capacity of fully homomorphic computation.

1. Introduction

With the continuous development of artificial intelligence, data have become precious resources due to their value for mining. Nevertheless, numerous private information is embodied as data, which may be abused to violate personal privacy, business secrets, or even state secrets. For example, once a patient’s medical record is exposed to insurance companies, they may never sell him some kind of medical insurance [1]. Similarly, many other machine learning applications have also caught sight of privacy infringements, such as financial analysis, product customization, and public opinion surveillance [24]. On the other hand, any data-driving mechanism heavily relies on the quantity and quality of information, which brings about the conflict between data usability and data confidentiality. Fortunately, secure multiparty computation (SMC) [57] and homomorphic encryption (HE) [8, 9] provide us powerful tools to process data in a concealed manner. Therefore, the remaining problem to address is how to devise a cryptosystem that is applicable for machine learning in consideration of storage and computation overheads.

As a cryptographic technology orienting decentralized systems, secure multiparty computation aims at data confidentiality for distributed participants. Despite the privacy concern, the involved parties can still figure out a public output as they wish. Based on such cryptosystem, F. Ö. Çatak et al. [10] proposed a privacy-preserving learning protocol for classification in virtue of vertically segmented data from multiple parties. Since the data is just partially shared without concealing, semantic security is unachievable as plain data dose. The first provable secure ML protocol of this kind is presented by R. Devin et al. [11] for text classification. Howbeit, their research only focused on the privacy of data classification and left the learning process unaddressed.

Oriented at centralized systems, homomorphic encryption is another way towards secure machine learning, which is capable of performing specific operations over ciphertexts. Researches of applying FE for data privacy during machine learning have developed rapidly since the significant innovation [12] appeared in 2016. Y. Aono et al. [13] combined the additive homomorphism with deep learning to narrow the gap between system functionality and data security, by applying FH technology to asynchronous stochastic gradient descent algorithm. F. Bourse et al. [14] improved the FHE structure of Chillotti et al. [15] and proposed a homomorphic neural network evaluation framework, namely, FHE-DiNN. Its complexity is strictly linear in network depth, but the model parameters must be proactively predefined. Based on a multikey variant of two HE schemes [16, 17] with ciphertexts packed, H. Chen et al. [18] provided a suite of interfaces for secure machine learning which also exploited bootstrapping for arbitrary circuit evaluation. As matter of fact, almost all existing FHE-based machine learning algorithms are based on the algebraic structure of lattice, such as BGV [1921], CKKS [2224], and NTRU [2527]. These methods suffer from a common defect that decryption may fail due to noise growth. Though bootstrapping can be deemed as an effective tool for noise control, its extra computational burden is hardly acceptable. Surprisingly, J. Li et al. [28] discovered an alternative tool, saying Conjugate Search Problem, to actualize full homomorphism without noise interference. They also applied such cryptosystem for privacy-preserving data training, which achieved the same accuracy as the plaintexts used for learning.

Though more comprehensible and effective than lattice-based secure machine learning, Li’s scheme can only be applied to the scenario of a signal data provider. Ordinarily, one party can always provide a small quantity of data which may incur an overfitted model. To ensure the generalization of machine learning, data from diverse sources should be gathered for a specific learning task. In the circumstances of multiparty secure machine learning, each data provider may conceal their information by a dependent key. Therefore, a training framework that operates over heterogeneous (i.e., encrypted by different keys) ciphertext is desiderated. Conversely, the parameters of the system model should be taken as assets held by the trainer as in general business operation. Thus, we should also make sure that the machine is concealed, even when not thoroughly trained.

To preserve the privacy of all participants, this paper presents a complete machine learning mechanism in virtue of CSP and DLP hardness. Our contributions are summarized as follows.

1.1. Contributions

(1)We coded float-type data as low-dimensional upper triangular matrices that are homomorphic under the operations of addition, subtraction, multiplication, division, and comparison. With the help of CSP, the plain matrices can also be projected to semantically secure ciphertexts homomorphically under the same kind of operations. That is to say, our basic cryptosystem is fully homomorphic, since addition and multiplication are simultaneously implemented. Therefore, we can realize secure training and classification/regression once private data are provided under the same key.(2)We constructed a cyclic group by lifting the plain matrices to a Galois domain. Thereafter, key switching (switch a ciphertext encrypted by one key to another) is made possible via DLP for the purpose of cooperative training.(3)We combined the two aforementioned technologies and devised a secure machine learning protocol under semihonest model, which preserves the privacy of multiple data providers as well as that of the trainer.

2. System Model

Neural network (NN) is employed as the engineering background and verification model in this paper due to its extensive application. Nevertheless, it is worth mentioning that our scheme can be applied to most machine learning algorithms if privacy is significant to multiple participants.

2.1. Neural Network Model

A typical neural network contains three or more layers, which turns into a deep learning model if hidden layers are multiple [29]. The certain principle of NN lies in the fact that numerous neurons can automatically extract features of the inputs layer by layer. Besides the topology of NN, the most important factors that defined it are the weights and bias designated to each link and neuron. As for learning, the essence is how to adjust these parameters in virtue of training data via iterative forward-/back-propagation. Thereafter, to securely implement a neural network model, we should homomorphically evaluate the following functions.Forward calculation (e.g., sigmoid):where and are the input and weight vectors corresponding to the proactive links of neuron , while represents its bias.Backward calculation:Loss function (e.g., quadratic loss function):where is the target value and is the actual value.Parameter adjusting (e.g., gradient descent):where is the error vector between the target value and the actual value, is the transpose of the output of the current layer node, and is the output of the node of the next layer.

2.2. System Model and Security Goal

In our system, a powerful trainer expects to acquire a neural network whose topology is predefined. To ensure the completeness of the resultant model, they may request multiple parties for training data. However, the data providers concern about privacy leakage though they have strong wills to cooperate. Meanwhile, the trainer also worries that the system parameters may expose and infringe their intellectual property. Therefore, we should preserve the privacy of all participants and guarantee the functionality of machine learning at the same time. Moreover, taking the trained neural network as a service, a user may not only desire to designate a classification/regression task to the server but also be anxious about data abuse.

2.3. Adversary Model

Suppose that the trainer and all data providers are honest but curious during the whole process. That is to say, they will completely follow the protocol to avoid unnecessary disputes but may be interested in the privacy contained within the data. Furthermore, it is reasonable to assume that both the trainer and data owner are provided with PPT (probabilistic polynomial time) computational power. However, since the trainer is always better equipped than data providers, the hypothesis that they have the accessibility to a quantum machine may also be valid. To define the success of privacy violation, we exploit the concept of symmetric IND-CPA (indistinguishability under chosen-plaintext attack) as below.

2.4. Symmetric IND-CPA [30]

Define an experiment under symmetric cryptosystem asfor any PPT adversary that queries the oracle polynomial times. Thus, the adversary’s advantage can be expressed by

Then the cryptosystem is IND-CPA-secure if , where stands for a negligible function in the security parameter .

3. Cryptographic Construction

Focusing on the security goals presented in the system model, we are now ready to construct our cryptographic building blocks. In this part, we first explore the homomorphism of conjugate search problem to underpin the functionality of training over homogeneous (i.e., encrypted by the same key) ciphertexts. Then, we present a key switching technology that can convert a ciphertext encrypted by one key to be decryptable by another.

Conjugate search problem is a special form of group factorization problem (GFP) [31], defined as follows.

3.1. Conjugate Search Problem (CSP) [31]

Given over a nonabelian algebraic structure , it is intractable to solve such that .

B. Evgeni [32] proved that the CSP is postquantum secure over the general linear group ( means real number field) if . Hence, to assure system security, we should code the message as a matrix with degree larger than 4.

To protect the privacy of data providers without affecting the accuracy of training, we resort to homomorphic encryption that is capable of actualizing the forward-/backward-propagation processes covertly. Thereafter, we devised a way that makes CSP semantically secure and homomorphic. It is worth noting that the conjugate search problem is resistant to quantum attacks, which dispels the privacy concern for data providers even if the trainer is extremely equipped.

A typical homomorphic encryption algorithm can be noted as a tetrad , standing for the functions of key generation, encryption, decryption, and evaluation, respectively.

For any data over the message space , we first code it as an upper triangular matrix as follows.

3.2. Encoding

Convert the message into three pairs of random numbers , , and , satisfying , , and , where is a constant random number of the system. Thus, we can construct the following matrices:

Combining the above matrices, the message is finally coded aswhere 0 represents the all-zero matrix and stands for random matrices uniformly sampled from .

For clarity, we denote the space of coded messages as . It is interesting that naturally constitutes a multiplicative cyclic group (excluding the elements whose determinants are zero) and (homomorphic). Furthermore, it is well known that all square matrices with the same dimension compose a ring. Though and its elements are commutative for multiplication, there is an overwhelming probability that a matrix uniformly sampled from is noncommutative with the coded message . Thereupon, a CSP-based fully homomorphic encryption algorithm can be actualized as below.

3.3. Key Generation

: uniformly sample a matrix from , which can also be represented as a combination of nine random matrices, namely,

The probability that is communitive with elements in should be negligible. Then, the algorithm takes as the symmetric key.

3.4. Encryption

: outputas the ciphertext of message (coded as a matrix ).

3.5. Decryption

: compute to obtain

Then, figure out to recover the plaintext.

3.6. Evaluation

: We describe the very basic operations underpinning formulae (2)-(4) in advance. Suppose that and are ciphertexts corresponding to and under the same key; the additive and multiplicative arithmetic can be simply carried out by and . These two operations can be trivially assembled to realize the functions for backward propagation. However, since the exponential operation cannot be implemented directly via homomorphic addition and multiplication, some activation functions of forward propagation such as sigmoid should be approximated as the form of polynomials. Thereby, we resort to a specific conversion [3234],to replace

Noting that the aforementioned formula is expressed as a piecewise function, to homomorphically decide which subfunction should be carried out, we can encrypt the numbers of and 1.5 and compare them with for branching.

To program a piecewise function, J. Li et al. [28] presented a homomorphic algorithm that covertly compares the size between two ciphertexts. Though our scheme is similar to that of [28], we argue that their cryptosystem is not semantically secure because and is always bigger than for .

3.7. Security Analysis of [28]

By computingwhere is also a random matrix, forwhereand is uniformly sampled from , the adversary carries out a chosen-plaintext attack such as the following.

,

,

,

,

,

,

Considering thatwheresince is guaranteed throughout Li’s scheme [28], must be positive. It is obvious that ; hence, the adversary can easily determine whether or by checking the sign of . That is to say,

It seems that the conflict between piecewise function evaluation and IND-CPA security is infeasible to address. However, we can introduce a specific form of ciphertext which can be used to encrypt a designated number and compare it with any other normal ciphertext. Our construction is given below.

The data provider randomly chooses a nonzero number and encrypts asforwhich satisfies

To compare with general cipher without decryption, the evaluator computesand thus achieves

3.8. Correctness

The correctness of encryption and decryption algorithms is straightforward, so we only focus on the homomorphism of evaluation.

Homomorphic addition: sincewe can decrypt it asbecause

The addition of and can be decoded as

Homomorphic multiplication: becausewe can deduce that

Homomorphic comparison: on the premise of , it can be seen that

According to formula (21), we have

Recall that and when , while when . In terms of the condition that , we can reduce formula (30) to

It is obvious that the signs of and are exactly the same, since , which determines the relationship between and without decryption.

3.9. Security

Thanks to the hardness of Conjugate Search Problem, an adversary must find such that to recover the plaintext. As for the semantic security of our scheme, it can be seen that in formula (17) is not always positive due to arbitrary relationship between and . Therefore, when an adversary executes a chosen-plaintext attack as mentioned before, their advantage is negligible. Noting that any normal ciphertext can just be compared with specifically encrypted messages without decryption, the data provider has full control over their privacy and permits exact comparisons only if necessary.

After each training, the neural network coefficients are concealed by the key of the data provider. When multiple data providers take part in the training process, those semimanufactured parameters should also be re-encrypted under the key of subsequent data holder for homomorphic computation. Therefore, we devised a way to decrypt and re-encrypt the machine coefficients without exposing them to data providers, in consideration of the trainer’s property right. Our key switching scheme is based on the hardness of Discrete Logarithm Problem (DLP).

3.10. Discrete Logarithm Problem

Given a cyclic group , a generator , and a random element , it is difficult to find the discrete logarithm such that .

Accordingly, if an adversary has obtained a ciphertext , it is hard for them to recover because of the confidentiality on [35]. However, in light of the Lagrange theorem [36], we can exploit a trapdoor to reverse back to .

3.11. Lagrange Theorem

Denote as a subgroup of finite ; then, , for and are the orders of groups and .

Since any generates a subgroup via , we can conclude that in terms of the Lagrange theorem, where is the identity of group .

Based on the aforementioned mathematical tools, we are now ready to construct our key switching scheme as a triad . Without loss of generality, we denote , , and as secret keys belonging to the trainer and two data providers and , respectively. Then can be used to generate the encryption/decryption key pair for the trainer, while is used to convert a ciphertext encrypted by to be decryptable by and is utilized to modify (encrypted under ) as whose corresponding key is .

3.12. Key Generation

: as mentioned before, we denote the space of coded messages as . Suppose that the precision of matrix elements in is -bits whose integer part is -bits and the decimal part is -bits. We can multiply any coded plaintext by to lift it over . Accordingly, the message space is changed to a cyclic group for . Moreover, for each , it composes a group satisfying Thereby, we uniformly sample an odd number and compute such that . Output as the key to the trainer.

3.13. Switching to

The trainer changes the encrypted model parameters as and sends to data provider . On receiving , computes as their response.

3.14. Switching to

On receiving from the trainer , the data provider computes their response as . Therefore, the trainer can reverse back to a ciphertext purely encrypted under via and then right-shift its elements by -bits.

3.15. Correctness

Since , we have

Thus, .

Similarly, because ,where are integers for .

During the encoding process in , it is easy to choose such that . Considering that and for , the space of must be a cyclic group for . According to the Lagrange Theorem, it can be seen that ; thus . By right-shifting -bits on , we obtain .

3.16. Security

Note that, after receiving , the trainer can trivially compute to recover the message. Nevertheless, since the model parameters are of their intellectual property, such operation does not conflict with our security goal.

As for data providers, they can just witness an exponential form of the plaintext (i.e., ). According to the hardness of DLP, the information about message will not be exposed.

4. Privacy-Preserving Machine Learning with Multiple Data Providers

To preserve privacy for machine learning, many cryptographic training and classification/regression methods have been proposed in the scene of a single data provider. In most cases, data should be sourced from multiple providers to guarantee the generality of training. Therefore, we present a secure machine learning mechanism with the capacity of training as well as classification/regression in consideration of data and parameter privacy.

As for training, the cloud is supposed to obtain model parameters with the help of labeled data. During the initialization phase, the trainer computes for key switching and each data provider generates for homomorphic training.

Denote the encoded training data owned by provider as and the system parameters as . The server primarily encrypts the initialized system coefficients (may contain some private intellectual property information) as to the first data provider who executes and as their response. On encrypted data and corresponding to the same key , the cloud can thus achieve which are updated system parameters decryptable by .

For clarity, we describe the above processes as shown in Table 1.

Note that is a protocol that should be carried out by both the data provider and the cloud.

To make the updated coefficients homomorphically computable with data encrypted by the following providers, we can exploit the key switching scheme to re-encrypt it. Without loss of generality, the updated parameters under key will be represented as . By means of and , the cloud can obtain the re-encrypted coefficients with the help of successive providers. After receiving from the next provider, they can compute since both ciphertexts are encrypted by . In consideration of the final parameters , the cloud needs to execute with the last provider and then computes to restore the plain parameters.

The subsequent training and recovering processes are presented in Table 2.

The classification/regression process is straightforward that, on encrypted data and system parameters for , the cloud can homomorphically compute . By decrypting the received , the user obtains the classification/regression result such that . This process can be found in Table 3.

5. Experiment Analysis

We drew support from the power load data of Chongqing Tongnan Electric Power Co., Ltd., dating from May 4 to May 10 in 2015, to verify the effectiveness of our training method. A short-term electrical load prediction model is also testified in virtue of 96 historical data pieces sampled during 4 consecutive days. The original machine learning model is exactly the same as that of [29], which has considered nothing about privacy. Our experiment environment is shown in Table 4.

To simulate the scenario of multiparty machine learning, we divide the data into three parts and realize the training process corresponding to 3 different keys in . To prove that our method is not harmful to the accuracy of the trained network, as is shown in Figure 1, we compared the prediction result directly achieved via original model (without privacy-preserving) with that of ours (privacy-preserving scheme). Figure 1 illustrates that the two results are completely consistent.

The experimental results are shown in Table 5; our scheme can perform encryption training and prediction for multiple data providers in general machine learning. As for the efficiency of training and prediction, our scheme is 73578 and 12000 times slower than its plain version. Nevertheless, since the server is always powerful on computational capacity and the data providers only have to carry out trivial multiplications over , our scheme is practical in cloud environments. Moreover, if the accuracy is tolerable, we can shorten the ciphertext to make it more efficient.

In terms of communication overheads, encrypted data for training or prediction are 18 times larger than plain messages. In each iteration, the cloud should also exchange the ciphertexts of system parameters with two successive data providers, which are also 18 times of original coefficients. Considering that the expansion rate is not big and system parameters are quite limited, the communication burden causes just little performance degradation.

6. Conclusions

We presented a privacy-preserving machine learning method that works over multiple data providers in this paper. Thanks to the hardness of the conjugate search problem, data can be homomorphically processed for training or classification/regression under the same key. It is worth mentioning that we solved the intrinsic conflict between IND-CPA security and homomorphic comparision (without decryption), by specifically encoding the data which is allowed to be compared. To support training among multiple data providers, a key switching technology is also proposed based on the difficulty of the discrete logarithm problem and Lagrange theorem, which evaded the necessity of multikey homomorphic computation. Experiment illustrated that the accuracy of machine learning cannot be affected by the privacy capability of our scheme. The expansion rate of computation/communication complexity is small enough, which makes the scheme practical in cloud environments.

Data Availability

Our dataset comes from Chongqing Tongnan Electric Power Co., Ltd. (telephone: 023-44559308; official website: http://www.12398.gov.cn/html/information/753078881/753078881201200006.shtml).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank Darong Huang and Yang Liu for their comments and suggestions. This work was supported in part by the National Natural Science Foundation of P.R. China under Grants 61573076, 61703063, and 61903053, the Science and Technology Research Project of the Chongqing Municipal Education Commission of P.R. China under Grants KJZD-K201800701, KJ1705121, and KJ1705139, and the Program of Chongqing Innovation and Entrepreneurship for Returned Overseas Scholars of P.R. China under Grant cx2018110.

Supplementary Materials

data.txt: contains the data used for training and prediction in this research; it is from the power load data of Chongqing Tongnan Electric Power Co., Ltd., dating from May 4 to May 10 in 2015. (Supplementary Materials)