Cryptographic Schemes and Protocols for Artificial IntelligenceView this Special Issue
Practical Privacy Preserving-Aided Disease Diagnosis with Multiclass SVM in an Outsourced Environment
With the rapid development of cloud computing and machine learning, using outsourced stored data and machine learning model for training and online-aided disease diagnosis has a great application prospect. However, training and diagnosis in an outsourced environment will cause serious challenges to the privacy of data. At present, many scholars have proposed privacy preserving machine learning schemes and made a lot of progress, but there are still great challenges in security and low client load. In this paper, we propose a complete privacy preserving outsourced multiclass SVM training and aided disease diagnosis scheme. We design some efficient basic operation algorithms for encrypted data. Then, we design an efficient and privacy preserving SVM model training protocol using the basic operation algorithms. We propose a secure maximum finding algorithm and secure comparison algorithm. Then, we design an efficient online-aided disease diagnosis scheme based on the BFV cryptosystem and blinding technique. Detailed security analysis proves that our scheme can protect the privacy of each participant. The experimental results illustrate that our proposed scheme significantly reduces the computation overhead compared with the previous similar works. Our proposed scheme completes most of the operations of aided disease diagnosis by the cloud servers and the client only needs to complete a small amount of encryption and decryption operations. The overall computation overhead is 0.175 s, and the efficiency of online aided disease diagnosis is improved by 85.4%. At the same time, our proposed scheme provides multiclass diagnosis results, which can better assist doctors in their treatment.
Machine learning (ML) uses the computer system to build mathematical models on sample data with statistical methods and makes predictions or decisions without being explicitly programmed. Now, ML has shown significant advantages in the field of disease diagnosis and brings more and more convenience to the prevention and treatment of diseases.
With the rapid development of cloud computing technology, cloud service providers (CSP) have high-quality computation and huge storage space, which can provide data processing, model training, diagnosis services and deployment, and other intelligent solutions based on machine learning. In this context, the local clients will outsource their medical data and machine learning models to CSP without having to build their own large-scale infrastructure and computing resources. The cloud can train a machine learning model and provide aided disease diagnosis service by using the outsourced medical data and machine learning models, which can help improve doctors’ diagnosis, treatment decisions and provide patients an online disease diagnosis service. A typical cloud platform machine learning system architecture is shown in Figure 1.
However, the security and privacy of outsourced data will be threatened by various threats, making people afraid to use the service of CSP. The security and privacy threats are mainly reflected in the leakage of the data, the machine learning model of the model owners, the users’ request, and diagnosis results. As we all know, the leakage of medical information may cause irreversible losses or become a major event. Therefore, the security and privacy preserving of model training and diagnosis based on cloud computing have become a major challenge.
To address the abovementioned challenges, many scholars have proposed various schemes, such as a secure outsourced classification based on logistic regression model , an electronic medical disease risk prediction scheme based on naive Bayes model , and other secure disease prediction schemes based on machine learning technology [3–5]. As a machine learning algorithm with high computational efficiency and nice predictive accuracy, the support vector machine (SVM) has achieved high classification accuracy and efficiency in the medical field [6, 7]. However, the existing privacy preserving SVM schemes mainly implement secure prediction [8–11], and there are few privacy preserving SVM schemes for secure training. Most of the existing privacy preserving SVM schemes are designed for binary classification, which can only determine whether the patient has the disease , but cannot deal with the multiclass of the disease. In addition, multiclass SVM requires more computation, which will reduce the efficiency .
To solve the abovementioned problems, we propose an efficient and privacy preserving online disease diagnosis scheme based on the SVM algorithm. In our scheme, we can achieve multi-class SVM training on the encrypted outsourced data from multiple data owners and provide users with privacy preserving disease diagnosis. In summary, our contributions are as follows:(1)Efficient and secure basic operation algorithms: Based on the Paillier cryptosystem, we design several basic operation algorithms to realize the secure outsourced data storage and computation, including secure aggregation algorithm, secure multiplication algorithm, and so on. These secure computation algorithms are the building blocks for our proposed training protocol.(2)Completing machine learning process under privacy preserving: Aiming at the general machine learning process and the goal of privacy preserving, we propose a privacy preserving outsourced multiclass SVM model training and online-aided disease diagnosis scheme. Different from the existing privacy preserving schemes that only support training or diagnosis, our proposed scheme extends the function of privacy preserving machine learning system.(3)Efficient and secure online aided disease diagnosis: Based on the BFV cryptosystem, we design a secure maximum finding algorithm and secure comparison algorithm. We provide an efficient and privacy preserving aided disease diagnosis scheme. Experimental results illustrate that our proposed scheme significantly reduces the computation cost than the existing similar schemes, which is suitable for practical application scenarios where a large number of users request diagnosis at the same time.(4)Low overhead for local client: For a local client, the client only needs to perform encryption and decryption operations in our proposed scheme, which reduces the storage and computation overhead of the local client to the greatest extent and makes full use of the computation power of the cloud servers.
The remainder of this paper is organized as follows. In Section 2, we review some related works. In Section 3, we review the Paillier cryptosystem, BFV cryptosystem, and SVM algorithm as preliminaries. In Section 4, we make a system overview. Then, we propose our scheme in Section 5. In Section 6, we analyze the security of our proposed scheme. In Section 7, we make a performance evaluation. Finally, we conclude this paper in Section 8.
2. Related Work
In this section, we summarize the privacy preserving machine learning schemes in recent years.
With the development of big data era, machine learning has been widely used in many fields. Among them, the application of machine learning in the field of intelligent disease diagnosis has developed rapidly. Disease diagnosis schemes based on various machine learning classification algorithms have been proposed [14–17]. However, at the same time, the problem of privacy disclosure in the machine learning process is becoming more and more serious. So, many scholars have carried out the research studies on privacy preserving machine learning.
Triastcyn and Faltings  proposed the Bayesian differential privacy, considered the distribution of data and provided a more practical privacy guarantee. Laur et al.  proposed a privacy preserving scheme of support vector machine based on secure multiparty computation. For each training or testing phase, their scheme involves multiple parties holding encrypted data and secret sharing obtained during training. Based on additive homomorphic encryption, Mandal and Gong  designed a privacy preserving scheme that performs gradient descent on data owners and cloud server. They achieved secure linear and logistic regression model training. Shen et al.  used blockchain technology to establish a secure and reliable data sharing platform among multiple data providers and constructed a privacy preserving support vector machine training scheme based on the Paillier cryptosystem. However, in their scheme, the data provider needs to interact with the cloud server to complete the computation. The computation cost of the data provider is large. Liu et al.  proposed a privacy preserving clinical decision support system using the naive Bayes (NB) classifier. The BGV homomorphic encryption system significantly improved the performance. In work , a framework for securely and efficiently outsourcing decision tree inference was proposed. Tan et al.  proposed a system for privacy-preserving machine learning that implements all operations on the GPU, which makes full use of the computing power of GPU. Zheng et al.  combined random permutation and arithmetic secret sharing by the compute-after-permutation technique and built a privacy-preserving machine learning framework. Li et al.  proposed a verifiable privacy-preserving machine learning prediction scheme for the edge-enhanced HCPSs, which outputs the verifiable prediction results for users without privacy leakage. Ma et al.  designed a lightweight privacy-preserving medical diagnosis mechanism on edge called LPME.
Among them, the SVM algorithm is a research hotspot and has been widely used in different data mining and machine learning schemes. Most of the existing privacy preserving SVM schemes are based on three main privacy preserving technologies: differential privacy (DP), secure multi-party computation (SMC), and homomorphic encryption (HE). DP can significantly improve the calculation and communication efficiency, but the cost is to sacrifice the accuracy of the model by adding random noise [28, 29]. Zhang et al.  proposed a general differential privacy model fitting method based on the genetic algorithm, but it reduces the decision accuracy of the model. SMC alleviates the limitation of computing but requires more interaction between participants. This leads to expensive communication overhead [31, 32]. Yu et al.  first proposed a privacy preserving SVM classification method based on vertically segmented data. They use SMC technology to obtain the global model, so as to protect the local privacy data and hide the classification model. However, this method requires at least three parties to participate in the calculation, which is complex and inefficient. HE can directly calculate the encrypted data, but it also requires a lot of computing costs [34, 35]. Bajard et al.  uses HE technology to protect the decision model and medical data, but it needs high computational load. Therefore, it is necessary to design an efficient and secure SVM scheme for cloud online disease diagnosis service. Wang et al.  proposed an efficient privacy preserving outsourced SVM scheme for Internet of medical things deployment, which protected training data privacy and guaranteed the security of the trained SVM model.
In this paper, we propose a new privacy preserving scheme for training and disease diagnosis of the multiclass SVM algorithm. We make a comparison analysis with the schemes in [38–40]. The experimental results demonstrate that our scheme has more practical application values.
In this section, we describe some techniques as the basis of our scheme, including the Paillier cryptosystem, BFV cryptosystem, and SVM algorithm.
3.1. Paillier Cryptosystem
In the training phase, the data are encrypted by the Paillier cryptosystem . The Paillier cryptosystem is a public key cryptosystem with additive homomorphic operation. We will introduce the Paillier cryptosystem as follows.(i)Key generation: Set the security parameter . Choose two big primes , , is the Carmichael function of . Choose a random number , and . The public key is . The private key is .(ii)Encryption: Given . The message will be encrypted with . The ciphertext is expressed as , where is a random number.(iii)Decryption: According to the key generation stage and Carmichael’s theorem, . So . Then, .(iv)Homomorphic computation: Given two ciphertexts under the same public key . The homomorphic computations are defined as .
3.2. BFV Cryptosystem
In the prediction phase, the data are encrypted by the BFV cryptosystem . BFV cryptosystem is a leveled-FHE public key cryptosystem based on RLWE, which can support unlimited times additive homomorphic operation and limited times multiplicative homomorphic operation.(i)Key generation: Generate a polynomial . The private key is defined as . Then, generate a polynomial from ciphertext polynomial space (polynomial ), . The polynomial is used to generate public key. Define a noise polynomial . The notation expresses the Gaussian distribution. The public key is .(ii)Encryption: The message . Define . The ciphertext is computed as .(iii)Decryption: To decrypt the ciphertext , define . The message is computed as .(iv)Homomorphic computation: BFV cryptosystem supports ciphertext batch processing. Define two -dimensional vectors encrypted under public key , . The homomorphic computations are defined as follows:
3.3. SVM Algorithm
SVM is a classical supervised learning algorithm to solve two kinds of classification problems. The SVM algorithm will find the best hyperplane. The classifier is a decision function , expresses positive class and expresses negative class.
There are two training methods for the SVM model: one is based on the SMO algorithm and the other is based on the gradient descent algorithm. Because the operation steps of the SMO algorithm are more complex, which makes a lot computation costs when using encrypted data. Therefore, we choose gradient descent to realize the privacy preserving SVM model training. In the SVM model training process based on the gradient descent method, the objective function needs to be minimized. When , it means that the classification is correct. The and the parameters do not need to be updated. When , it means that the classification is incorrect. The and the parameters need to be updated.
4. System Overview
In this section, we will introduce our system model, security goals, and threat model.
4.1. System Model
Our system model should achieve the privacy preserving training and online disease diagnosis process. Therefore, our system model is designed as shown in Figure 2.
There are six participants in our system model, which are trusted authority (TA), medical centers (MCs), cloud storage server (CSS), cloud computation server (CCS), diagnosis service provider (DSP), and users.(i)Trusted authority (TA): TA is the fully trusted party of the whole system, which is used to generate and distribute keys for other participants in the system. After initialization, TA will stay offline.(ii)Medical centers (MCs): Each MC has its own local medical data. To reduce the local storage cost, MCs will outsource the medical data to CSS for storage.(iii)Cloud storage server (CSS): CSS has the ability to store and manage outsourced data. CSS can perform privacy preserving computation with its powerful computation power.(iv)Cloud computation server (CCS): CCS assists CSS to complete privacy preserving computation.(v)Diagnosis service provider (DSP): DSP wants to train a machine learning model on the outsourced data from MCs and provides online aided disease diagnosis for users. Due to the limited computation and communication ability, DSP will outsource the training and diagnosis to CSS.(vi)Users: Users are patients or doctors who have unlabeled samples and want to get the diagnosis results. The users will send encrypted diagnosis requests to CSS and obtain the encrypted results. The users can decrypt the results with own private key.
4.2. Security Goals
In order to meet the security requirements of outsourced training and diagnosis, our scheme will achieve the following security goals.(i)Medical data privacy: The outsourced data of MCs will not be leaked to other participants in the whole machine learning process.(ii)Model privacy: Other participants cannot learn any useful information about the model of DSP.(iii)Users privacy: The diagnosis requests and results of users will not be acquired by other participants.(iv)Intermediate results privacy: In the execution of protocols, any participant will not infer other participants’ sensitive information through the intermediate results.
In our scheme, the training and diagnosis processes are completed by CSS and CCS. All participants are semi-honest (or honest-but-curious). Specifically, they will honestly implement the secure computation protocols, but they will try to analyze the sensitive data and intermediate results to infer the useful information of other participants. Like the previous works, we assume that CSS and CCS will not collude. Because CSS and CCS belong to different commercial companies, they will not collude with each other for their own reputation.
4.3. Threat Model
In this paper, we will define three attacks in our system model.(i)Eavesdropping attack: This attack means that an adversary can eavesdrop and analyze data during the data transmission. The data transmission includes outsourcing process and the interaction between participants in protocol implementation.(ii)Honest-but-curious attack: All participants will implement the protocol honestly, but they will infer the useful information during the execution of protocols.(iii)Client-collusion attack: In the training and diagnosis process, some clients may collude to analyze the useful information of other participants.
5. Proposed Scheme
In this section, we describe the proposed scheme in detail. Our scheme mainly includes system initialization, privacy preserving machine learning training, and online disease diagnosis.
In order to accurately describe our proposed scheme, we give the description of used notations in Table 1.
5.1. System Initialization
In the system initialization phase, TA generates system parameters and distributes the parameters for MCs, CSS, CCS, and DSP, respectively. TA sends the parameters through the secure communication channel. Then, TA will stay offline. We assume that there are MCs in our system. Because the Paillier cryptosystem and BFV cryptosystem can only encrypt integers, the floating point numbers and negative numbers should be converted into integers. Therefore, all participants should make data conversion before encrypting their sensitive information.
5.1.1. Generate System Parameters(1)Generate a public-private key pair of the Paillier cryptosystem and a public-private key pair of the BFV cryptosystem. The BFV plaintext space is . The public keys are public and the private keys are sent to the CCS.(2)Generate a public-private key pair of the Paillier cryptosystem and a public-private key pair of the BFV cryptosystem for CSS. The BFV plaintext space is . The public keys are public and the private keys are sent to CSS.(3)Generate a public-private key pair of the Paillier cryptosystem for DSP. The public key is public and the private key is sent to DSP.(4)Generate a random integer . TA randomly splits to integers, satisfying and sends to . Then, generate two lists and . Each list has random integers, . Each element in and represents the ID of each MC. When sends authentication to CSS, will hide and with and , respectively. The is sent to . and are sent to CSS.
5.1.2. Data Conversion
In the machine learning application scenario, data and model parameters contain floating point numbers and negative numbers.
For a floating point number , we enlarge to ( is the precision of floating point numbers). For example, given a floating point number and the precision , we can convert into an integer . For a negative number , we divide the plaintext space ( is expressed the plaintext space of the Paillier or BFV cryptosystem) into two parts because all variables and intermediate results in the process of training and prediction are much smaller than . An integer in represents a positive integer and represents a negative integer. When encrypting the negative integer , it is converted to encrypt . If is both a floating point number and a negative number, is first converted into a negative integer.
5.2. Privacy Preserving Machine Learning Training
The privacy preserving machine learning training process is completed by CSS and CCS. We assume that the amount of outsourced data is .
5.2.1. Local Data Outsourcing
To protect the privacy of MCs’ local data, MCs will encrypt the data before outsourcing. The outsourcing process of is as follows.(1) generates a random integer . Computing as public key and the private key is .(2)Computing .(3)For each plaintext data, such as , will make a data conversion as mentioned in Section 5.1. Then, compute to encrypt and outsource the encrypted data to CSS for storage.
5.2.2. Secure Basic Building Blocks for Training
To complete the privacy preserving outsourced training, we construct some algorithms as basic building blocks based on the Paillier cryptosystem: secure data aggregation (Block_1), secure multiplication algorithm (Block_2), secure inner product algorithm (Block_3), secure scalar multiplication of vector algorithm (Block_4), and secure symbol judgment algorithm (Block_5). The algorithms will be executed with CSS and CCS.
(1) Secure data aggregation algorithm (Block_1). CSS needs to aggregate MCs’ outsourced data before starting machine learning training. The algorithm works as follows and is described in Algorithm 1.(1)CSS sends a training request to , .(2)After receiving the training request, computes as authentication (The indicates that CSS is allowed to use the outsourced data of for training) and sends to CSS.(3)CSS obtains the of through the and computes . It should be noted that can be obtained only after all MCs have sent their authentication. Then, CSS computes and completes the aggregation.
(2). Secure multiplication algorithm (Block_2). Given two encrypted integers and , the algorithm needs to compute . The algorithm works as follows and is described in Algorithm 2.(1)CSS generates two random integers and . Then, it computes by applying the additive homomorphism, obtaining the following results. Then, sending them to CCS.(2)CCS generates a random integer . It decrypts by using . Then, it encrypts with to get . Computing and encrypting with . Sending and to CSS.(3)CSS decrypts with and computes . Then, computing by applying the additive homomorphism, obtaining the following results. Computing the result,
(3). Secure inner product algorithm (Block_3). Given two encrypted vectors . The algorithm will compute and is described in Algorithm 3.
(4). Secure scalar multiplication of vector algorithm (Block_4). Given a encrypted vector and a encrypted integer , the algorithm will compute and is described in Algorithm 4.
(5). Secure symbol judgment algorithm (Block_5). Given an encrypted integer , the algorithm will compute the sign of . Let if else . The algorithm works as follows and is described in Algorithm 5.(1)CSS chooses a random integer . Then, it computes by applying the additive homomorphism and sends to CCS.(2)CCS decrypts . Let if else . Then, it sends to CSS.(3)CSS decrypts and obtains the symbol .
5.2.3. Privacy Preserving Outsourced Training with Multiclass SVM
In this section, we construct a privacy preserving outsourced training protocol to train a multiclass SVM model using the proposed building blocks. DSP outsources the training task to CSS and CSS completes the aggregation of outsourced data. Then, CSS and CCS complete the model training. After finishing the training, CSS transforms into . To achieve the transformation, we use the algorithm proposed in reference .
For multiclass SVM training, there are two methods: one to rest (ovr) and one to one (ovo). In order to improve the efficiency and reduce the number of iterations, we choose the ovr method for training. We need to construct binary SVM classifiers, each of which corresponds to one classification. The process is described in Algorithm 6.
5.3. Privacy Preserving Online-Aided Disease Diagnosis
In this section, our proposed scheme consists of four steps: diagnosis outsourcing, secret diagnosis request generation, diagnosis values computation, and diagnosis result generation. The privacy preserving online-aided disease diagnosis is completed by CSS and CCS.
5.3.1. Diagnosis Outsourcing
To reduce the computation and communication overhead, DSP outsources the SVM model parameters to CSS and authorizes CSS to provide diagnosis service for users.
The SVM parameters of DSP are expressed as (There are classifiers),
For and the corresponding class result , DSP generates a -dimensional random integer vector and a random integer , . Then, DSP computes and to hide the parameters class results.
According to the combination of subtraction of random integer vectors, DSP constructs a combination table. The combination table has values, as shown in Table 2. The values in combination table are used to eliminate the blinding factors in subsequent computation.
DSP encrypts with , with , with and all values of combination table with . Then, DSP sends them as the outsourced parameters to CSS. After receiving the outsourced parameters, CSS decrypts and the combination table with . CSS computes as follows:
5.3.2. Secret Diagnosis Request Generation
For , the symptom is expressed as (The last 1 is added to facilitate the computation of vector inner product). The generates a -dimensional random integer vector and . Then, hides plaintext symptoms .
The encrypts symptom with and encrypts with . Let as the secret prediction request of .
Then, the sends to CSS.
5.3.3. Diagnosis Value Computation
In our proposed diagnosis scheme, it is a multiclassification problem, so it is necessary to compute the diagnosis value of each classification. After receiving the secret prediction request , CSS decrypts with . Then, it computes by the homomorphic operation of the BFV cryptosystem.
According to the decision function of the SVM algorithm, a diagnosis value needs to be computed by one multiplication homomorphic operation and one addition homomorphic operation. Because the BFV encryption algorithm supports ciphertext packaging, batch operation can be realized and the computation efficiency is significantly improved. The process is described in Algorithm 7.
5.3.4. Diagnosis Result Generation
After computing the diagnosis values, CSS obtains encrypted diagnosis values and each value corresponds to a class result. Then, CSS needs to select the classification corresponding to the maximum value from the encrypted values as the diagnosis result.
Therefore, we design a secure maximum find protocol and a secure comparison algorithm. In this process, CSS and CCS jointly execute the protocol.
(1). Secure maximum finding. CSS sets an initial maximum position . Then, CSS executes cycles and each cycle executes a secure comparison algorithm to continuously update the value.
After cycles, CSS obtains the final diagnosis result and converts into under the public key of . To achieve the transformation, we use the algorithm proposed in literature . Then, CSS sends to . The decrypts the encrypted result with . The process is described in Algorithm 8.
(2). Secure comparison (SC). For the i-th cycle, CSS computes . Then, according to and , computing . The corresponds to the value in the combination table and computing as follows:
At this time, has eliminated in .
CSS chooses equal random integers and . Computing . Then, CSS chooses different random integers, and computing . Summing all elements in to get and encrypting it with . CSS sends to CCS.
CCS decrypts with and with , . Then, summing each dimension, .
CCS removes from by computing . Let if , else .
CCS encrypts with and sends it to CSS. CSS decrypts it and if , updates the value of .
The process is described in Algorithm 9.