Abstract

Medical prediagnosis systems are now available online to give users quick and preliminary diagnosis information. The need for such a system has become particularly evident in areas with insufficient health professionals. Due to the privacy of patient medical information and the sensitivity of cloud diagnosis models, it is necessary to protect the security of data, models, and communications. These existing diagnosis systems can hardly provide a satisfied diagnosis accuracy while ensuring comprehensive security and high efficiency. In order to solve these problems, we proposed Relief- minimum Wasserstein distance (Relief-MW) classification method, which combined data encryption and BLS signature to form a privacy-preserving efficient online multiparty interactive medical prediagnostic scheme (OMPD). Theoretical analysis shows our OMPD effectively provides high-precision prediagnosis services. Extensive experimental results demonstrate that OMPD not only greatly improves the diagnostic accuracy but also reduces the computational and communication overhead.

1. Introduction

With the rapid development of mobile Internet, wearable devices, and intelligent Internet of Things, online medical prediagnosis systems that can provide prediagnosis services and medical advice anytime and anywhere have received extensive research attention due to their importance. Typically, an online medical prediagnosis system needs to provide a high degree of diagnostic accuracy along with a strong level of privacy protection. In order to achieve these two goals at the same time or at least maintain an acceptable balance between the two goals, multiple factors need to be considered. Firstly, a good classifier must be carefully selected from the existing classification algorithm library to achieve high diagnostic accuracy. And the classifier must be refined to capture the peculiar nature of the online diagnostic problem. The existing literature already provides us some candidate algorithms such as random forest [1], neural network [2], and other methods [35], and they have been widely used in medical diagnosis. However, in general, these algorithms often yield either an interminable training and response time or an unbearably low diagnostic accuracy and privacy protection. Therefore, we need to reinvestigate the potential of existing classifiers and then select the one that can be modified to offer the experimentally best performances in terms of computational overhead and privacy protection.

Secondly, in order to keep the medical data used by an online medical prediagnosis system well protected from malicious users who may attach and profit from these data, we need to choose appropriate privacy protection schemes [615] for security needs. There are three kinds of privacy protection schemes: anonymity protection, differential privacy protection, and homomorphic encryption. The anonymity protection (such as -anonymity [9] and -diversity [10]) simply erases users’ private information which may lead to a decreased diagnostic accuracy as the deleted data cannot be used. The differential privacy protection [1113] protects users’ privacy by adding noise to completely obfuscate the query response. However, these random noise can cover the critical information that are needed to boost the diagnostic accuracy. Different from the prior two schemes, homomorphic encryption [14, 15] can strictly protect privacy without destroying the original data. However, the encryption and decryption process of the homomorphic encrption usually requires huge computational overhead. Therefore, to have homomorphic encryption work for the scenarios of online medical prediagnosis, it still requires us considerable effort to bring the computational burden of the homomorphic encryption down considerably without sacrificing the protection strength very much.

In addition, a multiparty interactive system usually needs to be aware of the problem of communication security. This becomes particularly essential for online medical prediagnosis scenarios with vulnerable and unsecure communication channels. It is possible that the transmitted messages between the two parties can be eavesdropped malicious attackers. Hence, we need an efficient information security scheme for online medical prediagnosis system to protect the communication. For this target, there are some existing methods [1619], such as secure multiparty protocol, digital signature, and some other data encryption methods. In this paper, we choose digital signature information authentication scheme. Because it can ensure that only the correct recipient can obtain the communication information through its private key, it can effectively ensure that the communication content is not maliciously stolen or tampered with. The BLS short signature [16] is considered to be one of the most effective methods. This technology authenticates and confirms users’ identity information. It can prevent others from fraudulently using users’ identity information.

Based on the above three points, it is difficult to combine the classification method with the privacy protection method, while saving computation and communication overhead to achieve high system efficiency. In this paper, we proposed an online multiparty interactive medical prediagnosis service scheme with high efficiency, high precision, and privacy protection, called OMPD. It can protect the private information of medical users, a large of medical instances of the hospital and the diagnosis model in the cloud. And the users can obtain online medical prediagnosis services when the original data is not available in the cloud. The main contributions of this paper are as follows: (1)OMPD can provide high-precision prediagnosis services. We introduced a data preprocessing method (simple data encryption and some conversion) based on the newly proposed Relief-MW classifier, and these data processing operations would not change the original classification accuracy of the classifier. In order to verify the classification accuracy of OMPD, we conducted accuracy analysis and experiments on two real data sets on the UCI machine learning library (http://archive.ics.uci.edu/ml). Experimental results showed that OMPD can provide high-precision services(2)OMPD is a three-party interactive system that can provide medical prediagnosis services with full-process protection. By preprocessing medical data and applying BLS signatures to communication information, the security of private information of medical users, the instance of the hospital’s database, the diagnostic model in the cloud, and the interaction can be guaranteed

The structure of our work is as follows: Section 2 introduces some related works, and Section 3 introduces the Relief-MW classification method and BLS signatures. Section 4 introduces the entire detailed process of the OMPD. Section 5 carries out accuracy and safety analysis. The experimental evaluation is carried out in Section 6. Finally, we conclude in Section 7.

In recent years, more and more people have paid attention to the efficiency and privacy safety of medical diagnosis services, and many solutions [2025] have been proposed. In view of the privacy and security issues in online medical prediagnosis services, homomorphic encryption technology can well protect the private information in medical data and is widely used in various medical diagnosis schemes. For example, literature [22] constructed three classification protocols based on the Paillier cryptosystem to protect the security of data collected from medical users and service providers. Literature [23] developed an automatic diagnosis system for privacy protection. The remote server classified the biomedical signal provided by the client without obtaining any information about the signal itself and the final result of the classification. Liu et al. [24] proposed a privacy-preserving patient-centered clinical decision support system based on additive homomorphism to help clinicians assist in diagnosing patients’ disease risks. Hua et al. [25] proposed an efficient and privacy-preserving medical diagnosis framework, which outsourced the accurate diagnosis model to a cloud server in an encrypted manner based on partial decryption and security comparison technology to achieve the advantages of two-way ciphertext quantification. Since all encryption operations are based on homomorphic encryption, the huge computational overhead makes the system extremely inefficient, which is not suitable for online medical service scenarios.

In addition, many online medical prediagnosis schemes based on various machine learning classification algorithms [2633] adopt different privacy protection strategies. Wu et al. [27] designed a new efficient and privacy-preserving conditional unintentional transmission protocol. The literature [28] proposed a novel privacy protection biometric identification scheme, which improved efficiency by using the power of cloud computing. Zhang et al. [29] proposed a cloud-based privacy protection deep computing model to improve the efficiency of big data feature learning. Their schemes mainly protect the privacy information of users’ query vector, but do not protect the security of communication information and diagnostic models.

Zhang et al. [32] proposed a disease prediction system (called PPDP) that used random matrices to construct new medical data encryption, disease learning, and disease prediction. Although the use of a single-layer perceptron makes the disease prediction stage simple and efficient, the accuracy of the prediagnosis is not high enough. And in the disease learning stage, constantly updating the weights until convergence would consume huge computation and communication overhead. At the same time, in this three-party interaction scenario, the users’ private information is directly transmitted without communication security. Zhu et al. [33] proposed an efficient and privacy-preserving medical primary diagnosis based on NN (called EPDK). With lightweight multiparty random shielding and polynomial aggregation technology, users can ensure the security of their sensitive information in the online medical diagnosis. This is an interactive service in which only the user and the server participate. The diagnostic model of the server is based on the original medical data set. Medical data that is not encrypted or processed is vulnerable to attack or theft. In addition, the classification accuracy of EPDK is still not high enough. There are few schemes that provide comprehensive privacy protection, high-precision, and high-efficiency prediagnosis services.

3. Preliminaries

This section introduces the Relief-MW classifier used by OMPD and the BLS signature technology to protect communication security.

3.1. Relief-kMW Classifier

The Relief-MW classifier first needs to use the Relief algorithm [34] to obtain , and then calculates the Relief-Wasserstein distance between the query vector and each medical instance , in the database , where is the number of features in each vector, and is the total number of data in the database. The specific steps are as follows:

3.1.1. Relief Feature Weight Distribution Algorithm

Before the Relief-MW classifier works, the weight of the feature is calculated according to the Relief feature weight distribution algorithm (as shown in Algorithm 1).

Input: training data set , sample sampling times , feature set
, which has n features in total.
Output: feature weight .
1:The feature rights are reset to 0;
2:for to do
3: Randomly select a sample ;
4: Find the nearest neighbor sample in the same class of ,
and the nearest neighbor sample in different classes of ;
5: for to do
6:  ;
7: end for
8:end for
9:return .

The function in Algorithm 1 represents the difference between the sample and the sample on the -th feature, and its formula is as follows:

3.1.2. Relief-MW Classification Method

After getting the , we perform the Relief-MW classification method, which is similar to NN, except that NN uses Euclidean distance, and our method uses Relief-Wasserstein distance. First, we calculate the Relief-Wasserstein distance between the query vector and each instance , in the database , its definition as the following:

Definition 1 (Relief-Wasserstein distance). Suppose and are two input samples, where is the total number of features, is the iterative value of the difference between and on each component, and is the weight value of the corresponding features obtained by the Algorithm 1. Then, we calculate the followings:

The Relief-Wasserstein distance between and is

Then, we select the closest instances. At last, we use most of the classification results of these instances as the final classification result. In fact, the choice of value is not optimal in theory. It depends on data characteristics and classification requirements. Some articles [35] on the optimal theoretical value of pointed out that the best choice of for a given data set may also depend on many attributes of the data. They have carried out the selection experiment of value under different applications for different specific data sets and selected the best value for the application as much as possible.

3.2. BLS Signature

Suppose there is a large prime number and two cyclic groups and , their orders are both , and is a generator of . Then, there is a mapping , for and , and it has the following properties: (1)Generate public key

Definition 2 (BLS signature). The bilinear parameter generator takes a safety parameter as input and outputs a 5-tuple , where is the prime number of , and are two cyclic groups of order , and is the generator of , is a nondegenerate and effectively calculated bisexual mapping. The steps of BLS signature are as follows:

The message sender randomly selects an integer as the private key and , and then calculates the public key: (2)Create a signature. The sender creates a signature by performing the following operations on the message :(3)Verify the signature. After receiving the signature, the receiver performs the verification of the following formula, and the message content can be obtained after the verification is successful:

4. OMPD Scheme

Our OMPD includes three entities: the hospital, the cloud, and the users. And it consists of six stages: initialization, query generation, query processing, prediagnosis service, query result analysis, and result acquisition. Figure 1 shows the flow of the OMPD. For ease of expression, Table 1 gives the description of the notations used in the following sections.

4.1. Initialization

The hospital first generates a bilinear parameter . Then, the hospital uses a random number as private key (), calculates public key , and sets the parameters , , and , where . It needs to choose a secure asymmetric encryption algorithm and encrypted hash function , where . The hospital securely saves its private key as the master key and publishes system parameters .

Suppose there is a medical data set in the hospital database, these medical instances include kinds of diseases, and each disease corresponds to a data subset containing data. For each , the instance contains features. The hospital uses the Algorithm 1 to obtain the feature weight in each data subset . The preprocessing of medical data is shown in Algorithm 2. The hospital selects two large prime numbers and and sets , . Next, it chooses a large random number and . Each medical instance should successively perform vector feature weight distribution (to get the vector ), vector value iterative transformation (to get the vector ), forward expansion (to get the vector , ), and finally to get the processed medical vector , . The time complexity of Algorithm 2 is .

The hospital keeps the parameters secret and sends all preprocessed medical instances to the cloud. After obtaining the preprocessed medical instances, the cloud randomly selects part of the data as the test set to obtain the optimal value of the classifier for the diseases . Then, the cloud can have the Relief-MW classifier with the best value and many preprocessed medical instances received from the hospital.

Input: disease , medical vector (or ), feature weight
vector .
Output: the preprocessed vector , (or ).
1: Generate two large prime numbers and a random number (or ) and satisfy ,
and (or );
2: The feature weight is assigned to get ,
(or );
3: The vector value is iteratively transformed to get
,
(or );
4: The vector is expanded forward to get
,
(or );
5:for to do
6: Generate a random number (or ) and satisfy ;
7: if do
8:   (or );
9: else do
10:   (or );
11: end if
12:end for
13: return (or ).
4.2. Query Generation

The user generates query vector . Then, the user uses the hospital’s public key to encrypt the query vector to get , where is the disease that the user wants to query. Then, he (she) uses private key to create a signature and then sends to the hospital.

4.3. Query Processing

After receiving the from the user, the hospital first needs to confirm the , and then uses the following formula to verify the validity of the message:

If the equation is true, then the hospital performs Algorithm 2. The processing method is the same as that of medical instances. The hospital selects a large random number , and then performs vector features weight distribution (to get the vector ), vector value iterative transformation (to get the vector ), and two-dimensional forward expansion (to get the vector ), and obtains the preprocessed query vector after encryption.

The hospital calculates and keeps the parameters secret. Then, it uses private key to create a signature and sends to the cloud.

4.4. Prediagnostic Service

After receiving from the hospital, the cloud verifies the validity of the message by using the following formula:

If the above equation is true, the cloud performs the prediagnosis service algorithm. As shown in Algorithm 3, the cloud finds the data subset corresponding to the disease and calculates the Relief-Wasserstein distance between and each preprocessed encrypted medical instance in . Then, the cloud selects the instances with the smallest distance and takes most of the categories of these instances as the final prediagnosis result . The time complexity of Algorithm 3 is . The cloud keeps each , and computation rules secret. Then, it uses the hospital’s public key to calculate and uses to create a signature . Findly, the cloud sends to the hospital.

Input: the preprocessed query vector , , , and .
Output: pre-diagnosis result .
1:;
2:for to do
3: set ;
4: for to do
5:  ;
6: end for
7:end for
8:Select the data with the smallest between the data and the query vector, and use most of the classification results in the
data as the pre-diagnostic result ;
9:return .
4.5. Query Result Analysis

After receiving from the cloud, the hospital verifies the validity of the message by using the following formula:

The prediagnosis result can be obtained if the equation holds. The hospital gives some advice for this result and calculates . Then, the hospital uses private key to create a signature and sends to the user.

4.6. Result Acquisition

After receiving from the hospital, the user verifies the validity of the message by using the following formula:

If the equation holds, the user can obtain the pre-diagnosis result and the advice .

5. Accuracy and Security Analysis

This section analyzes the accuracy and security of OMPD. We verify that the data preprocessing of OMPD does not affect the original classification accuracy and can ensure privacy security.

5.1. Accuracy Analysis

If it can be verified that the Relief-Wasserstein distance between the preprocessed data calculated by Algorithm 3 and the Relief-Wasserstein distance between the original data are approximately a fixed multiple, it can be proved that our scheme does not require decryption to obtain the original data and also provides relatively accurate prediagnosis services.

Theorem 3. Assuming that represents the Relief-Wasserstein distance between and , represents the Relief-Wasserstein distance between the two original data, then is equal to .
The calculation process of the Relief-Wasserstein distance between the original data is as follows: represents the Relief-Wasserstein distance between and , and then the calculation process of is as follows: In summary, by Theorem 3, we can sssget that is equal to .

Theorem 4. Assuming that represents the Relief-Wasserstein distance between and , represents the Relief-Wasserstein distance between and , and then and are approximately in a fixed multiple relationship.
represents the Relief-Wasserstein distance between and , and then the calculation process of is as follows: represents the Relief-Wasserstein distance between and , and then the calculation of is as follows: Available from , we can get the followings:

From , we can get that is negligible relative to the large prime number , and then

Finally, Theorem 4 holds. And we can get that and are approximately in a fixed multiple relationship. It can be obtained from Theorems 3 and 4 that the cloud can still obtain accurate pre-diagnosis results through Algorithm 3 without obtaining the original data.

5.2. Security Analysis

The focus of security analysis is whether OMPD can protect the privacy of users’ medical data, the privacy of medical instances in hospitals, and the confidentiality of cloud diagnostic models.

The user’s query vector and the medical instance , in the hospital data subset are privacy-preserving. In the initialization stage, in order to prevent the privacy of medical instances from leaking, the hospital expands two dimensions with 0 for each medical instance after feature weight distribution and vector value iteration, which can prevent the cloud and illegal users from obtaining real medical instances. Each in the encryption calculation is randomly generated; so, is protected. When , , , and are unknown, the cloud cannot obtain the original medical data. And the random number generated is independent of each other at each time. Only the hospital knows the data processing rules, and the cloud cannot infer the original medical data. Therefore, the medical instance set is kept secret during the calculation process. Similarly, the query vector of the user is also kept secret.

The Relief-MW classifier is confidential. In the operating calculation phase, each medical instance , and query vector are preprocessed by the hospital before sending to the cloud. For two same query vectors, they should have the same Relief-Wasserstein distance between the same medical instances, but the random number and generated by each data processing are different and the Relief-Wasserstein distance calculated by the query vector is also different, which ensures that even the same user cannot obtain medical instance information after multiple queries. And each calculated by the Relief-MW classifier is confidential, and the cloud uses the hospital’s public key to encrypt the query result. The hospital uses the user’s public key to encrypt the query result and the advice; so, only the corresponding user can decrypt the query result. Moreover, the user and the cloud cannot communicate directly during the query. Although the cloud knows the final prediction result, it cannot obtain the corresponding user’s information. And the user cannot obtain the detailed information of the Relief-MW classifier in the cloud. Therefore, the Relief-MW classifier is confidential.

OMPD ensures communication security. Our scheme uses BLS signature to protect the information of each interaction. The signature is proved to be safe under the Diffie-Hellman problem of the random prediction model [36]. In addition, any illegal user cannot successfully submit a query request to the hospital because there is no key. Signature authentication can ensure that the message is not maliciously tampered with during transmission. Even if someone maliciously intercepts the message, the effective information in the message cannot be obtained because there is no key.

In summary, our OMPD can provide privacy-preserving medical prediagnosis services.

6. Performance Evaluation

In this section, multiple experiments are conducted to verify the accuracy and efficiency of OMPD from multiple dimensions of accuracy, computation overhead, and communication overhead.

6.1. Experiment Configuration

We implement OMPD with Python programming language and evaluate the computation and communication overhead of OMPD. We carry out our experiments on the device with CPU Intel(R) Xeon(R) Silver 4114, 2.20GHz, and Memory 32G. We choose two real data sets: Wisconsin Breast Cancer (WBC) and Mammographic Mass (MM) in the UCI machine learning library to evaluate the accuracy of the OMPD. In the comparative experiment, OMPD uses the Relief-MW classifier, while OMPD-MW uses the MW classifier, EPDK [33] is a two-party interaction pre-diagnosis program using NN classifiers, and PPDP [32] is a three-party interaction prediagnostic program using a single-layer perceptron trained with encryption matrices as the classifier.

The WBC contains 683 instances, including 444 benign instances and 239 malignant instances. Each instance contains 9 attributes. The MM contains 830 instances, of which 427 instances are benign and 403 instances are malignant. Each instance contains 5 attributes.

6.2. Selection of the Optimal Value of

For WBC and MM, we randomly selected 100 malignant instances and 100 benign instances as the test data set to evaluate the accuracy of OMPD. And the rest were used as the training data set. Then, we performed 100 calculations to compare the average accuracy and computation time under different values of . As shown in Table 2, when and , the classification accuracy of OMPD is the highest, and the calculation time-consuming swing is the smallest. Therefore, in the following experiment, the values of and are 9 and 11, respectively.

6.3. Evaluation of Accuracy

In order to verify the accuracy of our method, we conducted an accuracy comparison experiment, as shown in Table 3. The results show that the classification accuracy of OMPD is significantly higher than the other three schemes. This means that our OMPD still has a high classification accuracy after the data is preprocessed by feature weight distribution, iteration, expansion, and encryption. And it can provide high-precision online medical prediagnosis services for medical users. Please note that the accuracy of OMPD is higher than OMPD-MW, which means that the feature weights obtained by OMPD using the Relief feature weight distribution algorithm, which increases the impact of features that have a greater contribution to the classification and reduces the impact of features with low contributions. The experimental results verify that the accuracy of using Relief-MW is higher than that of MW.

6.4. Evaluation of Computational Efficiency

In order to verify that OMPD can provide efficient online medical prediagnosis services for medical users, this section evaluates the efficiency of computation. During the system operation, when the user generates the query vector and sends it to the hospital, the hospital needs to preprocess the query vector through multiplication (division) and addition (subtraction) operation. After the cloud receives the query request, it needs to go through multiplication (division) and addition (subtraction) operation to calculate the diagnosis result. Let and denote the running time of addition and multiplication, respectively. Then, the overall computation complexity of OMPD is during system operation. As shown in Table 4, it can be seen that OMPD-MW only lacks a multiplication operation (the calculation of multiplying feature weights and feature values) than our OMPD. The computational complexity of PPDP in the operating phase of the system has nothing to do with the amount of data . Its computational complexity is concentrated in the system initialization stage, and a lot of calculations are required to continuously update the weights to converge. The multiplication (division) in OMPD is far less than that in EPDK, while is more than . Therefore, the computational complexity of OMPD is lower than that of EPDK, and the gap becomes more obvious with the increase of data.

However, we cannot intuitively see the comparison of the computational time from Table 4. So, we conducted a comparison experiment of computational overhead. As shown in Figure 2, as the number of medical instances increases, the computation time of EPDK has grown rapidly, and OMPD and OMPD-MW have grown slowly, while PPDP has remained almost unchanged. This is because EPDK takes more time to perform operations such as multiplication (division). In addition, a large number of medical data stored by EPDK providers are collected directly from various medical service sites without encryption and other privacy protection measures, and there is a serious risk of privacy leakage. Our OMPD and OMPD-MW, as shown in Table 4, require significantly less calculations than EPDK. However, PPDP only needs to calculate the query data and the weight to get the diagnosis result; so, the computation time of PPDP in the running phase remains unchanged, but PPDP generates a lot of computation and communication overhead in the initialization phase to update the weights to converge. And PPDP ignores the protection of the communication process. Users send sensitive information directly to the hospital, which is vulnerable to attacks and theft by malicious third parties.

6.5. Evaluation of Communication Overhead

The communication overhead comparison of the four schemes in the system operation phase is shown in Figure 3. Assuming that the medical advice contains 100 words, each word contains an average of 5 characters, plus punctuation and spaces, one word approximately contains 6 characters, and each character occupies 1 byte. The total communication overhead of OMPD is about 0.735 KB (including 600B medical advice). As the number of medical instances increases, the total communication overhead of OMPD, OMPD-MW, and PPDP remains unchanged, while EPDK is much higher than the other three schemes. Because these three schemes obtain the diagnosis results directly in the cloud, only the results need to be sent during the communication process. The server in the EPDK only calculates the two intermediate values of the Spearman distance between the query vector and medical instances in database. The server needs to send intermediate values to the user, and the user finally calculates the diagnosis result. Therefore, the larger the , the greater the communication overhead of EPDK. And users cannot get corresponding medical advice in time.

In summary, OMPD can balance accuracy and efficiency to achieve high-precision online medical prediagnosis with lower computational complexity and communication overhead. At the same time, it also provides timely professional medical advice.

7. Conclusions

In this article, we propose an efficient online multiparty interactive medical prediagnosis scheme with privacy protection, called OMPD, which can protect the privacy with low computation and communication overhead. Accuracy analysis showed that our OMPD can obtain more accurate classification results without decryption. The security analysis demonstrated its security strength and privacy protection capabilities. And its effectiveness was verified through comparative experiments. Future work is to further improve the classification accuracy and operating efficiency of the program while ensuring the strength and security of privacy protection.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Key Research and Development Program of Anhui, China under Grant No. 201904a05020071, the Program for Synergy Innovation in the Anhui Higher Education Institutions of China (Grant Nos. GXXT-2019-025 and GXXT-2020-012), in part by the Natural Science Research Project of Anhui Province (Grant No. 2108085MF218), in part by the University Natural Science Research Project of Anhui Province (No. KJ2020A0249), and in part by the Open Fund of Key Laboratory of Anhui Higher Education Institutes under Grant CS2020-006.