Abstract
Cloud computing is highly suitable for medical diagnosis in ehealth services where strong computing ability is required. However, in spite of the huge benefits of adopting the cloud computing, the medical diagnosis field is not yet ready to adopt the cloud computing because it contains sensitive data and hence using the cloud computing might cause a great concern in privacy infringement. For instance, a compromised ehealth cloud server might expose the medical dataset outsourced from multiple medical data owners or infringe on the privacy of a patient inquirer by leaking his/her symptom or diagnosis result. In this paper, we propose a medical diagnosis system using ehealth cloud servers in a privacy preserving manner when medical datasets are owned by multiple data owners. The proposed system is the first one that achieves the privacy of medical dataset, symptoms, and diagnosis results and hides the data access pattern even from ehealth cloud servers performing computations using the data while it is still robust against collusion of the entities. As a building block of the proposed diagnosis system, we design a novel privacy preserving protocol for finding the k data with the highest similarity (PEFTK) to a given symptom. The protocol reduces the average running time by 35% compared to that of a previous work in the literature. Moreover, the result of the previous work is probabilistic, i.e., the result can contain some error, while the result of our PEFTK is deterministic, i.e., the result is correct without any error probability.
1. Introduction
Cloud computing, as an emerging computing paradigm, is revolutionizing the data processing methodology of many organizations because of its resource efficiency and reduction in management cost. As the costs of healthcare services rise, ehealth is considered as one of the promising fields that could benefit from using cloud computing [1, 2]. Among various health services, medical diagnosis is especially well suited for the ehealth cloud, because the diagnosis requires heavy computational ability and can be implemented on a payasyouuse model on the Internet.
Meanwhile, adopting cloud computing for medical diagnosis causes privacy issues because of the sensitive personal information contained in medical data. Specifically, if medical data owners such as hospitals outsource their medical diagnosis dataset in the open to ehealth cloud, a compromised ehealth cloud service provider might expose them. Similarly, if a patient inquirer sends and receives his/her symptom and diagnosis result in the open with the ehealth cloud for diagnosis service, the compromised ehealth cloud service provider might infringe on his/her privacy by exposing them. Even though the medical data owners and the patient inquirer encrypt them before sending them to the ehealth cloud to protect their privacy, it is still possible that the compromised ehealth cloud service provider might obtain additional information by observing data access patterns during processing.
The Health Insurance Portability and Accountability Act (HIPAA) regulates the privacy and security of individually identifiable health information to be guaranteed obligatorily [3]. The privacy and security regulations of HIPAA were improved in the Health Information Technology for Economic and Clinical Health (HITECH) Act [4]. Unfortunately, these acts do not suggest the technical methods for the privacy.
For medical diagnosis, casebased reasoning (CBR), which has been applied to the medical diagnosis since late 1980 [5], is a wellestablished problemsolving methodology. Given a problem (i.e., symptom), CBR provides its solution (i.e., diagnosis result) by referencing the cases with the most similar problem to the given problem among previous ones (i.e., medical diagnosis dataset) in case library where a case consists of a problem and its solution [6, 7]. One of the most important functionalities in CBR is to find the most similar cases to provide the solution to a given problem. For this purpose, many papers and systems [7–11] adopted knearest neighbor (kNN) classification. In other words, kNN classification for a query selects the k most similar data to the query in a classified dataset and determines the class of the query as the majority class of the k selected data [12]. It is fairly simple, has great performance, and gives a quite good result.
In real healthcare service environment, health records are owned by multiple data owners such as hospitals, which are unwilling to reveal the health records due to privacy or legal issue. If a data owner collects the health records to outsource them to ehealth cloud servers, it brings privacy concerns. Unfortunately, most of the previous works to compute kNN in privacy preserving manner assumed that there exists only one data owner rather than multiple data owners [13–19].
1.1. Contribution
The main theme of this paper is to design a privacy preserving kNN classification, socalled PPkNN [15], with multiple data owners for medical diagnosis. For privacy, we provide the privacy of medical dataset outsourced by multiple dataset owners, a symptom of patient inquirer, data access patterns during computation, and diagnosis results as PPkNN result. For security, we provide robustness against collusion among cloud servers, collusion between any data owner and cloud server, or collusion between inquirer and cloud server. There have been some results on PPkNN using cloud computing with multiple data owners. The authors of [20] proposed a privacy preserving kernel density estimation instead of PPkNN and demonstrated that the result accuracy is similar to that of PPkNN in many applications. They also introduced various realistic threats which can occur in the multiple data owner environment and discussed privacy of PPkNN classification. But, their protocol does not consider the privacy of the kNN result and data access pattern. In the work of [21], its PPkNN provides the privacy of dataset, input query, kNN result, and data access pattern. But it is vulnerable to collusion attacks. In other words, it assumed that there is neither collusion among cloud servers nor collusion between any data owner and cloud server. We summarize functionalities provided in the previous works and our PPkNN in Table 1.
As one of the building blocks of our PPkNN, we propose the improved method to find k data with the highest similarity (PEFTK). It reduces the average running time by 35% when compared to the previous work [22]. The number of rounds and the running time increases little as the number of data or k increases. Moreover, the result of the previous work is probabilistic, i.e., its result can contain some error, while the proposed PEFTK is deterministic, i.e., our result is correct without any error probability. Thus, our PEFTK is more suitable for medical diagnosis to handle sensitive medical data. We stress that our work is meaningful in terms of privacy preserving and efficient protocol to find k data with the highest value (topk data) using cloud computing.
As mentioned in [23], privacy preserving cloud computing with multiple data owners and inquirers (they denoted that model as stateful private multiclient computing) cannot be realized with a single cloud server using only cryptography, and adopting distributed multiple cloud servers can be an alternative. We thus realize our PPkNN using multiparty computation (MPC) based on secret sharing to compute kNN result in distributed manner without any trusted server.
In MPC based on secret sharing, data are to be shared among multiple cloud servers and each share reveals nothing on the original data, which can be reconstructed only when a sufficient number (i.e., more than the predefined threshold value) of shares are combined together. Since our PPkNN is designed using MPC, it is robust to collusion attack. In other words, it allows for an adversary to compromise some of ehealth cloud servers. The allowed number of the compromised cloud servers depends on the MPC protocol to be adopted. For instance, when GMW protocol [24] is applied, our PPkNN can compute kNN results in the privacy preserving manner even if an adversary compromises all ehealth cloud servers except one.
The remaining part of this paper is organized as follows: in Section 2, we explain MPC primitive, complexity, and kNN as preliminaries and then outline the proposed PPkNN and attack scenarios in Section 3. In Section 4, we present the proposed PEFTK as main contribution and then present the proposed PPkNN. In Section 5, we analyze the efficiency and discuss the security of PEFTK and PPkNN. In Section 6, we review the previous works related to PPkNN and privacy preserving topk protocol and lastly conclude this paper in Section 7.
2. Preliminaries
We explain MPC protocols based on Shamir’s secret sharing in Section 2.1 and Section 2.2 by which our proposed protocols are constructed (we implemented our PEFTK using the source code opened in the previous work [25] which is the MPC protocol based on Shamir’s secret sharing). However, since the proposed protocols can be constructed by not only MPC based on Shamir's secret sharing but also those based on other secret sharing, such as [24], we consider MPC applying to our proposed protocol as those based on secret sharing throughout this paper.
2.1. Multiparty Computation Based on Shamir’s Secret Sharing
MPC allows a set of parties (i.e., cloud servers) to jointly compute an agreed function on their inputs in a distributed fashion and to obtain the results of the function but nothing else. Each party receives shares generated from input values of function and computes results using the shares. MPC assumes that it allows for an adversary to compromise at most t parties, and their t shares do not involve any information on the original data. In other words, since any adversary to compromise at most t parties does not obtain information on the original data, MPC allows for parties to carry out secure computation without a trusted third party.
MPC based on secret sharing proceeds in three phases: input sharing, computation, and output reconstruction. In the input sharing phase, a party or an external entity holding a secret s generates a random polynomial f_{s}(x) of degree t at most with f_{s}(0) = s where t is the number of corrupted parties and sends its share to each party P_{i} where is any distinct nonzero element. In this paper, we denote the shares by where is the number of parties. In the computation phase, parties carry out a protocol according to a gate in circuit realizing the function agreed by the parties in advance and obtain result in shared representation. Lastly, in the output reconstruction phase, the parties send their own computed shares to the other parties and then reconstruct the final result from the received shares. Bitwise sharing shares a secret s in bitwise shared representation, i.e., the bitwise share is where , and is the size of the secret s.
2.2. Addition and Multiplication of Multiparty Computation
Since Shamir’s secret sharing has a linear property, addition in MPC is homomorphic. For addition of [a] and [b], each party P_{i} locally adds up its own shares and without any communication. We denote the addition of MPC by [a] + [b] = [a + b]. Similarly, since multiplication by a public constant c is also homomorphic, each party P_{i} holding share locally multiplies by the public constant c without any communication. We denote multiplication by a public constant c by c·[a] = [c·a].
Multiplication by two shares [a] and [b] requires communication once. Specifically, each party P_{i} locally multiplies its own shares and , i.e., , generates shares of the h_{i} by Shamir’s secret sharing, i.e., , for i = 1, …, ( is the number of parties), and sends the share to other party P_{j}. Lastly, each party P_{i} computes , where is the recombination vector and public information that all parties can compute. For more details, refer to [26]. The circuit randomization method [27] enables to locally perform multiplication without any communication by using precomputed random shares [x], [y], and [z] where [z] = [xy].
2.3. Comparison and Equality
Our proposed protocol uses comparison (lessThan) and equality MPC operations as well as basic addition and multiplication. In [25], comparison MPC operation requires 24l + 5 multiplications in 2l + 10 rounds, and the equality MPC operation requires l + 1 multiplications in l rounds (we implemented the proposed PEFTK using the library of [28] to implement the comparison and equality operations proposed in [25]. Their running time is optimized by reducing the number of multiplications although their round complexity is linear in the length of data. For more details, refer to [25]), where l is the size of data. Table 2 shows notations for MPC operations used in our protocol. The comparison and equality MPC operations are proved formally in the previous works, [29, 30] and therefore, we skip a formal proof in this paper.
2.4. Complexity
We evaluate the efficiency of a protocol in terms of both the number of rounds and the amount of communication. We measure the round complexity by the invocation count of a dominant operation performed in parallel and the communication complexity by the total number of invocations of the dominant operation to be carried out, as in [29, 30]. In other words, the round complexity denotes the time required to complete a protocol, and communication complexity denotes the amount of data sent and received in a protocol.
2.5. kNearest Neighbor
kNN classification [31], as an instancebased learning algorithm, is one of the simplest and oldest nonparametric pattern classification techniques and results in a competitive outcome. It selects k data most similar to an unclassified input query (i.e., input symptom) in classified dataset (i.e., medical dataset) and classifies the input query into the class (i.e., diagnosis result) with the majority class of the selected k data. Its performance depends on similarity computation. Many papers and medical diagnosis systems related to kNN adopted Euclidean distance for a similarity measure [6].
3. Overview
We outline the proposed PPkNN in Section 3.1 and explain how to generate global dataset in shared representation from horizontally or vertically distributed datasets of multiple data owners for input of PPkNN or PEFTK in Section 3.2. Then, we explain attack scenarios in Section 3.3.
3.1. System Model
The proposed PPkNN consists of multiple medical data owners, ehealth cloud servers, and a patient inquirer as shown in Figure 1. Organizations such as hospitals holding medical diagnosis datasets can be medical data owners. For medical diagnosis service, multiple medical data owners outsource their medical datasets to ehealth cloud servers to utilize their huge computing resources and benefit from their management cost. A patient inquirer wishing to have a medical examination sends his/her symptom to the ehealth cloud servers. The ehealth cloud servers carry out PPkNN classification as a part of the medical diagnosis and return the result back to the patient inquirer. We assume that the entities are connected on a secured and authenticated channel. This means that an adversary cannot eavesdrop on the communication between the entities.
We represent the medical data by symptom and its diagnosis result, denoted by (, c_{i}). We assume that the symptom consists of m details, denoted by mdimensional vector = (d_{i,1}, …, d_{i,m}), and the diagnosis result c_{i} is in , represented as bit value. If the diagnosis result c_{i} is α ∈, the only αth bit is 1 and the other bits are all 0.
We assume that the input symptom of a patient inquirer consists of m details and denote it by mdimensional vector = (q_{1}, …, q_{m}) as the symptom of the medical data. We also assume that the result sent from ehealth cloud servers is in . We denote it by = (scr_{1}, …, ), where scr_{i} is the score of each disease. The diagnosis result for the symptom of patient inquirer is the disease with the highest score.
3.2. Generating an Input Dataset from Horizontally or Vertically Distributed Data
In this subsection, we explain how cloud servers privately generate global dataset from datasets distributed to multiple data owners for PPkNN or PEFTK. The data distribution approach is classified as horizontally distributed dataset and vertically distributed dataset [32]. In the horizontally distributed dataset, each data owner holds some records of global dataset which have the same set of attributes. In the vertically distributed dataset, each data owner holds data corresponding to some attributes of global dataset.
In order to carry out the proposed PPkNN or PEFTK on global dataset of multiple data owners, they carry out the input sharing phase by sending shares generated from their datasets to each cloud server as described in Section 2.1. For instance, in the horizontally distributed dataset, if a data owner A stores and a data owner B stores , the global dataset which cloud servers store after input sharing phase is for . In the vertically distributed dataset, if a data owner A stores and a data owner B stores , the global dataset is also . In order to generate one column dataset for PEFTK, cloud servers additionally carry out computation (e.g., computeSimilarity in this paper) and generate the global dataset as input dataset.
3.3. Attack Scenarios
We consider a semihonest adversary model where a compromised entity follows a specified protocol but tries to obtain additional information on dataset of data owners, input query, intermediate results, and kNN result during the protocol. Our PPkNN allows for an adversary to compromise any entity, and we also consider multidata owner outsourced model defined in [20] where an adversary can compromise several entities simultaneously and carry out collusion attack. However the authors of [20] showed that the adversary, which compromises both data owners and inquirer and performs collusion attack, can obtain additional information on dataset of data owner regardless of protocol design or encryption scheme, even if cloud servers store the dataset in encrypted form. Therefore, we exclude the attack to compromise both data owners and inquirer and consider the remaining attacks. In other words, we consider the attacks where an adversary compromises cloud servers and data owners, cloud servers and inquirer, and each entity.
The attack scenarios in our PPkNN are as follows: a data owner tries to obtain information on dataset of another data owner. An inquirer also tries to obtain information on dataset stored in cloud servers by analyzing input query and kNN result occurred in communication with the cloud servers. Cloud servers try to obtain information on the internally stored dataset, the input query sent from an inquirer, intermediate results, and kNN result. Furthermore, since the compromised cloud servers can also collude with data owners or inquirer in a multidata owner outsourced model [20] (we assume that it allows for an adversary to compromise at most t entities including data owner or inquirer), they try to obtain information from their own randomized dataset in the way that they send an input query via the compromised inquirer to themselves and observe data access patterns during computation. With the information from the attack scenario, they can obtain information on input query sent by another inquirer.
Since our PPkNN is constructed with MPC, it allows for an adversary to compromise some of cloud servers. The proposed PPkNN can be realized by applying MPC based on secret sharing according to the number of cloud servers and the expected compromised cloud servers among them. Even though we consider semihonest adversary model in our work, it is possible to realize the protocols of cloud servers secure against malicious adversary if we apply MPC secure against malicious adversary to the proposed protocol of cloud servers.
3.3.1. Notations
For simplicity, means a set {1, 2, …, n}. For a set , means .
4. Proposed Protocols
PPkNN firstly computes similarities between input query and each data in dataset (computeSimilarity), converts the similarities in bitwise shared representation (BitDecomposition), and selects k data with the highest similarities (PEFTK). Among the subprotocols, we focus on the most important PEFTK to select topk similarities and present it in Section 4.1, and the other subprotocols utilize the previous works. We construct the proposed PPkNN using the subprotocols in Section 4.2.
4.1. Privacy Preserving and Efficient Protocol to Find the TopK Data (PEFTK)
The basic idea of PEFTK is to find the topk data according to the arrangement of bitwise 1. Specifically, the higher value out of two values denotes that, when examining and comparing each bit of the two values from the most significant bit to the least significant bit, bitwise 1 appears earlier in the higher value than in the lower value. For example, when comparing two 4bitdata 4 and 3 (0100 and 0011 in binary), since the second bit (from the most significant bit) of data 4 is 1 while the second bit of data 3 is 0, the data 4 is higher. As another example, when comparing two 4bitdata 6 and 5 (0110 and 0101 in binary), since the second bit of both data is 1 but the third bit of data 6 is 1 while the third bit of data 5 is 0, the data 6 is higher.
While PEFTK examines each bit of all data from the most significant bit (we will call it bitround), it counts the number of data whose current bit is 1, i.e., it adds up the current bits of all data, since a bit is 0 or 1. Then, it adds the count and the number of data in which bitwise 1 already appears in a prior bit, i.e., the result dataset in prior bitround, and compares the sum with k. The detailed procedure is as follows:(1)While examining each bit from the most significant bit to the least significant bit, PEFTK computes Cnt by adding the sum of the current bits of data in which bitwise 0 continually appears in prior bit and the number of data in which bitwise 1 appears in prior bit, i.e., the result dataset in prior bitround, and compares the Cnt with k.(21)Cnt > k: it carries out step (3).(22)Cnt == k: it contains in the result dataset, the data whose current bit value is 1. Then, it outputs the result dataset and terminates.(23)Cnt < k: it includes in the result dataset, the data whose current bit value is 1 and repeats step (1).(3)It decides candidate data, that is, the data whose current bit is 1 among the data in which bitwise 0 continually appears in the prior bit(4)For the next bit of candidate data, it computes Cnt by adding the sum of the current bits of the candidate data and the number of result dataset in prior bitround and compares the Cnt with k.(51)Cnt == k: it contains in the result dataset the candidate data whose current bit value is 1. Then, it outputs the result dataset and terminates.(52)Cnt > k: it removes the candidate data whose current bit value is 0 from them and carries out step (4).(53)Cnt < k: it includes in the result dataset, the candidate data whose current bit value is 1, and then it carries out step (4).
Table 3 and 4 shows an example of PEFTK where a dataset is {16, 12, 11, 10, 9} and k = 3, and thus the result dataset is {16, 12, 11}. We define PEFTK as Algorithm 1. Recall that bitwise share is where , and is the size of a secret .

PEFTK consists of part 1 (lines 2–13) and part 2 (lines 15–24). When it checks the jth bit in part 1, it computes Cnt by adding the number of data in which bitwise 1 appears from the (l − 1)th bit to the (j + 1)th bit and the number of data where the jth bit is 1 among the data in which bitwise 0 continually appears from the (l − 1)th bit to the (j + 1)th bit (line 3) and compares the Cnt with k (lines 6 and 10). In the case where Cnt is less than or equal to k, it includes in the result dataset, the data where the jth bit is 1 (line 9), and in the case where Cnt is larger than k, it proceeds to part 2 (line 7). In part 2, it finds the topk data among candidate data (). It computes Cnt of the current bit in the same manner as the part 1 (line 16). In the case where Cnt is not equal to k, it computes the result dataset (line 22) and candidate dataset (line 23), respectively, and otherwise, it computes and returns the result dataset (lines 1819).
4.2. Privacy Preserving kNearest Neighbor (PPkNN)
We present the PPkNN protocol in Algorithm 2. There are a variety of similarity measures for computeSimilarity protocol and we consider the squared Euclidean distance [33] in this paper. The BitDecomposition protocol decomposes a shared secret [s] into a bitwise shared secret where for i = 0, …, l − 1. For an efficient bitdecomposition protocol, refer to [34].

PPkNN computes similarity between an input query and each data (line 1) and selects k data with the highest similarity (line 3). If the data is one of the topk data, it holds for since [Res_{i}] = [1], and otherwise, [ck_{i,j}] = [0] for since [Res_{i}] = [0] (line 4). Thus, the value to add up (line 5) is the score value to aggregate the jth class of k data most similar to the query . Cloud servers send the result shares of to the inquirer, and it then reconstructs and obtains . The class information with the highest value among is the class (i.e., diagnosis result) for input query (i.e., input symptom) as the result of kNN classification.
5. Efficiency and Security
In this section, we discuss the efficiency and the security of the proposed protocols. Specifically, we analyze the empirical result of PEFTK implementation in Section 5.1 and measure the complexity of PEFTK and PPkNN in Section 5.2. We discuss the security of PEFTK in Section 5.3 and that of PPkNN in Section 5.4.
5.1. Empirical Results of PEFTK
We implemented the proposed PEFTK with the source code of [28] based on Java which is opened in the previous work [25] and conducted experiments to confirm its performance. Specifically, we first experimented PEFTK implementation for five cloud servers to find the top 100 data among 1000 data of 33 bits generated in random and then varied the number of data, length of data, and k where each experiment is conducted 30 times. Each cloud server was run on a separate server, and intermediate results across 100 Mb/s network were communicated. A cloud server used an Intel Core i7 2.4 GHz CPU.
Figures 2–4 show the distribution of the number of bitrounds and average running time using a boxandwhisker plot and line graph, respectively. In the boxandwhisker plot, the central mark and each edge of the box represent the median, the 8th (Q1) and the 23rd (Q3) of the number of bitrounds, respectively. The whisker represents the range not to be considered, i.e., outliers, which means the range larger than Q3 + 1.5 (Q3−Q1) or smaller than Q11.5 (Q3Q1) as [22].
As seen in PEFTK (Algorithm 1), the computation cost of part 2 (lines 1524) contributes most to the complexity of PEFTK and that of part 1 (lines 213) is relatively low. In other words, the part 1 requires one round (one invocation) of multiplication each bitround, while the part 2 requires the expensive comparison and equality operations once as well as 3 rounds (5 invocations) of multiplication each bitround. According to the previous result [25] used to implement PEFTK, the comparison operation requires 76 rounds (797 invocations) of multiplication and the equality operation requires 34 rounds (34 invocations) of multiplication in the case of 33bit data. Therefore, the execution of part 2 is a dominant factor of the complexity of PEFTK.
Table 5 shows our PEFTK is more efficient than the previous work [22] (as the number of input parties increases in the previous work [22], the number of round increases, since the previous work runs the collision resolution phase to reduce global collision. However, the number of bitrounds and the running time of our PEFTK do not increase, since it outputs deterministic result) in terms of average running time for one round and total running time. This is because the previous work requires the expensive comparison operation one more each its round. Our experimental results show that the distribution of the number of bitrounds and the average running time of PEFTK little increase, even when the number of data, length of data, and k increase, except for the running time according to the length of data. We observed that our PEFTK found the topk data between 9.7 and 11.1 bitrounds and took between 98.23 and 123.83 seconds for dataset generated at random. Moreover, the experimental results show a great variance because the data are at random.
Figure 2 shows that the number of bitrounds and the average running time of PEFTK do not increase in proportion to the increasing number of data. As seen in PEFTK, the number of multiplication invocations is proportional to the number of data (n invocations in part 1 and 5n invocations in part 2, where n is the number of data), but they have little influence on running time since these multiplications can be carried out in parallel. Furthermore, since the expensive comparison and equality operations take Cnt and k unrelated to the number of data as input, the number of data does not have an influence on the number of bitrounds and the average running time of PEFTK.
Figure 3 shows that the number of bitrounds does not increase as the length of data increases, but the average running time increases. The reason is not PEFTK. It is because the comparison and equality in the library [25, 28] used to implement PEFTK are linear in the length of data. In other words, since the complexities of the comparison and equality operations in the library [28] are linear in the length of data, the average running time of PEFTK implementation increases in proportion to the length of data. Therefore, if we implement PEFTK with the library in which the complexities of comparison and equality are constant [29, 30], the running time does not increase as the number of bitrounds.
Figure 4 shows that the number of bitrounds and the average running time of PEFTK are unrelated to k. The round of the previous work [22] increases according to the increase of k, while PEFTK does not require additional operations for the high value of k. In other words, since the number of expensive comparison and equality operations does not increase according to the increase of k, the value of k does not have an influence on the number of bitrounds and the running time.
5.2. Complexity
As explained above, we evaluated the complexity of PEFTK with the execution count of part 2 (lines 15–24), since the complexity of part 2 contributes most to that of PEFTK. Table 6 shows the complexity of PEFTK in comparison to that of the previous work [22]. The previous work requires two rounds of comparison (n + 1 invocations) each its round, since it compares τ (the median of data bound) to all n data and the number of larger data to k. Since the previous work requires one more round of comparison (n invocations more) each its round, our PEFTK is more efficient. In the experiment of PEFTK with random data, the execution count α of part 1 was mostly one or two. However, if most data are the values smaller than 10 bits size, PEFTK can be more efficient since the execution count of part 1 increases and that of part 2 decreases.
The complexity of PPkNN consists of executions of computeSimilarity (line 1), BitDecomposition (line 2), PEFTK (line 3), and multiplications of line 4. Since we consider the squared Euclidean distance [33] to compute similarity, the computeSimilarity requires one round of multiplication (nm invocations). The BitDecomposition, which represents the similarity values in bitwise shared representation for PEFTK, is known as a comparatively expensive operation. However, in stateoftheart research [34], the author constructed a very efficient bitdecomposition protocol using precomputed random values. It requires (3l − 2u) multiplications in (l/u + 1) rounds where l is the length of data and u is the number of bits to convert in one round. For more details, refer to [34]. Lastly, line 4 requires one round of multiplication (nv invocations). Consequently, since the round complexity, which relates to the time to complete a protocol, is not proportional to the number of data which is quite large in most cases, our PPkNN is relatively efficient.
5.3. Security of PEFTK
In the part 1 of our PEFTK, cloud servers reconstruct the number of the highest data (Cnt) each bitround for efficiency. In other words, until the number of the highest data is larger than k (part 1), the number of the highest data is leaked for each bitround. However, it does not leak what data is the highest data and what the exact value of the highest data is. It leaks that bitwise 1 appears in current bit of Cnt data among all data. The information does not give an unreasonable amount of information on input dataset to cloud servers.
As a variation of PEFTK, it is possible to find the topk data without reconstructing Cnt in part 1. It requires comparison operation (line 6 in Algorithm 1) and equality operation (line 10 in Algorithm 1) once each bitround, respectively. However, the previous work [22] requires n comparisons in one round each bitround (totally, nl comparisons in l rounds) more in comparison to the variation where the comparison operation is the expensive operation in our proposed protocols, and thus the variation is still more efficient than the previous work. Moreover, since the number of comparison and equality operations is unrelated to the number of data, the length of data and k, even if they increase, the efficiency is similar to that of Section 5.1.
5.4. Security of PPkNN
We show that the proposed PPkNN is secure against the threats mentioned in Section 3.3. Specifically, we show that our PPkNN provides the privacy of dataset of data owners, input query, kNN result, and data access pattern for three attack scenarios to compromise each entity, cloud servers and data owners, and cloud servers and inquirer.
5.4.1. Privacy of Dataset
Since data owners send randomized shares of their dataset to each of cloud servers in the input sharing phase, at most t compromised cloud servers cannot obtain any information on the original dataset from their shares as explained in Section 2.1. Similarly, since t compromised cloud servers can obtain at most t shares of the intermediate results during MPC processing, they cannot obtain any information on the intermediate results. Since data owners do not interact with other data owners and do not receive any data from other entities, the compromised data owners cannot obtain any information. Even if compromised cloud servers collude with data owner or inquirer, they obtain at most t shares of each dataset and thus it cannot obtain any information on dataset.
5.4.2. Privacy of Input Query and kNN Result
Similar to data owners, an inquirer sends to each of cloud servers the randomized share of an input query generated in secret sharing phase and receives kNN result in shared representation from each of the cloud servers. Note that the kNN result is reconstructed to the inquirer rather than the cloud servers. Since the adversary can obtain at most t shares of the input query and the kNN result, it is impossible to leak their information.
5.4.3. Privacy of Data Access Pattern
Compromised cloud servers can attempt to guess additional information by observing data access patterns even though the stored data are randomized. For example, when the compromised cloud servers collude with an inquirer, the compromised inquirer can send an input query to cloud servers and the compromised cloud servers can observe the data access patterns. However, since the cloud servers access all data to compute kNN result, the compromised cloud servers cannot guess the relation between the input query and the data access patterns.
6. Related Work
In this section, we review existing works related to PPkNN and a privacy preserving topk protocol.
6.1. Privacy Preserving kNearest Neighbor Protocols
After Lindell and Pinkas first introduced privacy preserving data mining in [35], many researchers proposed PPkNN schemes. In [33], Shaneck et al. proposed the PPkNN algorithm over a horizontally distributed dataset, but it leaks some information. Qi et al. [36] resolved the information disclosure problem of [33] with a homomorphic encryption such as the Paillier cryptosystem, but their protocol also executes in a horizontally distributed data model. Further, Xiong et al. [37] proposed a PPkNN scheme which does not provide query privacy as its query is publicly known, and the protocol is also executed in a horizontally distributed data model.
In [13], Yao et al. relaxed the PPkNN requirement in which the protocol finds the partition containing the nearest neighbor for a query instead of the exact nearest neighbor. In their protocol, a data owner and inquirer must be trusted because they share a secret key. In [14], Elmehdwi et al. proposed a scheme using the Paillier cryptosystem with a homomorphic property, which provides both data and query privacy, and hides the access pattern. Then, they improved their work in [15] and formally proved the scheme that outputs the query class information in encrypted data. However, they did not consider the untrusting multiple data owner model. In [16], Zhu et al. proposed a PPkNN scheme in which a data owner does not expose the secret key to an inquirer but it encrypts the query by interacting with the inquirer. Hence, the data owner maintains the online connection for encryption. In [20], Li et al. considered a practical scenario in which the scheme provides privacy in a mutually untrusting multidata owner outsourced model but did not consider hiding the data access pattern. In [17], Songhori et al. presented a method to generate a compact PPkNN using garbled circuit and implemented it, but they did not consider multiple data owners. In [18], Zhu et al. proposed an efficient PPkNN scheme providing data privacy, key confidentiality, input query privacy, and query controllability using combination of random matrix transformation, random permutation, additively homomorphic encryption, and dimension extension. However, the scheme does not consider mutually untrusting multidata owner outsourced model. The work of [19] provided privacy of data owner and inquirer by constructing oblivious kdtree and oblivious bounded priority queue, but it does not consider multiple data owners. The work of [21] provided data privacy, input query privacy, PPkNN result privacy, and hiding access pattern and considered multiple data owners. But, it allows for data owners to send horizontally partitioned data rather than vertically partitioned data.
6.2. Privacy Preserving Topk Protocols
In [38, 39], Vaidya and Clifton researched the problem to find the topk elements over vertically partitioned private data using MPC to extend Fargin’s algorithm [40]. In [41], Aggarwal et al. designed the protocol to find the kth smallest element over horizontally partitioned data using a binary search. Specifically, the protocol proposes the median of an expected range as a candidate element and counts the number of data smaller than the candidate element over every binary search round. When the count is more than k, the range bigger than the candidate element is removed from the expected range since the kth smallest element is smaller than the candidate element, and vice versa. The above process is repeated until the count is same as k. However, it is carried out over horizontally partitioned data. In [22], Burkhart et al. proposed the PPTKS protocol to find the topk values over an aggregated keyvalue list, where the basic idea is the same binary search of [41]. However, the difference from [41] is that PPTKS uses a hash function, and hence it is efficient for sparsely distributed data such as an IP address. However, since PPTKS outputs a probabilistic result because of the hash function, it is unsuitable for application to the ehealth handling of sensitive health information. In [42], Jonsson et al. proposed a privacy preserving sorting protocol with MPC in a sorting network and a privacy preserving topk protocol using the sorting protocol, but the running time of their topk protocol is longer than that of the proposed PEFTK.
7. Conclusion
In this paper, we proposed PPkNN suitable for medical diagnosis using MPC based on secret sharing in multiple medical data owner environment. The proposed PPkNN provides the privacy of medical diagnosis dataset outsourced from multiple data owners, a symptom of patient inquirer and diagnosis result as kNN result and hides the data access pattern. As a building block of the proposed PPkNN, we proposed the protocol to find k data with the highest similarity, which is more efficient than the previous work [22] since it reduces the expensive MPC comparison operation. Furthermore, as the number of data, the length of data, or k increase, the number of rounds of PEFTK does not increase. The proposed PEFTK returns deterministic results in comparison with the previous work [22]. We expect that researchers construct the privacy preserving and efficient protocols for other data mining techniques other than kNN to apply MPC.
Data Availability
The data used to support the findings of this study are included within the article.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This research was supported by Samsung Electronics.