Research Article  Open Access
Geontae Noh, Ji Young Chun, Ik Rae Jeong, "Sharing Privacy Protected and Statistically Sound Clinical Research Data Using Outsourced Data Storage", Journal of Applied Mathematics, vol. 2014, Article ID 381361, 12 pages, 2014. https://doi.org/10.1155/2014/381361
Sharing Privacy Protected and Statistically Sound Clinical Research Data Using Outsourced Data Storage
Abstract
It is critical to scientific progress to share clinical research data stored in outsourced generally available cloud computing services. Researchers are able to obtain valuable information that they would not otherwise be able to access; however, privacy concerns arise when sharing clinical data in these outsourced publicly available data storage services. HIPAA requires researchers to deidentify private information when disclosing clinical data for research purposes and describes two available methods for doing so. Unfortunately, both techniques degrade statistical accuracy. Therefore, the need to protect privacy presents a significant problem for data sharing between hospitals and researchers. In this paper, we propose a controlled secure aggregation protocol to secure both privacy and accuracy when researchers outsource their clinical research data for sharing. Since clinical data must remain private beyond a patient’s lifetime, we take advantage of latticebased homomorphic encryption to guarantee longterm security against quantum computing attacks. Using latticebased homomorphic encryption, we design an aggregation protocol that aggregates outsourced ciphertexts under distinct public keys. It enables researchers to get aggregated results from outsourced ciphertexts of distinct researchers. To the best of our knowledge, our protocol is the first aggregation protocol which can aggregate ciphertexts which are encrypted with distinct public keys.
1. Introduction
Researchers can accelerate their learning curve if they are able to freely access clinical data from other studies. Such clinical data sharing in outsourced publicly available services is crucial to scientific progress in clinical research. The benefits of clinical data sharing using these services have been widely reported, including reduced research costs, reduced management costs, improvement of quality control, and reduced time in discovering diseases and dealing with them effectively. Through shared data, researchers access valuable information that they would not ordinarily obtain. In its policy statement on grants, the U.S. National Institute of Health (NIH) supports data sharing by requiring investigators to include a plan for data sharing or explain why data sharing is not possible.
The problem with clinical data sharing in outsourced publicly available services for research is that researchers can inadvertently violate patient privacy. HIPAA (Health Insurance Portability and Accountability Act) offers protection of patients’ personal health information, but it is difficult not to invade patient privacy while sharing clinical data in outsourced publicly available data storage services [1]. Therefore, researchers would rather not make their data publicly available than run the risk of violating HIPAA.
To mitigate privacy concerns, the HIPAA describes two ways to use and disclose clinical data for research purposes. Under the HIPAA Safe Harbor policy, clinical data should be deidentified so that patients are not individually identifiable. The HIPAA Safe Harbor policy stipulates that the data sharer should deidentify data by removing 18 specific data attributes, such as name, address, and all dates related to the individual patient, which may include birth date and date of death. (In addition, some researchers continue to assert that combinations of other data that are excluded from the HIPAA Safe Harbor policy could individually identify a specific person with nonnegligible probability, so they insist that there are more than 18 specific data attributes that should be included in the Safe Harbor policy [2–4].) Once identifying information has been removed, the deidentified data are no longer subject to the Institutional Review Board (IRB) overview. Alternatively, researchers may use anonymity techniques to deidentify patient information instead of removing all of the 18 or more data attributes that are required to be deidentified. To date, anonymity techniques have been proposed, such as anonymity [5–7], diversity [8], and closeness [9].
It is useful to protect patient privacy with deidentification formats when sharing clinical data in outsourced publicly available data storage services, but doing so degrades the statistical accuracy since it makes it difficult to get precise statistical results. However, in some cases where accurate statistical data on patients are critical, the anonymity techniques for deidentification are not sufficient. Due to poorly deidentified data, researchers can make bad decisions. Therefore, there needs to be a privacypreserving method for accurate statistical data.
In this work, we propose how to outsource clinical research data securely and how to control the outsourced data against potential breaches of privacy, while not compromising the accuracy of statistical results. For example, a malicious researcher could circumvent any encryption by asking for one piece of data on one patient; in this way, the researcher could ultimately obtain each patient’s private information. In this case, we propose a method that will foil such a malicious attempt.
The system environment we propose for hospitals, aggregator, and researchers is illustrated in Figure 1. In our system, each hospital outsources its own clinical data to cloud storage servers. The clinical data must be deidentified or encrypted to be stored publicly. We use a hybrid method to store the clinical data; that is, we deidentify the clinical data for approximate statistical data requests and encrypt numerical clinical data for accurate statistical data requests. Therefore, researchers can request both approximate and accurate statistical data. Researchers would obtain approximate statistical data directly from the cloud storage servers but cannot obtain accurate statistical data directly. When researchers would like to get accurate statistical data, they can get the data through the aggregator. The aggregator aggregates the requested data from the encrypted database stored in the cloud storage servers, and then asks each hospital to decrypt the aggregated data by consent. Hospitals can refuse the request of the aggregator, unless initial consents that have been obtained from patients allow the secondary research. Since there are ethical and practical issues associated with aggregating databases [10], hospitals should ensure that they are following “best practices” for their outsourced data, such as determining whether initial consents that have been obtained allow secondary research.
(a)
(b)
Since clinical data should remain private beyond a patient’s lifetime, cryptographic longterm security is absolutely needed [11] in the area of managing clinical data. Therefore, we take advantage of a latticebased homomorphic encryption in order to encrypt clinical data. Latticebased cryptography is believed to be secure against quantum computing attacks and guarantees longterm security. RSA, ECC, and DLP cryptosystems, which have gained attention so far, could be attacked with quantum computers [12]. Quantum computing is not yet possible, but may become so in our lifetime. Furthermore, latticebased cryptographic algorithms are more efficient than others in computational overhead because they require only linear operations on matrices such as addition, multiplication, and inverse.
In 2009, Gentry proposed the first fully homomorphic encryption scheme using ideal lattices [13]. In 2010, Gentry et al. have proposed a novel homomorphic encryption scheme (referred to as GHV homomorphic encryption scheme hereafter) that supports one multiplicative and polynomially many additive operations on encrypted data [14]. As a building block, we use a variant of the GHV homomorphic encryption scheme, which supports only additive operations. This can make it possible to aggregate ciphertexts which are encrypted under distinct public keys. Due to this property, the aggregator can aggregate the outsourced encrypted data from hospitals. Therefore, once hospitals outsource their clinical data, they do not need to encrypt the clinical data again for individual researchers. Each hospital only has to encrypt the clinical data, and then it outsources the encrypted data.
Contributions. In this paper, we propose a controlled secure aggregation protocol in sharing clinical research data to balance the interests between hospitals and researchers. The main contributions of this paper are as follows.(i)Researchers can get approximate statistical data from deidentified clinical data directly. Researchers can also obtain accurate and aggregated clinical data from the encrypted database through the aggregator by obtaining each hospital’s consent.(ii)We take advantage of a latticebased homomorphic encryption which is secure against quantum computing attacks. Therefore, our protocol resists quantum attacks and could remain secure in the long term.(iii)The aggregator can aggregate encrypted clinical data which are encrypted with distinct public keys. Therefore, hospitals do not have to encrypt the clinical data again whenever researchers send requests.
To the best of our knowledge, our protocol is the first protocol which takes advantage of the latticebased homomorphic encryption in order to share outsourced clinical research data.
Organization. The remainder of this paper is organized as follows. Section 2 provides related works and background. Section 3 presents our controlled secure aggregation protocol. We present our secure clinical data aggregation system in Section 4 and analyze it in Section 5. We provide our conclusions in Section 6.
2. Related Works and Background
In this section, we present related works and background.
2.1. Data Aggregation Based on Homomorphic Encryption
In 2004, Hacıgümüş et al. proposed an aggregation protocol over encrypted relational databases [15]. They designed the aggregation protocol using the PH (Privacy Homomorphism) which supports additive and multiplicative operations. In the aggregation protocol, permitted users can get the accurate and aggregated data. However, Mykletun and Tsudik showed that the aggregation protocol using the PH is not secure against ciphertextonly attacks [16]. Since then, various aggregation protocols over encrypted data have been proposed in the literatures [17–23]. Among those protocols, few literatures have focused on the healthcare environment. In addition, most protocols considered the aggregation for a single provider’s data.
Molina et al. [22] designed the aggregation protocol, HICCUPS, using homomorphic encryption in the healthcare environment. In HICCUPS, clinical data of multiple providers can be aggregated as follows: caregivers who store clinical data on their own database are randomly chosen as the aggregator. When a researcher requests the aggregated result, the aggregator aggregates the encrypted clinical data from each caregiver and sends the aggregated result to the researcher.
Since HICCUPS is not based on the outsourcing system, caregivers have to provide clinical data whenever a researcher requests a certain data. In addition, HICCUPS requires each caregiver to aggregate and encrypt clinical data with the researcher’s public key so that the aggregator can aggregate the encrypted clinical data. However, a malicious aggregator may want to have a researcher get a misleading result by intentionally excluding the encrypted clinical data from certain caregivers. Even though the malicious aggregator fabricates the aggregated result on purpose, there is no way for a researcher to detect the malicious behavior of the aggregator in HICCUPS.
To resolve the above issues, we design the controlled secure aggregation protocol which can aggregate outsourced ciphertexts under distinct public keys. Therefore, data providers (or hospitals) do not have to encrypt clinical data again, once they have outsourced their clinical data. Our protocol also enables a researcher to detect the malicious behavior of the aggregator. If the malicious aggregator excludes the encrypted clinical data from certain data providers on purpose, a researcher can detect that. Since each data provider (or hospital) collaboratively makes the aggregated data decryptable by a researcher, if the aggregated data is generated maliciously then the researcher cannot get a plausible result. The researcher gets the random result that cannot seem to be a meaningful result. Therefore, in our protocol, the researcher can be sure that the requested data are aggregated correctly.
2.2. Anonymity Techniques for Deidentification
Samarati and Sweeney introduced an anonymity technique called anonymity [5–7]. They considered a relational database that consists of unique identifiers, quasiidentifiers, and sensitive attributes. A unique identifier is any attribute that is able to identify only one private individual, such as a personal ID, an email address, or a cell phone number. A quasiidentifier is any set of attributes that can be joined with additional information to identify only one private individual, such as a zip code and a birthday. A sensitive attribute is any attribute that a data owner does not want to publish, such as healthcare data. In order to preserve privacy, all unique identifiers must be removed and all quasiidentifiers must be anonymized. In anonymity, each quasiidentifier is indistinguishable from at least other quasiidentifiers. Tables 1 and 2 are good examples of the original healthcare data and the 4anonymous pieces of healthcare data.


However, anonymity is not secure against homogeneity attacks and background knowledge attacks [8]. For example, suppose that Alice knows that Bob is in his twenties and his zip code is 13032; then, Alice can identify that Bob must have a gastric ulcer from Table 2.
To mitigate these attacks, Machanavajjhala et al. introduced a new anonymity technique, called diversity [8]. In diversity, all the equivalence classes that have the same quasiidentifiers must have or more different sensitive attributes. Table 3 shows the diverse kinds of healthcare data.

Since this result of diversity, Li et al. showed that diversity is insufficient for anonymity [9]. In diversity, any information can be released if there exists a significant distribution difference between sensitive attributes of any equivalence class and all sensitive attributes. For example, if Alice knows Bob’s personal information such as his age and zip code, she will be able to identify from Table 3 that Bob has stomachrelated disease (e.g., gastric ulcer, gastritis, and stomach cancer.)
To mitigate such potential problem, Li et al. introduced another new anonymity technique, called closeness [9]. closeness requires the distribution of sensitive attributes of any equivalence class to be similar to that of all sensitive attributes.
2.3. GHV Homomorphic Encryption Scheme
GHV homomorphic encryption scheme supports one multiplicative and polynomially many additive operations on encrypted data [14]. The security of the GHV homomorphic encryption scheme is based on the learning with errors (LWE) problem [24] which is one of the hardest assumptions so far.
Let be the security parameter, then other parameters are as follows:(i),(ii) is a positive integer by setting a prime number ,(iii), and(iv) is a Gaussian parameter.
Then the INDCPA secure [25] GHV homomorphic encryption scheme GHV = {GHV.Key, GHV.Enc, GHV.Dec, GHV.Add, GHV.Mul} is as follows.(i)GHV.: given , , and , output a public key and a secret key such that , is invertible, and the elements of are bounded by . (To generate two matrices and , the trapdoor sampling algorithm in [26] can be used. For further details, please refer to [14].)(ii)GHV.: given and a plaintext , choose a uniformly random matrix and a Gaussian error matrix . Then output a ciphertext .(iii)GHV: given and a ciphertext , compute . Then output a plaintext .
In this algorithm, (iv) GHV.Add: given ciphertexts, , …, , output That is, the output of GHV.Dec is .(v) GHV.Mul: given two ciphertexts, and , output That is, the output of GHV is .
In this paper, we use, as a building block, a variant version of the GHV homomorphic encryption scheme which supports only additive operations. We call this variant version of the GHV homomorphic encryption scheme a homomorphic encryption scheme hereafter. We can replace , , , and of the GHV homomorphic encryption scheme with , , , and of the homomorphic encryption scheme without any loss of security. Then the INDCPA secure [25] homomorphic encryption scheme = .Key, .Enc, .Dec, .Add} is as follows.(i): given , , and , output a public key and a secret key such that , is invertible, and the elements of are bounded by .(ii): given and a plaintext , choose a uniformly random vector and a Gaussian error vector . Then output a ciphertext .(iii): given and a ciphertext , compute . Then output a plaintext .
2.4. Ajtai’s OneWay Function
Ajtai constructed a oneway function whose security is based on some well known approximation problems in lattices [27, 28].
Let be the security parameter, a positive integer, and a positive integer. For a uniformly random matrix and , the Ajtai’s oneway function is as follows:
Note that the Ajtai’s oneway function is regular [29]; that is, every output of is uniformly distributed over [30].
3. Controlled Secure Aggregation Protocol
In this section, we propose our controlled secure aggregation protocol (CSA protocol hereafter). Let be the security parameter. Then we choose other parameters which are used in our CSA protocol as follows:(i),(ii) is a positive integer by setting a prime number ,(iii), and(iv) is a Gaussian parameter.
Suppose that there are n users, , a receiver , and an aggregator . Each user outsources its own numerical data with encrypted form. We assume that the receiver wants to know an aggregated value , where and is the number of elements in . We also assume that the receiver has a public key and a secret key by performing . Then the receiver can get by performing our CSA protocol.
Our CSA protocol consists of the following phases which are illustrated in Box 1: Key Generation, Encryption, Aggregation, reAggregation, and decAggregation. In the Key Generation phase, each user generates a public key pair and a secret key. In the Encryption phase, each user encrypts its numerical data with his/her public key pair. In the Aggregation phase, ciphertexts generated under distinct public key pairs are aggregated. That is, to get an aggregated value, the receiver allows the aggregator to know . Then an aggregator aggregates each ciphertext on in this phase. In the reAggregation phase, the user eliminates from a ciphertext and adds .Enc which is a ciphertext on under the receiver’s public key . In this phase, a ciphertext under the public key is converted into a ciphertext under the public key maintaining the same plaintext. For example, suppose that and , then As a result, a ciphertext which is decryptable by a user is converted into a ciphertext which is decryptable by the receiver maintaining the same plaintext . This phase is needed in the decAggregation phase to make an aggregated ciphertext decryptable by the receiver . In the decAggregation phase, each user in turn makes an aggregated ciphertext decryptable by the receiver . Through these phases, the receiver can get an aggregated value .

For example, we assume that users participating in our controlled secure aggregation protocol CSA and each user has its numerical data . Each user () outsources its numerical data with encrypted form , using algorithm. Suppose that the receiver wants to know an aggregated value . The receiver lets the aggregator know . Then runs the algorithm to get . gives , , and to , and and to . Then runs to get sends to ; then runs to get sends to ; then runs to get . That is, is the same as , since is a sufficiently short value [14].
In the decAggregation phase, any user can refuse to perform the algorithm, unless initial consents that have been obtained from patients allow the secondary research. Then the receiver cannot get the result. The receiver can get the result only if all users perform the algorithm. That means the receiver can get an aggregated value that he/she is seeking only by the unanimous consent of all fusers who have the data aggregated. That is the reason why we use the term “controlled” in the CSA protocol.
3.1. Security
We now analyze the security of our controlled secure aggregation protocol.
First, we show that our encryption is INDCPA secure. Intuitively, the only difference between our encryption scheme and the homomorphic encryption scheme is how to generate a vector . In the homomorphic encryption scheme; the vector is chosen uniformly, but in our encryption scheme , it is generated by computing using a randomly chosen vector . Since every output of the Ajtai’s oneway function is uniformly distributed over , a vector from our encryption scheme is uniformly distributed over . Therefore, the security of our encryption scheme is the same as the homomorphic encryption scheme.
Theorem 1. Our encryption scheme provides INDCPA if the homomorphic encryption scheme provides INDCPA and every output of the Ajtai’s oneway function is uniformly distributed over .
Proof of Theorem 1. Formally, we show that if there exists an adversary breaking the INDCPA security of our encryption scheme , there exists a challenger breaking the INDCPA security of the homomorphic encryption scheme.
Let be an instance given to . chooses a uniformly random matrix and sends to . chooses and sends to . outputs and returns , where . sends to , and outputs . Then outputs .
In our controlled secure aggregation protocol CSA, ciphertexts generated under distinct public key pairs can be aggregated. To decrypt the aggregated ciphertext which is generated by ciphertexts of users , each user needs to eliminate from and add using the algorithm. Therefore, we should show that is secure.
Theorem 2. CSA.reAgg is an aggregation of secure ciphertexts if is an aggregation of secure ciphertexts including the ciphertext which is one of a pair of the ciphertexts .
Proof of Theorem 2. Let CSA.reAgg, , , , and , then Therefore, CSA.reAgg is the aggregation of the secure ciphertexts.
In the fifth step of the algorithm, .Enc is added to be secure against an adversary who can eavesdrop on our controlled secure aggregation protocol CSA. Assume that in the fifth step of CSA.reAgg, then any adversary who can eavesdrop on our controlled secure aggregation protocol CSA is able to get , , and . Then, can compute the following: Since is a sufficiently short value, is the same as [14]. Therefore, can decrypt without the secret key .
In the decAggregation phase, after all the users eliminate from , the result is the same form as a ciphertext generated under the public key . Therefore, the receiver can decrypt it.
4. Secure Clinical Data Aggregation System
In this section, we provide an overview of our system and how it works.
4.1. System Overview
The proposed system environment consists of hospitals, an aggregator, and researchers. In our system, each hospital outsources its clinical data to cloud storage servers. Hospitals use the following hybrid method to store data when outsourcing their clinical research data in cloud servers: they make anonymous data publicly available in the cloud servers using anonymity techniques for deidentification in Section 2.1. In addition, hospitals also store their encrypted numerical data together with the anonymous data for statistical accuracy.
Suppose that there are hospitals, , and that want to share their clinical data and have public and sescret key pairs , and of our CSA protocol, respectively. Suppose that there is an aggregator and a researcher who has a public and secret key pair of the homomorphic encryption scheme. The original clinical data of hospitals are shown in Table 4. Each hospital outsources its clinical data to cloud storage servers. That is, stores deidentified nonsensitive data (such as zip code and age), sensitive data in the raw, and numerical data (such as age) using on cloud servers. Both anonymous and encrypted clinical data on cloud servers are shown in Table 5, where is an output of is an output of , and so on.


When the researcher wants to know the rough estimate of the age of the hospitals’ cancer patients, can directly get the estimate data from the cloud servers. When wants to figure out the average age of the hospitals’ cancer patients, can ask the aggregator for an aggregated age. sums up the ages of cancer patients in each hospital, then totals the ages across hospitals. That is, performs homomorphic additions to ciphertexts under the same public key, such as .Add.Add, and . After performing homomorphic additions, runs to get an aggregated ciphertext . In order to allow to know the aggregated age, each hospital in turn gives its consent. , and in turn perform CSA.reAgg, CSA.reAgg and CSA.reAgg to get , , and , respectively. After the agreement procedure, can get the aggregated ciphertext under his/her public key. Then, can get an average age of the cancer patients, , by performing .Dec, that is, the sum of the age of the cancer patients.
4.2. Attack Model
For designing a secure clinical data aggregation system, the following conditions should be considered.(1)(Anonymity) Adversaries should not exactly identify only one private individual after looking ciphertexts on cloud storage servers.(2)(Confidentiality) Adversaries should not reveal any information from the encrypted numerical data on cloud storage servers.(3)(External security) The third parties (external adversaries) should not know any information with information flow.(4)(Internal security) Hospitals and researchers (internal adversaries) except the researcher who sends a request should not know any information with information flow.
4.3. Our System
Now, we propose our secure clinical data aggregation system (SCDA system hereafter). Let be the security parameter. Then we choose other parameters which are used in our SCDA system as follows:(i),(ii) is a positive integer by setting a prime number ,(iii), and(iv) is a Gaussian parameter.
Suppose that there are hospitals , researchers , and an aggregator . We assume that the th hospital has tuples and the relational database in the cloud servers has n_{C} numerical clinical data attributes.
The building blocks of our SCDA system are our controlled secure aggregation protocol CSA and the homomorphic encryption scheme . Our SCDA system consists of the following phases which are illustrated in Box 2: Preparation, Data Publication, Query, Aggregation, Consent, and Acquisition. In the Preparation phase, each hospital and each researcher generates a public key (pair) and a secret key. In the Data Publication phase, each hospital encrypts its numerical clinical data with his/her public key pair and makes anonymous data using anonymity techniques for deidentification. Then each hospital stores them in the cloud servers. In the Query phase, one of the researchers asks the aggregator for an aggregated clinical data. In the Aggregation phase, ciphertexts generated under distinct hospitals are aggregated. In the Consent phase, each hospital goes through the procedure for consent. In the Acquisition phase, the researcher can get the aggregated clinical data.
