Research Article | Open Access
Secure Testing for Genetic Diseases on Encrypted Genomes with Homomorphic Encryption Scheme
The decline in genome sequencing costs has widened the population that can afford its cost and has also raised concerns about genetic privacy. Kim et al. present a practical solution to the scenario of secure searching of gene data on a semitrusted business cloud. However, there are three errors in their scheme. We have made three improvements to solve these three errors. They truncate the variation encodings of gene to 21 bits, which causes LPCE error and more than 5% of the entries in the database cannot be queried integrally. We decompose these large encodings by 44 bits and deal with the components, respectively, to avoid LPCE error. We abandon the hash function used in Kim’s scheme, which may cause HCE error with a probability of and decompose the position encoding of gene into three parts with the basis to avoid HCE error. We analyze the relationship between the parameters and the CCE error and specify the condition that parameters need to satisfy to avoid the CCE error. Experiments show that our scheme can search all entries, and the probability of searching error is reduced to less than .
Genes are the intrinsic nature of human health. All human life activities and physiological phenomena are directly related to the gene. Genome data can be used for a wide range of applications including healthcare, biomedical research, and forensics . Gene sequencing technology is the core of the human genome project; the genome sequencing technology helps humans to better understand the whole life activities of cells and organisms, and it is also of great significance for the prevention and treatment of some diseases, such as cancer and genetic diseases.
Advances in high throughput technologies have made it increasingly affordable to sequence the human genome in various settings, ranging from biomedical research to healthcare . Relevant data show that in 2000 the cost of whole genome sequencing for human is nearly $3 billion, and by 2015, the cost of single genome sequencing is reduced to less than $1,000, and the sequencing costs for certain sites on the genome are lower.
The decline in genome sequencing costs has widened the population that can afford the cost of gene sequencing and has also raised concerns about genetic privacy. Genetic data can be widely used in healthcare, biomedical research and identification, and other fields, with a strong personal privacy characteristics. More and more businesses and individuals put the computing processing of genetic data to cloud services, but the current commercial cloud server does not fully guarantee the privacy and security of genetic data. This raises concerns about the privacy of sensitive information since data is stored in external, off-premise data centers. In particular in the health sector, sensitive personal patient records need to be kept confidential .
There are a number of technical solutions that have been proposed to protect genome privacy, and existing studies can be categorized into two groups : (i) protecting the computation process in genome data analysis [5–7] and (ii) protecting the genome data before computation [8, 9] or research outcomes after computation .
In order to prevent the user’s genetic data from being compromised by unauthorized users or organizations, protecting genetic privacy is an urgent problem. To mitigate the privacy risks inherent in storing and computing sensitive data, cryptography offers a potential solution in the form of encryption ; only the legitimate data owner can access the data by decrypting it using their private decryption key.
However, sometimes the calculation and analysis of genetic data need to be implemented in the cloud due to the limitation of personal computing power and genetic diagnosis algorithm patent, the need for cloud server, the user’s genetic data analysis, and analysis to help get the user diagnosis and treatment of the relevant information. Nevertheless, traditional cryptographic schemes limit the computation process on the ciphertext stored on the cloud and also prevent the data center from performing computation on it without the decryption key.
Homomorphic encryption can do computation on the encrypted data in the case of unknown secret key, and the decryption results of the ciphertext data are equivalent to the corresponding plaintext for the same processing operations. In 2009, Gentry proposed the first FHE scheme and described the framework blueprint of the FHE . Since then many improvements to FHE have been proposed based on Gentry’s work such as [13–17].
Homomorphic encryption-based methods which support secure genome data computation have been studied. Cheon et al.  studied how to calculate edit distance of encrypted gene data homomorphically. Yasuda et al.  described how to compute multiple Hamming distance values using the LNV scheme  on encrypted data. Graepel et al.  and Bos et al.  applied HE to machine learning and described how to privately conduct predictive analysis based on an encrypted learned model. Lauter et al.  gave a solution to privately compute the basic genomic algorithms used in genetic association studies.
To achieve safe genetic data analysis, iDASH (integrating Data for Analysis, Anonymization, and SHaring) National Center has released annual security challenges regarding genetic privacy protection since 2014. In 2016, the challenge of testing for genetic diseases on encrypted genomes (secure outsourcing) was published to calculate the probability of genetic diseases through matching a set of biomarkers to encrypted genomes stored in a commercial cloud service. The requirement is that the entire matching process (only consider the exact match for each variation) needs to be carried out using homomorphic encryption so that no trace is left behind during the computation.
For the challenge published by iDASH, Kim et al. give a practical solution called [KSC17], which uses the homomorphic encryption technique to encrypt the entire gene database as a polynomial on the ring, thus solving the challenge of testing genetic disease (security outsourcing) to a certain extent .
The application scenario of this paper is shown in Figure 1. There are three parties involved in this scenario: the user (hospital or medical institution that has patient’s gene data), the semitrusted commercial cloud service, and data owner (the research institute that has the genetic variation database). The purpose of this system is to determine if a patient’s gene data is presented in the gene variation database.
System Initialization. Data owner encrypts the gene variation database and uploads the ciphertexts to the commercial cloud server. Then, the user interacts with the cloud server to complete the testing process. Step 1: the user encrypts the patient’s gene data and uploads the ciphertexts to the commercial cloud server. Step 2: the cloud homomorphically searches user’s gene data in database and generates a ciphertext of searching result. Step 3: the cloud sends the ciphertexts to the user. Step 4: the user decrypts the ciphertexts and concludes whether the patient’s gene data is presented in the gene variation database. The source code of our implementation is available on github https://github.com/lonyliu/genetest.
Our Contributions. The contributions of this paper focus on optimizing the design and improving the correctness of the scheme. Through the analysis of the  and its related code, we found three types of query errors in [KSC17], called losing of partial coefficient error (LPCE), hash collision error (HCE), and coefficient combination error (CCE), and made some improvements as follows.(1)The gene data is encoded by prefix code so as to detect more entries in the gene database with fewer bits than [KSC17].(2)Correcting the LPCE error: [KSC17] truncates the variation encodings of gene to 21 bits, which causes partial coefficient losing; thus more than 5% of the entries in the database cannot be queried integrally. In this paper, we decompose the encodings of gene variation by 44 bits, then optimize, encrypt, and query the components, respectively. As a result, all the entries can be queried effectively in the database.(3)Correcting the HCE error: [KSC17] uses the method of hash function unreasonably, which may cause HCE error with a probability of . In this paper, we abandon the hash function by adding half the ciphertext of database, thus avoiding the hash collision.(4)Correcting the CCE error: [KSC17] cannot distinguish the different groups of gene data, which may return incorrect results with nonnegligible probability. In this paper, we analyze the relationship between the core parameter (bit size of the encoding for gene variation) and CCE error and specify the condition that parameter needs to satisfy, so that the probability of CCE errors is negligible.
2. Practical Homomorphic Encryptions
This section describes the homomorphic encryption schemes which are used in our genetic privacy protection. First, some symbols and parameters are described below.
For the security parameter , let integer define the th cyclotomic polynomial , Throughout this paper, we assume that the integer is a power of two so that and . Both of our homomorphic encryption schemes operate in the polynomial ring . denotes the reduction modulo into the interval of the integer or integer polynomial (coefficient-wise). Set the plaintext space to for some fixed and the ciphertext space to for an integer . Let denote a noise distribution over the ring . Notation denotes that is chosen from the distribution , and denote that is randomly chosen from the distribution .
2.1. The RLWE Scheme
First basic homomorphic encryption scheme is based on the hardness of Ring Learning with Errors (RLWE) assumption, which is proposed by Lyubashevsky, Peikert, and Regev. The RLWE assumption is divided into decisional RLWE assumption and computational RLWE assumption. The decisional RLWE assumption implies the infeasible solution to distinguish the following two distributions: pairs where and pairs where and . The computational RLWE assumption is that it is hard to find the key from many samples .
The RLWE scheme is described as follows:(i) RLWE.ParamsGen: given the security parameter , choose an integer which is a power of 2, a ciphertext modulus , a plaintext modulus with , and discrete Gaussian distribution . Output params .(ii) RLWE.KeyGen(params): for input parameters , let and choose a random sparse . Generate an RLWE instance for . Set the secret key and the public key .(iii) RLWE.Enc: for the input plaintext , choose a small polynomial and two Gaussian polynomials , and output the ciphertext :(iv) RLWE.Dec: given the ciphertext , output the plaintext :(v) RLWE.Add: given three ciphertexts , , with the same secret key sk, output the ciphertext .
Conversion and Modulus Switching techniques have been introduced in [KSC17]. Conversion technique can change an RLWE ciphertext of into an LWE encryption of its constant term . Modulus Switching technique reduces the ciphertext modulus down to while preserving the message, thus reducing the size of ciphertext.
2.2. The Ring-GSW Scheme
In 2013, Gentry et al. proposed an LWE-based homomorphic encryption scheme , which uses the approximate eigenvector method to express ciphertext as a matrix, so that the addition and multiplication of ciphertext no longer cause dimension expansion. In this paper, we use its RLWE version introduced by Ducas and Micciancio , and its encryption algorithm is given below:(i) RGSW.ParamsGen: given the same parameters and secret key as in the RLWE scheme, set the decomposition base and exponent satisfying . Given a small matrix for identity matrix .(ii) RGSW.Enc: given the plaintext , choose a matrix uniformly, and ; output the ciphertext :
And the ciphertext satisfies . Let denote the decomposition with the base , so can be regarded as an approximate eigenvalue of with the eigenvector .
Reference  defines a hybrid multiplication between an RLWE ciphertext and an RGSW ciphertext .
Thus the ciphertext is a RLWE encryption of .
3. Encoding and Encryption of Gene Data
Recall the task proposed by iDASH: secure biomarkers matching of encrypted genetic data, and in this section, we describe how to encode and encrypt the genomic data.
3.1. Genetic Data
The gene data is stored in a semitrusted business cloud in VCF format. The database VCF file contains multiple genotype information entries, where each of them consists of chrome (chr), position (pos), locus (loc), reference (ref), alternate (alt), type. The example of database is shown in Table 1. Chrome represents the chromosome where the gene is located, and it ranges from 1 to 22, , and . Position represents the base position of the gene variation in the chromosome, and locus indicates the location of the gene. Reference, alternate, type display the base transformation information for the variation: reference represents the base information before the mutation occurs; alternate represents the base information after the mutation; type indicates the type of the mutation, including the single base variation (SNP), multibase mutation (SUB), insertion variation (INS), and deletion variation (DEL).
In fact, the gene mutation can be located by chr and pos information only, and the information of base change can be obtained by comparing the ref base and the alt base. In order to improve the efficiency of the program, we only match the chr and pos information between the patient and the cloud, and then we get the corresponding ref and alt information of base variation at the same location in the database. Finally the user compares the base change information from the cloud and his base change information to get the final match result.
3.2. Encoding and Encryption of Genetic Data
In this section, we describe how to encode the genomic data so that they can be applied to homomorphic encryption scheme. Let denote the position information of the th entry in the gene database, the variation information of the th entry in the gene database, and the integer encodings of reference genome and alternate genome, respectively.
For the coding of the gene position information, define a mapping from (chr, pos) to :
In the following we describe how to encode the base variation information in [KSC17]. Firstly, they represent the common SNPs by two binary numbers asand encode them according to their order. Then pad with 1 to the left of the bit string so as to distinguish the -string and empty string. For instance, the base A will be encoded as , and string CG will be encoded as . denotes the maximal number of reference (or alternate) alleles to be compared between the query genome and genomes in the target database; thus the length of the base string is . And in [KSC17] the encoding of base variation information is expressed asOur Contribution. The value of in [KSC17] is set, respectively, to 2, 5, or 10, but the genovariation that more than 10-base insertion or deletion may occur actually. For example, the second entry in Table 1 for column of “alt” genome is GGAGGTTTCAGT GAGCT. If the patient’s alt genome (query gene information) is GGAGGTTTCA, the server will conclude that the patient is more likely to suffer from a genopathy. At the same time, we found that the numbers of ref bases and alt bases are usually not symmetrical through the statistical analysis of the genetic database, and the number of bases after concatenating the ref and alt genome does not exceed 20 mostly. Therefore, the prefix code is used to encode the genome data:
Firstly, the SNPs are encoded as
Then, a string “111” is added to concatenate the ref and alt genome; this can help us to separate the encoding of ref and alt genome correctly. Finally, pad with 1 to the left of the bit string so as to distinguish the -strings. Here is the formula of getting :
Let denote the bit size of , and set so as to expand the number of gene entries which can be correctly matched. If the length of is less than 44 bits, then pad bit 0 at the left of the bit string to ensure that the length of the encoding is 44 bits. For the case of bits, is divided by 44 bits; the details are in Section 3.3.3.
The HE scheme in this paper is carried out on a polynomial ring, so it is necessary to express the integer pairs as polynomial , where
Since the from VCF files have bits size about 32, set , . And then the data owner (research institute) encrypts the polynomial with the RLWE public-key encryption scheme as described above.
The query genes are also encoded as a pair of integers . However, the hospital or medical institution only needs to encrypt the monomial with the RGSW symmetric encryption scheme.
3.3. The Optimization of the Encoded Data
Since the from VCF files have bits size of about 32, set , . While taking into account the safety and efficiency of HE schemes, a dimension is considered appropriate.
[KSC17] makes use of SHA-3 to transform into a pair of two nonnegative integers and , and both of them have the bit size of 11 bits; then define ring , , and mapping , and transform polynomial into groups of lower-dimension polynomials , , whereand polynomials and satisfy , for one .
A corresponding mapping is defined as the specific mapping from a term in a polynomial to terms in polynomials .
We found that there are three types of errors in [KSC17], named hash collision error (HCE), coefficient combination error (CCE), and losing of partial coefficient error (LPCE). In the following we will describe these errors and our solutions.
3.3.1. Hash Collision Error
KSC17 made use of SHA-3 to transform 33-bit-size into a pair of two 11-bit-size integers and in order to improve the efficiency of the scheme. The hash function maps 33 bits of information to 22 bits, which may cause the collisions with a probability of approximately. This collision will result in a searching error. Take 10,000 entries in the database as an example, and suppose that the user queries the position of , where . The probability of at least one hash collision existing between the query and the database, with the same and , is ; that is, the user might get a wrong result with a probability more than . What is more, this error cannot be avoided by repeating the algorithm.
Our Contribution. For the HCE error, we abandon the method of hash function, and decompose the index with the basis , so that can be represented as ; that is, . Then, we extend the mapping to mapping where is the number of polynomial groups, , , , . And extend the corresponding mapping to mapping
As a result, we can effectively avoid the collision caused by the compression of the index and solve the HCE problem.
3.3.2. Coefficient Combination Error
In this section, we will describe how the CCE error is happening. For , the CCE error exists because [KSC17] cannot distinguish whether two coefficients and , picked, respectively, from , belong to the one mapping . This error may lead to the mistake that an entry that is not in the database was judged in the database.
There is a way to determine whether an integer pair is in the database. Firstly the integer pair is employed to represent the polynomial ; then transform it into through mapping . If there exists a certain group (for one ) whose corresponding entries , satisfy , then the integer pair is judged in the database; otherwise the integer pair is not judged in the database.
We give a brief description in Figure 3.
The first line in Figure 3 represents the polynomial DB() with large dimension, and the second and third lines represent the polynomials and with small dimension. All nonzero coefficients in the polynomials are labeled with short lines. If patient want to check whether the query entry exists in the database with , [KSC17] will give the conclusion that this entry exists in the database, since the coefficients of and satisfy . This will cause the mistake that the patient was misdiagnosed as sick.
Our Contribution. The CCE error means that the query entry does not exist in the database, but the scheme gives the result that the entry is in the database. If the is decomposed into , , and , this error will happen only if the sum of the coefficients of corresponding term , , for one is exactly equal to . Since the sum of coefficients is uniformed, the probability of for a given is , for one group . The mapping will generate group polynomials, and the probability of at least one group has a collision with the given query entry is . Through a similar analysis, we find that the probability of CCE error in [KSC17] also satisfies this formula. When [KSC17] gives the parameters , , , this probability will be as high as .
There are two factors we need to consider for the parameter . Firstly, the size of coefficients needs to be multiple of 11 bits. Secondly, the size of coefficients needs to be large enough to decrease the probability of CCE error, and the scheme’s efficiency should be taken into account. Consequently we set to reach the requirement for security and efficiency, and decrease the probability of CCE error to .
3.3.3. Losing of Partial Coefficient Error
In this section, we will describe how the error LPCE is happening. [KSC17] sets bits, while we found that there are some entries in the database whose encodings are more than 21 bits, and the longest one even needs 272 bits to be represented. For those entries whose encodings are more than 21 bits, [KSC17] truncates the encoding bit to the left 21 bits and abandon other bits. The truncation will cause the LPCE problem. In 3.2 we extend to 44 bits, but also we cannot meet our needs for the correct matching.
Our Contribution. In order to solve this problem, we decompose these long encodings by 44 bits, then optimize, encrypt, and query the components, respectively. Suppose that the given entry with large coefficient is . First, is decomposed by 44 bits; get a set of smaller components , where , . Second, we construct and output multiple new entries with smaller coefficients . Finally, these entries are mapped by and optimized separately. As a result, we can represent and query all entries effectively in the database and solve the LPCE problem.
4. Secure Searching of Gene Data
4.1. Optimized Encoding Algorithm of Gene Data
4.1.1. Coefficient Optimization
In order to solve LPCE problem and get an efficient scheme, it is necessary to optimize the coefficients of the polynomial . Set as the minimum power of 44 bits, which is larger than the bound of . We found that the length of the encoding in the database is no more than 272 bits; thus set , which is minimum multiple of 44 bits and larger than 272 bits. We use a general method to decompose into with the basis , where , . In this paper, we set , and thus . Algorithm 1 presents the coefficient optimization algorithm.
4.1.2. Dimension Optimization
Since the encoded integers from VCF files have bits size of about 32, while taking into account the safety and efficiency for implementation of HE schemes, a dimension is considered appropriate. After decomposing the index into , , and , where , we appoint that if the , , and have been decomposed for the previous index, then rebuild a set of polynomials and reassign their corresponding coefficients. Here we set the total groups of , , to which means . Dimension optimization of encoding algorithm is shown as Algorithm 2.
4.2. Secure Searching Algorithm of Gene Data
This section gives the general framework and complete process of secure searching algorithm, showing the process of our secure testing with details in Figure 4.
4.2.1. Database Encryption
The data owner (research institute) encodes the genomic information as , , and encrypts the polynomials as . The process is shown in Algorithm 3. Then the research institute submits the ciphertexts to the commercial cloud service (server).
4.2.2. Query Encryption
The user (hospital or medical institution) encodes the query as , , and , where . Then the user sends the ciphertexts , , to the server (commercial cloud service):
4.2.3. Evaluation Phase
The server computes the hybrid multiplications , , and between the ciphertext of genetic database and the query. Let . The server converts it into an LWE ciphertext and performs Modulus Switching operations. Then return the resulting ciphertext to the user.