Abstract

The cloud computing technique, which was initially used to mitigate the explosive growth of data, has been required to take both data privacy and users’ query functionality into consideration. Searchable symmetric encryption (SSE) is a popular solution that can support efficient attribute queries over encrypted datasets in the cloud. In particular, some SSE schemes focus on the substring query, which deals with the situation that the user only remembers the substring of the queried attribute. However, all of them just consider substring queries on a single attribute, which cannot be used to achieve compound substring queries on multiple attributes. This paper aims to address this issue by proposing an efficient and privacy-preserving SSE scheme supporting compound substring queries. In specific, we first employ the position heap technique to design a novel tree-based index to support substring queries on a single attribute and employ pseudorandom function (PRF) and fully homomorphic encryption (FHE) techniques to protect its privacy. Then, based on the homomorphism of FHE, we design a filter algorithm to calculate the intersection of search results for different attributes, which can be used to support compound substring queries on multiple attributes. Detailed security analysis shows that our proposed scheme is privacy-preserving. In addition, extensive performance evaluations are also conducted, and the results demonstrate the efficiency of our proposed scheme.

1. Introduction

The rapid development of information techniques has been promoting the explosive growth of data. In order to mitigate the local storage and computing pressure, an increasing number of individuals and organizations tend to store and process their databases in the cloud [1, 2]. However, since the cloud server may not be fully trustable, those databases with some sensitive information (e.g., electronic health records) have to be encrypted before being outsourced to the cloud. Although the encryption technique can preserve database privacy, it also hides some critical information such that the cloud server cannot well support some users’ query functionality over the encrypted database, e.g., attribute query, which returns a collection of records containing a specific queried attribute.

To deal with the above challenge, the concept of searchable symmetric encryption (SSE) [3] was introduced, which enables the cloud server to search on encrypted records in a very efficient way. Over the past year, in order to improve the query efficiency of SSE, a series of secure index techniques have been designed to match the attributes to corresponding records, such as inverted index [47] and tree-based index [8]. Since these index techniques are built with exact attributes, the corresponding SSE schemes can only support the exact attribute query; i.e., the queried attribute must be exactly the same attribute as that stored in cloud.

Recently, to solve the situation that a user only remembers a substring of an attribute rather than the exact attribute, some studies [911] designed SSE schemes to support substring queries. However, they just considered a substring query on a single attribute, which cannot be used to achieve a compound substring query on multiple attributes, i.e., the queried records match multiple substring queries for multiple attributes at the same time. Considering an example of the compound query that a database includes records and each record in it has attributes , a user can send a compound substring query on two attributes: select from DB where , to query records whose attributes and contain substring and , respectively, where and belong to . A straightforward solution to support the compound query is that users query substrings separately and then calculate the intersection of results. Unfortunately, this solution is inefficient because it leads to a large number of communication overheads and computational costs for the data user.

To address the above problems, in this paper, we propose a privacy-preserving SSE scheme, which can efficiently support compound substring queries on multiple attributes. In specific, the main contributions of this paper are threefold:(i)First, based on the position heap technique, we design a tree-based index to support substring queries on a single attribute. This tree-based index can support two types of substring patterns: and , where , , and represent queried substrings and represents any string of any length. In addition, we employ pseudorandom functions and fully homomorphic encryption techniques to encrypt this tree-based index, which can well preserve the privacy of records.(ii)Second, based on the homomorphism of fully homomorphic encryption, we design an algorithm (see Section 4.2.3) to calculate the intersection of search results for different attributes and therefore achieve the compound substring query on multiple attributes.(iii)Finally, we analyze the security of our proposed scheme and conduct extensive experiments to evaluate its performance. The results show that our proposed scheme is efficient in terms of computational cost and storage overhead.

The remainder of the paper is organized as follows. We formalize the system model, security model, and design goals in Section 2. Then, we introduce some preliminaries including the position heap technique [12] and the security notion of substring-of-attribute query in Section 3. After that, we present our proposed scheme in Section 4, followed by security analyses and performance evaluation in Section 5 and Section 6, respectively. Some related works are discussed in Section 7. Finally, we draw our conclusions in Section 8.

2. Models and Design Goals

In this section, we formalize the system model, security model, and identify our design goals.

2.1. System Model

In our system model, we consider two entities, namely, a data user and a cloud server, as shown in Figure 1.(i)Data user: the data user has a database DB with records and attributes . Each record in includes a unique identifier idi and a set of string-type attributes . Due to the limited storage space and computational capability, the data user intends to outsource the database DB and its index, i.e., , to the cloud server. Later, the data user submits a compound substring query token to the cloud server to retrieve a set of records matching , where is a substring query for attribute and is a compound formula consisting of conjunctive expressions (i.e., ) and disjunctive expressions (i.e., ) on . For example, means matching records’ attribute contains substring for at the same time.(ii)Cloud server: the cloud server is considered to be powerful in storage space and computational capability. The duties of the cloud server include the following: (i) efficiently store database DB and index and (ii) process compound substring query token and respond a set of matching records to the data user.

2.2. Security Model

In our security model, the data user is considered as trusted, while the cloud server is assumed as honest-but-curious, which means that the cloud server will (i) honestly execute the query processing, return the query results without tampering it and (ii) curiously infer as much sensitive information as possible from the available data. The sensitive information could include the database DB, the index , and the compound substring query token . The formal simulated-based definition for this security model is described in Section 3.4.

2.3. Design Goals

In this work, our design goal is to achieve an efficient and privacy-preserving SSE scheme supporting compound substring queries for the database. In particular, the following two requirements should be achieved.(i)Privacy Preservation. In the proposed scheme, all the data obtained by the cloud server, i.e., , should be privacy-preserving during the outsourcing and query phases. Formally, the proposed scheme needs to satisfy security Definition 1 in Section 3.4.(ii)Efficiency. In order to achieve the above privacy requirement, additional computational costs will inevitably be incurred. Therefore, in this work, we also aim to reduce the query time to be linear with the length of the query token plus the size of matching results.

3. Preliminary

In this section, we give some preliminaries including position heap [12], symmetric key encryption scheme, fully homomorphic encryption, and the security definition of our proposed scheme, which will serve as the basis of our proposed scheme.

3.1. The (Original) Position Heap Technique

Intuitively speaking, the (original) position heap is a trie built from all the suffixes of and can be used to achieve efficient substring search for . To construct the position heap from a string , a set of suffixes are chosen and inserted to the , which is initialized as a root node. To do this, for each suffix , its longest prefix that is already represented by a path in is found and a new leaf child is added to the last node of this path. The new leaf child is labeled with and its edge is labeled with (see Figure 2). Compared to other data structures to achieve substring search, such as suffix tree [9] and suffix array [13], the position heap [12] can achieve high efficiency in both storage and query time.

In the following, we formally describe the and algorithms of the position heap, which will be used to build a position heap and search on it. Note that we consider each node in the position heap stores two types of data: and , which present the label of the node’s edge and the label of the node, respectively.

3.1.1. Algorithm

Given a string , the (i.e., Algorithm 1) visits the from the right to left and inserts each position to the position heap . In particular, for each position , the algorithm first finds the longest path from the root node of , where its path label is a prefix of (lines 4–7). Assume that the last node of this longest path is . Then, the algorithm appends a new leaf child to the , where and (lines 8–11). Figure 2 depicts an example to build such a position heap for a string . During inserting position , this algorithm first finds the longest path (shown in bold) in and appends a new leaf child to the last node of this path, where and .

(1)initialize a root node as the position heap , where and ;
(2)for each in do
(3);
(4)for each in do
(5)  find the child of , where ;
(6)  if does exist then
(7)   
(8)  else
(9)   insert a new child node for the ;
(10)   ;
(11)   break;
(12)  end if
(13)end for
(14)end for
(15)return ;
3.1.2. Algorithm

Given a substring and a position heap , the (i.e., Algorithm 2) is supposed to find all the positions in that are occurrences of . The time complexity of this algorithm is , where is the length of the queried substring and is the number of matching occurrences. The details are as follows:(i)The algorithm first finds the longest path from the root node of , where its path label denoted by is a prefix of . We refer to this longest path as a search path in the rest of the article. Then, the algorithm lets be the set of positions stored in the intermediate nodes along the search path and be the set of positions stored in the descendants of the last node of the search path (lines 3–18). In particular, if , the position stored in the last node of the search path is included in . Otherwise, it is included in .(ii)After completing the previous step, elements in must be the matching positions, and elements in may or may not be the matching positions. Next, the algorithm reviews each position in the string to filter out unmatching positions and removes them from the . Finally, this algorithm returns (lines 19–23).

(1)initial empty sets and ;
(2)let be the root node of the ;
(3)for each in do
(4) find the child of , where ;
(5)if does exist then
(6)  ifthen
(7)   ;
(8)   for each descendant of do
(9)    ;
(10)   end for
(11)  else
(12)   ;
(13)  end if
(14)  ;
(15)else
(16)  break;
(17)end if
(18)end for
(19)for each in do
(20)if is not equal to then
(21)  ;
(22)end if
(23)end for
(24)return ;

Take an example with Figure 2. Given a substring , the algorithm first finds the search path labeled with . In this way, and are equal to and . Then, this algorithm reviews the string and makes sure is not an occurrence of . Therefore, position 9 is removed from , and is an empty set now. Finally, this algorithm returns all the positions in .

3.2. Symmetric Key Encryption Scheme

A symmetric key encryption scheme (SKE) consists of the following three polynomial-time algorithms .(i): it takes a security parameter as input and outputs a secret key (ii): it takes a key and a message as inputs and then outputs a ciphertext (iii): it takes a key and a ciphertext as inputs and then outputs

3.2.1. Correctness

For any message in plaintext space, it holds that .

3.2.2. Security

In this paper, we consider that the SKE is indistinguishable under a chosen-plaintext attack (IND-CPA) [14], which guarantees that the ciphertext does not leak any information about the plaintext even an adversary can query an encryption oracle. We note that common private-key encryption schemes such as AES in counter mode satisfy this definition.

3.3. (Leveled) Fully Homomorphic Encryption

A leveled fully homomorphic encryption (FHE) scheme [15] consists of four polynomial-time algorithms . The details are described as follows:(i): it takes a security parameter and a maximum multiplicative depth as inputs and outputs a public key and a secret key . We assume that a public key specifies both the plaintext space and the ciphertext space .(ii): given the public key and a plaintext , it outputs a ciphertext . For simplicity, we omit the randomness used for encryption.(iii): given the secret key and a ciphertext , it outputs a message . . It takes the public key , a function , and a set of ciphertexts as inputs and outputs a ciphertext .

3.3.1. Correctness

For any and any function which can be evaluated by a circuit with depth at most , if , , and , then it holds that .

3.3.2. Security

In this work, we consider an FHE scheme is indistinguishable under a chosen-plaintext attack (IND-CPA), which is described in [15].

3.3.3. Homomorphic Operations

In general, an FHE scheme can directly support homomorphic bitwise addition and multiplication . Other advanced homomorphic operations can be realized by arithmetic circuits based on and . In this paper, we consider three types of advanced homomorphic operations: bitwise AND, bitwise OR, and integer equality. In specific, if the FHE ciphertexts of two -bit integers and are and , these arithmetic circuits are defined as follows:(i)Bitwise AND : , where for .(ii)Bitwise OR : , where for .(iii)Integer equality: . The output of is in the case of and otherwise, where and are FHE ciphertexts of 1-bit message 1 and 0, respectively.

3.4. Security Definition of Our Proposed Scheme

In this section, we follow the security definition in [5] to formalize the simulated-based security definition of our proposed scheme by using the following two experiments: and . In the former, the adversary , who represents the cloud server, executes the proposed scheme with a challenger that represents the data user. In the latter, also executes the proposed scheme with a simulator that simulates the output of the challenger through the leakage of the proposed scheme. The leakage is parameterized by a leakage function collection , which describes the information leaked to the adversary in the data outsourcing phase and query phase, respectively. If any polynomial adversary cannot distinguish the output information between the challenger and the simulator , then we can say there is no other information leaked to the adversary , i.e., the cloud server, except the information that can be inferred from the . More formally,(i): given a database DB chosen by the adversary , the challenger outputs encrypted index by following the data outsourcing phase of the proposed scheme. Then, can adaptively send a polynomial number of compound substring query tokens to the , which outputs corresponding encrypted compound substring query tokens. Eventually, returns a bit as the output of this experiment.(ii): given the leakage function , the simulator outputs simulated encrypted index and simulated encrypted database . Then, for each query token, the adversary sends its leakage function to the simulator , which generates the corresponding simulated encrypted compound substring query token. Eventually, returns a bit as the output of this experiment.

Definition 1. Our proposed scheme is -secure against adaptive attacks (i.e., -adaptively secure) if, for any probabilistic polynomial-time adversary , there exists an efficient simulator such that .

4. Our Proposed Scheme

In this section, we will present our SSE scheme. Before delving into the details, we first introduce our basic index structure, which is the basic building block of our proposed scheme.

4.1. Basic Index Structure

In order to process efficient compound substring queries, we build an index for each attribute in database DB. The index is a modified position heap and can support two types of substring patterns: and . Next, we introduce algorithm and , which are used to build index and search on it.

4.1.1. Algorithm

Given an attribute column of database DB, the algorithm outputs an index as follows. It first transforms to a string , where denotes a character that does not appear in . In the rest of this paper, we call attribute string. Then, it follows algorithm to insert all the positions in , except positions of character , to a position heap . Finally, this algorithm builds an index from by replacing its nodes’ position data (i.e., ) to corresponding identifiers (i.e., ).

Figure 3 gives an example of building an index for the first attribute column of database DB.

4.1.2. Algorithm

Given a substring and an index , the algorithm follows the algorithm to search and outputs a set of identifiers.

4.2. Description of Our Proposed Scheme

In this subsection, we will describe our proposed scheme, which mainly consists of three phases: (i) system initialization; (ii) data outsourcing; (iii) compound substring query. To make the description simple, we first introduce a basic scheme which only supports substring query with pattern, and then extend it to support substring query with pattern.

4.2.1. System Initialization

Given a security parameter , the data user first initializes a pseudorandom function (PRF) , an IND-CPA secure SKE , and an IND-CPA secure FHE . Then, the data user generates keys , , and .

4.2.2. Data Outsourcing

Assume that the data user has a database DB with records , where each record includes string-type attributes . Then, the data user generates a secure index and an encrypted database as the following steps:Step 1: the data user first uses the algorithm to build for each attribute and then encrypts it as follows:(i)For each node (except the root), the data user encrypts its to (ii)For each node (except the root), the data user concatenates all the edge labels, i.e., , along the path from the root to this node, and calculates the PRF output of the concatenation through pseudorandom function Consider the example in Figure 4, which is encrypted from the index in Figure 3(d).Step 2: the data user encrypts each record through and sends these encrypted records to the cloud server with an encrypted index .

4.2.3. Compound Substring Query

Given a set of substrings and a compound formula on , where consists of conjunctive expressions (i.e., ) and disjunctive expressions (i.e., ), the data user launches a compound substring query with the cloud server as follows:Step 1: for each substring with , the data user calculates and sends a compound substring query token to the cloud server.Step 2: for each , the cloud server performs algorithm to search over and outputs a set .Step 3: the cloud server generates a function according to the compound formula . In specific, all the and in are replaced by bitwise AND and bitwise OR , respectively. For example, if compound formula , then , where for .Step 4: the cloud server performs Algorithm 3 to filter matching elements in according to function . In specific, assuming that is the minimum-size element in , for each encrypted identifier in , this algorithm traverses all encrypted identifiers in to calculate an encrypted flag , which is if matches the compound formula and otherwise. Then, the cloud server inserts all the to an empty set and returns it to the data user. Figure 5 depicts an example of this step where the compound formula . In this example, is the only identifier that matches the formula . Note that, since an algorithm consumes multiplicative depth and an AND/OR operation consumes 1 multiplicative depth, this step requires at most multiplicative depth, where is the average attribute length of database and is the number of records in the database.Step 5: for each , the data user calculates . If , then the data user calculates and requests a corresponding encrypted record from the cloud server.

(1)let be the minimum-size element in .
(2)let be an empty set.
(3)for each do
(4)
(5)for each do
(6)  
(7)  for each do
(8)   
(9)   
(10)  end for
(11)end for
(12)
(13) insert to ;
(14)end for
(15)return
4.2.4. Query with Substring

We extend our scheme to support substring queries with pattern. The extension is very simple, which only makes some small changes in the processes of the outsourcing phase and compound substring query phase. In specific, when the data user builds index for the attribute column , the attribute string is replaced by . In other words, each attribute in is copied once and concatenated to its replica with character , and every two adjacent attributes in are concatenated with character , where and denote two separate characters that do not appear in . In this way, the substring query with pattern can be transferred to pattern where . Meanwhile, this method leads to double the storage cost of index since the size of the attribute string is twice longer than before.

5. Security Analysis

In this section, we prove the security of our proposed scheme based on the security definition described in Section 3.4.

5.1. Leakage Function Collection

We first define the leakage function collection of our proposed scheme.(i)Outsourcing Phase: given the index and the encrypted database DB, the leakage consists of the following information:(a): the number of records in DB(b): the size of record (c): the number of attributes in DB(d): the number of nodes in (e): the structural dependencies between nodes in (ii)Query Phase: given the index and a compound substring query token , the leakage consists of the following information:(a): search paths in corresponding to the query token (b): compound formula in

5.2. Security Proof

Now, we prove the security of our proposed scheme based on the leakage function collection . Intuitively, we first define a simulator based on the leakage function collection and then analyze the indistinguishability between the output of the in the ideal world and the challenger (i.e., the data user) in the real world. Finally, we conclude that our proposed scheme does not reveal any information beyond the leakage function collection to the server. The details are as follows.

Theorem 1. Let be a pseudorandom function (PRF), let be an IND-CPA secure symmetric key encryption scheme (SKE), and let be an IND-CPA secure fully homomorphic encryption (FHE). Then, our proposed scheme is -adaptively secure.

Proof. Based on the leakage function collection , we can build a simulator as follows:(i)Data outsourcing: given the leakage function , the simulator is supposed to generate a simulated index and a simulated encrypted database . To build , the simulator first generates empty nodes and constructs these nodes to a tree (i.e., ) based on , which means that has the same tree structure as . Then, for each node in , the simulator randomly chooses and . Since the outputs of and are pseudorandom, the adversary cannot distinguish between and . To build , the simulator chooses a random value for each record and lets . Since is an IND-CPA secure SKE, the adversary cannot distinguish between and .(2)Compound substring query: given the leakage function for a compound substring query token , the simulator is supposed to generate a simulated encrypted compound substring query token . Note that, at this moment, the simulator has not only but also and from the data outsourcing phase. Therefore, for each , the simulator can follow the corresponding in to find its search path and output all the stored in the nodes along the search path as . Since is a pseudorandom function, the adversary cannot distinguish between and .In summary, as the adversary cannot distinguish between the outputs from the simulator in the ideal world and the challenger in the real world, we can conclude that our proposed scheme is -adaptively secure.

6. Performance Evaluation

In this section, we evaluate the performance of our proposed scheme from both theoretical and experimental perspectives. To the best of our knowledge, we are the first to discuss the compound substring query. Although some existing works [911] focus on the substring query, they cannot directly support the compound substring query. Therefore, a fair comparison is quite difficult and we just evaluate our proposed scheme in this section.

6.1. Theoretical Analysis

First, we theoretically analyze the query computational cost and storage overhead of our proposed scheme.

For the query computational cost, we analyze the server and user separately. In specific, the query computational cost of the server comes from two steps: search over index to get a set of collections and filter matching elements in . The former consumes at most integer comparison operations and the latter consumes at most FHE multiplication operations, where and are the minimum-size and maximum-size elements in . Meanwhile, the query computational cost of the user also comes from two steps: generate query token and decrypt the returned FHE ciphertexts, which consumes hash operations and FHE decryption operations, respectively. Compared with the above operations, we can see that FHE multiplication operations on the server dominate the query computational cost.

For the storage overhead, most of the storage overhead in our proposed scheme comes from the encrypted identifiers in , which consumes FHE ciphertexts, where is the number of nodes in and is the decomposition parameter in FHE (see next subsection).

6.2. Experimental Analysis

Then, we experimentally analyze our proposed scheme. In specific, we implement our proposed scheme in C++ (our code is open source [16]) and conduct experiments on a 64-bit machine with an Intel(R) Core(TM) i5-4300M CPU at 2.6GHZ and 4 GB RAM, running Ubuntu 18.4. Note that we implement the data user and cloud server on the same machine, which means there is no network delay between them. The underlying database in our experiment was extracted from Kaggle [17]. It contains 31,087 movies and 153,584 movie tags. The average length of all these movie tags is about 10. We treat each movie as a record with corresponding tags as its attributes.

6.2.1. FHE Implementation

In order to reduce the storage overhead of FHE, we use the SIMD technique [18] in experiments. Before describing our packing method, we first review the underlying structure of FHE. Specifically, the plaintext space of FHE is for a positive integer where is the th cyclotomic polynomial. When the plaintext modulus is prime and not divisible by , the decomposes into irreducible factors of degree modulo . This induces an isomorphism, via the Chinese Remainder Theorem, between the algebra of the plaintext space and the product of finite fields for ,

With this decomposition, the plaintext of compatible FHE schemes can be regarded as a length vector , where for . Addition and multiplication on ciphertext correspond to component-wise addition and multiplication over the for .

Now we give our packing method. In our proposed scheme, each index for includes a set of identifiers , where each identifier in it can be seen as a bit-array . In the experiments, we pack each to FHE ciphertexts, which means each for and is encoded to an element in and therefore an FHE ciphertext contains bits in .

6.2.2. HElib Parameters

We utilize the HElib library [19] to implement the FHE described above. For the parameters, we choose and , which lead to , , and . Meanwhile, the recent version of the HElib library (commit c74ffab in [19]) uses a concept of ciphertext capacity instead of depth. The ciphertext capacity of a ciphertext is defined in [20] as , where is the current modulus, and is the current noise bound. In practice, an algorithm costs about 60-bit capacity and an AND/OR operation costs about 25-bit capacity when and . Since in our experiments does not exceed 25000, does not exceed 5, and the average attribute length , we set the initial ciphertext capacity to , which leads to 84-bit security.

In the following, we evaluate the computational cost and storage overhead of our proposed scheme in terms of two phases: data outsourcing and compound substring query.

6.2.3. Data Outsourcing

First, we consider the computational cost and storage overhead of the data outsourcing phase, which mainly comes from the building runtime and storage overhead of index . Figures 6 and 7 plot the building runtime and storage overhead of the data outsourcing versus the number of records and the number of attributes . From these figures, we can see that both computational cost and storage overhead increase linearly with and .

6.2.4. Compound Substring Query

Next, we consider the computational cost of the compound substring query phase. As mentioned in the last subsection, most of the query computational cost is from FHE multiplication operations on the server, where and are the minimum-size and maximum-size elements in . As shown in Figures 8 and 9, the query time is indeed linear with and . Meanwhile, it is not affected by the number of records .

A searchable encryption scheme can be realized with optimal security via powerful cryptographic tools, such as fully homomorphic encryption (FHE) [21, 22] and Oblivious Random Access Memory (ORAM) [23, 24]. However, these tools are extraordinarily impractical. Another set of works utilize property-preserving encryption (PPE) [2528] to achieve searchable encryption, which encrypts messages in a way that inevitably leaks certain properties of the underlying message. For balancing the leakage and efficiency, many studies focus on searchable symmetric encryption (SSE). Song et al. [3] first used symmetric encryption to facilitate attribute queries over the encrypted data. Then, Curtmola et al. [5] gave a formal definition of SSE and proposed an efficient SSE scheme. Later, Kamara et al. [29] proposed the first dynamic SSE scheme, which uses a deletion array and a homomorphically encrypted pointer technique to securely update files. Unfortunately, due to the use of fully homomorphic encryption, the update efficiency is very low. In a more recent paper [7], Cash et al. described a simple dynamic inverted index based on [5], which utilizes the data unlinkability of the hash table to achieve secure insertion. Meanwhile, to prevent the file-injection attacks [30], many works [3134] focused on forward security, which ensures that newly updated attributes cannot be related to previous queried results.

Nevertheless, these above works only can support the exact attribute query. If the queried attribute does not match a preset attribute, the query will fail. Fortunately, the fuzzy query can deal with this problem as it can tolerate minor typos and formatting inconsistencies. Li et al. [35] first proposed a fuzzy query scheme, which used an edit distance with a wildcard-based technique to construct fuzzy attribute sets. For instance, the set of with 1 edit distance is . Then, Kuzu et al. [36] used LSH (Local Sensitive Hash) and Bloom filter to construct a similarity query scheme. Since an honest-but-curious server may only return a fraction of the results, Wang et al. [37] proposed a verifiable fuzzy query scheme that not only supports fuzzy query service but also provides proof to verify whether the server returns all the queried results. However, these fuzzy query schemes only support single fuzzy attribute queries and address problems of minor typos and formatting inconsistency, which cannot be directly used to achieve substring queries.

In [9], Chase and Shen designed an SSE scheme based on the suffix tree to support substring queries. Although this scheme can be used to implement the substring query and allows for substring query in time, its storage cost has a big constant factor. The reason is that the suffix tree only stores position data in leaf nodes and does not utilize the space of inner nodes effectively. This makes the number of nodes in the suffix tree can be up to , where is the size of the dataset. In order to reduce the storage cost as much as possible, Leontiadis and Li [13] leveraged Burrows-Wheeler Transform (BWT) to build an auxiliary data structure called a suffix array, which can achieve storage cost with a lower constant factor. However, its query time is relatively large. Later, Mainardi et al. [11] optimize the query algorithm in [13] to achieve at the cost of higher index space, i.e., , where is the number of distinct characters in the dictionary. In addition to suffix tree and suffix array, there are some other auxiliary data structures that can be used to support substring queries. In 2018, Hahn et al. [38] designed an index based on k-grams. When a user needs to perform a substring query, the cloud performs a conjunctive keyword query for all the k-grams of the queried substring. In the same year, Moataz et al. [10] proposed a new substring query scheme based on the idea of letter orthogonalization, which allows testing of string membership by performing an efficient inner product. Although the above schemes can support substring queries, they can only solve the substring query problem for a single attribute, which cannot be used to achieve compound substring queries efficiently.

8. Conclusion

In this paper, we have proposed a novel efficient and privacy-preserving compound substring query scheme. Specifically, based on the position heap technique, we first designed a tree-based index to support a substring query on a single attribute and then applied PRF and FHE techniques to protect its privacy. In addition, based on the homomorphism of fully homomorphic encryption, we designed an algorithm to support compound substring queries on multiple attributes. Detailed security analysis and performance evaluation show that our proposed scheme is indeed privacy-preserving and efficient. In our future work, we will consider extending the proposed scheme to support wildcard queries.

Data Availability

Data will be available at the following link https://www.kaggle.com/rounakbanik/the-movies-dataset.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported in part by NSERC Discovery Grants (Rgpin 04009) and National Natural Science Foundation of China (U1709217), and NSFC Grant (61871331).