Abstract

Multi-keyword conjunctive query is the most common searchable encryption (SE) scheme and gives practical search capability. This paper focuses on constructing a conjunctive query scheme with high privacy and accurate query result. For efficiency, this paper introduces a novel counter vector (CV) data structure instead of inverted index and builds the CV index database with three-tuple keywords. We use two kinds of cryptography primitives to encrypt the CV index database, respectively. One is symmetric encryption for efficiency, which gives two schemes, CVX and ICV. Experiments show that both CVX and ICV have much higher efficiency than the existing schemes. The other is BGN homomorphic encryption for privacy and gives the scheme HCV. HCV reduces much RP information leakage and even achieves “ideal” leakage for 3-keyword query. Experiments show that the HCV scheme achieves high privacy while compromising not so much storage.

1. Introduction

With the wide application of cloud computing and the commercialization of 5G, more and more people tend to outsource their data. Some remote storage systems [1, 2] help clients with limited resources to manage large amounts of data at a very low cost. Although it brings great convenience and higher efficiency, the security and privacy issues cannot be ignored. For example, famous social networking site Facebook may have leaked data for millions of users to a political firm Cambridge Analytica [3]. Encryption is a simple solution to protect data security, but it would prevent the data from being searched. Searchable encryption (SE) can address this issue by providing a way to search without decryption. Xiaodong et al. proposed the first SE scheme in 2000 [4]. They realized the retrieval of a keyword on the encrypted data. Later, a series of practical SE schemes [59] were proposed.

The schemes mentioned above can only support single-keyword queries. However, a practical system needs to find the documents containing a set of keywords, which was called conjunctive query. A naive method of conjunctive query is to perform single-keyword query for each keyword one by one and then filter the desired ones. Nevertheless, if the resultant document set is very large for a keyword, this method will have low efficiency. Besides, this method causes significant information leakage, as it reveals the resultant document sets for each queried keyword. A passive attacker in [10, 11] can leverage the common leakages in SSE schemes to reveal the user’s query. Recently, Zhang et al. [12] proposed an even more powerful attack, in which the attacker can adaptively inject new documents. With this power, the attacker can recover the content of user’s query by learning which added documents match it [3].

Some conjunctive SSE schemes are proposed to compromise security and efficiency. Golle et al. firstly proposed the conjunctive equality queries [13]. Each conjunctive query builds a set of tokens that can be used to identify matching documents in the database. Their methods only leak the set of matching documents. However, the workload of the server is heavy. Moreover, the communication complexity between server and client is high. The scalability of this solution is limited [14]. Cash et al. [14] proposed the first conjunctive query scheme with sublinear searching complexity, that is, “Oblivious Cross-Tags” (OXT). Before that, all other solutions can only work linearly in the database’s size. Nevertheless, the OXT protocol leaks some “partial” information to the server, containing the queries themselves and the database contents. Result pattern (RP) leakage is one of the information leakages mentioned in OXT. The adversary can use it to steal information. The file-injection attacks [12] have exploited RP leakage to reveal all queried keywords with accuracy [15].

Kamara and Moataz [16] proposed highly efficient SSE schemes with worst-case sublinear search and achieved optimal communication complexity. They used the set operations (union, intersection, and complement) for efficiency. Also, their methods can support conjunctive, disjunctive, and Boolean queries. However, set operations inevitably cause information leakage. For the sake of efficiency and functionality, Kamara’s method does not prevent RP information leakage. It leaks more than many other solutions. Nevertheless, Kamara’s method of building the inverted index is very worthy of our reference. Lai et al. [15] analyzed the result pattern (RP) leakage and proposed “Hidden Cross-Tags” (HXT). The HXT protocol eliminates keyword-pair result pattern (KPRP) leakage presented in the OXT protocol, leaving only the minimal and significantly smaller whole result pattern (WRP). Thus, the HXT protocol offers high security than the OXT protocol. Although the HXT protocol is efficient and practical, it cannot get the full query results because of using Bloom filter to build an index database.

Yin et al. [17] proposed an efficient and privacy-preserving multi-keyword conjunctive query over the cloud. They used the binary tree structure and homomorphic encryption to achieve high query efficiency and small privacy leakage. This method supports the multi-keyword conjunctive query and protects its privacy. However, the scale of this scheme is limited for the tree-based index. They gave the experiment result based on eight keywords, not enough in most of the settings.

The existing scheme mentioned above cannot achieve unnecessary information leakage and precise query results. We make progress on these issues and give an affirmative answer by constructing a practical SSE scheme. Our construction has been focused on the conjunctive queries, like OXT and HXT, since such queries are the most common in many practical settings. In our construction, we assume that the server is honest but curious and there are only one reader and one writer.

1.1. Our Contribution

We make progress on the multi-keyword conjunctive query and construct a method with high privacy and accuracy query result. The main contribution of this paper can be summarized as follows:(1)First, we progress on the multi-keyword conjunctive query setting since it is a common search method for most people. Unlike the index database based on inverted index, we propose a novel data structure—counter vector (CV)—to construct the index database, which is the basis of our proposed solutions. Using CV, we built a map from three-tuple keywords to a file-path collection containing all the three keywords, and we designed an algorithm to build the CV database as quickly as possible. Compared to Kamara and Moataz’s multi-map [16], CV achieves a higher search efficiency while compromising storage efficiency. Although the number of three-tuple keywords is much larger than the keyword pair, storage space’s natural growth is limited due to the sparsity inverted index database. However, due to the original single-word inverted index database’s sparsity, the actual growth of storage is limited. Experiments show that when the keywords’ weight is less than 110 and the number of keywords is not more than 512, the CV database would add about storage to the keyword-pair database, and the search efficiency improved dozens of times.(2)Second, we propose three schemes CVX, ICV, and HCV, using two kinds of cryptography primitives and CV data structure. CVX is the basic scheme and has a much higher search efficiency than Kamara’s IEX. However, CVX’s search efficiency is greatly affected by the weight of keywords (i.e., the number of documents containing a keyword). For this reason, we propose the improved ICV, which is more efficient for heavy-weight keywords. For HCV, it uses BGN homomorphic encryption algorithm to reduce the RP leakage [18]. We prove that HCV achieved ideal leakage for a 3-keyword query. All of these schemes have a strong practicability. They are easy to implement and can work on a PC.(3)Third, we analyze our scheme’s security and evaluate the performance of all three schemes. The experiments show that both CVX and ICV have much more searching efficiency than Kamara’s IEX. When the keywords’ weight is larger, ICV has a significant advantage over CVX. For HCV, we analyze its privacy, indicating that it leaks less than Cash’s OXT and gives the accuracy query results better than the probabilistic result of Lai’s HXT.

We compared the performance of some schemes, and Table 1 gives the details.

1.2. Related Work

Xiaodong et al. [4] gave the first SSE scheme, whose search complexity is linear to the size of database. Later, Goh [8] introduced a search index for each file and made the search cost to be proportional to the number of files. Curtmola et al. [7] presented the inverted index to achieving sublinear search complexity. This scheme defined two formal security models and gave a formal security definition. To support expressive queries, Golle et al. firstly proposed the conjunctive equality queries [13]. In each conjunctive query, a set of tokens can be built to identify matching documents in the database. Their methods leak little information; however, the performance is not so good and the scalability of this solution is limited.

To support more scalable and expressive queries, Cash et al. [14] proposed the Oblivious Cross-Tags (OXT) protocol with worst-case sublinear search complexity. The OXT protocol divides conjunctive search process into s-term and x-terms. The s-term is for a regular single-keyword search and x-terms are used to acquire document identifiers containing multiple keywords. However, the OXT protocol cannot avoid the “keyword-pair result pattern” (KPRP) leakage, which can be exploited in recent attacks [10, 12]. Since then, a line of extensions [1921] have been made for OXT. However, such schemes actually trade off performance, security, and functionality. To improve the efficiency of OXT, Kamara and Moataz [16] introduced two schemes LEX-2Lev and LEX-ZMF, and both achieve worst-case sublinear search complexity for Boolean query. Lai et al. [15] proposed “Hidden Cross-Tags” (HXT) protocol, which achieved conjunctive query and reduced the leakage of OXT. The HXT protocol used the “Cross-Tags Set” (XSet) data structure and a lightweight Hidden Vector Encryption (HVE) to encrypt it, and then it avoids the KPRP leakage. In fact, HXT’s XSet is actually a Bloom filter, and thus it cannot give the precise query result.

Very recently, some conjunctive SSE schemes with extended functions are proposed. Wang et al. [22] pointed out that the scheme proposed in article [23] is not correct. A new SE scheme was proposed by adopting a special additive homomorphic encryption scheme to achieve the multiplicative homomorphic property efficiently. Furthermore, they enhanced the security on the user side. Ma et al. presented a practical SSE protocol that supports conjunctive queries without KPRP leakage [24]. They proposed a novel SSE protocol called “Practical Hidden Cross-Tags” (PHXT). Using subset membership check (SMC), the PHXT protocol maintains the same storage size as OXT while preserving the same privacy and functionality as HXT. Fan et al. [25] proposed a verifiable conjunctive keyword search scheme based on cuckoo filter (VCKSCF), which significantly reduces verification and storage overhead. Gan et al. also focused on the verifiable conjunctive SSE [26] and presented an efficient verifiable SSE (VSSE) scheme for conjunctive queries with sublinear search overhead. VSSE is built on the OXT protocol and completes the verification through Symmetric Hidden Vector Encryption (SHVE) and greatly reduces the computation payload in the verification process. For IoT application, Zhang et al. [27] proposed a lightweight and efficient attribute-based encryption scheme for data sharing and searching (namely, LSABE). Their scheme can significantly reduce the computing cost of IoT devices with the provision of multiple keyword searching for data users.

This paper is organized as follows. We give the preliminaries in Section 2. In Section 3, we depict the details of the CVX SSE scheme, containing the proof of correctness and evaluation of the efficiency and security. Section 4 depicts the ICV scheme. In Section 5, we introduce the HCV scheme. The experiment results are shown in Section 6. We conclude our method in Section 7.

2. Preliminary

We first depict the notations and definitions. Then, we list part of them in Table 2.

2.1. Notations

We denote all binary strings with length as and all finite binary strings as . Let , and is the corresponding power set. An element sampled from a distribution is denoted as . represents that the element is output by algorithm . For a tuple of elements, its th element can be denoted as or . Given an element , let denote the index of in . For a set , we use to represent its cardinality. For a string , means its bit length and means its th bit. Given strings and , refers to their concatenation.

Multi-map (MM) is an abstract data type. Typically, it can be instantiated by an inverted index. MM with capacity is a collection of label/tuple pairs . Getting the tuple associated with label can be denoted as . Similarly, associating the tuple to label can be denoted as .

The symbol denotes a document collection. Each document contains a number of keywords from the universe . The th keyword in can be denoted as , and the document identifier can be denoted as . Each multi-map can be regarded as a database and denoted as . The document collection containing keyword can be written as , and the set of keywords in that co-occur with can be written as .

Informally, for a private-key encryption scheme, if its ciphertexts do not reveal any partial information about the plaintext even to an adversary that can adaptively query an encryption oracle, we say it is secure against chosen-plaintext attacks (CPAs). Similarly, if its ciphertexts are computationally indistinguishable from random even to an adversary that can adaptively query an encryption oracle, we say it is random-ciphertext-secure against chosen-plaintext attacks (RCPAs) [28].

2.2. Result Pattern Hiding Searchable Encryption

Lai et al. [15] proposed result pattern (RP) hiding searchable encryption to resist RP leakage proposed by Cash et al. [14]. RP leakage is the leaked information obtained by the server during query. In [15], Lai et al. analyzed the RP leakage and gave three forms: single-keyword result pattern (SP) leakage, keyword-pair result pattern (KPRP) leakage, and multiple keyword cross-query intersection result pattern (IP) leakage.

The KPRP leakage is a “nonideal” leakage and can be eliminated. Consider an n-keyword conjunction query ; during this process, the server gets the set of documents containing every pair of query keywords of form , and it can acquire the final query result, which is the set of documents matching all query keywords.

Only the final query result, which is called whole result pattern (WRP) leakage, cannot be avoided during this process. In addition, other leaks are intermediate links in the query process and can be reduced in some ways.

2.3. Bilinear Groups and Homomorphic Encryption
2.3.1. Bilinear Groups of Composite Order

Given a security parameter k, generate a tuple (N, , G, , e), where N= and are two k-bit prime numbers. G and are two finite cyclic multiplicative groups of composite order N, is a generator, and is a bilinear map with the following properties:(i)Bilinearity. for any and .(ii)Nondegeneracy. If is a generator of , then is a generator of with order .(iii)Computability. There exists an efficient algorithm to compute for all .

2.3.2. BGN Homomorphic Encryption

The Boneh–Goh–Nissim (BGN) [29] homomorphic encryption includes three algorithms: key generation, encryption, and decryption. We give the detailed depiction as follows.(i)Key generation: Given a security parameter , generate a tuple as described in Section 2.3.1. Set ; then, is a random generator of the subgroup of G of order . Compute the key pair, containing the private key and the public key .(ii)Encryption: let m denote the message to be encrypted, choose a random number , and compute the ciphertext .(iii)Decryption: Given the ciphertext , then compute . Set and compute the discrete log of base according to Pollard’s lambda method (see [30] (p.128) and [17]).

3. The Basic Scheme

In this section, we will introduce the basic multi-keyword conjunctive query scheme proposed in this paper. First of all, we give the details of a counter vector data structure.

3.1. Counter Vector and Counter Vector Database
3.1.1. Counter Vector

A counter vector with length is an array of integers. Given a three-tuple, suppose contains seven files , and we can get counter vector as in Table 1. A collection of counter-vectors compose of CV database..

Given a counter vector , we can easily query the 2-conjunctive keywords and 3-conjunctive keywords as follows:(1)Initialize a result set .(2)For ,(a)For 2-conjunctive queries, if , then we append in .(b)For 3-conjunctive queries, if , then we append in .

In fact, our constructions are mainly based on the counter vector. First, we make database (DB) containing CV and MM. Then, we encrypt them separately to get EDB = (EMM, ECV). During the search, we decrypt and get the counter vectors first and then query them to get the result.

3.1.2. Counter Vector Database

In the proposed mechanism, we need to construct CV database. The details are shown in Algorithm 1.Step 1: sort the original inverted index according to the length of DB ().Step 2: for each keyword pair , initial a integer vector with length , and compute .Step 3: for each three-tuple, compute and get the counter vectors as in Table 1.

Once finishing all three tuples, we can get the CV database.

Here, we give a definition of symbol , which will be used in the following algorithm. For two counter vectors and , where , , define as follows.

If , then ; else, .

Input: inverted index
Output: CV database
(1)sort the original inverted index according to ;
(2)fordo
(3) compute ;
(4)ifthen
(5)  initial the of length with 0;
(6)  while do ;
(7)end
(8)end
(9)for,do
(10)ifandthen
(11)  compute ;
(12)  if initial the CV with length , let ; while do ;
(13)end
(14)end
3.2. Description of CVX Scheme

CVX mechanism is the basic scheme we proposed. It consisted of three modules: Setup, Token, and Query. The Setup generates an index database DB and encrypts it to EDB. Token outputs a token with the key and keyword set. Query returns the queried result once given the token and EDB. These modules contain several algorithms:(i)A CV encryption scheme , which is adaptive secure.(ii)A black-box multi-map encryption scheme .(iii)A private-key encryption scheme SKE = (Gen, Enc, Dec), which is RCPA-secure.(iv)A pseudo-random function F.(v)A function GetTag , which is used to get tags given the counter vector and document database.

The CV structure is a multi-map data structure with different contents from the MM, so CV and MM support the same encryption algorithm, but it will be different in the process of Search. We choose a black-box MM encryption scheme as in Kamara and Moataz [16].

The project is depicted as follows.(i)Setup. In the Setup modules, an index database DB was generated and encrypted to be EDB. Given the security parameter , compute the keys for encryption. Then, given the keyword/docID pairs, we can generate an inverted index database MM and get a counter vector using Algorithm 1. MM maps each keyword to encrypted identifiers in DB (). CV maps each three-tuple to a vector of integers. Finally, encrypt the MM and CV using algorithms and , respectively. We can get encrypted multi-maps (EMM) and encrypted counter vectors (ECV). All the output of Setup mainly contains the encrypted structures EDB = (EMM, ECV) and their keys.(ii)Token. Given the key and a set of queried keywords , the Token algorithm outputs the token TK = (gtk, ltk). is the global token used to query the MM. is local token used to query the CV. contains subtokens. Take for example, is calculated from , and contains two subtokens (stk) computed from and , respectively.(iii)Query. We can get the output result of Query using the following steps. First of all, input the EMM and global token , and we can get the encrypted multi-maps mm = DB . Second, from ECV using the local token , get encrypted counter vectors . Third, decrypt them using the exiting encryption schemes and separately. Subsequently, for each , get the set of according to the counter, compute the intersection , and output the set .

We detailed the CV-based conjunctive SSE scheme CVX = (Setup, Token, Query) in Algorithm 2.

(1)Setup
Input: k, inverted index
Output: K, EDB
(2)a. initialize a multi-map MM, for all , pad the according to ;
(3)b. initialize a counter-vectorCV, generate the CV database using Algorithm 1;
(4)c. compute ;
(5)d. compute ;
(6)set and ;
(7)return .
(8):
Input: a subset
Output: token tk
(9)compute gtk = H;
(10)For
(11) select a three-tuple ;
(12) compute ;
(13);
(14)set ;
(15)return ;
(16):
(17)parse as ;
(18)parse as , parse as ;
(19)a. compute ;
(20)b. for all , compute ;
(21)c. for , compute
(22)
3.3. Correctness and Efficiency

In this subsection, we prove the correctness and analyze the efficiency of our scheme.

3.3.1. Correctness

To show the correctness of CVX, we consider the operations of the counter vectors. For a common conjunctive query , the output of CVX.Query (EDB, tk) is

We want to get for conjunctive queries, where

We know that

3.3.2. Efficiency

The Query complexity of BVX is . The size of tokens is . The community complexity achieves optimal because the search result has no redundancy. The storage complexity is

is a constant relating to . is the number of keyword sets which contain pairs . is the number of keyword sets which contain three-tuple. is the storage complexity of , which is a black-box multi-map encryption scheme used in this paper.

3.4. Security Analysis

CVX is adaptive secure on condition of controlled disclosure [6]. We mainly consider its leakage functions, which include the Setup leakage and Query leakage. The details of the CVX leakage profile is depicted below. Its Setup leakage iswhere and are the Setup leakages of the multi-map encryption schemes and counter vector encryption schemes, respectively. The Query leakage iswhere for all ,and is a random function from to .

If the definition of adaptive security is similar to Kamara and Moataz [16] (Definition 4.2), we give the theorem as follows.

Theorem 1. CVX is -secure on the condition that and are adaptively -secure, SKE is RCPA-secure, and F is pseudo-random.

Proof. Suppose and are the simulators guaranteed to exist from and ’s adaptive semantic security. To simulate EDB, the simulator for CVX takes the Setup leakage as input:computes , , and outputs EDB = (ECV,EMM).
If a token was simulated, it takes the Query leakage as input:and then, a token can be acquired. For all , we set .
The can be simulated asFor all , stk is simulated asAs we know, for all probabilistic polynomial-time adversaries , the probability Real(k) outputs 1 and Ideal(k) outputs 1 is very close. That is, if the and are adaptive security, the SKE is RCPA-security, and the is pseudo-randomness, then the simulated EDB and are indistinguishable from the real EDB and . It shows that the leaked random tag is indistinguishable from the encrypted identifier in the Real(k) experiment.

4. Improved CV Scheme (ICV)

4.1. Description of ICV

Firstly, we set to be the weight of keyword . When the average weight of all keywords is larger, the CVX will be less efficient. For this reason, we improve the CVX and propose ICV scheme, which is more efficient when the average weight is heavy.

Similar to CVX, ICV has three modules (Setup, Token, and Query) and five security algorithms that are , , private-key encryption scheme SKE, pseudo-random function F, and function.

The biggest difference between CVX and ICV lies in the construct of DB. In CVX, the MM is the inverted index, while ICV uses two-dimensional inverted index to construct MM. That is, for each keyword pair , compute the DB , pad it to the value of MM , and then we get a two-dimensional MM, i.e., each label corresponds to the set .

Obviously, we can query a keyword pair easily using the MM. For a single-keyword query, we only need to let , that is, to acquire . For -conjunctive query, where , we can parse it to three-tuples,and then, query each three-tuple to get a set , and then output the result .

The scheme is depicted as follows.(i)Setup. In the Setup process, an index database is generated and encrypted. Firstly, given the security parameter , we can compute the keys. Then, calculate the two-dimensional index database MM using the keyword/docID pairs. For every keyword pair , MM maps it to encrypted identifiers in DB . Otherwise, given a new keyword , compute a binary counter vector with length ; if , set the to be value “1”; otherwise, set it to “0.” Up to now, CV maps each three-tuple to a vector of binary integers with length .Finally, encrypt the MM and CV using algorithms and , respectively. We can get EMM and ECV. All the output of Setup mainly contains the encrypted structures EDB = (EMM, ECV) and their keys.(ii)Token. Given the key and a set of queried keywords , suppose , and the Token algorithm generates the token . is the global token obtained from and is used to query the MM. contains subtokens, . Each is calculated from , .(iii)Query. We can get the results of Query through the following steps. First of all, input the EMM and global token , we can get MM = DB . Then, parse the , for each , query ECV and get encrypted counter vectors , and decrypt them to using , separately. Finally, for each , run GetTag to get the set . If we have all the subtokens processed, compute the intersection and output the set S.

The details of the ICV SSE scheme ICV = (Setup, Token, Query) are given in Algorithm 3.

(1)Setup
Input: k, inverted index
Output: K, EDB
(2)a. initialize a multi-map MM,
(3) for all ,
(4)  compute ;
(5)b. initialize a binary integer vector cv with length ,
(6) for all ,
(7)  compute ;
(8)  for each ,
(9)   ;
(10)c. compute ;
(11) compute ;
(12) set and ;
(13) return .
(14):
Input: a subset
Output: token tk
(15)compute global token
(16)compute local token :
(17) for , do
(18);
(19)output
(20)return tk;
(21):
(22)parse as ;
(23)parse as , parse as ;
(24)(a)compute ;
(25)(b)for all , compute
(26);
(27)(c)compute ,
(28)
(29)output ;
4.2. Correctness and Efficiency

We now analyze the correctness and efficiency.

4.2.1. Correctness

To show the correctness of ICV, we consider the operations of the counter vectors. For a common conjunctive query , we can get the following equation, which gives the correctness of the ICV.

4.2.2. Efficiency

The Query complexity of ICV is . The sizes of tokens are . The community complexity achieves optimal because the search result has no redundancy. The storage complexity iswhere is a constant relating to and is the storage complexity of , which is a black-box multi-map encryption scheme used in this paper.

4.3. Security Analysis

Because the CVX and the ICV use the same data structures MM and CV and use the same algorithm for security, their safety performance is equivalent and can be proved in the same way.

5. Hidden Result Pattern CV SSE (HCV)

In this section, we propose our new hidden result pattern conjunctive keyword query scheme. This scheme employs Yin et al.’s scheme [17] to decrease the result pattern leakage. Unlike Lai’s HXT scheme, we do not use the Bloom filter to construct index database, so our HCV can achieve accurate query and even ideal leakage when q = 3. The details of HCV are given as follows.

5.1. Description of HCV

HCV mechanism consists of three modules: Setup, Token, and Search. In Setup module, an index database DB was generated and encrypted to EDB. Token outputs a token for searching with the key and keyword set. Query returns the queried result once given the token and EDB. These modules contain several algorithms:(i)A hidden vector CV encryption scheme , which uses BGN homomorphic encryption technique to protect its privacy.(ii)A black-box multi-map encryption scheme .(iii)A private-key encryption scheme SKE = (Gen, Enc, Dec), which is RCPA-secure.(iv)A pseudo-random function .

The project is depicted as follows.

5.1.1. Setup

In the Setup process, an index database is generated encrypted. Firstly, it generates a MM and CV database using the same method as . Then, encrypt the MM and CV, respectively.

MM is encrypted using the existing black-box encryption scheme , outputting EMM. CV is encrypted using based on BGN homomorphic encryption technique. All the output of Setup mainly contains the encrypted structures EDB = (EMM, ECV) and their keys.

is depicted as follows.(a)Initialize: Compute a pair of public key and the private key , according to the given security parameter . Further, a one-way hash function was initialized by the data user, and three random numbers , , and were chosen from . The tuple and keyword dictionary are kept secret, and the key is sent to the cloud server.(b)Encrypt CV: for each three-tuple , we can get a cv , , and the user encrypts it aswhere are three random numbers owned by the data user. Then, each counter vector can be encrypted to , and all of the encrypted counter vectors constitute the ECV.(c)Encrypt files: encrypt each file using the existing SKE encryption scheme, and then send them to the cloud server, as well as the EMM and ECV.

5.1.2. Token

It is the same as CVX.

5.1.3. Query

The conjunctive query between client and server can be done as follows.(a)The client sends the token to server. Server parses it as some subtokens , for each i, query the ECV, and if one of the queries returns , then return to the client and end the search. Otherwise, send a new query request command to client.(b)Once received query request, client initializes a integer vector V of length n, sets each element of V to a fixed value, which can be chosen from , chooses n random numbers , and encrypts the vector V as , whereClient sends the subtokens and to the server.(c)After receiving the subtokens and , the cloud server performs the search operation. For each subtoken , the server initializes a vector , computes , and then for each , computes and matches it with the jth element of . If , it sets ; otherwise, it sets . Then, the server gets each using the same method and computes the intersection of .(d)Finally, the server computes and outputs as the search result.

The details of the HCV SSE scheme HCV = (Setup, Token, Query) are given in Algorithm 4.

(1)Setup
Input: Security Parameter k, Inverted Index
Output: K, EDB
(2)Generate key according to ;
(3)Get the EMM as in ;
(4)Compute CV database using Algorithm 1;
(5)Encrypt CV as equation (15) to acquire ECV;
(6)set ,
(7)return .
(8)Token
(9)The same as CVX.
(10)Query;
Input: tk, EDB
Output: result
(11)initialize a tag vector S;
(12)initialize a bool vector ;
(13)initialize a integer vector ;
(14)(a) Server parse , and parse ;
(15)fordo
(16)if ;
(17)  return;
(18)Server send Query Request command to Client.
(19)(b) fordo
(20) set a vector and encrypt it to as equation (16), send and to Server.
(21)(c) Server do as follows:
(22)set each element of to integer 1 fordo
(23) set ,
(24)fordo
(25)  if then
(26)   ;
(27)  else
(28)   ;
(29)Finally, the Server computes ;
(30)return
5.2. Correctness and Efficiency

In this subsection, we prove the correctness and analyze the efficiency.

5.2.1. Correctness

The correctness of HCV can be proved using the following equation. While it is true, the check operations in is valid, and the result is correct.

5.2.2. Efficiency

The search complexity of HCV is . The sizes of Tokens are . The communication complexity achieves optimal because the search result has no redundancy. For CVX and HCV, they have the same CV data structure before encryption, while the encryption algorithm of CVX keeps the length of encryption unchanged, and HCV extends the length of encryption to times, so the storage complexity of HCV is times over CVX, where is determined by HCV’s security parameter k.

5.3. Security Analysis

In this section, the privacy and security of the HCV can be analyzed. We keep our eyes on two types of security: one is the RP leakage, and the other is the privacy of outsourcing data and conjunctive queries.

5.3.1. RP Leakage Comparison

We consider two leakage components: KPRP and WRP, and illustrate the difference. For example, given a database containing six documents labelled by and each document contains some keywords. The details are listed in Table 3.

Consider the conjunctive query . Suppose the keyword is the least frequent. For the three queried keywords, the inverted indexes are listed below: , , and .

Firstly, we compute the RP leakage component in Cash’s OXT:

As shown in Table 4, the RP leakage reveals 4 entries of the inverted index.

Lai’s HXT protocol eliminates the “partial query” (KPRP) leakage, which is a part of RP leakage. It only has the leakage of whole result pattern (WRP). Actually, in HXT, . The HXT protocol reveals the exact result of the query, that is, only . However, the HXT protocol leaks two entries and as shown in Table 4.

Our scheme HCV reveals the exact result , which is ideal leakage. Besides, HCV reveals nothing about the entry. The server only gets the encrypted keyword-tuples. The leakage comparison is given in Table 5.

5.3.2. HCV Privacy-Preserving

(1) Outsourcing Data Is Privacy-Preserving. In HCV scheme, the outsourced data include a file collection and an index database, both of which are encrypted. The encrypted files are secure because the server does nothing to them. Consider the encrypted index, and it includes two data structures, MM and CV data structures. For MM, we use a black-box MM encryption scheme, and the security is not discussed here. For the CV structure, as described in Section 5.1, it contains n encrypted elements represented as , where . Theorem 2 shows that the cloud server can get nothing related to .

Theorem 2. If is a secure one-way hash function, the cloud server can get nothing related to from the encrypted CV database.

Proof. Now we prove the correctness of the theorem.
On the one hand, for each encrypted element , to get the value of , the server can inverse the function H or exhaust the value of . However, H is a one-way function, it cannot be inverted. Meanwhile, is a very large integer (i.e., 1024 bit). It is impossible currently to exhaust it. To guess the value, the cloud server only has 1/3 probability to get correct result for each element, as .
On the other hand, for different encrypted elements, e.g., and , if , the server cannot know whether and have the same value as H is a secure one-way hash function.
The cloud server gets nothing about from the encrypted tag array [17].

(2) Conjunctive Query Is Privacy-Preserving. The conjunctive query contains three steps: query request, query processing, and query response.

In the query request, the query vector is encrypted to be , where for each , we have . The encryption guarantees is secure. Meanwhile, as each query uses a random value , the encryption is nondeterministic. That is, for two queries, the honest-but-curious server cannot determine whether they have the same queried keywords.

For the query processing, the server receives the query vector which is encrypted, calculates , and matches the result with the ith element in the encrypted tag array. In this process, the server can only get information about the access pattern and search pattern, which our scheme allows.

For the query response, the search result is returned to the client by the cloud server. The search result consists of some encrypted files, and the encryption algorithm can be nondeterministic, e.g., AES-CBC. Thus, the adversary cannot correlate two queries even if they use the same queried keywords [17].

6. Experiment Evaluation

All experiments were run on Intel(R) Core(TM) i5-8400 [email protected] GHz processor with 8 GB RAM. We use a commodity Windows 10 system. For CVX and ICV scheme, we implement our work in Python, while for HCV, we did the experiment in C++ for efficiency. To show our solution’s practicability, the data used in the experiment are from the NSF e-mail dataset. The whole set contains 30799 keywords and 49078 files. We sampled and selected some of them for the experiment.

We use the common hash algorithm HMAC-SHA1, and the encryption algorithm for CVX and ICV is AES-CBC. In both the CVX and ICV schemes, Setup, Token, and Search algorithms are contained. While Setup and Token algorithms are implemented locally by the client, Query is implemented by the server. We assume that the client has sufficient computing resources, so only the efficiency of Query is considered here. In HCV scheme, we use the homomorphic encryption algorithm BGN and mainly experiment on the Setup performance. We generate the BGN parameters through type a1 pairing in PBC library, based on the curve (the group order N is a 1024 bit number).

Firstly, we test the performance of CVX and ICV, and we compare them with Kamara’s IEX. To perform a search, Kamara’s scheme needs to perform decryption and then compute the intersection of sets. CVX only needs to perform decryption, and then do the intersection of vectors. Take for example, and we implement experiment. For each keyword , we assumed that represents the weight of . Obviously, search efficiency is closely related to keywords’ weight. The larger the keywords’ weight, the less efficient the query operation.

For the sake of simplification, we use the mean value of all keywords’ total weight as the metric to test the Query efficiency. In the case of low weight, we compare the Search efficiency of CVX and ICV with Kamara’s scheme. The experiment shows that Search efficiency of CVX and ICV is not significantly different. However, both of them are dramatically better than Kamara’s IEX. The experimental results are shown in Figure 1. The abscissa is the mean value of keyword weight, and the ordinate is the Query time in seconds. The keywords sampled range from 756 to 779.

Then, we find that when the weight mean value is larger, the CVX is not so efficient, so we proposed ICV. When the weight is larger, the advantage of ICV is obvious. Compared with CVX, ICV is more suitable for larger mean weight value. Figure 2 shows the experimental results. The abscissa is the mean value of keyword weight, and the ordinate is the Query time in seconds. The keywords sampled range from 203 to 537.

Secondly, we test the performance of HCV, which uses BGN homomorphism technology with high computational complexity to encrypt the index database. Therefore, we focus on the experiment of Setup which contains generating CV database and encrypting CV using BGN algorithm. We mainly tested the time complexity and storage complexity of Setup, both of which are closely related to the number of keywords and the weight of keywords. Our experiments are based on the sampled NSF dataset, which can be divided to two types. One contains about 512 keywords, and the other contains about 256 keywords. Our experiments mainly depict how the keyword number and weight affect the Setup efficiency. When the number of keywords is about 512, we set the mean weight as 23/51/83/112. When the sampled keyword number is about 256, we set the mean weight as 53/76/95/116. The experimental results are shown in Figure 3. Figure 3(a) describes the time complexity of HCV Setup process. The horizontal axis is the mean weight, and the vertical axis is the time of Setup process. If we fix the number of the keywords, the storage complexity depends on the number of counter vectors, mainly correlated to the keywords’ weight. Figure 3(b) describes the relationship between the number of counter vectors and the keywords’ weight, which can depict storage complexity. The horizontal axis is the mean weight, and the vertical axis is the number of the reserved counter vectors.

7. Conclusion

This paper proposed a novel CV data structure, which has been used in our proposed CVX, ICV, and HCV conjunctive query SE scheme. In CVX, the search efficiency is greatly improved compared with Kamara’s IEX. Furthermore, when the weight of the keywords is larger, the search efficiency will decrease dramatically. So, we improved the CVX and proposed the ICV, which is more efficient but leaks more information than CVX. In HCV, we use the homomorphic encryption technology (BGN) to encrypt the CV data structure for the sake of resisting the RP information leakage. Security analysis shows that our scheme is secure, and performance evaluation also validates its efficiency. However, homomorphism encryption is used in our scheme and causes large computation and heavy storage, so our scheme can only be used for small datasets. In future work, we will study the scalable scheme and consider more security properties.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Nature Science Foundation of China under grant nos. 62072466 and U1811462.