Abstract

Searchable symmetric encryption that supports dynamic multikeyword ranked search (SSE-DMKRS) has been intensively studied during recent years. Such a scheme allows data users to dynamically update documents and retrieve the most wanted documents efficiently. Previous schemes suffer from high computational costs since the time and space complexities of these schemes are linear with the size of the dictionary generated from the dataset. In this paper, by utilizing a shallow neural network model called “Word2vec” together with a balanced binary tree structure, we propose a highly efficient SSE-DMKRS scheme. The “Word2vec” tool can effectively convert the documents and queries into a group of vectors whose dimensions are much smaller than the size of the dictionary. As a result, we can significantly reduce the related space and time cost. Moreover, with the use of the tree-based index, our scheme can achieve a sublinear search time and support dynamic operations like insertion and deletion. Both theoretical and experimental analyses demonstrate that the efficiency of our scheme surpasses any other schemes of the same kind, so that it has a wide application prospect in the real world.

1. Introduction

Nowadays, with the development of the network and virtualization technology, cloud computing technology has been developed rapidly. Through the cloud service, enterprises and individuals can obtain better computing and storage services at a lower cost. Since cloud servers are not entirely trusted, utilizing cloud services while maintaining data privacy is an essential concern. A straightforward way to address this issue is encrypting the data before outsourcing it to the cloud servers. However, this approach fails to meet the requirement of data retrieval since traditional encryption will scramble the original data, making the data inconvenient to utilize. In this scenario, the users have to download all the ciphertext data and decrypt them locally, which will bring huge transmission, storage, and computation overhead, which is not applicable in cloud environment.

Searchable encryption (SE) can support keyword search without decrypting the data, and thus it is very suitable to achieve the keyword search over ciphertext. Based on the SE scheme, data owners and authorized users share a secret key. Data owners can encrypt the sensitive data and upload them to the cloud server. If data users want to search the encrypted data, they can generate an encrypted trapdoor by using the query and the secret key. When the cloud server receives the trapdoor, it tests the trapdoor against the encrypted data without decrypting these data and returns the data related to the query to the users. The first searchable symmetric encryption (SSE) scheme was proposed by Song et al. [1]. This scheme can not only encrypt data, but also provide a search mechanism over the encrypted data. With the improvement of security and efficiency of SSE schemes, it has attracted the community attention. During recent years, researchers focused on how to construct solutions with complex query functions, such as multikeyword search [27], similarity search [8, 9], and ranked search [1015]. In particular, ranked search schemes can sort the query results according to the relevant degree between the documents and queries and only return the most related (top-k) documents. Thus, ranked schemes can significantly reduce the computation and storage costs.

The primary ranked search schemes were proposed in [10, 11], which only support a single-keyword search. The early SSE scheme supporting multikeyword ranked search was given by Cao et al. [12]. The score evaluation method used in their scheme is the inner product between the query and document vectors. In their scheme, since each document has its own vector representation, the search time is linear with the number of documents in the dataset, which will have a very high storage overhead for a big data environment. Then, Sun et al. [13] gave a similar scheme with a better-than-linear search efficiency by using a tree-based index [16, 17]. They adopt the technique of term frequency-inverse document frequency (TF-IDF) to evaluate the score between the index and queries. To further improve the search efficiency and support dynamic update, Xia et al. proposed an efficient SSE scheme supporting dynamic multikeyword ranked search [14]. In their scheme, they construct a tree-based structure and propose a parallel search algorithm to accelerate the search process. Moreover, they also provide a dynamic update method to cope with the deletion and insertion of documents flexibly. Recently, by utilizing the Bloom filter [18], Guo et al. constructed an efficient SSE scheme supporting dynamic multikeyword ranked search [15] to further improve the efficiency of keywords search and index construction. Owing to the Bloom filter, the internal nodes in the index tree are not needed to be encrypted, and the dimension of the vectors in the internal nodes is also reduced. As a result, this scheme can achieve a better performance than the previous similar schemes.

Another kind of SE is called searchable public key encryption (SPE), which is established on the public key system. In SSE, the key for encrypting data is the same as the key for generating search trapdoor. By contrast, in SPE, the public key for encrypting data is open to public, while the secret key for generating search trapdoor is only given to the authorized data receivers. The very first SPE scheme supporting keyword search was introduced by Boneh et al., and it is so-called public key with keyword search (PEKS) [19]. However, their work only supports a single-keyword search. In order to support more expressive query, many SPE schemes [2023] were proposed to realize advanced search, for example, conjunctive, disjunctive, and Boolean keyword search. By using a special hidden structure, Xu et al. proposed two SPE schemes supporting single-keyword search [24, 25] whose search performance is very close to that of a practical SSE scheme. By converting an attribute-based encryption scheme, Han et al. proposed an SPE scheme which can control user’s search permission according to an access control policy [26]. After this, Kai et al. proposed an SPE scheme achieving both Boolean keyword search and fine-grained search permission [27]. Sepehri et al. proposed a scalable proxy-based protocol for privacy-preserving queries, which allows authorized users to perform queries over data encrypted with different keys [28]. Later, by utilizing an El-Gamal elliptic curve encryption system, Sepehri et al. gave a similar scheme with better efficiency [29]. In order to improve search accuracy, Zhang et al. proposed an SPE scheme supporting semantic keywords search by adopting a method called “Word2vec” [30]. For the sake of brevity, we summarize some SPE and SSE schemes in Table 1, which describes the difference between our scheme and previous schemes.

1.1. Motivation

The previous ranked search schemes in symmetric key setting are secure and somewhat efficient. However, the index building, trapdoor generation, and search time are all related to the size of the dictionary generated from the dataset, which is not suitable for the big data environment. According to the statistical information given in [20], we found that the vocabulary size in a dataset is commonly linear with O (106). Therefore, it is necessary to construct a more efficient ranked search scheme. Motivated by this, in this paper, we aim to construct a novel SSE scheme supporting dynamic multikeyword ranked search (SSE-DMKRS) with high efficiency.

1.2. Contributions

The main contributions are summarized as follows:(1)Based on “Word2vec” [31] technique, we propose a novel method which can change the documents and queries into vector representations. The dimension of the vector representation obtained by our method is nearly 10% of that in the previous SSE-DMKRS schemes [14, 15].(2)We propose an efficient index building algorithm which can create a balanced binary tree to index all the documents. The obtained index tree can achieve a sublinear search time and support dynamic update operations.(3)Through applying the secure k-nearest neighbour (KNN) scheme [32] to encrypt the index tree and the query, we propose an efficient SSE-DMKRS scheme.

In addition, we implement our scheme on a widely used data collection. The experiment results show that our scheme extremely reduces the time cost of index building, trapdoor generation, keywords search, and update without losing too much accuracy; e.g., the time cost of index building in our scheme is nearly 10% of that in the previous schemes. Meanwhile, the storage cost of encrypted index is also reduced greatly; e.g., the storage cost of the index in our scheme is nearly one percent of that in the previous schemes. In conclusion, compared to the previous SSE-DMKRS schemes [14, 15], our scheme is very suitable for the mobile cloud environment in which the client device has limited computation and storage resources.

1.3. Organization

This paper is organized as follows. In Section 2, we give a formal definition of the system model and threat model in our scheme and also introduce the tools we adopt in our scheme, which contains “Word2vec” and the vector space model. In Section 3, we present the construction of the search index tree and the SSE-DMKRS scheme. Besides, a detailed security analysis and update operations of our scheme are also given. Theoretical and experimental analyses are given in Section 4. Section 5 gives the conclusion.

2. Preliminaries

In this section, we first give the framework of the system model and introduce the threat model adopted in our scheme. Then, we introduce some tools adopted in our schemes, including a famous term representation method in the field of natural language processing, e.g., “Word2vec,” and the vector space model. Finally, we present the design goal of our scheme. In addition, the main notations used in this paper are summarized in Table 2.

2.1. System Model

The system model contains three different roles: data owner, data user, and cloud server. The data owner outsources a group of documents F = {f1, f2, …, fn} to the cloud in ciphertext form C = {c1, c2, …, cn}. Moreover, the data owner also generates an encrypted searchable index for keywords search operation. For each query of an arbitrary keyword set Q, the data user computes a search trapdoor TQ of the query Q and sends it to the cloud server. Upon receiving TQ from the data user, the cloud server searches against the encrypted index and returns the candidate encrypted documents. After this, the data user decrypts the candidate documents and obtains the plaintext.

As illustrated in Figure 1, the architecture of the system model is formally described as follows:(1)Data Owner (DO). DO holds a group of documents F = {f1, f2, …, fn} and generates a secure searchable index I from F and an encrypted document collection C for F. Then, DO uploads I and C to the cloud server and distributes the secret key to the authorized data users. Furthermore, DO needs to update the index and documents stored in the cloud server.(2)Data User (DU). Authorized DU can launch keywords query over the encrypted data by utilizing a trapdoor which is generated by using the secret key fetched from DO. Moreover, DU can decrypt the encrypted documents by utilizing the secret key.(3)Cloud Server (CS). CS stores the encrypted index I and documents C from DO. When CS receives the trapdoor for query Q from DU, CS executes keywords query over the index and returns the top-k most relevant encrypted documents associated with the query Q. Upon receiving the update information from DO, CS also performs update operation over the encrypted data. In addition, we assume that CS is “honest-but-curious,” which is employed by many searchable encryption schemes [12, 14, 15]. This means that CS honestly and correctly executes the algorithms in our scheme. However, CS curiously infers and analyses the received data to obtain extra privacy information.

2.2. Threat Model

Throughout the paper, we mainly utilize two threat models proposed by Cao et al. [12]:(1)Known Ciphertext Model. CS only knows the information of the encrypted index, ciphertext, and trapdoor. That is to say, CS can execute cipher-only attacks in this model.(2)Known Background Model. CS knows more information than the known ciphertext model, such as the statistical information inferred from the documents. By taking advantage of these pieces of statistical information, e.g., term frequency (TF) and inverse document frequency (IDF), CS can conduct statistical attack to verify whether certain keywords are in the query [33].

2.3. Design Goals

As mentioned before, we aim to build a secure and efficient SSE-DMKRS scheme. The design goal of our scheme is described as follows:(1)Efficiency. The scheme aims to realize a sublinear search efficiency, and the time and space costs of index building and trapdoor generation are much less than those of the current schemes.(2)Privacy Preserving. Similar with previous schemes [12, 14, 15], our scheme needs to prevent CS from learning extra privacy information, which is inferred from the documents, secure index, and queries. More precisely, the privacy requirement is listed as follows:Index and Trapdoor Privacy. The plaintext information concealed in the index and the trapdoor cannot be leaked to CS. This information involves the keywords and the corresponding vector representation of each keyword.Trapdoor Unlinkability. CS cannot determine whether two trapdoors are built from the same query.Keyword Privacy. CS cannot identify whether a specific keyword is in the trapdoor or index by analysing the search results and the statistical information of the documents.(3)Dynamic. The scheme can efficiently support dynamic operations like documents insertion and deletion. Note that the efficiency of update operations in our scheme is better than the previous SSE-DMKRS schemes.

2.4. Word2Vec

“Word2Vec” model is a shallow, two-layer neural network, which is used to convert words into a group of vector representations [31]. Under this model, each word in the document set is mapped to a vector, which can be used to calculate the similarity between words. For instance, Figure 2 shows that, through training a simple corpus, three words “dog,” “fox,” and “orange” are mapped to three vector representations, respectively. By utilizing these vectors, the similarity among these three words can be calculated. We can find that the similarity between “dog” and “fox” is more than that between “dog” and “orange” since “dog” and “fox” are animals. Thus, we can utilize “Word2Vec” to convert the keywords in a corpus into a group of vector representations and then apply these vectors to perform ranked search.

2.5. Advice on Equations

Vector space model is a very popular method used in the field of information retrieval, usually along with the TF-IDF rule to realize the top-k search, where TF is term frequency and IDF is the inverse document frequency [34]. By utilizing the TF-IDF rule, the documents and queries can be represented as a group of vectors. These vectors can be adopted in the top-k search over the ciphertext [12, 14, 15]. However, the dimension of these obtained vectors is linear with the number of words in the dataset, which is not efficient if the dataset has a lot of words. To address this issue, we will apply “Word2Vec” to present a novel keywords conversion method, which is described as follows:(1)Through applying “Word2Vec” to a corpus, we create a dictionary in which each keyword is associated with a vector representation.(2)For the keyword set Wi = {, , …, } of the document fi, we obtain a vector by looking up the dictionary, where is a vector representation for and , . After this, we set as the vector representation of Wi.(3)For the query keyword set Q = {q1, q2, …, qt}, we utilize the dictionary to construct a vector . Then, we set as the vector representation of Q.

Note that the dimensions of Wi and Q are very small, e.g., 200, which is significantly smaller than the number of words in the dataset. Thus, the proposed method is better than the previous method based on the TF-IDF rule. In addition, together with the vector space model mentioned above, we use the cosine measure to evaluate the relevance between the document and the query. The relevant evaluation function is defined in the next section.

3. Proposed Scheme

In this section, we first give the algorithms of the index tree building and the search algorithm on this tree. Then, we give the concrete construction of our scheme and the dynamic update operations of our scheme. Finally, we give a detailed analysis of the security of our scheme.

3.1. Search Index Balanced Binary Tree

In this section, we adopt a balanced binary tree to create the search index, which will be used in our main scheme. Inspired by the construction process in [14], the tree building and the search process for our scheme are described as follows.

3.1.1. Tree Building Process

Formally, the data structure of the tree node u is defined as , , where ID is the identity of the node u, and are the vector representations of the node u, Pl and Pr are pointers which point u’s left and right children, respectively, and FID stores the identity of a document if u is a leaf node. Note that, compared with the previous index trees [12, 14, 15], the node in our tree has two vectors while it has only one vector in previous trees. The main reason is that the node vector in our tree has a negative number while the node vector in previous trees only contains positive number. For clarity, we give a simple example. Let and be two vectors of leaf nodes A and B, respectively. For the previous index trees, the vector of the parent node C of these two leaf nodes is in which the value of each dimension is the larger value of and . For a query vector , the scores of the nodes A, B, and C are 0.2, 0.2, and −0.4, respectively. It is very important to note that the score of the parent node is less than the scores of its children, which causes the fact that these two leaf nodes will be ignored in the tree search process even if they should be considered.

In our index tree, let the dimensions of and be both d. The methods for constructing and are denoted by M1 and M2, respectively, and given as follows:(1)M1: if the node u is a leaf node which is corresponding a file f, we create a vector for f by adopting the keywords conversion method mentioned in Section 2.5. Then, we set and (2)M2: if the node u is an internal node, the and are based on its children vectors. Let and be the two vectors of u’s left child, and let and be the two vectors of u’s right child.

Suppose that Min () and Max () are the functions of the minimum and maximum, respectively; is built as follows:

And, is built as follows:

We find that is built by utilizing the larger number of and , and is created by using the smaller number of and .

An illustration of the above methods is given in Figure 3. From Figure 3, let the node u be a leaf node, and let W be the keyword set of the file that u stores. By using the keyword conversion method, W is converted to be a vector . Then, we set a. If the node u is an internal node, and the vectors of its children are , , , and and the vectors of the internal node are and .

Based on the methods M1 and M2, inspired by the tree building algorithm introduced in [14], our tree building algorithm is given in Algorithm 1. An example of the proposed index tree is given in Example 1 and Figure 4. In Algorithm 1, we use function GenID () to generate the unique identity ID for each node, and apply GenFID () to generate the unique file ID for each leaf node. CurrentNodeSet contains a group of nodes having no parent node, which are needed to be processed. |CurrentNodeSet| is the number of nodes in CurrentNodeSet. If |CurrentNodeSet| is even, we assume that |CurrentNodeSet| = 2h; otherwise, we assume that |CurrentNodeSet| = 2h + 1, where h is a positive number. TempNodeSet is a set containing the newly generated nodes. Moreover, for each node u, if u is a leaf node, we use method M1 to generate and ; otherwise, and are created by using M2.

Input: the document collection F = {f1, f2, …, fn}, a semantic dictionary D generated by applying “Word2Vec” to F.
Output: the index tree T.
(1)for each do:
(2) Construct a leaf node u for fi, with u.ID = GenID (), u.Pl = u.Pr = NULL, u.FID = GenFID (fi,), and generate and according to the method M1;
(3) Insert u to CurrentNodeSet;
(4)end for
(5)while |CurrentNodeSet| ≥ 1 do:
(6)if |CurrentNodeSet| is even, i.e. 2h then:
(7)  for each pair of nodes u′ and u″ in CurrentNodeSet do:
(8)   Create a parent node u for u′ and u″, with u.ID = GenID (), u.Pl = u′, u. Pr = u″, u.FID = NULL, and set and according to the method M2;
(9)   Insert u to TempNodeSet;
(10)  end for
(11)else \\Suppose that |CurrentNodeSet| = 2h + 1
(12)  for each pair of nodes u′ and u″ of the former 2h − 2 nodes in CurrentNodeSet do:
(13)   Create a parent node u for u′ and u″;
(14)   Insert u to TempNodeSet;
(15)  end for
(16)  Create a parent node u1 for the (2h − 1)-th and (2h)-th nodes, and then generate a parent node u for the (2h + 1)-th node and u1;
(17)  Insert u to TempNodeSet;
(18)end if
(19) Set CurrentNodeSet = TempNodeSet and clear TempNodeSet;
(20)end while
(21)return CurrentNodeSet;
(22)\\Note that the CurrentNodeSet only contains one node which is the root of the index tree T.
3.1.2. Search Process

For a query vector of query Q, we spilt into two vectors and . For each dimension , if , and ; otherwise, and . Obviously, holds all the negative part of , while holds the positive part. For clarity, we denote this splitting method for query Q by M3. The illustration of this method is given in Figure 5. If the query vector , then and .

For a query Q and a node u, the score is calculated as

We can utilize the above equation to evaluate which documents are the most related to the query. Moreover, we can verify that the score of the parent node is larger than its children’s score. This property can significantly reduce the number of nodes which will be checked in the search process.

The search process is given in Algorithm 2. In Algorithm 2, we use RList to store the top-k files which have the k-largest relevance scores to the query. The RList is initialized to be an empty list, and it is updated when finding a relevance file. The kth score is defined as the smallest relevance score in the current RList, which is initialized to be a very small integer. By using the kth score, we can accelerate the search process by ignoring some paths with low scores. In Example 1 and Figure 4, an illustration of the search process is given, where F = {f1, f2, …, f6}, query vectors are and , and d (vector dimension) is 3.

Input: A vector of query Q, a semantic dictionary D generated by applying “Word2Vec” to F, a root node u of IndexTree and RList.
Output: RList.
(1)Split into and according to the method M3;
(2)if u is an internal node then:
(3)if Score (u, Q) > k-th score then:
(4)  SearchIndexTree (, D, u.Pl, RList);
(5)  SearchIndexTree (, D, u.Pr, RList);
(6)else:
(7)  return
(8)end if
(9)else
(10)if Score (u, Q) > k-th score then: \\Update RList.
(11)  Delete the element holding the smallest relevance score in RList;
(12)  Insert a new element <Score (u, Q), u.FID> in the Rlist, and sort the elements in RList;
(13)end if
(14)return
(15)end if
3.1.3. Example 1

An example of an index tree and a search process on this tree is illustrated in Figure 4. In Figure 4, we show an index tree with F = {f1, f2, …, f6} in which the dimension of the vector for each node is 3. For each node u in the tree, the upper vector and lower vector are corresponding to and , respectively. In the tree building process, we first generate the leaf nodes from F and then create the internal nodes based on these leaf nodes.

Moreover, Figure 4 also gives an illustration of the search process. In Figure 4, we set and split it into and . We suppose that top-3 files will be returned to the data user. According to Algorithm 2, the search process begins with the root node r and calculates the score between the query Q and the two child nodes r11 and r12 of r by using equation (3). The calculation process is presented as follows:

Because the score between r11 and Q is higher than that between r12 and Q, Algorithm 2 will traverse the subtree with r11 as the root node and compute the score between the query Q and two child nodes of r11. Since the score between r21 and Q is higher than that between r22 and Q, Algorithm 2 will traverse the subtree with r21 as the root node and add the leaf nodes f1, f2 to the RList. After this, the subtree with r22 as the root node will be traversed, and the leaf nodes f3 and f4 are reached. Since the number of files in RList is less than 3, f3 is added to RList directly. For the file f4, since the number of files in RList equals 3 now, Algorithm 2 will compare the score between f4 and Q to the minimum score in the RList. Because the score between f4 and Q is smaller than the minimum score in the RList, f4 is not added to the RList. At present, the subtree with r11 as the root node has been traversed. Algorithm 2 will traverse the subtree with r12 as the root node. As the score between r12 and Q is smaller than the minimum score in the RList, which means that the score of all child nodes of r12 is smaller than the minimum score in the RList (this property is described in Section 3.1.2), f5 and f6 will not be checked. Therefore, Algorithm 2 outputs RList = {f1, f2, f3}.

3.2. Construction of SSE-DMKRS

In this section, through combining the secure KNN algorithm [32] and the index tree building algorithm, we propose a concrete SSE-DMKRS scheme. The SSE-DMKRS scheme consists of five algorithms. The algorithms KeyGen, DictionaryBuild, and IndexBuild are executed by the data owners, while the algorithms TrapdoorGen and Search are performed by the data users and the cloud server, respectively:(i)KeyGen (): given a security parameter , this algorithm first randomly chooses four invertible matrices , , , and , where d is the dimension of and . Then, it randomly generates a d-bit vector S. Finally, it outputs the secret key sk = {S, , , , }.(ii)DictionaryBuild (F): given the document set F = {f1, f2, …, fn}, the algorithm runs “Word2vec” to generate the dictionary D of F. In the dictionary D, each keyword is associated with a vector representation. Besides, each keyword is also corresponding with a set of semantically related keywords.(iii)IndexBuild (sk, F, D): given the document set F and the dictionary D for F, the algorithm first creates the index tree T by using the algorithm BuildIndexTree (F, D) (Algorithm 1). Then, for each node u in the tree T, the algorithm generates two random vector pairs and for the vectors of and , respectively. More precisely, if S [i] = 0, it sets and ; if S [i] = 1, are set as four random values under the constraints and . This process is expressed as the following equation:

Finally, for each node u, it computes . Through replacing the plaintext vectors and with the encrypted index Iu, an encrypted index tree IT is created.(iv)TrapdoorGen (sk, Q): given a query keyword set Q, the algorithm first extends Q to a new semantic keyword set Q′. The process is as follows:(a)It generates a new keyword set Q′, which is initialized to an empty set.(b)Note that each keyword in the dictionary is associated with a group of keywords semantically related to this keyword. For each keyword q in Q, it randomly chooses k′ semantic keywords based on the dictionary and inserts these keywords into Q′, where k′ is chosen dynamically and .

Then, based on Q′, the TrapdoorGen algorithm generates a pair of vectors and by adopting the method M3. After this, it generates two random vector pairs and for the vectors of and , respectively. This process is similar to the process in the IndexBuild algorithm and can be expressed as the following equations:

Finally, this algorithm generates as the trapdoor for Q.(v)Search (sk, TQ, IT): for each node u in IT, the algorithm computes

According to equation (3), the relevance score calculated from the encrypted vector Iu and the trapdoor TQ equals the value of Score (u, Q). By using this property, the algorithm can utilize the SearchIndexTree algorithm (Algorithm 2) to perform ranked search.

3.3. Dynamic Update Operations

Besides search operation, the proposed scheme also supports some dynamic operations, e.g., documents insertion and deletion, satisfying the requirement of real-world application. Because the proposed scheme is built over a balanced binary tree, the update operations are realized by modifying the nodes in the tree. Inspired by the update method introduced in [14, 15], the update algorithm is presented as follows:(i)UpdateInfoGen (sk, Ts, fi, Utype): this algorithm is executed by the data owners and generates the update information {Is, ci} to the cloud server, where Ts is a set containing all the update nodes, Is is an encrypted form of Ts, fi is the target document, ci is an encrypted form of fi, and Utype is the update type. In order to reduce the communication cost, the data owners will store the unencrypted index tree on its own device. For the Utype {Ins, Del}, the algorithm works as follows:(a)If Utype = “Del,” it means that the algorithm will delete a document fi from the tree. The algorithm first finds the leaf node associated with the document fi and deletes it. In addition, internal nodes associated with this leaf node are also added to Ts. Specifically, if the deletion operation will break the balance of the index tree, the algorithm can set the target leaf node as a fake node instead of removing it. After this, the algorithm encrypts Ts to generate Is. Finally, the algorithm sends Is to the cloud server and sets ci as null.(b)If Utype = “Ins,” it means that the algorithm will insert a document fi to the tree. The algorithm first creates a leaf node for fi according to the method M1 introduced in Section 3.1 and inserts this leaf node to Ts. Then, based on the method M2, the algorithm updates the vectors of the internal nodes which are placed on the path from root to the new leaf node and inserts these internal nodes to Ts. Here, the algorithm prefers to replace the fake leaf node with the new leaf node rather than insert a new leaf node. Finally, the algorithm encrypts Ts and fi to generate Is and ci, respectively, and sends them to the cloud server.(ii)Update (IT, C, Is, ci, Utype): this algorithm is executed by the cloud server to update the index tree IT with encrypted nodes set Is. After this, if Utype = “Del,” then the algorithm removes ci from C. Otherwise, the algorithm inserts ci to C.

Note that after a period of insertion and deletion operations, the number of keywords in the dictionary should be changed. Because the dimensions of the index and trapdoor vectors in the previous schemes are linear with the number of keywords in the dictionary, these schemes have to rebuild the search index tree. By contrast, our scheme will not be affected by this problem. For the proposed scheme, the dimensions of the vectors in the index and trapdoor are determined by the tool of “Word2vec” and set by the users. For example, if we set the dimension of the vector as 200, the dimension of each keyword’s vector is 200, and thus the dimensions of the vectors of , , , and are all 200. According to the above analysis, our scheme is more suitable for the update operations than the previous schemes.

3.4. Security Analysis

In this section, we analyse the security of the proposed SSE-DMKRS scheme according to the privacy requirement introduced in Section 2.3:(1)Index and Trapdoor Privacy. In the proposed scheme, each node u in the index tree and the query Q in the trapdoor are encrypted by using the secure KNN algorithm introduced in [32]. Thus, the attackers cannot obtain the original vectors in the tree nodes and the query, which means that the index and trapdoor privacy are well protected.(2)Trapdoor Unlinkability. In the trapdoor generation phase, the query vector will be split randomly. Moreover, the same keyword set Q will be extended to be multiple different semantic keyword sets Q′. So, the same query Q will be encrypted to be different trapdoors, which means that the goal of the trapdoor unlinkability is achieved.(3)Keyword Privacy. Since the index and the trapdoor are protected by the secure KNN algorithm, the adversary cannot infer the plaintext information from the index and the trapdoor under the known ciphertext model. Considering that the known background model is common in real-world applications, we will analyse the security of the proposed scheme under the known background model. For the TrapdoorGen algorithm, the original query keyword set Q is extended to a new set Q′. Specifically, for each keyword q in Q, randomly choosing a number k′, the algorithm chooses k′ semantic keywords related to q by utilizing the dictionary and inserts these keywords into the Q′. Suppose that each keyword is associated with N semantic keywords in the dictionary, each keyword can generate 2N different keyword sets since each semantic keyword can be chosen or not. For example, if a keyword q is associated with three semantic keywords {q1, q2, q3}, then q can generate 23 keyword sets {q}, {q, q1}, {q, q2}, {q, q3}, {q, q1, q2}, {q, q1, q3}, {q, q2, q3}, and {q, q1, q2, q3}. Since the query Q usually contains more than one keyword, Q will generate more than 2N different semantic keyword sets. According to this method, the final similarity score is obfuscated by these random semantic keyword sets. As the analysis in [14, 15], our scheme can protect the keyword privacy under the known background model.

4. Proposed Scheme

In this section, we analyse the proposed SSE-DMKRS scheme theoretically and experimentally. A detailed experiment is given to demonstrate that our scheme can efficiently perform dynamic ranked keywords search over the encrypted data. Our experiment is run on Intel® Core™ i7 CPU at a 2.90 GHz processor and 16 GB memory size and is based on a real-world e-mail dataset called Enron e-mail dataset [35]. We mainly analyse the performance of our scheme in two aspects: (1) the efficiency of the proposed scheme including index building, trapdoor generation, search, and update; (2) the relationship between the search precision and the privacy level. Moreover, in order to show the advantages of our scheme, we also compare our scheme to two previous schemes related to our scheme. For simplicity, we denote these two schemes introduced in [14, 15] by X15 and G19.

4.1. Efficiency
4.1.1. Index Building

The process of index building mainly consists of two steps: (1) creating an unencrypted index tree by utilizing Algorithm 1; (2) encrypting each node in the tree by using the secure KNN scheme. In the tree building step, Algorithm 1 generates O (n) nodes based on the document set F. Because each node has two vectors , whose dimensions are both d, the vector splitting process needs O (d) time and the matrix multiplication operations take O () time in the encryption step. According to these two steps, the whole time complexity of index building is O (nd2), which means that the time cost for index building mainly depends on the number of documents in F and the dimension of each node’s vector.

Since the dimensions of each node’s vector in X15 and G19 are both linear with the number of keywords in the dictionary (m), the time costs for index building in X15 and G19 are both O (nm2). Due to , we can argue that the time cost for index building in our scheme is much less than that in X15 and G19. In addition, for the scheme G15, the internal nodes are constructed by the tool called bloom filter, and thus the dimension of each internal node’s vector is linear with b. Since b is usually smaller than m, the index building time in G19 is less than that in X15.

Figure 6(a) shows that the time cost for index building in our scheme is much less than that in X15 and G19. More precisely, when n = 1000, m = 20000, d = 1000, and b = 10000, the time consumption for index building in X15 and G19 is nearly 100∼200 times that in our scheme, respectively. As m increases, the advantages of our scheme will become even more significant.

In addition, because the index tree has O (n) nodes and each node holds two d-dimensional vectors, the space complexity of the index tree is O (nd). By contrast, the space complexities of the index tree in X15 and G19 are both O (nm). From Table 3, even if we set n = 1000, m = 20000, d = 1000, and b = 10000, the storage cost of the index tree in our scheme is still much less than that in X15 and G19.

4.1.2. Trapdoor Generation

In our scheme, the query is converted to be two vectors and , whose dimensions are both d. The trapdoor generation process is to multiply these two vectors by the matrices in the key. So, the time complexity of trapdoor generation in our scheme is O (d2). By contrast, since the dimensions of query vectors in X15 and G19 are both m, the time complexities of trapdoor generation are both O (m2). Thus, the time cost of trapdoor generation of our scheme is much less than that in X15 and G19. Particularly, from Figure 6(b), when n = 1000, m = 20000, and d = 1000, the time cost for trapdoor generation in our scheme is 1.5 ms, while that in G19 and X15 is 287 ms and 290 ms, respectively.

4.1.3. Search

In the search process, if the relevance score of an internal node u and the query Q is less than the minimum relevance score of the current top-k documents, the subtree which uses node u as the root node will not be accessed. Thus, not all of the nodes in the tree will be accessed during the search process. We suppose that there are leaf nodes that contain at least one keyword in the query Q. Since the height of the tree is O (log n) and the time complexity of the relevance score calculation is O (d), the time complexity of the search process is O (). For the scheme X15, because the time complexity of relevance score calculation is O (m), the time complexity of the search process is O () in X15. For the scheme G19, because each internal node contains a Bloom filter whose size is b and each leaf node involves a vector whose size is m, the time complexity of search process in G19 is O (). From Figure 6(c), when n = 1000, m = 20000, d = 1000, and b = 10000, the search time cost in our scheme is 36 ms, while that in G19 and X15 is 135 ms and 214 ms, respectively.

4.1.4. Update

When the data owners want to insert or delete a document, they will not only insert or delete a leaf node, but also update O (log n) internal nodes. Since the encryption time for each node is O (d2), the time complexity of an update operation is O (log n·d2). For X15 scheme, because the encryption time for each node is O (m2), the time complexity of an update operation is O (log n·m2). For G19 scheme, because the internal nodes are based on the Bloom filter which is not encrypted, the time cost for updating the internal nodes can be ignored. Thus, the time complexity of update in G19 is O (m2) since only the leaf node is encrypted. From Figure 6(d), when n = 1000, m = 20000, d = 1000, and b = 10000, the time cost for updating one document in our scheme is 16 ms, while that in X15 and G19 is 1020 ms and 107 ms, respectively.

4.2. Precision and Privacy

The search precision of our scheme is affected by a group of semantic keywords related to the original index and query keywords. We measure our scheme by adopting a metric called “precision” defined in [12]. The metric of precision is defined as follows:where k′ is the number of real top-k documents in the retrieved k documents.

In addition, the semantic keywords in the index and query keyword set will disturb the relevance score calculation in the search process, which makes it harder for adversaries to identify keywords in the index and trapdoor through the statistical information about the dataset. To measure the disturbance extent of the relevance score, we use the following equation called “rank privacy” introduced in [12] to quantify this obscureness:where ri is the rank number of the document i in the retrieved top-k documents and is document i′s real rank number in the real ranked results.

We compare our scheme to the schemes of X15 and G19 in terms of “precision” and “rank privacy.” Note that an important parameter in the previous two schemes is a standard deviation , which is utilized to adjust the relevance score for the dummy keywords. In the comparison, we set  = 0.05, which is usually used in the previous schemes. Besides, in our scheme, we set the number of semantic keywords for each keyword in the dictionary is 100, and the dimension of each node’s vector is 1000 (d = 1000). Based on these settings, the comparison is illustrated in Figure 7.

From Figure 7, as k grows from 10 to 50, the precision of our scheme decreases slightly from 59% to 55%, and the rank privacy increases slightly from 26% to 28%. For the schemes X15 and G19, the precision decreases and the rank privacy increases when k grows. This characteristic exists in all three schemes. Because the vector representations for the index tree and query in our scheme are compressed deeply, some statistical information in the index and the query will be lost. Thus, the precision of our scheme is less than that in X15 and G19. However, the rank privacy in our scheme is accordingly more than that in X15 and G19.

4.3. Impact of the Dimension of Vector Representation

The dimension of the vector representation (d) which we set in the “Word2vec” is an important parameter in our scheme. Next, we give the discussion of the impact of d for our scheme. The impact of d on the efficiency of our scheme is given in Figure 8. From Figure 8, we know that the time costs of index building, trapdoor generation, search, and update all increase when d grows. Besides, Figure 9 gives an illustration of the impact of d on the precision and rank privacy in our scheme. As d increases from 200 to 1000, the precision of our scheme increases slightly, while the rank privacy decreases gradually accordingly. These phenomena are all consistent with our previous theoretical analysis. So, in the proposed scheme, data users can balance the efficiency and accuracy by adjusting the parameter d to satisfy the requirements of different applications.

4.4. Discussion

From the experiment results, when n = 1000, m = 20000, d = 200, and b = 10000, the time cost of index building is 3 s, the generation time of a single trapdoor is 1.5 ms, and the search time is 36 ms, which are all much better than the previous schemes X15 and G19. Efficiency in our scheme demonstrates that our scheme is extremely suitable for practical applications, especially the mobile cloud setting in which the clients have limited computation and storage resources.

The experiment result shows that the precision of our scheme is less than that in the previous two schemes, while the rank privacy is more than that in the previous schemes accordingly. In addition, by using the “Word2vec” method, the vector representations used in our scheme contain the semantic information of the documents and queries. Based on these facts, we argue that the proposed scheme is suitable for applications requiring similarity and semantic search, such as mobile recommendation system, mobile search engine, and online shopping system.

5. Conclusions

In this paper, by applying “Word2Vec” to construct the vector representations of the documents and queries and adopting the balanced binary tree to index the documents, we proposed a searchable symmetric encryption scheme supporting dynamic multikeyword ranked search. Compared with the previous schemes, our scheme can tremendously reduce the time costs of index building, trapdoor generation, search, and update. Moreover, the storage cost of the secure index is also reduced significantly. Considering that the precision of our scheme can be further improved, we will construct a more accurate scheme based on the recent information retrieval techniques in the future work.

Data Availability

The data used to support the findings of this study is available from the following website: Http://www.cs.cmu.edu/∼./enron/.

Conflicts of Interest

The authors declare that they have no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors gratefully acknowledge the support of the National Natural Science Foundation of China under Grants nos. 61402393 and 61601396 and the Nanhu Scholars Program for Young Scholars of XYNU.