Abstract
Searchable symmetric encryption that supports dynamic multikeyword ranked search (SSEDMKRS) has been intensively studied during recent years. Such a scheme allows data users to dynamically update documents and retrieve the most wanted documents efficiently. Previous schemes suffer from high computational costs since the time and space complexities of these schemes are linear with the size of the dictionary generated from the dataset. In this paper, by utilizing a shallow neural network model called “Word2vec” together with a balanced binary tree structure, we propose a highly efficient SSEDMKRS scheme. The “Word2vec” tool can effectively convert the documents and queries into a group of vectors whose dimensions are much smaller than the size of the dictionary. As a result, we can significantly reduce the related space and time cost. Moreover, with the use of the treebased index, our scheme can achieve a sublinear search time and support dynamic operations like insertion and deletion. Both theoretical and experimental analyses demonstrate that the efficiency of our scheme surpasses any other schemes of the same kind, so that it has a wide application prospect in the real world.
1. Introduction
Nowadays, with the development of the network and virtualization technology, cloud computing technology has been developed rapidly. Through the cloud service, enterprises and individuals can obtain better computing and storage services at a lower cost. Since cloud servers are not entirely trusted, utilizing cloud services while maintaining data privacy is an essential concern. A straightforward way to address this issue is encrypting the data before outsourcing it to the cloud servers. However, this approach fails to meet the requirement of data retrieval since traditional encryption will scramble the original data, making the data inconvenient to utilize. In this scenario, the users have to download all the ciphertext data and decrypt them locally, which will bring huge transmission, storage, and computation overhead, which is not applicable in cloud environment.
Searchable encryption (SE) can support keyword search without decrypting the data, and thus it is very suitable to achieve the keyword search over ciphertext. Based on the SE scheme, data owners and authorized users share a secret key. Data owners can encrypt the sensitive data and upload them to the cloud server. If data users want to search the encrypted data, they can generate an encrypted trapdoor by using the query and the secret key. When the cloud server receives the trapdoor, it tests the trapdoor against the encrypted data without decrypting these data and returns the data related to the query to the users. The first searchable symmetric encryption (SSE) scheme was proposed by Song et al. [1]. This scheme can not only encrypt data, but also provide a search mechanism over the encrypted data. With the improvement of security and efficiency of SSE schemes, it has attracted the community attention. During recent years, researchers focused on how to construct solutions with complex query functions, such as multikeyword search [2–7], similarity search [8, 9], and ranked search [10–15]. In particular, ranked search schemes can sort the query results according to the relevant degree between the documents and queries and only return the most related (topk) documents. Thus, ranked schemes can significantly reduce the computation and storage costs.
The primary ranked search schemes were proposed in [10, 11], which only support a singlekeyword search. The early SSE scheme supporting multikeyword ranked search was given by Cao et al. [12]. The score evaluation method used in their scheme is the inner product between the query and document vectors. In their scheme, since each document has its own vector representation, the search time is linear with the number of documents in the dataset, which will have a very high storage overhead for a big data environment. Then, Sun et al. [13] gave a similar scheme with a betterthanlinear search efficiency by using a treebased index [16, 17]. They adopt the technique of term frequencyinverse document frequency (TFIDF) to evaluate the score between the index and queries. To further improve the search efficiency and support dynamic update, Xia et al. proposed an efficient SSE scheme supporting dynamic multikeyword ranked search [14]. In their scheme, they construct a treebased structure and propose a parallel search algorithm to accelerate the search process. Moreover, they also provide a dynamic update method to cope with the deletion and insertion of documents flexibly. Recently, by utilizing the Bloom filter [18], Guo et al. constructed an efficient SSE scheme supporting dynamic multikeyword ranked search [15] to further improve the efficiency of keywords search and index construction. Owing to the Bloom filter, the internal nodes in the index tree are not needed to be encrypted, and the dimension of the vectors in the internal nodes is also reduced. As a result, this scheme can achieve a better performance than the previous similar schemes.
Another kind of SE is called searchable public key encryption (SPE), which is established on the public key system. In SSE, the key for encrypting data is the same as the key for generating search trapdoor. By contrast, in SPE, the public key for encrypting data is open to public, while the secret key for generating search trapdoor is only given to the authorized data receivers. The very first SPE scheme supporting keyword search was introduced by Boneh et al., and it is socalled public key with keyword search (PEKS) [19]. However, their work only supports a singlekeyword search. In order to support more expressive query, many SPE schemes [20–23] were proposed to realize advanced search, for example, conjunctive, disjunctive, and Boolean keyword search. By using a special hidden structure, Xu et al. proposed two SPE schemes supporting singlekeyword search [24, 25] whose search performance is very close to that of a practical SSE scheme. By converting an attributebased encryption scheme, Han et al. proposed an SPE scheme which can control user’s search permission according to an access control policy [26]. After this, Kai et al. proposed an SPE scheme achieving both Boolean keyword search and finegrained search permission [27]. Sepehri et al. proposed a scalable proxybased protocol for privacypreserving queries, which allows authorized users to perform queries over data encrypted with different keys [28]. Later, by utilizing an ElGamal elliptic curve encryption system, Sepehri et al. gave a similar scheme with better efficiency [29]. In order to improve search accuracy, Zhang et al. proposed an SPE scheme supporting semantic keywords search by adopting a method called “Word2vec” [30]. For the sake of brevity, we summarize some SPE and SSE schemes in Table 1, which describes the difference between our scheme and previous schemes.
1.1. Motivation
The previous ranked search schemes in symmetric key setting are secure and somewhat efficient. However, the index building, trapdoor generation, and search time are all related to the size of the dictionary generated from the dataset, which is not suitable for the big data environment. According to the statistical information given in [20], we found that the vocabulary size in a dataset is commonly linear with O (10^{6}). Therefore, it is necessary to construct a more efficient ranked search scheme. Motivated by this, in this paper, we aim to construct a novel SSE scheme supporting dynamic multikeyword ranked search (SSEDMKRS) with high efficiency.
1.2. Contributions
The main contributions are summarized as follows:(1)Based on “Word2vec” [31] technique, we propose a novel method which can change the documents and queries into vector representations. The dimension of the vector representation obtained by our method is nearly 10% of that in the previous SSEDMKRS schemes [14, 15].(2)We propose an efficient index building algorithm which can create a balanced binary tree to index all the documents. The obtained index tree can achieve a sublinear search time and support dynamic update operations.(3)Through applying the secure knearest neighbour (KNN) scheme [32] to encrypt the index tree and the query, we propose an efficient SSEDMKRS scheme.
In addition, we implement our scheme on a widely used data collection. The experiment results show that our scheme extremely reduces the time cost of index building, trapdoor generation, keywords search, and update without losing too much accuracy; e.g., the time cost of index building in our scheme is nearly 10% of that in the previous schemes. Meanwhile, the storage cost of encrypted index is also reduced greatly; e.g., the storage cost of the index in our scheme is nearly one percent of that in the previous schemes. In conclusion, compared to the previous SSEDMKRS schemes [14, 15], our scheme is very suitable for the mobile cloud environment in which the client device has limited computation and storage resources.
1.3. Organization
This paper is organized as follows. In Section 2, we give a formal definition of the system model and threat model in our scheme and also introduce the tools we adopt in our scheme, which contains “Word2vec” and the vector space model. In Section 3, we present the construction of the search index tree and the SSEDMKRS scheme. Besides, a detailed security analysis and update operations of our scheme are also given. Theoretical and experimental analyses are given in Section 4. Section 5 gives the conclusion.
2. Preliminaries
In this section, we first give the framework of the system model and introduce the threat model adopted in our scheme. Then, we introduce some tools adopted in our schemes, including a famous term representation method in the field of natural language processing, e.g., “Word2vec,” and the vector space model. Finally, we present the design goal of our scheme. In addition, the main notations used in this paper are summarized in Table 2.
2.1. System Model
The system model contains three different roles: data owner, data user, and cloud server. The data owner outsources a group of documents F = {f_{1}, f_{2}, …, f_{n}} to the cloud in ciphertext form C = {c_{1}, c_{2}, …, c_{n}}. Moreover, the data owner also generates an encrypted searchable index for keywords search operation. For each query of an arbitrary keyword set Q, the data user computes a search trapdoor T_{Q} of the query Q and sends it to the cloud server. Upon receiving T_{Q} from the data user, the cloud server searches against the encrypted index and returns the candidate encrypted documents. After this, the data user decrypts the candidate documents and obtains the plaintext.
As illustrated in Figure 1, the architecture of the system model is formally described as follows:(1)Data Owner (DO). DO holds a group of documents F = {f_{1}, f_{2}, …, f_{n}} and generates a secure searchable index I from F and an encrypted document collection C for F. Then, DO uploads I and C to the cloud server and distributes the secret key to the authorized data users. Furthermore, DO needs to update the index and documents stored in the cloud server.(2)Data User (DU). Authorized DU can launch keywords query over the encrypted data by utilizing a trapdoor which is generated by using the secret key fetched from DO. Moreover, DU can decrypt the encrypted documents by utilizing the secret key.(3)Cloud Server (CS). CS stores the encrypted index I and documents C from DO. When CS receives the trapdoor for query Q from DU, CS executes keywords query over the index and returns the topk most relevant encrypted documents associated with the query Q. Upon receiving the update information from DO, CS also performs update operation over the encrypted data. In addition, we assume that CS is “honestbutcurious,” which is employed by many searchable encryption schemes [12, 14, 15]. This means that CS honestly and correctly executes the algorithms in our scheme. However, CS curiously infers and analyses the received data to obtain extra privacy information.
2.2. Threat Model
Throughout the paper, we mainly utilize two threat models proposed by Cao et al. [12]:(1)Known Ciphertext Model. CS only knows the information of the encrypted index, ciphertext, and trapdoor. That is to say, CS can execute cipheronly attacks in this model.(2)Known Background Model. CS knows more information than the known ciphertext model, such as the statistical information inferred from the documents. By taking advantage of these pieces of statistical information, e.g., term frequency (TF) and inverse document frequency (IDF), CS can conduct statistical attack to verify whether certain keywords are in the query [33].
2.3. Design Goals
As mentioned before, we aim to build a secure and efficient SSEDMKRS scheme. The design goal of our scheme is described as follows:(1)Efficiency. The scheme aims to realize a sublinear search efficiency, and the time and space costs of index building and trapdoor generation are much less than those of the current schemes.(2)Privacy Preserving. Similar with previous schemes [12, 14, 15], our scheme needs to prevent CS from learning extra privacy information, which is inferred from the documents, secure index, and queries. More precisely, the privacy requirement is listed as follows:●Index and Trapdoor Privacy. The plaintext information concealed in the index and the trapdoor cannot be leaked to CS. This information involves the keywords and the corresponding vector representation of each keyword.●Trapdoor Unlinkability. CS cannot determine whether two trapdoors are built from the same query.●Keyword Privacy. CS cannot identify whether a specific keyword is in the trapdoor or index by analysing the search results and the statistical information of the documents.(3)Dynamic. The scheme can efficiently support dynamic operations like documents insertion and deletion. Note that the efficiency of update operations in our scheme is better than the previous SSEDMKRS schemes.
2.4. Word2Vec
“Word2Vec” model is a shallow, twolayer neural network, which is used to convert words into a group of vector representations [31]. Under this model, each word in the document set is mapped to a vector, which can be used to calculate the similarity between words. For instance, Figure 2 shows that, through training a simple corpus, three words “dog,” “fox,” and “orange” are mapped to three vector representations, respectively. By utilizing these vectors, the similarity among these three words can be calculated. We can find that the similarity between “dog” and “fox” is more than that between “dog” and “orange” since “dog” and “fox” are animals. Thus, we can utilize “Word2Vec” to convert the keywords in a corpus into a group of vector representations and then apply these vectors to perform ranked search.
2.5. Advice on Equations
Vector space model is a very popular method used in the field of information retrieval, usually along with the TFIDF rule to realize the topk search, where TF is term frequency and IDF is the inverse document frequency [34]. By utilizing the TFIDF rule, the documents and queries can be represented as a group of vectors. These vectors can be adopted in the topk search over the ciphertext [12, 14, 15]. However, the dimension of these obtained vectors is linear with the number of words in the dataset, which is not efficient if the dataset has a lot of words. To address this issue, we will apply “Word2Vec” to present a novel keywords conversion method, which is described as follows:(1)Through applying “Word2Vec” to a corpus, we create a dictionary in which each keyword is associated with a vector representation.(2)For the keyword set W_{i} = {, , …, } of the document f_{i}, we obtain a vector by looking up the dictionary, where is a vector representation for and , . After this, we set as the vector representation of W_{i}.(3)For the query keyword set Q = {q_{1}, q_{2}, …, q_{t}}, we utilize the dictionary to construct a vector . Then, we set as the vector representation of Q.
Note that the dimensions of W_{i} and Q are very small, e.g., 200, which is significantly smaller than the number of words in the dataset. Thus, the proposed method is better than the previous method based on the TFIDF rule. In addition, together with the vector space model mentioned above, we use the cosine measure to evaluate the relevance between the document and the query. The relevant evaluation function is defined in the next section.
3. Proposed Scheme
In this section, we first give the algorithms of the index tree building and the search algorithm on this tree. Then, we give the concrete construction of our scheme and the dynamic update operations of our scheme. Finally, we give a detailed analysis of the security of our scheme.
3.1. Search Index Balanced Binary Tree
In this section, we adopt a balanced binary tree to create the search index, which will be used in our main scheme. Inspired by the construction process in [14], the tree building and the search process for our scheme are described as follows.
3.1.1. Tree Building Process
Formally, the data structure of the tree node u is defined as , , where ID is the identity of the node u, and are the vector representations of the node u, P_{l} and P_{r} are pointers which point u’s left and right children, respectively, and FID stores the identity of a document if u is a leaf node. Note that, compared with the previous index trees [12, 14, 15], the node in our tree has two vectors while it has only one vector in previous trees. The main reason is that the node vector in our tree has a negative number while the node vector in previous trees only contains positive number. For clarity, we give a simple example. Let and be two vectors of leaf nodes A and B, respectively. For the previous index trees, the vector of the parent node C of these two leaf nodes is in which the value of each dimension is the larger value of and . For a query vector , the scores of the nodes A, B, and C are 0.2, 0.2, and −0.4, respectively. It is very important to note that the score of the parent node is less than the scores of its children, which causes the fact that these two leaf nodes will be ignored in the tree search process even if they should be considered.
In our index tree, let the dimensions of and be both d. The methods for constructing and are denoted by M_{1} and M_{2}, respectively, and given as follows:(1)M_{1}: if the node u is a leaf node which is corresponding a file f, we create a vector for f by adopting the keywords conversion method mentioned in Section 2.5. Then, we set and (2)M_{2}: if the node u is an internal node, the and are based on its children vectors. Let and be the two vectors of u’s left child, and let and be the two vectors of u’s right child.
Suppose that Min () and Max () are the functions of the minimum and maximum, respectively; is built as follows:
And, is built as follows:
We find that is built by utilizing the larger number of and , and is created by using the smaller number of and .
An illustration of the above methods is given in Figure 3. From Figure 3, let the node u be a leaf node, and let W be the keyword set of the file that u stores. By using the keyword conversion method, W is converted to be a vector . Then, we set a. If the node u is an internal node, and the vectors of its children are , , , and and the vectors of the internal node are and .
(a)
(b)
Based on the methods M_{1} and M_{2}, inspired by the tree building algorithm introduced in [14], our tree building algorithm is given in Algorithm 1. An example of the proposed index tree is given in Example 1 and Figure 4. In Algorithm 1, we use function GenID () to generate the unique identity ID for each node, and apply GenFID () to generate the unique file ID for each leaf node. CurrentNodeSet contains a group of nodes having no parent node, which are needed to be processed. CurrentNodeSet is the number of nodes in CurrentNodeSet. If CurrentNodeSet is even, we assume that CurrentNodeSet = 2h; otherwise, we assume that CurrentNodeSet = 2h + 1, where h is a positive number. TempNodeSet is a set containing the newly generated nodes. Moreover, for each node u, if u is a leaf node, we use method M_{1} to generate and ; otherwise, and are created by using M_{2}.

3.1.2. Search Process
For a query vector of query Q, we spilt into two vectors and . For each dimension , if , and ; otherwise, and . Obviously, holds all the negative part of , while holds the positive part. For clarity, we denote this splitting method for query Q by M_{3}. The illustration of this method is given in Figure 5. If the query vector , then and .
For a query Q and a node u, the score is calculated as
We can utilize the above equation to evaluate which documents are the most related to the query. Moreover, we can verify that the score of the parent node is larger than its children’s score. This property can significantly reduce the number of nodes which will be checked in the search process.
The search process is given in Algorithm 2. In Algorithm 2, we use RList to store the topk files which have the klargest relevance scores to the query. The RList is initialized to be an empty list, and it is updated when finding a relevance file. The kth score is defined as the smallest relevance score in the current RList, which is initialized to be a very small integer. By using the kth score, we can accelerate the search process by ignoring some paths with low scores. In Example 1 and Figure 4, an illustration of the search process is given, where F = {f_{1}, f_{2}, …, f_{6}}, query vectors are and , and d (vector dimension) is 3.

3.1.3. Example 1
An example of an index tree and a search process on this tree is illustrated in Figure 4. In Figure 4, we show an index tree with F = {f_{1}, f_{2}, …, f_{6}} in which the dimension of the vector for each node is 3. For each node u in the tree, the upper vector and lower vector are corresponding to and , respectively. In the tree building process, we first generate the leaf nodes from F and then create the internal nodes based on these leaf nodes.
Moreover, Figure 4 also gives an illustration of the search process. In Figure 4, we set and split it into and . We suppose that top3 files will be returned to the data user. According to Algorithm 2, the search process begins with the root node r and calculates the score between the query Q and the two child nodes r_{11} and r_{12} of r by using equation (3). The calculation process is presented as follows:
Because the score between r_{11} and Q is higher than that between r_{12} and Q, Algorithm 2 will traverse the subtree with r_{11} as the root node and compute the score between the query Q and two child nodes of r_{11}. Since the score between r_{21} and Q is higher than that between r_{22} and Q, Algorithm 2 will traverse the subtree with r_{21} as the root node and add the leaf nodes f_{1}, f_{2} to the RList. After this, the subtree with r_{22} as the root node will be traversed, and the leaf nodes f_{3} and f_{4} are reached. Since the number of files in RList is less than 3, f_{3} is added to RList directly. For the file f_{4}, since the number of files in RList equals 3 now, Algorithm 2 will compare the score between f_{4} and Q to the minimum score in the RList. Because the score between f_{4} and Q is smaller than the minimum score in the RList, f_{4} is not added to the RList. At present, the subtree with r_{11} as the root node has been traversed. Algorithm 2 will traverse the subtree with r_{12} as the root node. As the score between r_{12} and Q is smaller than the minimum score in the RList, which means that the score of all child nodes of r_{12} is smaller than the minimum score in the RList (this property is described in Section 3.1.2), f_{5} and f_{6} will not be checked. Therefore, Algorithm 2 outputs RList = {f_{1}, f_{2}, f_{3}}.
3.2. Construction of SSEDMKRS
In this section, through combining the secure KNN algorithm [32] and the index tree building algorithm, we propose a concrete SSEDMKRS scheme. The SSEDMKRS scheme consists of five algorithms. The algorithms KeyGen, DictionaryBuild, and IndexBuild are executed by the data owners, while the algorithms TrapdoorGen and Search are performed by the data users and the cloud server, respectively:(i)KeyGen (): given a security parameter , this algorithm first randomly chooses four invertible matrices , , , and , where d is the dimension of and . Then, it randomly generates a dbit vector S. Finally, it outputs the secret key sk = {S, , , , }.(ii)DictionaryBuild (F): given the document set F = {f_{1}, f_{2}, …, f_{n}}, the algorithm runs “Word2vec” to generate the dictionary D of F. In the dictionary D, each keyword is associated with a vector representation. Besides, each keyword is also corresponding with a set of semantically related keywords.(iii)IndexBuild (sk, F, D): given the document set F and the dictionary D for F, the algorithm first creates the index tree T by using the algorithm BuildIndexTree (F, D) (Algorithm 1). Then, for each node u in the tree T, the algorithm generates two random vector pairs and for the vectors of and , respectively. More precisely, if S [i] = 0, it sets and ; if S [i] = 1, are set as four random values under the constraints and . This process is expressed as the following equation:
Finally, for each node u, it computes . Through replacing the plaintext vectors and with the encrypted index I_{u}, an encrypted index tree I_{T} is created.(iv)TrapdoorGen (sk, Q): given a query keyword set Q, the algorithm first extends Q to a new semantic keyword set Q′. The process is as follows:(a)It generates a new keyword set Q′, which is initialized to an empty set.(b)Note that each keyword in the dictionary is associated with a group of keywords semantically related to this keyword. For each keyword q in Q, it randomly chooses k′ semantic keywords based on the dictionary and inserts these keywords into Q′, where k′ is chosen dynamically and .
Then, based on Q′, the TrapdoorGen algorithm generates a pair of vectors and by adopting the method M_{3}. After this, it generates two random vector pairs and for the vectors of and , respectively. This process is similar to the process in the IndexBuild algorithm and can be expressed as the following equations:
Finally, this algorithm generates as the trapdoor for Q.(v)Search (sk, T_{Q}, I_{T}): for each node u in I_{T}, the algorithm computes
According to equation (3), the relevance score calculated from the encrypted vector I_{u} and the trapdoor T_{Q} equals the value of Score (u, Q). By using this property, the algorithm can utilize the SearchIndexTree algorithm (Algorithm 2) to perform ranked search.
3.3. Dynamic Update Operations
Besides search operation, the proposed scheme also supports some dynamic operations, e.g., documents insertion and deletion, satisfying the requirement of realworld application. Because the proposed scheme is built over a balanced binary tree, the update operations are realized by modifying the nodes in the tree. Inspired by the update method introduced in [14, 15], the update algorithm is presented as follows:(i)UpdateInfoGen (sk, T_{s}, f_{i}, Utype): this algorithm is executed by the data owners and generates the update information {I_{s}, c_{i}} to the cloud server, where T_{s} is a set containing all the update nodes, I_{s} is an encrypted form of T_{s}, f_{i} is the target document, c_{i} is an encrypted form of f_{i}, and Utype is the update type. In order to reduce the communication cost, the data owners will store the unencrypted index tree on its own device. For the Utype {Ins, Del}, the algorithm works as follows:(a)If Utype = “Del,” it means that the algorithm will delete a document f_{i} from the tree. The algorithm first finds the leaf node associated with the document f_{i} and deletes it. In addition, internal nodes associated with this leaf node are also added to T_{s}. Specifically, if the deletion operation will break the balance of the index tree, the algorithm can set the target leaf node as a fake node instead of removing it. After this, the algorithm encrypts T_{s} to generate I_{s}. Finally, the algorithm sends I_{s} to the cloud server and sets c_{i} as null.(b)If Utype = “Ins,” it means that the algorithm will insert a document f_{i} to the tree. The algorithm first creates a leaf node for f_{i} according to the method M_{1} introduced in Section 3.1 and inserts this leaf node to T_{s}. Then, based on the method M_{2}, the algorithm updates the vectors of the internal nodes which are placed on the path from root to the new leaf node and inserts these internal nodes to T_{s}. Here, the algorithm prefers to replace the fake leaf node with the new leaf node rather than insert a new leaf node. Finally, the algorithm encrypts T_{s} and f_{i} to generate I_{s} and c_{i}, respectively, and sends them to the cloud server.(ii)Update (I_{T}, C, I_{s}, c_{i}, Utype): this algorithm is executed by the cloud server to update the index tree I_{T} with encrypted nodes set I_{s}. After this, if Utype = “Del,” then the algorithm removes c_{i} from C. Otherwise, the algorithm inserts c_{i} to C.
Note that after a period of insertion and deletion operations, the number of keywords in the dictionary should be changed. Because the dimensions of the index and trapdoor vectors in the previous schemes are linear with the number of keywords in the dictionary, these schemes have to rebuild the search index tree. By contrast, our scheme will not be affected by this problem. For the proposed scheme, the dimensions of the vectors in the index and trapdoor are determined by the tool of “Word2vec” and set by the users. For example, if we set the dimension of the vector as 200, the dimension of each keyword’s vector is 200, and thus the dimensions of the vectors of , , , and are all 200. According to the above analysis, our scheme is more suitable for the update operations than the previous schemes.
3.4. Security Analysis
In this section, we analyse the security of the proposed SSEDMKRS scheme according to the privacy requirement introduced in Section 2.3:(1)Index and Trapdoor Privacy. In the proposed scheme, each node u in the index tree and the query Q in the trapdoor are encrypted by using the secure KNN algorithm introduced in [32]. Thus, the attackers cannot obtain the original vectors in the tree nodes and the query, which means that the index and trapdoor privacy are well protected.(2)Trapdoor Unlinkability. In the trapdoor generation phase, the query vector will be split randomly. Moreover, the same keyword set Q will be extended to be multiple different semantic keyword sets Q′. So, the same query Q will be encrypted to be different trapdoors, which means that the goal of the trapdoor unlinkability is achieved.(3)Keyword Privacy. Since the index and the trapdoor are protected by the secure KNN algorithm, the adversary cannot infer the plaintext information from the index and the trapdoor under the known ciphertext model. Considering that the known background model is common in realworld applications, we will analyse the security of the proposed scheme under the known background model. For the TrapdoorGen algorithm, the original query keyword set Q is extended to a new set Q′. Specifically, for each keyword q in Q, randomly choosing a number k′, the algorithm chooses k′ semantic keywords related to q by utilizing the dictionary and inserts these keywords into the Q′. Suppose that each keyword is associated with N semantic keywords in the dictionary, each keyword can generate 2^{N} different keyword sets since each semantic keyword can be chosen or not. For example, if a keyword q is associated with three semantic keywords {q_{1}, q_{2}, q_{3}}, then q can generate 2^{3} keyword sets {q}, {q, q_{1}}, {q, q_{2}}, {q, q_{3}}, {q, q_{1}, q_{2}}, {q, q_{1}, q_{3}}, {q, q_{2}, q_{3}}, and {q, q_{1}, q_{2}, q_{3}}. Since the query Q usually contains more than one keyword, Q will generate more than 2^{N} different semantic keyword sets. According to this method, the final similarity score is obfuscated by these random semantic keyword sets. As the analysis in [14, 15], our scheme can protect the keyword privacy under the known background model.
4. Proposed Scheme
In this section, we analyse the proposed SSEDMKRS scheme theoretically and experimentally. A detailed experiment is given to demonstrate that our scheme can efficiently perform dynamic ranked keywords search over the encrypted data. Our experiment is run on Intel® Core™ i7 CPU at a 2.90 GHz processor and 16 GB memory size and is based on a realworld email dataset called Enron email dataset [35]. We mainly analyse the performance of our scheme in two aspects: (1) the efficiency of the proposed scheme including index building, trapdoor generation, search, and update; (2) the relationship between the search precision and the privacy level. Moreover, in order to show the advantages of our scheme, we also compare our scheme to two previous schemes related to our scheme. For simplicity, we denote these two schemes introduced in [14, 15] by X15 and G19.
4.1. Efficiency
4.1.1. Index Building
The process of index building mainly consists of two steps: (1) creating an unencrypted index tree by utilizing Algorithm 1; (2) encrypting each node in the tree by using the secure KNN scheme. In the tree building step, Algorithm 1 generates O (n) nodes based on the document set F. Because each node has two vectors , whose dimensions are both d, the vector splitting process needs O (d) time and the matrix multiplication operations take O () time in the encryption step. According to these two steps, the whole time complexity of index building is O (nd^{2}), which means that the time cost for index building mainly depends on the number of documents in F and the dimension of each node’s vector.
Since the dimensions of each node’s vector in X15 and G19 are both linear with the number of keywords in the dictionary (m), the time costs for index building in X15 and G19 are both O (nm^{2}). Due to , we can argue that the time cost for index building in our scheme is much less than that in X15 and G19. In addition, for the scheme G15, the internal nodes are constructed by the tool called bloom filter, and thus the dimension of each internal node’s vector is linear with b. Since b is usually smaller than m, the index building time in G19 is less than that in X15.
Figure 6(a) shows that the time cost for index building in our scheme is much less than that in X15 and G19. More precisely, when n = 1000, m = 20000, d = 1000, and b = 10000, the time consumption for index building in X15 and G19 is nearly 100∼200 times that in our scheme, respectively. As m increases, the advantages of our scheme will become even more significant.
(a)
(b)
(c)
(d)
In addition, because the index tree has O (n) nodes and each node holds two ddimensional vectors, the space complexity of the index tree is O (nd). By contrast, the space complexities of the index tree in X15 and G19 are both O (nm). From Table 3, even if we set n = 1000, m = 20000, d = 1000, and b = 10000, the storage cost of the index tree in our scheme is still much less than that in X15 and G19.
4.1.2. Trapdoor Generation
In our scheme, the query is converted to be two vectors and , whose dimensions are both d. The trapdoor generation process is to multiply these two vectors by the matrices in the key. So, the time complexity of trapdoor generation in our scheme is O (d^{2}). By contrast, since the dimensions of query vectors in X15 and G19 are both m, the time complexities of trapdoor generation are both O (m^{2}). Thus, the time cost of trapdoor generation of our scheme is much less than that in X15 and G19. Particularly, from Figure 6(b), when n = 1000, m = 20000, and d = 1000, the time cost for trapdoor generation in our scheme is 1.5 ms, while that in G19 and X15 is 287 ms and 290 ms, respectively.
4.1.3. Search
In the search process, if the relevance score of an internal node u and the query Q is less than the minimum relevance score of the current topk documents, the subtree which uses node u as the root node will not be accessed. Thus, not all of the nodes in the tree will be accessed during the search process. We suppose that there are leaf nodes that contain at least one keyword in the query Q. Since the height of the tree is O (log n) and the time complexity of the relevance score calculation is O (d), the time complexity of the search process is O (). For the scheme X15, because the time complexity of relevance score calculation is O (m), the time complexity of the search process is O () in X15. For the scheme G19, because each internal node contains a Bloom filter whose size is b and each leaf node involves a vector whose size is m, the time complexity of search process in G19 is O (). From Figure 6(c), when n = 1000, m = 20000, d = 1000, and b = 10000, the search time cost in our scheme is 36 ms, while that in G19 and X15 is 135 ms and 214 ms, respectively.
4.1.4. Update
When the data owners want to insert or delete a document, they will not only insert or delete a leaf node, but also update O (log n) internal nodes. Since the encryption time for each node is O (d^{2}), the time complexity of an update operation is O (log n·d^{2}). For X15 scheme, because the encryption time for each node is O (m^{2}), the time complexity of an update operation is O (log n·m^{2}). For G19 scheme, because the internal nodes are based on the Bloom filter which is not encrypted, the time cost for updating the internal nodes can be ignored. Thus, the time complexity of update in G19 is O (m^{2}) since only the leaf node is encrypted. From Figure 6(d), when n = 1000, m = 20000, d = 1000, and b = 10000, the time cost for updating one document in our scheme is 16 ms, while that in X15 and G19 is 1020 ms and 107 ms, respectively.
4.2. Precision and Privacy
The search precision of our scheme is affected by a group of semantic keywords related to the original index and query keywords. We measure our scheme by adopting a metric called “precision” defined in [12]. The metric of precision is defined as follows:where k′ is the number of real topk documents in the retrieved k documents.
In addition, the semantic keywords in the index and query keyword set will disturb the relevance score calculation in the search process, which makes it harder for adversaries to identify keywords in the index and trapdoor through the statistical information about the dataset. To measure the disturbance extent of the relevance score, we use the following equation called “rank privacy” introduced in [12] to quantify this obscureness:where r_{i} is the rank number of the document i in the retrieved topk documents and is document i′s real rank number in the real ranked results.
We compare our scheme to the schemes of X15 and G19 in terms of “precision” and “rank privacy.” Note that an important parameter in the previous two schemes is a standard deviation , which is utilized to adjust the relevance score for the dummy keywords. In the comparison, we set = 0.05, which is usually used in the previous schemes. Besides, in our scheme, we set the number of semantic keywords for each keyword in the dictionary is 100, and the dimension of each node’s vector is 1000 (d = 1000). Based on these settings, the comparison is illustrated in Figure 7.
(a)
(b)
From Figure 7, as k grows from 10 to 50, the precision of our scheme decreases slightly from 59% to 55%, and the rank privacy increases slightly from 26% to 28%. For the schemes X15 and G19, the precision decreases and the rank privacy increases when k grows. This characteristic exists in all three schemes. Because the vector representations for the index tree and query in our scheme are compressed deeply, some statistical information in the index and the query will be lost. Thus, the precision of our scheme is less than that in X15 and G19. However, the rank privacy in our scheme is accordingly more than that in X15 and G19.
4.3. Impact of the Dimension of Vector Representation
The dimension of the vector representation (d) which we set in the “Word2vec” is an important parameter in our scheme. Next, we give the discussion of the impact of d for our scheme. The impact of d on the efficiency of our scheme is given in Figure 8. From Figure 8, we know that the time costs of index building, trapdoor generation, search, and update all increase when d grows. Besides, Figure 9 gives an illustration of the impact of d on the precision and rank privacy in our scheme. As d increases from 200 to 1000, the precision of our scheme increases slightly, while the rank privacy decreases gradually accordingly. These phenomena are all consistent with our previous theoretical analysis. So, in the proposed scheme, data users can balance the efficiency and accuracy by adjusting the parameter d to satisfy the requirements of different applications.
(a)
(b)
(a)
(b)
4.4. Discussion
From the experiment results, when n = 1000, m = 20000, d = 200, and b = 10000, the time cost of index building is 3 s, the generation time of a single trapdoor is 1.5 ms, and the search time is 36 ms, which are all much better than the previous schemes X15 and G19. Efficiency in our scheme demonstrates that our scheme is extremely suitable for practical applications, especially the mobile cloud setting in which the clients have limited computation and storage resources.
The experiment result shows that the precision of our scheme is less than that in the previous two schemes, while the rank privacy is more than that in the previous schemes accordingly. In addition, by using the “Word2vec” method, the vector representations used in our scheme contain the semantic information of the documents and queries. Based on these facts, we argue that the proposed scheme is suitable for applications requiring similarity and semantic search, such as mobile recommendation system, mobile search engine, and online shopping system.
5. Conclusions
In this paper, by applying “Word2Vec” to construct the vector representations of the documents and queries and adopting the balanced binary tree to index the documents, we proposed a searchable symmetric encryption scheme supporting dynamic multikeyword ranked search. Compared with the previous schemes, our scheme can tremendously reduce the time costs of index building, trapdoor generation, search, and update. Moreover, the storage cost of the secure index is also reduced significantly. Considering that the precision of our scheme can be further improved, we will construct a more accurate scheme based on the recent information retrieval techniques in the future work.
Data Availability
The data used to support the findings of this study is available from the following website: Http://www.cs.cmu.edu/∼./enron/.
Conflicts of Interest
The authors declare that they have no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors gratefully acknowledge the support of the National Natural Science Foundation of China under Grants nos. 61402393 and 61601396 and the Nanhu Scholars Program for Young Scholars of XYNU.