Research Article  Open Access
A Multibranch Search TreeBased MultiKeyword Ranked Search Scheme over Encrypted Cloud Data
Abstract
In the interest of privacy concerns, cloud service users choose to encrypt their personal data before outsourcing them to cloud. However, it is difficult to achieve efficient search over encrypted cloud data. Therefore, how to design an efficient and accurate search scheme over largescale encrypted cloud data is a challenge. In this paper, we integrate bisecting kmeans algorithm and multibranch tree structure and propose the αfiltering tree search scheme based on bisecting kmeans clusters. The novel index tree is built from bottomup, and a greedy depth first algorithm is used for filtering the nonrelevant document cluster by calculating the relevance score between the filtering vector and the query vector. The αfiltering tree can improve the efficiency without the loss of search accuracy. The experiment on a realworld dataset demonstrates the effectiveness of our scheme.
1. Introduction
Cloud computing is a new model in IT enterprise which can offer highquality calculation, storage, and application capacity. The cloud customers choose to outsource their local data and computation to the cloud server for minimizing the data maintenance cost. Thus, to protect users’ privacy and achieve efficient and precise data retrieving from the cloud server has become the focus of recent works.
The traditional way to protect data privacy is to encrypt the original data. However, this is a very challenging task for data utilization. The search schemes based on ciphertext [1–7] can guarantee the data privacy but the search algorithms have high time and space complexity, which are not suitable for cloud data retrieval. To solve this problem, researchers proposed a series of searchable encryption schemes [8–13] based on the theory of cryptography. These encryption schemes either do not have highaccuracy retrieval results [8–10, 12] or cost a lot of time and space overhead [8, 11]. Therefore, it is a necessity to design an efficient and useable search scheme.
In this paper, we propose an αfiltering tree search scheme based on bisecting kmeans clusters, which achieves efficient multikeyword ranked search over encrypted cloud data. We use vector space model and TFIDF model to build the keyword dictionary and transform the documents and keywords into “points” in a multidimensional space that can be described by vectors, and then we used the secure inner product to encrypt the document vectors and query vectors. The relevance scores between the document vectors and query vectors are used to obtain the topk most relevant documents.
Our paper’s main contributions are summarized as follows:(i)We integrate the bisecting kmeans algorithm and a multibranch tree structure where the bisecting kmeans algorithm is used to improve the cluster accuracy, and we propose an αfiltering tree search scheme based on the bisecting kmeans clusters.(ii)We propose a greedy depth first algorithm to achieve searches on the αfiltering tree, which improves the multikeyword search efficiency. By adopting the secure inner product encryption scheme, we achieve the privacypreserving ranked search on the encrypted αfiltering index tree.(iii)We perform the experiment on a realworld dataset and compare with existing schemes in terms of retrieval efficiency and index storage usage. The result shows that our scheme is superior in search efficiency and storage usage.
The rest of the paper is organized as follows: Section 2 introduces the related work, Section 3 introduces the main background knowledge, and Section 4 gives a brief introduction to our system model, threat model, and design goals. The constructions of the αfiltering tree and search algorithm are presented in Section 5. Sections 6 and 7 give the experiment result and its analyses. Finally, the conclusion is given in Section 8.
2. Related Work
Searchable encryption schemes implement keyword searches over encrypted outsourced data, which allow users to store their personal data on the cloud server without privacy concerns. Recently, an increasing number of scholars conduct research in this area. We discuss the related work on the development of searchable encryption schemes’ performance and function.
2.1. SingleKeyword Searchable Encryption
Song et al. [14] first proposed a symmetric encryption search scheme, and they encrypted each keyword in the document set separately and searched the entire data set by sequential scanning. Thus, the search time of the scheme is linear with the overall size of the document set. Goh [15] proposed a searchable encryption scheme based on Bloom filter. They achieved search efficiency, that is to say, the calculation overhead is not related to the number of documents in the dataset. However, due to the probability of a false positive for the Bloom filter, the cloud server may return documents that do not contain search keywords. The scheme in Chang and Mitzenmacher [16] used two indexes. The first index is to store and manage a premade dictionary by the user. The second requires twice the interactions between the user and the cloud server, which affects the user experience but it can achieve the same search efficiency as Goh [15]. Curtmola et al. [17] adopted two novel search schemes SSE1 and SSE2. SSE1 is used to prevent chosenkeyword attacks (CKA1), and SSE2 is against adaptive chosenkeyword attacks (CKA2). Their schemes’ search time cost is proportional to the number of keywords retrieved. Boneh et al. [18] adopted a searchable encryption structure that allows everyone to store their data with the public key. But their scheme needs large amount of calculation.
2.2. MultiKeyword Search Schemes
Multikeyword searchable encryption allows the user to submit multiple search keywords to retrieve the most relevant documents. They can be further classified into ranked search and traditional search. In traditional search, most schemes are conjunctive keyword search which returns all the documents containing the search keywords, and conjunctivesubset keyword search which returns the documents containing the keyword subset. However, traditional search is not suitable for the ranked search. Cao et al. [8] first achieved a privacypreserving multikeyword ranked search scheme. In their scheme, documents and search keywords are described by the dictionaryscale vectors. The scheme uses coordinate matching to rank the documents. Since the weights of different keywords in documents are not considered, the retrieval result obtained by the scheme lacks accuracy, and the search time of the scheme is linear with the scale of the dataset. Sun et al. [9] proposed a novel multikeyword ranked scheme; they used TFIDF vector space model and cosine distance measurement to build an index tree structure. The experiment shows that their scheme is more efficient than linear search but lacked accuracy. Orencik et al. [10] adopted LocalitySensitive Hashing to cluster the similar documents, but their ranked search result is also not accurate. Xia et al. [11] adopted the vector space model and KBB tree to build a dynamic multikeyword ranked search scheme, which more precisely obtains the ranked result. However, as the scale of the documents increases, the index tree space cost is large, and the pruning effect of the search algorithm is also reduced, resulting in a decrease in search efficiency.
To enhance searchable encryptions’ usability and functionality, many schemes that support fuzzy keyword search [19–23], conjunctive keyword search [3, 24–26], and similarity search [27–30] have also been presented. The dynamic scheme can support updates on the dataset, which largely enhances searchable encryptions’ usability. The first dynamic searchable encryption scheme is proposed by van Liesdonk et al. [31], which supports a limited number of updates. After their work, many dynamic searchable encryption schemes are proposed [32–35]. The verifiable scheme can check the integrity of search results when the cloud server is not honest. Many researches are conducted to support verifiable searches in [26, 36, 37]. To extend the searchable encryption scheme to support other data types like multimedia data, some research works were also proposed [38, 39].
In Xia et al.’s work [11], they presented an efficient search index tree to obtain the search result. However, as the scale of the document increases, the index tree space cost is large and the pruning effect of the search algorithm is also reduced, resulting in a decrease in search efficiency. Thus, we proposed a multibranch index tree to overcome this problem. By adopting the clustering algorithm over the document set, we can further increase the search efficiency. Moreover, the multibranch tree can also save the space cost for the index tree.
3. Notations and Background Knowledge
3.1. Notations
For simplicity, we defined our mainly used symbols as follows:(i)W—the keyword dictionary contains n keywords, denoted as W = {, , …, }(ii)D—the plaintext documents set contains m documents, denoted as D = {d_{1}, d_{2}, …, d_{m}}(iii)F_{V}—the plaintext document vector(iv)—the encrypted form of F_{V}(v)I—the αfiltering index tree(vi)—the encrypted form of I(vii)Q—the set of query keywords, Q = {, , …, }(viii)V_{Q}—the Q’s query vector(ix)T_{Q}—the query trapdoor(x)RL—the ranked search result list(xi)μ—the threshold of the maximum number of documents in an atom cluster(xii)α—the threshold of the maximum number of child nodes in a node of the αfiltering tree
3.2. Vector Space Model
Among many information retrieval models, the vector space model is the most popular method of relevance measurement and we adopt the TFIDF model for feature extraction. It is widely used in plaintext multikeyword retrieval. TF (term frequency) refers to the word frequency, that is, the number of occurrences of the keyword in the document f divided by the total number of words contained in the document f. IDF (inverse document frequency) indicates the inverse document frequency, that is, the number of documents divided by the number of documents containing the keyword.
The keyword dictionary is first generated by filtering the stop words form all the words contained in the document set D. Then, the document vector F_{V} and the query vector q are generated according to the keyword dictionary W. The dimension of F_{V} and q is equal to the scale of the keyword dictionary; each dimension represents a corresponding keyword . The value of each dimension in F_{V} means the normalized TF value and normalized IDF value in q. The TF value and the IDF value of the keyword w_{i} are calculated as follows:where and .
3.3. Relevance Measurement
The inner product operation is performed by two equallength vectors, and the relevance between two vectors is quantized by the inner product score. The larger the score, the higher the relevance between the two vectors. The relevance score is calculated as follows:
We make the following instructions about equation (2):(i)If F_{V} is a document vector and V_{Q} is a search vector, Score (F_{V}, V_{Q}) is the relevance score between the document and the search keywords.(ii)IF F_{V} is a filtering vector of the index tree node and V_{Q} is a search vector, Score (F_{V}, V_{Q}) is the relevance score between the upper bound vector of the documents stored in this node and the search keywords.
3.4. Bisecting kMeans Cluster
In data mining, the bisecting kmeans algorithm is a cluster analysis algorithm. By selecting 2 initial centroids in a bisecting kmeans algorithm, each point is assigned to the nearest centroid in turn, and the points that are assigned to the same centroid form a cluster. The centroids of each cluster are continually updated by different points assigned to the cluster, assignments and updates are repeated until the clusters no longer change, and then the clustering algorithm is completed. We use the cosine distance to measure the distance from the point to the centroid, which is defined in the following equation:where is the point’s vector, is the centroid’s vector, and and are the norms of and .
3.5. Secure Inner Product Operation
The special matrix encryption proposed in [8] can achieve privacypreserving vector inner product. Assuming that p and q are two ndimensional vectors, the user encrypted them to and by calculating and , where M is a random invertible matrix. Therefore, we can get the inner product of the original vectors only by the inner product of their encrypted form as follows:
4. Model and Problem Formulation
4.1. System Model
In this paper, there are 3 entities in our system model: data owner, data user, and cloud server as shown in Figure 1. These three entities collaborate as follows.
The data owner has the local dataset D and wants to outsource them in secure form to the cloud server while still providing the search service for users. In our scheme, it first generates the searchable index tree I according to D. Then, it uses the secure key to encrypt both D and I to its encrypted form and . After that, it shares the secure key with the data user through the access control and outsources and to the cloud server.
The cloud server provides both storage service and search service. It stores the secured index tree and encrypted document set . After it receives the search trapdoor from the data user, it performs the secure search by using and returns the search result to the data user.
The data user is the authorized one to access the document set. It generates the search trapdoor with search keywords Q through the proposed search scheme and sends as the search request to the cloud server. After receiving the search result, it uses the secure key to decrypt the encrypted documents and get the plaintext documents.
4.2. Threat Model
We adopt the same “honestbutcurious” threat model as the current work [8, 9, 11, 40, 42]. That is to say, the cloud server follows the user’s instruction honestly and precisely, but it could curiously analyze the received data to obtain additional information about the dataset. Two threat models were proposed by Cao et al. [8] and are adopted in our work as follows: Known Cyphertext Model: the cloud server could access the cyphertext dataset, the encrypted index tree, and the search trapdoor, and thus the cloud server can conduct the cyphertextonly attack. Known Background Model: the cloud server could have more datasetrelated information than the known cyphertext model in this stronger model. The cloud server can have statistical information about the relation between the search trapdoor and the search result. Then, it could infer or recognize some of the search keywords in the trapdoor by the additional information it has.
4.3. Design Goals
To ensure the privacy, efficiency, and accuracy in the multikeyword ranked search over encrypted cloud data, our system design should meet these requirements as follows: Search Efficiency: compared with other multikeyword search schemes, the proposed search scheme should be superior in efficiency than others. Search Accuracy: the proposed search scheme should guarantee the accuracy of the search result. Privacy Persevering: the proposed scheme should ensure the privacy of the document privacy, index privacy, trapdoor privacy, trapdoor unlinkability, and keyword privacy in the search process.
5. Index and Search Algorithm
In this section, we mainly discuss the index construction method and search method based on the index tree and then we give the corresponding algorithms. We first construct a document atom cluster list by using the bisecting kmeans algorithm. Then, based on the generated atom cluster list, we build the αfiltering tree and then propose a corresponding greedy depth first search algorithm for multikeywords ranked search.
5.1. Atom Cluster List Generation Algorithm
Considering the document set D as the input raw cluster, we use the bisecting kmeans algorithms to perform topdown bisecting clustering until all the generated subclusters contain less than μ documents in Algorithms 1 and 2, and thus a binary clustering tree is built as shown in Algorithm 2. Here, μ is the given threshold for clustering. Then, we traverse the leaf clusters in the generated binary clustering tree, and the atom cluster list L is constructed in Algorithm 3, which is used for building the αfiltering index tree.



Definition 1. Atom Cluster. The leaf clusters in the binary tree generated by Algorithm 1 are the atom clusters, where the number of documents in each atom cluster is no more than μ.
Theorem 1. Assuming that the list of the atom clusters generated by Algorithm 1 is L = {C_{1}, C_{2}, …, C_{t}}, we have the following properties(1)(2)
We illustrate the generation process of the atomic cluster list L in Algorithms 1–3 by an example. We assume that the document set is D = {d_{1}, d_{2}, …, d_{15}} and μ = 3. The first round of bisecting clustering is performed on D, and two subclusters are generated as shown in Figure 2. With the same process, the second layer’s and the third layer’s subclusters are all sequentially divided into two clusters, and the subcluster stops clustering when the number of documents contained in the subcluster is less than or equal to 3. Finally, a binary clustering tree is formed, where the leaf nodes are C_{1} = {d_{1}, d_{2}, d_{3}}, C_{2} = {d_{4}, d_{5}, d_{6}}, C_{3} = {d_{9}, d_{10}}, C_{4} = {d_{11}, d_{12}}, C_{5} = {d_{7}, d_{8}}, and C_{6} = {d_{13}, d_{14}, d_{15}}, as shown in Figure 2. Then, the algorithm traverses the leaf nodes of the binary clustering tree in the middle order and then the atom cluster list L = {C_{1}, C_{2}, C_{3}, C_{4}, C_{5}, C_{6}} is generated.
5.2. αFiltering Tree
Definition 2. Upper Bound Vector. For a ndimensional vector set V = {, , …, }, V’s upper bound vector is a ndimensional vector V_{UB} = UpBound{, , …, }, where _{UB}[i] = max{[i], [i], …, [i]}, = 1, 2, …, n.
Definition 3. αFiltering Tree. A node u in the αfiltering tree is a triple, which is denoted aswhere u·FV is a ndimensional filtering vector, u·PL is a child node pointer which have at most α pointers, and u·DC stores documents when u is a leaf node.(1)If u is a leaf node, then u·PL = , u·DC = {d_{1}, d_{2}, …, d_{x}} which corresponds to an atom cluster where , and u·FV = UpBound{FV_{1}, FV_{2}, …, FV_{x}}, where FV_{i} is the document vector of d_{i}(2)If u is a nonleaf node, then u·DC = , u·PL has at most α pointers, i.e., u·PL = {u·PL [1], u·PL [2], …, u·PL}, where u·PL[i] points to the i^{th} child node and , u·FV = UpBound{u·PL [1]·FV, u·PL [2]·FV, …, u·PL·FV} which is the upper bound vector of the filtering vectors of the child nodes in u·PLWe give the construction procedures of the αfiltering tree in Algorithm 4.
Algorithm 4 builds the αfiltering tree with the atom cluster list. Tree nodes are created during each round processing of steps 821. The original atom cluster list is treated as the first child node list (CNL). In each round, α nodes are fetched from CNL once a time and a parent node is created to have these nodes and added into the parent node list (PNL). After all the nodes in CNL have been fetched and then we have the complete parent node list (PNL) in this round. If we have more than 1 node in PNL, then we move all nodes in PNL to CNL. Otherwise, the only node in PNL is the root of the generated index tree.

Theorem 2. The height of an αfiltering tree with t leaf node is .
Proof. We assume that the length of the atom cluster list L is t, that is, the number of leaf nodes of the αfiltering tree is t. According to Algorithm 4, after the 1^{st}, 2^{nd}, …, x^{th} rounds of processing, the number of current generated parent nodes becomes . When the number of current generated parent nodes is 1, the construction of the αfiltering tree is finished, so there is = 1. Then, we deduce x = . Since the height of each tree is increased by 1 for each merge and the initial height of the tree is 1, the height of the αfiltering tree with t leaf node is .
5.3. MultiKeyword Ranked Search Algorithm
By adopting the greedy depthfirst search algorithm on the tree index, we can obtain the topk documents efficiently by preexcluding the subtree that certainly does not contain any search result documents. We give the greedy depthfirst search algorithm in this section.
Definition 4. For a query Q whose vector is V_{Q} and two nodes u and u’, if Score (V_{Q}, u·FV) Score (V_{Q}, u’·FV), then u has higher or equal relevance score with Q than u’ which is denoted as
Theorem 3. We assume that u = FV, PL, DC is a nonleaf node in the αfiltering tree and u·PL stores g child nodes, i.e., u·PL = {u·PL [1], u·PL [2], …, u·PL} and . For a query Q, we have
Proof. To prove , that is to prove
Because every elements in an ndimensional filtering vector u·FV are generated by the following equation:Thus, Score (V_{Q}, u·FV) is not less than the relevance scores between any child nodes’ filtering vector and the query vector. Then we have, .
During the search process, for a given Q, if the relevance score between a subtree's root node filtering vector and the corresponding query vector is not higher than the threshold of the candidate result list, then all its child nodes are noncandidates according to Theorem 3. Thus, we can directly ignore this subtree and the search efficiency is improved, which is the pruning criterion of greedy depth first search algorithm. Adopting the idea, we propose a greedy depth first search algorithm shown in Algorithm 5.
In Figure 3, we construct a 3filtering index tree example to further illustrate multikeyword ranked search algorithm. The index tree is built any child nodes' filtering vector and the query vector after the leaf nodes are generated from the atom cluster list. The intermediate nodes are generated based on the leaf nodes. We assume that the query vector is V_{Q} = (0.5, 0.5, 0, 0) and the top3 ranked documents are interested. When the search starts, the algorithm first visits the left subtree of u_{11}, u_{21}, and u_{31} recursively and finds that u_{31} is a leaf node which has 3 documents. The algorithm puts all the documents into the result list RL = {d_{1}, d_{2}, d_{3}}, where the relevance scores are 0.3, 0.35, and 0.3, respectively. Then accesses u_{32} is accessed, and the relevance score between its filtering vector and the query vector is 0.2 which is less than 0.3; therefore, RL remains unchanged. After that u_{33} is accessed with the relevance score 0.35, so d_{9} and d_{10} are added to RL, replacing d_{1} and d_{3}. Finally, the algorithm searches the subtree rooted by u_{22} and finds no need to search the remaining subtree. The search algorithm is finished.

6. Effective and Secured MultiKeyword Ranked Search Scheme
In this section, we construct the secure search scheme by using the secure kNN algorithm [41]. The data owner constructs the index tree with the document set and then uses the secure keys to encrypt the document sets and index tree, respectively. The data user submits search request to the cloud server by using query keywords. The cloud server performs search algorithm on the index tree and returns the search result documents.
6.1. Secure Search Scheme
6.1.1. GenKey ()
The data owner generates secure key SK = {, S, M_{1}, M_{2}}, which is used for encrypting the documents and index tree. Here, is the secure symmetric encryption key for document encryption and is only shared with the data user but protected from cloud server. S is a bit vector for vector splitting, and each dimension of S is randomly chosen to be 0 or 1 and the number of 0 and 1 should be nearly equal. and are both dimensional randomly generated invertible matrices.
6.1.2. BuildIndex (D, SK)
The data owner first performs index tree construction algorithms discussed in Sections 5.1 and 5.2 to generate the plaintext index tree I on the documents in D. Then, the data owner encrypts the index tree to its encrypted from . Specifically, for each document vector and each node's filtering vector, we use the bit vector S to split them into two vectors. For simplicity, we use V to represent one of these vectors, and the splitting procedures are as follows:
Then the data owner encrypted the split vectors to . After that, the data owner encrypts the documents in each leaf node’s atom cluster by secure key , and the encrypted index tree is generated. Finally, the data owner outsources to the cloud server.
6.1.3. GenTapdoor (Q, SK)
The data user generates the query vector V_{Q} according to the query keywords in Q. Then the secure key SK is adopted to generate the corresponding trapdoor T_{Q}. The generation of T_{Q} is similar to the encryption procedures of document vectors. First, V_{Q} is split into two vectors according to the following equation
Then, the data user encrypted the split vectors into the trapdoor . Finally, T_{Q} is submitted to the cloud server as the search command.
6.1.4. SearchIndex (, T_{Q}, k)
The cloud server receives the trapdoor T_{Q}, and performs search algorithm on the secure index tree . Then, the cloud server returns the encrypted topk documents result list RL to the data user who decrypts the encrypted documents and the search processing is finished. The special matrix encryption can obtain the inner product of two vectors only with the inner product result of their encryption forms, which is illustrated as follows:
To protect the Trapdoor unlinkability and keyword privacy under known background model, we should prevent the server from calculating the exact value of the relevance score between the T_{Q} and F_{V} which can leak TF distribution information. Thus, we add some phantom terms [11] on the vectors generated in our scheme to disturb the relevance score calculation. But the search accuracy would decrease.
In the enhanced scheme, we generate (n + n’) (n + n’)dimensional secure matrices and also the document vectors will be extended to n dimensions. The extended elements F_{V}[n + i] are set to a random number β. Similarly, the query vector is also extended to be a n + n’ dimensional vector, and the extended elements are random set to 1 or 0. Thus, the relevance score between the query trapdoor and document vector is equal to , where V_{Q}[n + i] = 1. The randomness of can ensure the privacy against the known background model.
6.2. Security Analysis
In this paper, we construct the treebased secure search scheme same as [11, 42] to achieve searchable encryption, which represents the security of our scheme should be the same as [11, 42]. We give the proof briefly as follows:(i)Document privacy: we use the traditional symmetric encryption on documents before outsourcing to the cloud server. As long as the secure key is secured against the adversary, the document privacy is protected in our scheme.(ii)Index and trapdoor privacy: the document vectors and query vectors store the TF and IDF value of the corresponding keywords and encrypted with the secure matrices generated by secure kNN after being randomly split. The secure matrices are both randomly generated invertible matrices. The adversary cannot calculate the secure matrices only with the encrypted vectors. Therefore, the index and trapdoor privacy is protected in our scheme.(iii)Trapdoor unlinkability: the query trapdoor is randomly split by the split vector S for each search, and the trapdoors are different with same search requests. Thus, the trapdoor unlinkability is guaranteed. But, the cloud server can link the same search requests by inferring the access pattern and the ranked result of the searches. To solve this problem, we can expand the vectors used in our secure scheme by adding phantom dimension to interference the relevance score. With phantom terms, the search results in same requests could be different. However, the search accuracy can be decreased and the balance between the privacy and accuracy is discussed in [11].(iv)Keyword privacy: the index and trapdoor privacy is protected in our scheme which means keyword privacy is also protected in the known cyphertext model. In the known background model, the relevance score between the documents and the query vector can leak the TF information about the query keywords. If a search request only has one search keywords or one of the search keywords has high TF value, the cloud server can easily infer this keyword by its statistical information about TF distribution of keywords. Similarly, to solve this problem, we add phantom terms to obfuscate the relevance score between the query trapdoor and the document vector. That is to say, the TFIDF value is variable with different search requests. Thus, the cloud server cannot link the keywords with their TF distribution, and the keyword privacy is enhanced.
7. Performance Analysis
We evaluate the performance of our αfiltering index tree scheme in this section and compare it with Xia et al.’s index tree scheme [11] and Zhu et al.’s HACtree [42] under different settings. We use a realworld dataset which has 120000 documents in total and implement our scheme using Java in Windows 10 with an Intel Core i56200U @ 2.30 GHz CPU, and the default parameter setting is shown in Table 1. k, μ, , α, and m are number of required documents, document threshold in each atom cluster, number of search keywords, number of α, and number of documents, respectively. In the enhanced scheme, we add phantom terms to enhance the security of our scheme. The search accuracy and efficiency of these two schemes are the same without the phantom terms. We only perform evaluation on original scheme for simplicity. The influence of phantom terms is discussed in [11].

7.1. Space Usage Evaluation
In this section, we conduct the space analysis of the different schemes from the aspect of the index tree. We only discuss the index tree space usage; therefore, the search parameters are not changed. The space usage of Xia is the same as Zhu because they both are binary tree with same number of nodes.
7.1.1. Space Usage versus μ
We change the document threshold μ in each cluster to compare the space usage of three schemes. Figures 4(a) and 4(b) shows the index tree space cost when the number of documents is 20000 and 120000, respectively.
(a)
(b)
The result shows that as the scale of the document set increases, the space usage of index tree is significantly increased. The reason is that more tree nodes are added to the index tree to store more documents. The result also shows that larger threshold can save the space usage of the index tree which will reduce the nodes in the αfiltering tree.
7.1.2. Space Usage versus α
We change α of the αfiltering tree to compare the space usage of three schemes. Figures 5(a) and 5(b) show the index tree space cost when the number of documents is 20000 and 120000, respectively.
(a)
(b)
The result shows that an appropriate setting of α can largely save the space usage of the αfiltering tree. But when α is too large, the index tree will save space usage with more nodes having the same parent node and tends to be stable.
7.2. Index Building Time Cost Evaluation
In this section, we evaluate the time cost of the index building. We measure the time cost of the BuildIndex algorithm of our scheme, which is shown in Table 2, given m = 20000. The BuildIndex algorithm in our scheme takes hours, while in Xia’s scheme, it takes seconds. It should be noted that the key extraction and TFIDF calculation are the same in all three schemes. And, the tree construction algorithm also has almost the same time cost because the basic structure of the tree is the same. The main difference of the time cost is that our tree uses the clustering algorithm to further improve the search efficiency in search algorithm. The clustering algorithm can consume a lot of time, which leads to worse index building time cost. But, it can be improved by adopting more efficient clustering algorithms such as distributed clustering algorithm. The longer time cost for index building is affordable because it only needs to be performed once while providing more efficient searches.

7.3. Search Time Cost Evaluation
In this section, we evaluate the time cost of the search efficiency of different schemes. Each data point in the figure is at least performed 10 times.(1)Time cost versus μ. Figure 6 indicates that our scheme is better than the existing schemes in the search process. The αfiltering tree can improve the search process by accelerating the process of finding the leaf nodes, shortening the height of the tree, and accessing more nodes by an intermediate node. The kmeans cluster can gather similar documents closely in leaf nodes which can fill the candidate result list reasonably. But when u increases, the time cost of our scheme tends to increase simultaneously; the reason is that the number of documents in a leaf node is increased which will slow down the relevance calculation process in the leaf node. The Xia’s and Zhu’s trees increase largely when the scale of document set increases; the reason is that their schemes are both binary tree in which the height of the tree increases larger than the αfiltering tree in our scheme.(2)Time cost versus α. Figure 7 indicates that the time cost of our scheme is lower than the existing schemes. As mentioned above, appropriate setting of α can improve the performance of our scheme. But when α is too large, the pruning function in an intermediate node will require more calculation and the pruning effect could be worse for there are fewer subtrees to be pruned.(3)Time cost versus . Figure 8 shows that the number of search keywords will slow down the search process of tree based index scheme. But overall, our scheme outperforms other schemes by the contribution of the αfiltering tree.(4)Time cost versus k. Figure 9 shows that under different setting of k, the time cost of our scheme is better than Xia and Zhu. When k increases, the time cost of treebased schemes increase slightly. The reason is that the pruning function in tree index can save the times of calculation between documents and query vector.
(a)
(b)
(a)
(b)
(a)
(b)
(a)
(b)
7.4. The setting of α
The experiment shows that different settings of α result in different improvements in our scheme. But it is hard to find an appropriate α for a tree with m nodes. The space usage of αfiltering tree decreases as α increases. However, the search time cost can increase as α increases, and it is worst when α = m. When search algorithm iterates every node in the tree, and the filtering vector in the only nonleaf node cannot help to filter the noncandidate nodes. The best αfiltering tree should balance between the width and depth. An αfiltering tree should at least have a depth of three to have the filtering vectors work. The B+ tree [43] is a multibranch tree widely adopted for storing index for large data, and arguably degree of a B+ tree is usually set to the result of the block size divided by the key size [43] in real circumstances when it stores index for a much larger dataset than that we used in experiment. These settings can help to define an appropriate α. The search time complexity of the αfiltering tree is O, which means that the tree with less depth has better search efficiency. However, the filtering vector in a shorter tree will filter fewer nodes than a tall tree with more filtering vectors, which results in worse search efficiency. Thus, it is hard to define the best setting of α when given different m and it needs further discussion.
8. Conclusion
In terms of the efficiency problem of privacypreserving multikeyword ranked search, we propose an αfiltering tree index search scheme based on bisecting kmeans clusters. The scheme utilizes the characteristics of a multibranch tree, which greatly reduces the spatial complexity of the index tree. At the same time, the idea of clustering is used to store the related documents closely in the index tree, which greatly improves the pruning algorithm on the index tree, thus improving the search efficiency. In contrast, since the index tree nodes are stored in the form of clusters and the clustering of the bisecting kmeans requires a large amount of time, the variability of the index tree could be limited. The experiment results on the realworld dataset show that, to a certain extent, our scheme can greatly improve the search efficiency of privacypreserving multikeywords ranked search and at the same time guarantee the accuracy of the search results.
Data Availability
The text data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This research was supported by the National Natural Science Foundation of China under the Grant nos. 61872197, 61972209, 61572263, 61672297, and 61872193; the Postdoctoral Science Foundation of China under the Grant no. 2019M651919; the Natural Science Foundation of Anhui Province under Grant no. 1608085MF127; the Natural Science Foundation of Anhui Province under Grant no. KJ2017A419; and the Natural Research Foundation of Nanjing University of Posts and Telecommunications under Grant no. NY217119.
References
 P. Golle, J. Staddon, and B. R. Waters, “Secure conjunctive keyword search over encrypted data,” in Proceedings of Applied Cryptography and Network Security, ACNS 2004, vol. 31–45, Yellow Mountain, China, June 2004. View at: Google Scholar
 Y. H. Hwang and P. J. Lee, “Public key encryption with conjunctive keyword search and its extension to a multiuser system,” in Proceedings of PairingBased Cryptography, Pairing 2007, pp. 2–22, Tokyo, Japan, July 2007. View at: Google Scholar
 L. Ballard, S. Kamara, and F. Monrose, “Achieving efficient conjunctive keyword searches over encrypted data,” in Proceedings of the Information and Communications Security, 7th International Conference, ICICS 2005, vol. 414–426, Beijing, China, December 2005. View at: Google Scholar
 D. Boneh and B. Waters, “Conjunctive, subset, and range queries on encrypted data,” in Proceedings of 4th Theory of Cryptography Conference, TCC 2007, vol. 535–554, Amsterdam, The Netherlands, February 2007. View at: Google Scholar
 B. Zhang and F. Zhang, “An efficient public key encryption with conjunctivesubset keywords search,” Journal of Network and Computer Applications, vol. 34, no. 1, pp. 262–267, 2011. View at: Publisher Site  Google Scholar
 J. Katz, A. Sahai, and B. Waters, “Predicate encryption supporting disjunctions, polynomial equations, and inner products,” in Proceedings of 27th Annual International Conference on the Theory and Applications of Cryptographic Techniques, EUROCRYPT 2008, vol. 146–162, Istanbul, Turkey, April 2008. View at: Google Scholar
 E. Shen, E. Shi, and B. Waters, “Predicate privacy in encryption systems,” in Proceedings of Theory of Cryptography, 6th Theory of Cryptography Conference, TCC 2009, pp. 457–473, San Francisco, CA, USA, March 2009. View at: Google Scholar
 N. Cao, C. Wang, M. Li, K. Ren, and W. Lou, “Privacypreserving multikeyword ranked search over encrypted cloud data,” in Proceedings of the 30th IEEE International Conference on Computer Communications, INFOCOM 2011, pp. 829–837, Shanghai, China, April 2011. View at: Google Scholar
 W. Sun, B. Wang, N. Cao et al., “Privacypreserving multikeyword text search in the cloud supporting similaritybased ranking,” in Proceedings of the 8th ACM Symposium on Information, Computer and Communications Security, ASIA CCS ’13, vol. 71–82, Hangzhou, China, May 2013. View at: Publisher Site  Google Scholar
 C. Örencik, M. Kantarcioglu, and E. Savas, “A practical and secure multikeyword search method over encrypted cloud data,” in Proceedings of the 2013 IEEE Sixth International Conference on Cloud Computing, CLOUD 2013, vol. 390–397, Santa Clara, CA, USA, July 2013. View at: Publisher Site  Google Scholar
 Z. Xia, X. Wang, X. Sun, and Q. Wang, “A secure and dynamic multikeyword ranked search scheme over encrypted cloud data,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 2, pp. 340–352, 2016. View at: Publisher Site  Google Scholar
 C. Chen, X. Zhu, P. Shen et al., “An efficient privacypreserving ranked keyword search method,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 4, pp. 951–963, 2016. View at: Publisher Site  Google Scholar
 J. Sun, S. Hu, X. Nie, and J. Walker, “Efficient ranked multikeyword retrieval with privacy protection for multiple data owners in cloud computing,” IEEE Systems Journal, pp. 1–12, 2019. View at: Publisher Site  Google Scholar
 D. X. Song, D. A. Wagner, and A. Perrig, “Practical techniques for searches on encrypted data,” in Proceedings of the 2000 IEEE Symposium on Security and Privacy, S&P 2018, vol. 44–55, Berkeley, California, USA, May 2000. View at: Publisher Site  Google Scholar
 E. Goh, Secure indexes, IACR Cryptology ePrint Archive, 2003.
 Y. Chang and M. Mitzenmacher, “Privacy preserving keyword searches on remote encrypted data,” in Proceedings of the Applied Cryptography and Network Security, Third International Conference, ACNS 2005, vol. 442–455, New York, NY, USA, June 2005. View at: Google Scholar
 R. Curtmola, J. Garay, S. Kamara, and R. Ostrovsky, “Searchable symmetric encryption: improved definitions and efficient constructions,” Journal of Computer Security, vol. 19, no. 5, pp. 895–934, 2011. View at: Publisher Site  Google Scholar
 D. Boneh, G. D. Crescenzo, R. Ostrovsky, and G. Persiano, “Public key encryption with keyword search,” in Proceedings of the International Conference on the Theory and Applications of Cryptographic Techniques, Advances in Cryptology—EUROCRYPT 2004, vol. 506–522, Interlaken, Switzerland, May 2004. View at: Google Scholar
 B. Wang, S. Yu, W. Lou, and Y. T. Hou, “Privacypreserving multikeyword fuzzy search over encrypted data in the cloud,” in Proceedings of the 2014 IEEE Conference on Computer Communications, INFOCOM 2014, pp. 2112–2120, Toronto, Canada, May 2014. View at: Google Scholar
 J. Li, Q. Wang, C. Wang, N. Cao, K. Ren, and W. Lou, “Fuzzy keyword search over encrypted data in cloud computing,” in Proceedings of the 29th IEEE International Conference on Computer Communications, INFOCOM 2010, vol. 441–445, San Diego, CA, USA, 2010. View at: Publisher Site  Google Scholar
 X. Zhu, Q. Liu, and G. Wang, “A novel verifiable and dynamic fuzzy keyword search scheme over encrypted data in cloud computing,” in Proceedings of the 2016 IEEE International Conference on Trust, Security, and Privacy in Computing and Communications, TrustCom 2016, pp. 845–851, Tianjin, China, August 2016. View at: Publisher Site  Google Scholar
 Z. Fu, X. Sun, S. Ji, and G. Xie, “Towards efficient contentaware search over encrypted outsourced data in cloud,” in Proceedings of the 35th Annual IEEE International Conference on Computer Communications, INFOCOM 2016, vol. 1–9, San Francisco, CA, USA, April 2016. View at: Publisher Site  Google Scholar
 X. Ge, J. Yu, C. Hu, H. Zhang, and R. Hao, “Enabling efficient verifiable fuzzy keyword search over encrypted data in cloud computing,” IEEE Access, vol. 6, pp. 45725–45739, 2018. View at: Publisher Site  Google Scholar
 H. T. Poon and A. Miri, “An efficient conjunctive keyword and phase search scheme for encrypted cloud storage systems,” in Porceedings of the 8th IEEE International Conference on Cloud Computing, CLOUD 2015, vol. 508–515, New York City, NY, USA, July 2015. View at: Publisher Site  Google Scholar
 L. Zhang, Y. Zhang, and H. Ma, “Privacypreserving and dynamic multiattribute conjunctive keyword search over encrypted cloud data,” IEEE Access, vol. 6, pp. 34214–34225, 2018. View at: Publisher Site  Google Scholar
 W. Sun, X. Liu, W. Lou, Y. T. Hou, and H. Li, “Catch you if you lie to me: efficient verifiable conjunctive keyword search over large dynamic encrypted cloud data,” in Proceedings of the 2015 IEEE Conference on Computer Communications, INFOCOM 2015, pp. 2110–2118, Kowloon, Hong Kong, May 2015. View at: Publisher Site  Google Scholar
 C.M. Yu, C.Y. Chen, and H.C. Chao, “Privacypreserving multikeyword similarity search over outsourced cloud data,” IEEE Systems Journal, vol. 11, no. 2, pp. 385–394, 2017. View at: Publisher Site  Google Scholar
 C. Wang, K. Ren, S. Yu, and K. M. R. Urs, “Achieving usable and privacyassured similarity search over outsourced cloud data,” in Proceedings of the IEEE Conference on Computer Communications, INFOCOM 2012, vol. 451–459, Orlando, FL, USA, March 2012. View at: Publisher Site  Google Scholar
 Z. Xia, Y. Zhu, X. Sun, and L. Chen, “Secure semantic expansion based search over encrypted cloud data supporting similarity ranking,” Journal of Cloud Computing, vol. 3, no. 1, p. 8, 2014. View at: Publisher Site  Google Scholar
 W. Sun, B. Wang, N. Cao et al., “Verifiable privacypreserving multikeyword text search in the cloud supporting similaritybased ranking,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 11, pp. 3025–3035, 2014. View at: Publisher Site  Google Scholar
 P. van Liesdonk, S. Sedghi, J. Doumen, P. H. Hartel, and W. Jonker, “Computationally efficient searchable symmetric encryption,” in Proceedings of the Secure Data Management, 7th VLDB Workshop, SDM 2010, vol. 87–100, September 2010. View at: Google Scholar
 X. Ge, J. Yu, H. Zhang et al., “Towards achieving keyword search over dynamic encrypted cloud data with symmetrickey based verification,” IEEE Transactions on Dependable and Secure Computing, pp. 1–16, 2019. View at: Publisher Site  Google Scholar
 S. Kamara, C. Papamanthou, and T. Roeder, “Dynamic searchable symmetric encryption,” in Proceedings of the ACM Conference on Computer and Communications Security, CCS’12, pp. 965–976, Raleigh, NC, USA, October 2012. View at: Publisher Site  Google Scholar
 D. Cash, J. Jaeger, S. Jarecki, and G. Tsudik, “Dynamic searchable encryption in verylarge databases: data structures and implementation,” in Proceedings of the 21st Annual Network and Distributed System Security Symposium, NDSS 2014, San Diego, CA, USA, February 2014. View at: Publisher Site  Google Scholar
 F. Hahn and F. Kerschbaum, “Searchable encryption with secure and efficient updates,” in Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, CCS’14, vol. 310–320, Scottsdale, AZ, USA, November 2014. View at: Publisher Site  Google Scholar
 L. Chen and Z. Chen, “Practical, dynamic and efficient integrity verification for symmetric searchable encryption,” in Proceedings of the Cryptology and Network Security—18th International Conference, CANS 2019, vol. 163–183, Fuzhou, China, October 2019. View at: Google Scholar
 X. Jiang, J. Yu, J. Yan, and R. Hao, “Enabling efficient and verifiable multikeyword ranked search over encrypted cloud data,” Information Sciences, vol. 403404, pp. 22–41, 2017. View at: Publisher Site  Google Scholar
 Q. Wang, M. He, M. Du, S. S. M. Chow, R. W. F. Lai, and Q. Zou, “Searchable encryption over featurerich data,” IEEE Transactions on Dependable and Secure Computing, vol. 15, no. 3, pp. 496–510, 2018. View at: Publisher Site  Google Scholar
 S. Hu, L. Y. Zhang, Q. Wang, Z. Qin, and C. Wang, “Towards private and scalable crossmedia retrieval,” IEEE Transactions on Dependable and Secure Computing, pp. 1–15, 2019. View at: Publisher Site  Google Scholar
 C. Wang, N. Cao, K. Ren, and W. Lou, “Enabling secure and efficient ranked keyword search over outsourced cloud data,” IEEE Transactions on Parallel and Distributed Systems, vol. 23, no. 8, pp. 1467–1479, 2012. View at: Publisher Site  Google Scholar
 W. K. Wong, D. W. Cheung, B. Kao, and N. Mamoulis, “Secure kNN computation on encrypted databases,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, vol. 139–152, Providence, RI, USA, June 2009. View at: Publisher Site  Google Scholar
 X. Zhu, H. Dai, X. Yi, G. Yang, and X. Li, “MUSE: an efficient and accurate verifiable privacypreserving multikeyword text search over encrypted cloud data,” Security and Communication Networks, vol. 2017, Article ID 1923476, 17 pages, 2017. View at: Publisher Site  Google Scholar
 “Wikipedia contributors, B+ tree—wikipedia, the free encyclopedia,” October 2019, https://en.wikipedia.org/w/index.php?title=B%2B_tree&oldid=920193380. View at: Google Scholar
Copyright
Copyright © 2020 Hua Dai et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.