Abstract

In the interest of privacy concerns, cloud service users choose to encrypt their personal data before outsourcing them to cloud. However, it is difficult to achieve efficient search over encrypted cloud data. Therefore, how to design an efficient and accurate search scheme over large-scale encrypted cloud data is a challenge. In this paper, we integrate bisecting k-means algorithm and multibranch tree structure and propose the α-filtering tree search scheme based on bisecting k-means clusters. The novel index tree is built from bottom-up, and a greedy depth first algorithm is used for filtering the nonrelevant document cluster by calculating the relevance score between the filtering vector and the query vector. The α-filtering tree can improve the efficiency without the loss of search accuracy. The experiment on a real-world dataset demonstrates the effectiveness of our scheme.

1. Introduction

Cloud computing is a new model in IT enterprise which can offer high-quality calculation, storage, and application capacity. The cloud customers choose to outsource their local data and computation to the cloud server for minimizing the data maintenance cost. Thus, to protect users’ privacy and achieve efficient and precise data retrieving from the cloud server has become the focus of recent works.

The traditional way to protect data privacy is to encrypt the original data. However, this is a very challenging task for data utilization. The search schemes based on ciphertext [17] can guarantee the data privacy but the search algorithms have high time and space complexity, which are not suitable for cloud data retrieval. To solve this problem, researchers proposed a series of searchable encryption schemes [813] based on the theory of cryptography. These encryption schemes either do not have high-accuracy retrieval results [810, 12] or cost a lot of time and space overhead [8, 11]. Therefore, it is a necessity to design an efficient and useable search scheme.

In this paper, we propose an α-filtering tree search scheme based on bisecting k-means clusters, which achieves efficient multi-keyword ranked search over encrypted cloud data. We use vector space model and TF-IDF model to build the keyword dictionary and transform the documents and keywords into “points” in a multidimensional space that can be described by vectors, and then we used the secure inner product to encrypt the document vectors and query vectors. The relevance scores between the document vectors and query vectors are used to obtain the top-k most relevant documents.

Our paper’s main contributions are summarized as follows:(i)We integrate the bisecting k-means algorithm and a multibranch tree structure where the bisecting k-means algorithm is used to improve the cluster accuracy, and we propose an α-filtering tree search scheme based on the bisecting k-means clusters.(ii)We propose a greedy depth first algorithm to achieve searches on the α-filtering tree, which improves the multi-keyword search efficiency. By adopting the secure inner product encryption scheme, we achieve the privacy-preserving ranked search on the encrypted α-filtering index tree.(iii)We perform the experiment on a real-world dataset and compare with existing schemes in terms of retrieval efficiency and index storage usage. The result shows that our scheme is superior in search efficiency and storage usage.

The rest of the paper is organized as follows: Section 2 introduces the related work, Section 3 introduces the main background knowledge, and Section 4 gives a brief introduction to our system model, threat model, and design goals. The constructions of the α-filtering tree and search algorithm are presented in Section 5. Sections 6 and 7 give the experiment result and its analyses. Finally, the conclusion is given in Section 8.

Searchable encryption schemes implement keyword searches over encrypted outsourced data, which allow users to store their personal data on the cloud server without privacy concerns. Recently, an increasing number of scholars conduct research in this area. We discuss the related work on the development of searchable encryption schemes’ performance and function.

2.1. Single-Keyword Searchable Encryption

Song et al. [14] first proposed a symmetric encryption search scheme, and they encrypted each keyword in the document set separately and searched the entire data set by sequential scanning. Thus, the search time of the scheme is linear with the overall size of the document set. Goh [15] proposed a searchable encryption scheme based on Bloom filter. They achieved search efficiency, that is to say, the calculation overhead is not related to the number of documents in the dataset. However, due to the probability of a false positive for the Bloom filter, the cloud server may return documents that do not contain search keywords. The scheme in Chang and Mitzenmacher [16] used two indexes. The first index is to store and manage a premade dictionary by the user. The second requires twice the interactions between the user and the cloud server, which affects the user experience but it can achieve the same search efficiency as Goh [15]. Curtmola et al. [17] adopted two novel search schemes SSE-1 and SSE-2. SSE-1 is used to prevent chosen-keyword attacks (CKA1), and SSE-2 is against adaptive chosen-keyword attacks (CKA2). Their schemes’ search time cost is proportional to the number of keywords retrieved. Boneh et al. [18] adopted a searchable encryption structure that allows everyone to store their data with the public key. But their scheme needs large amount of calculation.

2.2. Multi-Keyword Search Schemes

Multi-keyword searchable encryption allows the user to submit multiple search keywords to retrieve the most relevant documents. They can be further classified into ranked search and traditional search. In traditional search, most schemes are conjunctive keyword search which returns all the documents containing the search keywords, and conjunctive-subset keyword search which returns the documents containing the keyword subset. However, traditional search is not suitable for the ranked search. Cao et al. [8] first achieved a privacy-preserving multi-keyword ranked search scheme. In their scheme, documents and search keywords are described by the dictionary-scale vectors. The scheme uses coordinate matching to rank the documents. Since the weights of different keywords in documents are not considered, the retrieval result obtained by the scheme lacks accuracy, and the search time of the scheme is linear with the scale of the dataset. Sun et al. [9] proposed a novel multi-keyword ranked scheme; they used TF-IDF vector space model and cosine distance measurement to build an index tree structure. The experiment shows that their scheme is more efficient than linear search but lacked accuracy. Orencik et al. [10] adopted Locality-Sensitive Hashing to cluster the similar documents, but their ranked search result is also not accurate. Xia et al. [11] adopted the vector space model and KBB tree to build a dynamic multi-keyword ranked search scheme, which more precisely obtains the ranked result. However, as the scale of the documents increases, the index tree space cost is large, and the pruning effect of the search algorithm is also reduced, resulting in a decrease in search efficiency.

To enhance searchable encryptions’ usability and functionality, many schemes that support fuzzy keyword search [1923], conjunctive keyword search [3, 2426], and similarity search [2730] have also been presented. The dynamic scheme can support updates on the dataset, which largely enhances searchable encryptions’ usability. The first dynamic searchable encryption scheme is proposed by van Liesdonk et al. [31], which supports a limited number of updates. After their work, many dynamic searchable encryption schemes are proposed [3235]. The verifiable scheme can check the integrity of search results when the cloud server is not honest. Many researches are conducted to support verifiable searches in [26, 36, 37]. To extend the searchable encryption scheme to support other data types like multimedia data, some research works were also proposed [38, 39].

In Xia et al.’s work [11], they presented an efficient search index tree to obtain the search result. However, as the scale of the document increases, the index tree space cost is large and the pruning effect of the search algorithm is also reduced, resulting in a decrease in search efficiency. Thus, we proposed a multibranch index tree to overcome this problem. By adopting the clustering algorithm over the document set, we can further increase the search efficiency. Moreover, the multibranch tree can also save the space cost for the index tree.

3. Notations and Background Knowledge

3.1. Notations

For simplicity, we defined our mainly used symbols as follows:(i)W—the keyword dictionary contains n keywords, denoted as W = {, , …, }(ii)D—the plaintext documents set contains m documents, denoted as D = {d1, d2, …, dm}(iii)FV—the plaintext document vector(iv)—the encrypted form of FV(v)I—the α-filtering index tree(vi)—the encrypted form of I(vii)Q—the set of query keywords, Q = {, , …, }(viii)VQ—the Q’s query vector(ix)TQ—the query trapdoor(x)RL—the ranked search result list(xi)μ—the threshold of the maximum number of documents in an atom cluster(xii)α—the threshold of the maximum number of child nodes in a node of the α-filtering tree

3.2. Vector Space Model

Among many information retrieval models, the vector space model is the most popular method of relevance measurement and we adopt the TF-IDF model for feature extraction. It is widely used in plaintext multi-keyword retrieval. TF (term frequency) refers to the word frequency, that is, the number of occurrences of the keyword in the document f divided by the total number of words contained in the document f. IDF (inverse document frequency) indicates the inverse document frequency, that is, the number of documents divided by the number of documents containing the keyword.

The keyword dictionary is first generated by filtering the stop words form all the words contained in the document set D. Then, the document vector FV and the query vector q are generated according to the keyword dictionary W. The dimension of FV and q is equal to the scale of the keyword dictionary; each dimension represents a corresponding keyword . The value of each dimension in FV means the normalized TF value and normalized IDF value in q. The TF value and the IDF value of the keyword wi are calculated as follows:where and .

3.3. Relevance Measurement

The inner product operation is performed by two equal-length vectors, and the relevance between two vectors is quantized by the inner product score. The larger the score, the higher the relevance between the two vectors. The relevance score is calculated as follows:

We make the following instructions about equation (2):(i)If FV is a document vector and VQ is a search vector, Score (FV, VQ) is the relevance score between the document and the search keywords.(ii)IF FV is a filtering vector of the index tree node and VQ is a search vector, Score (FV, VQ) is the relevance score between the upper bound vector of the documents stored in this node and the search keywords.

3.4. Bisecting k-Means Cluster

In data mining, the bisecting k-means algorithm is a cluster analysis algorithm. By selecting 2 initial centroids in a bisecting k-means algorithm, each point is assigned to the nearest centroid in turn, and the points that are assigned to the same centroid form a cluster. The centroids of each cluster are continually updated by different points assigned to the cluster, assignments and updates are repeated until the clusters no longer change, and then the clustering algorithm is completed. We use the cosine distance to measure the distance from the point to the centroid, which is defined in the following equation:where is the point’s vector, is the centroid’s vector, and and are the norms of and .

3.5. Secure Inner Product Operation

The special matrix encryption proposed in [8] can achieve privacy-preserving vector inner product. Assuming that p and q are two n-dimensional vectors, the user encrypted them to and by calculating and , where M is a random invertible matrix. Therefore, we can get the inner product of the original vectors only by the inner product of their encrypted form as follows:

4. Model and Problem Formulation

4.1. System Model

In this paper, there are 3 entities in our system model: data owner, data user, and cloud server as shown in Figure 1. These three entities collaborate as follows.

The data owner has the local dataset D and wants to outsource them in secure form to the cloud server while still providing the search service for users. In our scheme, it first generates the searchable index tree I according to D. Then, it uses the secure key to encrypt both D and I to its encrypted form and . After that, it shares the secure key with the data user through the access control and outsources and to the cloud server.

The cloud server provides both storage service and search service. It stores the secured index tree and encrypted document set . After it receives the search trapdoor from the data user, it performs the secure search by using and returns the search result to the data user.

The data user is the authorized one to access the document set. It generates the search trapdoor with search keywords Q through the proposed search scheme and sends as the search request to the cloud server. After receiving the search result, it uses the secure key to decrypt the encrypted documents and get the plaintext documents.

4.2. Threat Model

We adopt the same “honest-but-curious” threat model as the current work [8, 9, 11, 40, 42]. That is to say, the cloud server follows the user’s instruction honestly and precisely, but it could curiously analyze the received data to obtain additional information about the dataset. Two threat models were proposed by Cao et al. [8] and are adopted in our work as follows:Known Cyphertext Model: the cloud server could access the cyphertext dataset, the encrypted index tree, and the search trapdoor, and thus the cloud server can conduct the cyphertext-only attack.Known Background Model: the cloud server could have more dataset-related information than the known cyphertext model in this stronger model. The cloud server can have statistical information about the relation between the search trapdoor and the search result. Then, it could infer or recognize some of the search keywords in the trapdoor by the additional information it has.

4.3. Design Goals

To ensure the privacy, efficiency, and accuracy in the multi-keyword ranked search over encrypted cloud data, our system design should meet these requirements as follows:Search Efficiency: compared with other multi-keyword search schemes, the proposed search scheme should be superior in efficiency than others.Search Accuracy: the proposed search scheme should guarantee the accuracy of the search result.Privacy Persevering: the proposed scheme should ensure the privacy of the document privacy, index privacy, trapdoor privacy, trapdoor unlinkability, and keyword privacy in the search process.

5. Index and Search Algorithm

In this section, we mainly discuss the index construction method and search method based on the index tree and then we give the corresponding algorithms. We first construct a document atom cluster list by using the bisecting k-means algorithm. Then, based on the generated atom cluster list, we build the α-filtering tree and then propose a corresponding greedy depth first search algorithm for multi-keywords ranked search.

5.1. Atom Cluster List Generation Algorithm

Considering the document set D as the input raw cluster, we use the bisecting k-means algorithms to perform top-down bisecting clustering until all the generated subclusters contain less than μ documents in Algorithms 1 and 2, and thus a binary clustering tree is built as shown in Algorithm 2. Here, μ is the given threshold for clustering. Then, we traverse the leaf clusters in the generated binary clustering tree, and the atom cluster list L is constructed in Algorithm 3, which is used for building the α-filtering index tree.

Input:
 The document set, D;
 The threshold of the maximum number of documents in an atom cluster, μ;
Output:
 An atom cluster list, L;
(1)Create a root cluster node r which has all documents of D;
(2)GenBiSectingTree (r, μ);
(3)InOrder (r, L);
(4)return L;
Input: A cluster node having a set of documents, c; μ;
Output: The root of the generated binary clustering tree;
(1)if then
(2) Perform the bisecting k-means algorithm on c, and two bisected clusters, cl and cr, are generated, where c = cl cr. Then, set clchild = cl and crchild = cr, where lchild and rchild are the left and right child cluster nodes of c.
(3)GenBiSectingTree (clchild, μ);
(4)GenBiSectingTree (crchild, μ);
(5)end if
Input: c, μ;
Output: L;
(1)if c is not empty then
(2)InOrder (clchild, μ, L);
(3)if then
(4)  Append c to L;
(5)end if
(6)InOrder (crchild, μ, L);
(7)end if

Definition 1. Atom Cluster. The leaf clusters in the binary tree generated by Algorithm 1 are the atom clusters, where the number of documents in each atom cluster is no more than μ.

Theorem 1. Assuming that the list of the atom clusters generated by Algorithm 1 is L = {C1, C2, …, Ct}, we have the following properties(1)(2)

We illustrate the generation process of the atomic cluster list L in Algorithms 13 by an example. We assume that the document set is D = {d1, d2, …, d15} and μ = 3. The first round of bisecting clustering is performed on D, and two subclusters are generated as shown in Figure 2. With the same process, the second layer’s and the third layer’s subclusters are all sequentially divided into two clusters, and the subcluster stops clustering when the number of documents contained in the subcluster is less than or equal to 3. Finally, a binary clustering tree is formed, where the leaf nodes are C1 = {d1, d2, d3}, C2 = {d4, d5, d6}, C3 = {d9, d10}, C4 = {d11, d12}, C5 = {d7, d8}, and C6 = {d13, d14, d15}, as shown in Figure 2. Then, the algorithm traverses the leaf nodes of the binary clustering tree in the middle order and then the atom cluster list L = {C1, C2, C3, C4, C5, C6} is generated.

5.2. α-Filtering Tree

Definition 2. Upper Bound Vector. For a n-dimensional vector set V = {, , …, }, V’s upper bound vector is a n-dimensional vector VUB = UpBound{, , …, }, where UB[i] = max{[i], [i], …, [i]},  = 1, 2, …, n.

Definition 3. α-Filtering Tree. A node u in the α-filtering tree is a triple, which is denoted aswhere u·FV is a n-dimensional filtering vector, u·PL is a child node pointer which have at most α pointers, and u·DC stores documents when u is a leaf node.(1)If u is a leaf node, then u·PL = , u·DC = {d1, d2, …, dx} which corresponds to an atom cluster where , and u·FV = UpBound{FV1, FV2, …, FVx}, where FVi is the document vector of di(2)If u is a non-leaf node, then u·DC = , u·PL has at most α pointers, i.e., u·PL = {u·PL [1], u·PL [2], …, u·PL}, where u·PL[i] points to the ith child node and , u·FV = UpBound{u·PL [1]·FV, u·PL [2]·FV, …, u·PL·FV} which is the upper bound vector of the filtering vectors of the child nodes in u·PLWe give the construction procedures of the α-filtering tree in Algorithm 4.
Algorithm 4 builds the α-filtering tree with the atom cluster list. Tree nodes are created during each round processing of steps 8-21. The original atom cluster list is treated as the first child node list (CNL). In each round, α nodes are fetched from CNL once a time and a parent node is created to have these nodes and added into the parent node list (PNL). After all the nodes in CNL have been fetched and then we have the complete parent node list (PNL) in this round. If we have more than 1 node in PNL, then we move all nodes in PNL to CNL. Otherwise, the only node in PNL is the root of the generated index tree.

Input: L; the threshold of the maximum number of child nodes, α
Output: the α-filtering tree, I
(1)for each Ci in L do
(2) Create a leaf node c for Ci. Assuming that Ci = {d1, d2, …, dx}, then u·PL = , u·DC = Ci, u·FV = UpBound{FV1, FV2, …, FVx}
(3) Add u into CNL; //CNL is a variable of the child node list and PNL is a variable of the newly generated parent node list
(4)end for
(5)if then
(6)return the only node left in CNL which is the root of the generated α-filtering tree
(7)end if
(8)while CNL is not empty do
(9)while CNL is not empty do
(10)  if then
(11)   Fetch the first α nodes {u1, u2, …, uα} from CNL and create a parent node u, where u·PL[1] = u1, u·PL[2] = u2, …, u·PL = uα, FV = UpBound{u1·FV, u2·FV, …, ·FV} and DC = ; then add u to PNL
(12)  else
(13)   Assuming that CNL = {u1, u2, …, }, , fetch all the nodes from CNL and create a parent node u where {u·PL[1] = u1, u·PL[2] = u2, …, u·PL = }, u·FV = UpBound{u1·FV, u2·FV, …, ·FV} and u·DC = ; then add u to PNL
(14)  end if
(15)end while
(16)if then
(17)  Move all the nodes from PNL to CNL and then set PNL empty
(18)else
(19)  break;
(20)end if
(21)end while
(22)return The only node left in PNL which is the root of the generated α-filtering tree

Theorem 2. The height of an α-filtering tree with t leaf node is .

Proof. We assume that the length of the atom cluster list L is t, that is, the number of leaf nodes of the α-filtering tree is t. According to Algorithm 4, after the 1st, 2nd, …, xth rounds of processing, the number of current generated parent nodes becomes . When the number of current generated parent nodes is 1, the construction of the α-filtering tree is finished, so there is  = 1. Then, we deduce x = . Since the height of each tree is increased by 1 for each merge and the initial height of the tree is 1, the height of the α-filtering tree with t leaf node is .

5.3. Multi-Keyword Ranked Search Algorithm

By adopting the greedy depth-first search algorithm on the tree index, we can obtain the top-k documents efficiently by preexcluding the subtree that certainly does not contain any search result documents. We give the greedy depth-first search algorithm in this section.

Definition 4. For a query Q whose vector is VQ and two nodes u and u’, if Score (VQ, u·FV)  Score (VQ, u’·FV), then u has higher or equal relevance score with Q than u’ which is denoted as

Theorem 3. We assume that u = FV, PL, DC is a non-leaf node in the α-filtering tree and u·PL stores g child nodes, i.e., u·PL = {u·PL [1], u·PL [2], …, u·PL} and . For a query Q, we have

Proof. To prove , that is to prove
Because every elements in an n-dimensional filtering vector u·FV are generated by the following equation:Thus, Score (VQ, u·FV) is not less than the relevance scores between any child nodes’ filtering vector and the query vector. Then we have, .
During the search process, for a given Q, if the relevance score between a subtree's root node filtering vector and the corresponding query vector is not higher than the threshold of the candidate result list, then all its child nodes are noncandidates according to Theorem 3. Thus, we can directly ignore this subtree and the search efficiency is improved, which is the pruning criterion of greedy depth first search algorithm. Adopting the idea, we propose a greedy depth first search algorithm shown in Algorithm 5.
In Figure 3, we construct a 3-filtering index tree example to further illustrate multi-keyword ranked search algorithm. The index tree is built any child nodes' filtering vector and the query vector after the leaf nodes are generated from the atom cluster list. The intermediate nodes are generated based on the leaf nodes. We assume that the query vector is VQ = (0.5, 0.5, 0, 0) and the top-3 ranked documents are interested. When the search starts, the algorithm first visits the left subtree of u11, u21, and u31 recursively and finds that u31 is a leaf node which has 3 documents. The algorithm puts all the documents into the result list RL = {d1, d2, d3}, where the relevance scores are 0.3, 0.35, and 0.3, respectively. Then accesses u32 is accessed, and the relevance score between its filtering vector and the query vector is 0.2 which is less than 0.3; therefore, RL remains unchanged. After that u33 is accessed with the relevance score 0.35, so d9 and d10 are added to RL, replacing d1 and d3. Finally, the algorithm searches the subtree rooted by u22 and finds no need to search the remaining subtree. The search algorithm is finished.

Input:
 The root node of an α-filtering tree, r;
 The query vector of Q, VQ;
 The number of requested documents, k;
 The minimum of the relevance scores between documents in RL and Q, λ;
 The list for storing top-k ranked documents, RL;
Output:
RL;
(1)u = r;
(2)if u is a leaf node then
(3) Add all the documents of u·DC in RL;
(4)if then
(5)  Set the threshold λ equals the minimum of the relevance scores between the candidate documents in RL and VQ;
(6)  Remove the documents from RL, the relevance scores between which and VQ are smaller than λ;
(7)end if
(8)else
(9)if Score (VQ, u·FV)  then
(10)  for each u’ in u·PL do
(11)   SearchIndex (u’, VQ, k, λ, RL);
(12)  end for
(13)end if
(14)end if

6. Effective and Secured Multi-Keyword Ranked Search Scheme

In this section, we construct the secure search scheme by using the secure kNN algorithm [41]. The data owner constructs the index tree with the document set and then uses the secure keys to encrypt the document sets and index tree, respectively. The data user submits search request to the cloud server by using query keywords. The cloud server performs search algorithm on the index tree and returns the search result documents.

6.1. Secure Search Scheme
6.1.1. GenKey ()

The data owner generates secure key SK = {, S, M1, M2}, which is used for encrypting the documents and index tree. Here, is the secure symmetric encryption key for document encryption and is only shared with the data user but protected from cloud server. S is a bit vector for vector splitting, and each dimension of S is randomly chosen to be 0 or 1 and the number of 0 and 1 should be nearly equal. and are both -dimensional randomly generated invertible matrices.

6.1.2. BuildIndex (D, SK)

The data owner first performs index tree construction algorithms discussed in Sections 5.1 and 5.2 to generate the plaintext index tree I on the documents in D. Then, the data owner encrypts the index tree to its encrypted from . Specifically, for each document vector and each node's filtering vector, we use the bit vector S to split them into two vectors. For simplicity, we use V to represent one of these vectors, and the splitting procedures are as follows:

Then the data owner encrypted the split vectors to . After that, the data owner encrypts the documents in each leaf node’s atom cluster by secure key , and the encrypted index tree is generated. Finally, the data owner outsources to the cloud server.

6.1.3. GenTapdoor (Q, SK)

The data user generates the query vector VQ according to the query keywords in Q. Then the secure key SK is adopted to generate the corresponding trapdoor TQ. The generation of TQ is similar to the encryption procedures of document vectors. First, VQ is split into two vectors according to the following equation

Then, the data user encrypted the split vectors into the trapdoor . Finally, TQ is submitted to the cloud server as the search command.

6.1.4. SearchIndex (, TQ, k)

The cloud server receives the trapdoor TQ, and performs search algorithm on the secure index tree . Then, the cloud server returns the encrypted top-k documents result list RL to the data user who decrypts the encrypted documents and the search processing is finished. The special matrix encryption can obtain the inner product of two vectors only with the inner product result of their encryption forms, which is illustrated as follows:

To protect the Trapdoor unlinkability and keyword privacy under known background model, we should prevent the server from calculating the exact value of the relevance score between the TQ and FV which can leak TF distribution information. Thus, we add some phantom terms [11] on the vectors generated in our scheme to disturb the relevance score calculation. But the search accuracy would decrease.

In the enhanced scheme, we generate (n + n’)  (n + n’)-dimensional secure matrices and also the document vectors will be extended to n dimensions. The extended elements FV[n + i] are set to a random number β. Similarly, the query vector is also extended to be a n + n’ dimensional vector, and the extended elements are random set to 1 or 0. Thus, the relevance score between the query trapdoor and document vector is equal to , where VQ[n + i] = 1. The randomness of can ensure the privacy against the known background model.

6.2. Security Analysis

In this paper, we construct the tree-based secure search scheme same as [11, 42] to achieve searchable encryption, which represents the security of our scheme should be the same as [11, 42]. We give the proof briefly as follows:(i)Document privacy: we use the traditional symmetric encryption on documents before outsourcing to the cloud server. As long as the secure key is secured against the adversary, the document privacy is protected in our scheme.(ii)Index and trapdoor privacy: the document vectors and query vectors store the TF and IDF value of the corresponding keywords and encrypted with the secure matrices generated by secure kNN after being randomly split. The secure matrices are both randomly generated invertible matrices. The adversary cannot calculate the secure matrices only with the encrypted vectors. Therefore, the index and trapdoor privacy is protected in our scheme.(iii)Trapdoor unlinkability: the query trapdoor is randomly split by the split vector S for each search, and the trapdoors are different with same search requests. Thus, the trapdoor unlinkability is guaranteed. But, the cloud server can link the same search requests by inferring the access pattern and the ranked result of the searches. To solve this problem, we can expand the vectors used in our secure scheme by adding phantom dimension to interference the relevance score. With phantom terms, the search results in same requests could be different. However, the search accuracy can be decreased and the balance between the privacy and accuracy is discussed in [11].(iv)Keyword privacy: the index and trapdoor privacy is protected in our scheme which means keyword privacy is also protected in the known cyphertext model. In the known background model, the relevance score between the documents and the query vector can leak the TF information about the query keywords. If a search request only has one search keywords or one of the search keywords has high TF value, the cloud server can easily infer this keyword by its statistical information about TF distribution of keywords. Similarly, to solve this problem, we add phantom terms to obfuscate the relevance score between the query trapdoor and the document vector. That is to say, the TF-IDF value is variable with different search requests. Thus, the cloud server cannot link the keywords with their TF distribution, and the keyword privacy is enhanced.

7. Performance Analysis

We evaluate the performance of our α-filtering index tree scheme in this section and compare it with Xia et al.’s index tree scheme [11] and Zhu et al.’s HAC-tree [42] under different settings. We use a real-world dataset which has 120000 documents in total and implement our scheme using Java in Windows 10 with an Intel Core i5-6200U @ 2.30 GHz CPU, and the default parameter setting is shown in Table 1. k, μ, , α, and m are number of required documents, document threshold in each atom cluster, number of search keywords, number of α, and number of documents, respectively. In the enhanced scheme, we add phantom terms to enhance the security of our scheme. The search accuracy and efficiency of these two schemes are the same without the phantom terms. We only perform evaluation on original scheme for simplicity. The influence of phantom terms is discussed in [11].

7.1. Space Usage Evaluation

In this section, we conduct the space analysis of the different schemes from the aspect of the index tree. We only discuss the index tree space usage; therefore, the search parameters are not changed. The space usage of Xia is the same as Zhu because they both are binary tree with same number of nodes.

7.1.1. Space Usage versus μ

We change the document threshold μ in each cluster to compare the space usage of three schemes. Figures 4(a) and 4(b) shows the index tree space cost when the number of documents is 20000 and 120000, respectively.

The result shows that as the scale of the document set increases, the space usage of index tree is significantly increased. The reason is that more tree nodes are added to the index tree to store more documents. The result also shows that larger threshold can save the space usage of the index tree which will reduce the nodes in the α-filtering tree.

7.1.2. Space Usage versus α

We change α of the α-filtering tree to compare the space usage of three schemes. Figures 5(a) and 5(b) show the index tree space cost when the number of documents is 20000 and 120000, respectively.

The result shows that an appropriate setting of α can largely save the space usage of the α-filtering tree. But when α is too large, the index tree will save space usage with more nodes having the same parent node and tends to be stable.

7.2. Index Building Time Cost Evaluation

In this section, we evaluate the time cost of the index building. We measure the time cost of the BuildIndex algorithm of our scheme, which is shown in Table 2, given m = 20000. The BuildIndex algorithm in our scheme takes hours, while in Xia’s scheme, it takes seconds. It should be noted that the key extraction and TF-IDF calculation are the same in all three schemes. And, the tree construction algorithm also has almost the same time cost because the basic structure of the tree is the same. The main difference of the time cost is that our tree uses the clustering algorithm to further improve the search efficiency in search algorithm. The clustering algorithm can consume a lot of time, which leads to worse index building time cost. But, it can be improved by adopting more efficient clustering algorithms such as distributed clustering algorithm. The longer time cost for index building is affordable because it only needs to be performed once while providing more efficient searches.

7.3. Search Time Cost Evaluation

In this section, we evaluate the time cost of the search efficiency of different schemes. Each data point in the figure is at least performed 10 times.(1)Time cost versus μ. Figure 6 indicates that our scheme is better than the existing schemes in the search process. The α-filtering tree can improve the search process by accelerating the process of finding the leaf nodes, shortening the height of the tree, and accessing more nodes by an intermediate node. The k-means cluster can gather similar documents closely in leaf nodes which can fill the candidate result list reasonably. But when u increases, the time cost of our scheme tends to increase simultaneously; the reason is that the number of documents in a leaf node is increased which will slow down the relevance calculation process in the leaf node. The Xia’s and Zhu’s trees increase largely when the scale of document set increases; the reason is that their schemes are both binary tree in which the height of the tree increases larger than the α-filtering tree in our scheme.(2)Time cost versus α. Figure 7 indicates that the time cost of our scheme is lower than the existing schemes. As mentioned above, appropriate setting of α can improve the performance of our scheme. But when α is too large, the pruning function in an intermediate node will require more calculation and the pruning effect could be worse for there are fewer subtrees to be pruned.(3)Time cost versus . Figure 8 shows that the number of search keywords will slow down the search process of tree based index scheme. But overall, our scheme outperforms other schemes by the contribution of the α-filtering tree.(4)Time cost versus k. Figure 9 shows that under different setting of k, the time cost of our scheme is better than Xia and Zhu. When k increases, the time cost of tree-based schemes increase slightly. The reason is that the pruning function in tree index can save the times of calculation between documents and query vector.

7.4. The setting of α

The experiment shows that different settings of α result in different improvements in our scheme. But it is hard to find an appropriate α for a tree with m nodes. The space usage of α-filtering tree decreases as α increases. However, the search time cost can increase as α increases, and it is worst when α = m. When search algorithm iterates every node in the tree, and the filtering vector in the only non-leaf node cannot help to filter the noncandidate nodes. The best α-filtering tree should balance between the width and depth. An α-filtering tree should at least have a depth of three to have the filtering vectors work. The B+ tree [43] is a multibranch tree widely adopted for storing index for large data, and arguably degree of a B+ tree is usually set to the result of the block size divided by the key size [43] in real circumstances when it stores index for a much larger dataset than that we used in experiment. These settings can help to define an appropriate α. The search time complexity of the α-filtering tree is O, which means that the tree with less depth has better search efficiency. However, the filtering vector in a shorter tree will filter fewer nodes than a tall tree with more filtering vectors, which results in worse search efficiency. Thus, it is hard to define the best setting of α when given different m and it needs further discussion.

8. Conclusion

In terms of the efficiency problem of privacy-preserving multi-keyword ranked search, we propose an α-filtering tree index search scheme based on bisecting k-means clusters. The scheme utilizes the characteristics of a multibranch tree, which greatly reduces the spatial complexity of the index tree. At the same time, the idea of clustering is used to store the related documents closely in the index tree, which greatly improves the pruning algorithm on the index tree, thus improving the search efficiency. In contrast, since the index tree nodes are stored in the form of clusters and the clustering of the bisecting k-means requires a large amount of time, the variability of the index tree could be limited. The experiment results on the real-world dataset show that, to a certain extent, our scheme can greatly improve the search efficiency of privacy-preserving multi-keywords ranked search and at the same time guarantee the accuracy of the search results.

Data Availability

The text data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Natural Science Foundation of China under the Grant nos. 61872197, 61972209, 61572263, 61672297, and 61872193; the Postdoctoral Science Foundation of China under the Grant no. 2019M651919; the Natural Science Foundation of Anhui Province under Grant no. 1608085MF127; the Natural Science Foundation of Anhui Province under Grant no. KJ2017A419; and the Natural Research Foundation of Nanjing University of Posts and Telecommunications under Grant no. NY217119.