Security and Communication Networks

Volume 2017, Article ID 1923476, 17 pages

https://doi.org/10.1155/2017/1923476

## MUSE: An Efficient and Accurate Verifiable Privacy-Preserving Multikeyword Text Search over Encrypted Cloud Data

^{1}College of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 200013, China^{2}Jiangsu Key Laboratory of Big Data Security and Intelligent Processing, Nanjing 210013, China^{3}School of Computer Science and IT, RMIT University, Melbourne, VIC 3001, Australia

Correspondence should be addressed to Dai Hua; nc.ude.tpujn@auhiad

Received 9 February 2017; Accepted 22 May 2017; Published 11 July 2017

Academic Editor: Xiangyang Luo

Copyright © 2017 Zhu Xiangyang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

With the development of cloud computing, services outsourcing in clouds has become a popular business model. However, due to the fact that data storage and computing are completely outsourced to the cloud service provider, sensitive data of data owners is exposed, which could bring serious privacy disclosure. In addition, some unexpected events, such as software bugs and hardware failure, could cause incomplete or incorrect results returned from clouds. In this paper, we propose an efficient and accurate verifiable privacy-preserving multikeyword text search over encrypted cloud data based on hierarchical agglomerative clustering, which is named MUSE. In order to improve the efficiency of text searching, we proposed a novel index structure, HAC-tree, which is based on a hierarchical agglomerative clustering method and tends to gather the high-relevance documents in clusters. Based on the HAC-tree, a noncandidate pruning depth-first search algorithm is proposed, which can filter the unqualified subtrees and thus accelerate the search process. The secure inner product algorithm is used to encrypted the HAC-tree index and the query vector. Meanwhile, a completeness verification algorithm is given to verify search results. Experiment results demonstrate that the proposed method outperforms the existing works, DMRS and MRSE-HCI, in efficiency and accuracy, respectively.

#### 1. Introduction

IT resources, such as computing and storage, are treated as the on-demand services in clouds nowadays, which is shown as “X as a Service.” In order to reduce the cost of data management and storage, data owners (DOs) who have a large amount of data usually choose to outsource their data to clouds. However, DOs cannot directly control their data placed in remote cloud servers, which may cause concerns about their outsourced data being illegally acquired or abused by the cloud service providers (CSPs), especially for the privacy-sensitive data, such as medical records, government documents, and emails. Although many CSPs claim that their cloud services have several security countermeasures, such as access control, firewall and intrusion detection, doubts about the security, and privacy of outsourced data that are still the main obstructions to the wider development of cloud computing [1].

A general approach to protecting data privacy is to encrypt the data before outsourcing [2]. However, this will make a significant difficulty and cost in terms of data management and utilization. In the field of information retrieval (IR), the existing retrieval technique based on multikeyword is mainly for the plaintext data and cannot be directly applied to the encrypted data. It is obviously unrealistic and wasteful to download all encrypted data from clouds to the local for decryption. In addition, due to hardware/software failures, storage device failure, and so forth, the search results may contain corrupted or incorrect data. If users cannot verify the completeness and correctness of search results, the upper-level decisions based on the search results may be misleading. Therefore, it is a challenge to research and give a searchable encryption scheme that supports verifiable privacy-preserving multikeyword text search over encrypted cloud data, which has become one of the hot issues in cloud computing recently [3–8].

In order to deal with the above problems, some encrypted data search methods [9–12] are proposed which utilize kinds of cryptographic techniques, such as homomorphic encryption and public-key cryptography. They are proved secure in text searching, but they usually need massive mathematical operations and cause high computational overhead. Hence, these methods do not adapt to the cloud computing scenario where data storage is very large and online data processing is the basic requirement. Besides, the relationship between documents is not taken into account during the search process, such as the category which describes the classification relationship of documents. In the design of encrypted documents search scheme, if we consider this type of documents relationship, it could improve the search efficiency and accuracy. However, the category relationship has been concealed by the blind encryption in the traditional methods. Therefore, it is desirable and helpful to maintain and utilize the category relationship to perform efficient and accurate text search for encrypted outsourced documents.

In this paper, we propose an efficient and accurate verifiable privacy-preserving multikeyword text search over encrypted cloud data (MUSE), which is based on hierarchical agglomerative clustering tree index (HAC-tree). We use the TF-IDF model and vector space model to represent every document and the interested keywords of queries as vectors, which means that each of them is considered as a point in a high-dimensional space. The secure inner product computation, adapted in the secure kNN [13], is used for measuring the relevance score between two documents or a document and a query. Based on the hierarchical agglomerative clustering, we propose a novel index structure, HAC-tree, which is constructed from the bottom leaf nodes to the upper root node. Firstly, documents are initialed as leaf nodes. Then the internal nodes are generated by clustering pairs of child nodes according to the sequence of their relevance scores level by level until the root node. Each internal node is with a pruning vector which is the extract maximum vector of its child nodes. On the basis of HAC-tree, we propose a noncandidate pruning depth-first search algorithm (NCP-DFS) for searching top-*k* relevance score documents recursively. The pruning vectors of internal nodes are used to filter the noncandidate subtrees which impossibly contain search results; thus the search space is narrowed and the search process speeds up while the result accuracy is not reduced. To verify the completeness and correctness of the search results, digests are generated for documents which are stored in leaf nodes of the HAC-tree. Upon returning the search results, a verification object (VO) is constructed for the result documents which is returned to data user (DU) along with the results. When DU receives the returned data from CS, a VO reconstruction procedure is performed to verify the completeness and correctness of the search results.

Our contributions of this paper are summarized as follows:(1)According to the basic idea of hierarchical agglomerative clustering, we propose a novel index structure HAC-tree and the corresponding bottom-up construction algorithm. The HAC-tree is the lowest binary tree if the leaf nodes number is fixed. Moreover, documents with higher relevance score between them (which means those documents may belong to similar categories) are always clustered.(2)On the basis of HAC-tree, we propose a noncandidate pruning depth-first search algorithm (NCP-DFS) for multikeyword text search. It can prune the subtrees that surely do not contain any search results; thus the search efficiency will be improved.(3)We define digests for documents and propose a result verification algorithm to check whether the search results are complete and correct. This algorithm can detect the damaged or incomplete results caused by hardware of software faults.(4)Based on the above methods, we propose two verifiable privacy-preserving multikeyword text search schemes over encrypted cloud data, named BMUSE and EMUSE, respectively. BMUSE is a basic scheme which can resist known ciphertext threat, while EMUSE is an enhanced scheme which can resist known background threat.(5)We compare the proposed method to DMRS [7] and MRSE-HCI [8] on a real dataset. Our results show that (i) the proposed method outperforms DMRS in terms of efficiency without losing accuracy; and (ii) the proposed method consistently gives more accurate results than MRSE-HCI while it may be less efficient than MRSE-HCI, depending on the choices of parameter values in MRSE-HCI.

The paper is organized as follows. Section 2 describes the related work. Section 3 gives the main notations and necessary preliminaries. Section 4 gives model and problem statement. Section 5 presents the index structure HAC-tree and its construction algorithm. On the basis of HAC-tree, the plaintext multikeyword search algorithm and search results verification are given. Section 6 presents the basic and enhanced schemes of the secure multikeyword text search. In Section 7, we analyze the security of the schemes. Section 8 carries on the experiment, comparing with the existing schemes in terms of the search results accuracy and the search efficiency. Section 9 concludes this paper.

#### 2. Related Work

Searchable encryption (SE) allows data owner to encrypt their own documents before uploading to the cloud server. In recent years, searchable encryption has drawn a wide range of attention, for example, [14–25].

*(i) Single Keyword Searchable Encryption*. Song et al. [26] first propose symmetric searchable encryption based on pseudo-random function and symmetric encryption mechanism and proof the security rigorously. Goh [14] gives security definitions for the formalization of the security requirements of searchable symmetric encryption schemes. Subsequently, many improvement and novel methods are proposed [27–29]. Boneh et al. [10] propose a public-key ciphertext search algorithm named PEKS based on bilinear mapping and IBE encryption. In this algorithm, encrypted data by the public key can be authenticated by the gateway and sent to the corresponding user, but the real content will not be revealed. However, none of the above works ranks the results. The cloud server has to return all the results to meet the query request, resulting in unnecessary bandwidth and processing power overhead. In order to return only the most relevant search results, Wang et al. [30] present keyword ranked search over encrypted data based on TF × IDF and Order-Preserving Symmetric Encryption (OPSE). However, the above works only focus on single keyword search and cannot be applied to the scenario of multikeyword search which is the main concern of this paper.

*(ii) Multikeyword Searchable Encryption*. In the typical case of search over encrypted data in cloud computing, a single keyword search cannot express the data user search intention sufficiently. The cloud server will inevitably return an excessive number of matches, where most will probably be irrelevant for the user. Multikeyword search allows data users to characterize their own requests from multiple perspectives, ensuring that the search results are the most relevant documents with the query. Bilinear pairing-based solutions are presented in [31–33]. The results in bilinear pairing-based solutions are free from false positives and false negatives caused by hashing. However, computation costs of pairing-based solutions are prohibitively high both on the server and on the user side. Pang et al. [34] propose a secure search scheme based on vector space model. Without the security analysis for frequency information and practical search performance, it is unclear whether there is keyword privacy disclosure or not. Besides, the practical search performance is absent from the demonstration of their experiment. Cao et al. [3] define and solve the challenging problem of privacy-preserving multikeyword ranked search over encrypted cloud data (MRSE), which adopt the similarity measure of “coordinate matching” to capture the relevance of data documents to the search query and ignore frequency information, leading to low accuracy in results. Meanwhile, MRSE needs a huge computational overhead. Sun et al. [4] use MDB-tree as index structure to improve the efficiency of MRSE, named MTS. Each level of the MDB-tree represents a subvector instead of an attribute domain in the database scenario, which leads to a decrease in accuracy. The index vector clustering further degrades the retrieval accuracy. Xia et al. [7] present a secure and dynamic multikeyword ranked search scheme (DMRS), which conduct a tree-based index structure to ensure accurate relevance score calculation between the encrypted document vector and the query vector. DMRS is significantly superior to MTS in accuracy. However, owing to neglecting the relationship of documents, this scheme still brings a lot of calculation cost. Therefore, there is still much room for improvement in search efficiency. In this paper, we focus on how to utilize the relationship of documents to improve the search efficiency.

Obviously, when constructing the index, if the similarity of documents can be classified and make their access paths as close as possible, it will make a huge contribution to the search efficiency through multikeyword of interest. Chen et al. [8] use the idea to propose a multikeyword ranked search over encrypted data based on hierarchical clustering index (MRSE-HCI). This method uses -means to cluster the documents based on the minimum relevance threshold and the maximum size of subcluster. Searching in the most relevant subcluster can achieve a linear computational complexity against an exponential size increase of document collection. Nevertheless, the significant improvement in efficiency is at the expense of accuracy, which reduces the accuracy and cannot fulfill user expectation well. In this paper, we not only improve the top- search efficiency but also ensure the accuracy of search results. Although the search efficiency of our work may be lower than MRSE-HCI in some cases depending on the choices of parameter values in MRSE-HCI, the search result accuracy is much higher than MRSE-HCI.

In addition, Sun et al. [35] use Merkle hash tree and cryptographic signature to build a verifiable MDB-tree. Chen et al. [8] design a minimum hash subtree as a verifiable structure. However, their works need to transmit a lot of verification object. Wan and Deng [36] propose the adapted homomorphic MAC technique and random challenge technique with ordering for verifying top- search results. However, the method requires a linear search of all documents and have poor efficiency performance. Hence, a proper mechanism should be adopted which really reduces the transmission cost of the verification object.

#### 3. Notations and Preliminaries

##### 3.1. Notations

For the sake of clarity, we firstly introduce the main notations used in this paper:(i): a plaintext document collection, (ii): an encrypted document collection, (iii): a keyword dictionary including keywords, (iv): a binary tree index generated from (each leaf node is associated with a document in )(v): a searchable encrypted index which is generated from (vi): a set of plaintext index, (vii): the encrypted form of (viii): a node of the index tree (ix): a cluster whose items are documents represented by leaf nodes of the subtree with node as its root(x): a vector of document (xi): the encrypted form of (xii): a query consisting of a set of the interested keywords, (xiii): the vector of (xiv): the encrypted form of (xv): a list storing the search results(xvi), : symmetric encryption and decryption functions where is a private key

##### 3.2. Preliminaries

*Vector Space Model*. Vector space model along with TF-IDF algorithm is very popular in the information retrieval area, which is also widely used in secure multikeyword search [3, 4, 7, 8, 16, 18]. We adopt the classic definitions of term frequency (TF) and inverse document frequency (IDF). The former refers to the number of times a given keyword or term exists in documents while the latter is calculated through dividing the total number of documents in the collection by the number of documents having the keyword. Under the vector space model, each document is denoted as a -dimensional vector . Any element in stores the normalized TF value of the keyword whose calculation formula is shown in (1). Similar to documents, the interested keywords are also denoted as a -dimensional vector, named the query vector , whose element stores the normalized IDF value of the keyword in . The calculation formula of is shown in (2). Obviously, the lengths of the document vector and query vector are equal to the capacity of the keywords dictionary and each element of them is nonnegative.where is the TF value of in , , is IDF value of in , , is the frequency of that appears in , and is the number of documents containing in .

*Relevance Score Measurement*. In this paper, we use the same measurements in [8] to quantify the relevance score between a pair of documents and between a document and a query (which is represented by the interested keywords of DU). It is also used to quantify the relevance score between a pair of cluster centers and between a cluster center and a query. The calculation of the above relevance score can be unified as the inner product of two vectors which is shown in

*Secure Inner Product Operation*. To achieve the goal of privacy preserving, we adopt the secure inner product operation which is proposed in [13]. The operation is able to calculate the inner product of two vectors without knowing their plaintext value. Its basic idea is as follows. Assume that and are two -dimensional vectors and is a random invertible matrix that is treated as a secure key. The encrypted forms of and are denoted as and , respectively, where , . The inner product of and is calculated as (4) which indicates that . Note that we can get the inner product of two vectors without knowing the plaintext.

The space vector model, inner product of two vectors, and secure inner product operation are widely used in the existing works [3, 4, 7, 8, 17, 18, 35]. In this paper, we will use them to design the secure multikeyword text search schemes over encrypted cloud data.

#### 4. Model and Problem Statement

##### 4.1. System Model

The system considered in this paper is the same as [3–5, 7, 8, 16–18] which consists of three entities: the data owner (DO), the data user (DU), and the cloud server (CS). As shown in Figure 1, their collaboration is as follows.