Abstract

Searchable encryption technique enables the users to securely store and search their documents over the remote semitrusted server, which is especially suitable for protecting sensitive data in the cloud. However, various settings (based on symmetric or asymmetric encryption) and functionalities (ranked keyword query, range query, phrase query, etc.) are often realized by different methods with different searchable structures that are generally not compatible with each other, which limits the scope of application and hinders the functional extensions. We prove that asymmetric searchable structure could be converted to symmetric structure, and functions could be modeled separately apart from the core searchable structure. Based on this observation, we propose a layered searchable encryption (LSE) scheme, which provides compatibility, flexibility, and security for various settings and functionalities. In this scheme, the outputs of the core searchable component based on either symmetric or asymmetric setting are converted to some uniform mappings, which are then transmitted to loosely coupled functional components to further filter the results. In such a way, all functional components could directly support both symmetric and asymmetric settings. Based on LSE, we propose two representative and novel constructions for ranked keyword query (previously only available in symmetric scheme) and range query (previously only available in asymmetric scheme).

1. Introduction

Cloud storage provides an elastic, highly available, easily accessible, and cheap repository to users to store and use their data, and such a convenient way attracts more and more people. In many cases, the users require their sensitive data, such as business documents, to be secure against any adversary or even the cloud provider, and therefore all data must be encrypted before sending to the server [1]. However, traditional encryption schemes (e.g., DES) do not provide any functionality to the users such that searching for the desired documents by keywords, as a basic function for storage system, is quite impossible. The problem is that there is no way to know if there exists such keywords in an encrypted document without decryption, and apparently the server should not have the decryption key.

Searchable encryption technique provides a solution to such problem. It enables the users to encrypt their sensitive data and store it to the remote server, while retaining the ability to search by keywords. While searching, the user sends to the server a secret token (a transformation of the queried keywords); then the server uses the token to search over the encrypted data and returns the matched documents. During the process, the server does not know what the queried keywords and the document contents are, and therefore the privacy is guaranteed.

Many searchable encryption schemes have been proposed with various settings and functionalities. For symmetric searchable encryption schemes, the user encrypts, searches, and decrypts the documents using his/her private symmetric key. For asymmetric searchable encryption schemes, the data sender encrypts the documents using the user’s public key, and the user searches and decrypts the documents using the private key. Beyond the basic keyword matching, many functions are also added to either symmetric or asymmetric setting, such as range query, phrase query, and fuzzy keyword query.

However, these functions are often realized by different methods with different searchable structures which are generally not compatible with each other. For example, the asymmetric encryption scheme introduced in [2] realized conjunctive, subset, and range queries. However, it is difficult to figure out how to apply this method to symmetric setting. Even for the same setting, such as the fuzzy query scheme introduced in [3] and the rank-ordered query scheme introduced in [4], it is difficult to figure out how to combine two methods together since the functions are constructed based on different indexing structures.

Layered searchable encryption (LSE) scheme aims to provide compatibility, flexibility, and security for various settings and functionalities. In this new framework, keywords are firstly transformed to tokens that are filtered by the core searchable component (symmetric or asymmetric setting), and then the tokens are dynamically converted to uniform mappings which are transmitted to many stand-alone functional components (e.g., ranked keyword query component, fuzzy query component, etc.) to further filter the results. Since all functional components are independent of each other and the interfaces are common, the functions are compatible with each other and directly support both symmetric and asymmetric settings, and adding or deleting a function is quite simple since each function is loosely coupled with the core searchable component. Furthermore, LSE supports combined query. For example, the query “SELECT WHERE keywords = “cloud, storage, encryption” AND “security classification > 5” ORDERED BY “keyword:cloud”” (to express the query, we adopt the SQL-like format used in database) is a combination of three functional components: basic query, range query, and ranked keyword query (in this paper, we will present the concrete construction for this example).

Furthermore, this framework is similar to the data stream processing architecture [5], where functional components could be treated as operator boxes and the whole scheme could be treated as a data-flow system by which all processes follow the popular boxes and arrows paradigm. Therefore, in comparison to the previous searchable encryption schemes, LSE is more suitable for distributed and parallel computing environment.

In this paper, our contributions are the following. We propose a novel framework for designing searchable encryption scheme called layered searchable encryption (LSE), which enables combined query and provides compatibility, flexibility, and security for various settings and functionalities. The new framework consists of a core searchable component with a symmetric/asymmetric converter, many functional components, and a common interface with new security model. We propose a concrete construction for LSE that could theoretically combine all possible functionalities which are proposed in the recent years, and prove its semantic security for the interface. As a complement for the prior works, we formally define two new security models for ranked keyword query and range query, called semantic security against chosen ranked keyword attack (CRKA) and chosen range attack (CRA) respectively, which provide integral security models for cryptographic analysis. Based on LSE, we propose two representative and novel constructions for ranked keyword query component (previously only available in symmetric scheme) and range query component (previously only available in asymmetric scheme) and prove them semantically secure under the new security models.

The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 presents the notations and preliminaries. Section 4 presents the layered searchable encryption scheme and the concrete construction. Section 5 discusses how to realize various functionalities and presents the concrete constructions for ranked keyword query and range query. Section 6 concludes this paper.

Searchable encryption schemes are designed to help the users to securely search over the encrypted data by keywords. The first scheme was introduced in [6] by Song et al., and later on many index-based symmetric searchable encryption (SSE) schemes were proposed. Goh introduced the first secure index in [7], and they also built the security model for searchable encryption called Adaptive Chosen Keyword Attack (IND-CKA). In [8], Curtmola et al. introduced two constructions to realize symmetric searchable encryption: the first construction (named SSE-1) is nonadaptive and the second one (named SSE-2) is adaptive. A generalization for symmetric searchable encryption was introduced in [9], and a representative SSE system designed by Microsoft was introduced in [10]. Another type of searchable encryption named asymmetric searchable encryption (ASE) is public-key based, which allows the user to search over the data encrypted by some data senders using the public key of the user. The first scheme was introduced in [11] by Boneh et al. based on bilinear maps, and the improved definition was introduced in [12].

There are many functional extensions for the searchable encryption schemes beyond the basic precise keyword matching. For symmetric setting, the authors in [4, 13, 14] introduced ranked keyword search schemes based on order-preserving encryption technique or two-round protocol, which allows the server to only return the top- relevant results to the user. In [15], Golle et al. introduced a scheme supporting conjunctive keyword search which allows the user to search multiple keywords in a single query. In [3, 16, 17], the authors introduced fuzzy keyword search schemes based on wildcard technique, which allows the user to submit only part of the precise keyword. Similar to fuzzy keyword search but different, the authors in [18, 19] introduced similarity search schemes based on wildcard technique, which allows the server to return the results similar to the queried keyword. In [20, 21], the authors introduced phrase query schemes based on trusted client-side server or binary search, which allows the user to query a phrase instead of multiple independent keywords. For asymmetric setting, the authors in [22, 23] introduced range query schemes. In addition, Boneh et al. also introduced conjunctive and subset query in [22] based on bilinear maps.

Note that most of these techniques are not compatible with each other due to specific data structure and mathematical property. However, in the following sections, we will prove that functional structures and searchable structures could be separately constructed, and asymmetric structures could be converted to symmetric structures such that a compatible all-in-one scheme is possible.

3. Notations and Preliminaries

We write to denote sampling element uniformly random from a set and write to denote the output of an algorithm . We write to denote the concatenation of two strings and . We write to denote its cardinality if is a set and write to denote its bit length if is a string. A function is negligible, if for every positive polynomial there exists an inter such that for all , . We write and to denote polynomial and negligible functions in , respectively.

We write to denote a dictionary of words in lexicographic order. We assume that all words are of length polynomial in . We write to refer to a document that contains words and write to denote the size of the document in bytes. In some cases, we also write to denote the document identifier that uniquely identifies the document, such as a memory location. We write X to denote a component or a scheme and write X.func() to denote the corresponding function for the component or an algorithm in the scheme.

4. Layered Searchable Encryption Scheme

Layered searchable encryption scheme aims to combine symmetric and asymmetric searchable encryption schemes to provide a uniform model for functional extensions. Therefore, we first revisit the basic symmetric and asymmetric searchable encryption models and then build the layered searchable encryption model based on these two different settings. After that, we introduce the security model of the new framework, and finally we present the concrete construction.

4.1. Revisiting Searchable Encryption

We adopt the definition introduced by Curtmola et al. in [8] as a representative model for symmetric searchable encryption scheme. In this setting, the user who searches for the documents is also the data sender who encrypts the documents. Therefore, some efficient searching techniques, such as using a global index, are used and the searchable structure may be a single index file for all stored documents. For consistency with other definitions, we make a little modification for the original definition, and define the scheme as follows.

Definition 1 (symmetric searchable encryption). A symmetric searchable encryption (SSE) scheme is a collection of five polynomial-time algorithms SSE = (Gen, Enc, Token, Search, Dec) as follows.   Gen is a probabilistic algorithm that takes as input a security parameter and outputs a secret key . It is run by the user and the key is kept secret.Enc is a probabilistic algorithm that takes as input a secret key and a document collection and outputs a searchable structure and a sequence of encrypted documents . It enables a user to query some keywords and the server returns the matched documents. For instance, in an index-based symmetric searchable encryption scheme, is the secure index. It is run by the user and is sent to the server.Token is a deterministic algorithm that takes as input a secret key and a keyword and outputs a search token (also named trapdoor or capacity). It is run by the user.Enc is a deterministic algorithm that takes as input the encrypted documents , the searchable structure , and the search token and outputs the matched documents (or identifiers) . It is run by the server and is sent to the user.Dec is a deterministic algorithm that takes as input a secret key and the encrypted document and outputs the recovered plaintext . It is run by the user.

We adopt the definition introduced by Boneh et al. in [11] as a representative model for asymmetric searchable encryption scheme. In this setting, the user generates the public key and the private key. The data sender encrypts the data using the public key, and the user searches and decrypts the data using the private key. The original definition only contains the searchable part, for consistency; we add two algorithms and define the asymmetric searchable encryption as follows.

Definition 2 (asymmetric searchable encryption). An asymmetric searchable encryption (ASE) scheme is a collection of seven polynomial-time algorithms ASE = (Gen, PEKS, Enc, Token, Test, Search, Dec) as follows. Gen is a probabilistic algorithm that takes as input a security parameter and outputs a public/private key pair . It is run by the user and only is kept secret.PEKS is a probabilistic algorithm that takes as input a public key and a word and outputs a searchable structure . It is run by the data sender and is attached to the encrypted message, and the combination is sent to the server.Enc is a probabilistic algorithm that takes as input a public key and a document (message) and outputs the ciphertext . It is run by the data sender and (followed by multiple searchable structures) is sent to the server.Token is a deterministic algorithm that takes as input a private key and a keyword and outputs a search token . It is run by the user.Test is a deterministic algorithm that takes as input the public key , a searchable structure PEKS, and a search token Token and outputs if or otherwise. It is run by the server.Search is a deterministic algorithm that takes as input the public key , the encrypted documents , the corresponding searchable structure set (each contains multiple searchable structures corresponding to the keywords of the document) and the search token and outputs the matched documents (the documents’ searchable structures satisfying Test). It is run by the server and is sent to the user.Dec is a probabilistic algorithm that takes as input a private key and a ciphertext and outputs the plaintext . It is run by the user.

Unlike symmetric setting, the definition of asymmetric setting only works on a single document. For a document collection, it does not make any difference since the user could execute the encryption algorithm for each document, respectively.

By comparing the definitions of the two different settings, there exists a common link between the queried keywords and the matched documents: the searchable structure which is constructed using either symmetric key or the public key. Note that the structure is probabilistic in the asymmetric setting, or else the server could directly launch the chosen plaintext attack using the public key. However, we say that for symmetric and asymmetric settings, the searchable structures are both run-time deterministic. To prove this property, we first introduce a lemma as follows.

Lemma 3. For asymmetric setting, if the token generated using the private key is deterministic, then the searchable structure encrypted using the public key is run-time deterministic when the the algorithm ASE.Test outputs 1, even if the encryption is probabilistic.

Proof. Recall that the algorithm Token is deterministic and PEKS is probabilistic. However, for a single document, there only exists a single that links to . When Test, it implies that matches . We replace with ; then the token , which could be generated by the data sender, who could generate the token using the public key Token which is in fact the algorithm PEKS. It seems that the data sender has indirectly generated the token without having the private key. Therefore, when the output of the test is 1, both the token and the searchable structure map to , which is deterministic.

Based on the lemma above, we introduce a theorem which guides us to construct the converter in the layered searchable encryption scheme.

Theorem 4 (run-time invariance). For both symmetric and asymmetric settings, if the search token is deterministic, then the searchable structure is run-time deterministic.

Proof. As proved in Lemma 3, the searchable structure is run-time deterministic for asymmetric setting. For symmetric setting, the searchable structure is encrypted using the symmetric key Enc, which is probabilistic. The token Token is deterministic. Similar to asymmetric setting, when executing the deterministic algorithm Search, the matched entries (probabilistic) map to , and the mapping is deterministic (here, an entry is the encrypted data using symmetric encryption that contains the information about the matched document, such as the node in the inverted index [8]). In other words, the searchable structure is run-time deterministic because of the deterministic mapping.

4.2. Scheme Definition

Our primary goal is to separate the functionalities from the searchable structures; therefore we consider to construct the basic searchable structures and various functions in different layers, as shown in Figure 1.(i)Global layer: we name this layer “global” because all documents and all searchable structures are involved. In this layer, the basic searchable encryption scheme (symmetric or asymmetric) is executed and a global index could be constructed to improve search efficiency. The server receives the search tokens (each token is related to a keyword), executes the search procedure, and outputs the matched documents. Furthermore, the server converts the tokens (symmetric or asymmetric) to the corresponding mappings (another type of secret token) with uniform format and transfers the mappings with the matched documents (or identifiers) to the local layer.(ii)Local layer: we name this layer “local” because functional structures are constructed for each document independently. In this layer, each matched document is further filtered by all functions (e.g., phrase query function) which execute separately. Only the documents that pass all filter tests are returned to the global layer and finally return to the user.

For both layers, the framework consists of three different components: the core symmetric and asymmetric searchable components which provide basic keyword search, one or more functional components which provides various functionalities, and a converter. The converter is an algorithm that provides a uniform interface for both symmetric and asymmetric settings and provides uniform inputs for all functions. We note that all components in the two layers execute the search algorithm on the server side, and no trusted third-party is required. Now we formally define the scheme as follows.

Definition 5 (layered searchable encryption). A layered searchable encryption (LSE) scheme is a collection of five polynomial-time algorithms LSE = (Gen, Enc, Token, Search, Dec) as follows. Gen is a probabilistic algorithm that takes as input a security parameter and outputs either a symmetric encryption key or an asymmetric encryption key pair . It is run by the user and only the public key is not kept secret.Enc is a probabilistic algorithm that takes as input an encryption key ( for symmetric setting or for asymmetric setting) and a document collection . It outputs encrypted documents , a single (index-based) global searchable structure or a sequence of global searchable structures corresponding to documents, and a sequence of local functional structures corresponding to encrypted documents. It is run by the data sender and are sent to the server.Token is a deterministic algorithm that takes as input a secret key and a set of keywords with functional instructions and outputs the corresponding search tokens with functional instructions. It is run by the user and is sent to the server.Search is a deterministic algorithm that takes as input a public key (only for asymmetric setting), the encrypted documents , the global searchable structure , the local functional structure , and the search token and outputs the matched documents . It is run by the server and is sent to the user.Dec is a deterministic algorithm that takes as input a secret key and an encrypted document , and outputs the plaintext . It is run by the user.

Functional instructions are separately specified by the functionalities and are written as a single SQL-like query. For example, the query “SELECT WHERE keywords = “cloud, storage, encryption” AND “security classification ” ORDERED BY “keyword:cloud”” indicate that finding the documents that satisfying: containing the keywords “cloud, storage, encryption”, the security classification of the documents , sorting the matched documents by relevance score according to the keyword “cloud” and return the top- relevant documents. Here we only write (e.g., cloud, storage, encryption”) as a representation for any instruction that contains the keywords. Similarly, the tokens are just a representation for all functional instructions.

A functional component (FC) is a module in LSE that provides a specific functionality. It generates a local functional structure for each encrypted document and provides filter service while searching. FC is designed to be compatible with both symmetric and asymmetric settings. Therefore, a conversion for the document as well as the query is required. We formally define the FC as follows.

Definition 6 (functional component). A functional component (FC) is a collection of two polynomial-time algorithms FC = (Build, Filter) as follows. Build is an algorithm that takes as input a document and the corresponding conversion and outputs a functional structure . It is run by the data sender and is appended to the encrypted document.Filter is an algorithm that takes as input the encrypted documents , the corresponding functional structure set , and the converted search tokens and outputs a subset of documents . It is run by the server.

4.3. Security Model

The security of LSE relies on the algorithms used by the components. For example, if the symmetric searchable encryption scheme introduced in [8] is used as the core searchable component, then the core searchable structure guarantees that it is semantic secure against chosen keyword attack (CKA-secure). Similarly, the functional components have their individual security guarantees. Therefore, the whole LSE scheme does not have a uniform security model, and security models are built separately and each component could be analyzed independently. However, we could divide the security models into three parts: searchable component security, interface security, and functional component security. Searchable component security is guaranteed by the underlying core searchable encryption scheme. Therefore, we mainly discuss the other two security models.

The interface is common, and therefore the data that flow through the interface must be semantic secure. Informally speaking, it must guarantee that the adversary cannot distinguish the input and the output of each component from random strings. Semantic security against chosen plaintext attack (CPA) is very important for the interface, or else the security of some components will be correlated such that the loose coupling property is lost.

We first define the notion of plain trace, which is the direct information that could be captured from the data that flow through the interface.

Definition 7 (plain trace). Let be a document collection. Let (only for asymmetric setting) be a keyword-counter set where is the number of keywords in . Let the query history be a sequence of queried keywords. Let the search pattern be a binary matrix such that for , , the th row and th column is 1 if and 0 otherwise. The plain trace .

Note that plain trace is different from the notion of trace introduced in [8] which further captures the logic links. We will explain the reason after the definition of the security model. We now present the security model for the interface.

Definition 8 (interface security against chosen plaintext attack, interface-CPA-secure). Let be the layered searchable encryption scheme. Let be the security parameter. We consider the following probabilistic experiments where is an adversary and is a simulator. : the challenger runs Gen to generate the key (symmetric) or (asymmetric). The adversary generates a document collection , a sequence of query , and receives Enc and search tokens Token from the challenger. generates a mapping as the input for the functional component. Finally, returns a bit that is output by the experiment.: given the plain trace generates and and then sends the results to . generates a mapping as the input for the functional component. Finally, returns a bit that is output by the experiment.

We say that the interface of LSE is semantic secure against chosen plaintext attack, if for all PPT adversaries , there exists a PPT simulator such that where the probabilities are over the coins of Gen.

Note that the functional structure is not included here since the functional component is loosely coupled with the core. Therefore, the security of the functional component is separate from the framework and should be defined and analyzed separately.

The security model of the interface does not care about the search algorithm and the number of queries (therefore, only a single query sequence is presented). The reason is that the other information about the queried keywords and the documents are protected by the components. For example, if some documents are returned by one token, then the adversary could immediately infer that these documents have a common keyword (even the tokens and documents are indistinguishable from random in the interface), and such logic links could be hidden by generating multiple different tokens for one keyword (please refer to the adaptive construction in [8]) and the protection is guaranteed in the core searchable component.

Therefore, semantic security for the interface does not guarantee that the whole scheme is secure against chosen keyword attack or each component is secure under some other security models. However, it provides the basic security guarantee for the whole scheme and the independence for each component, and we will show such independence in the construction of the functional component later.

4.4. Concrete Construction

We first present the basic idea for the search process and the converter; then we present the template for constructing the functional component. Finally, we present the constructions for LSE (symmetric and asymmetric) in detail and prove the security of the interface.

4.4.1. Basic Idea

As shown in Figure 2, the basic search process is as follows. The user transforms his queried keywords to tokens using the private key. The server receives the tokens and executes the search procedure over all encrypted documents . Each () is linked to a global searchable structure (if a global index is used, then only a single searchable structure is used for all encrypted documents) and a local functional structure , and only the global searchable structure is used in this step. Then the tokens are converted to the uniform tokens , and both and the matched encrypted documents are transmitted to functional components to further filter the results (e.g., phrase query filter). Each component outputs a subset of the input documents, and all components work serially since any document that does not pass the current filter will be unnecessary for the next filter. Finally, the matched encrypted documents are returned to the user and the user decrypts them to obtain the plaintexts .

In order to construct a functional component that supports both symmetric and asymmetric settings, a conversion is needed to transform the plaintext to a kind of ciphertext that is independent from the settings. We call this independent ciphertext as a one-to-one “mapping” since each word in the plaintext has a deterministic token in the ciphertext. In addition, in order to provide a uniform format for the functional components, a hash function is used, and we will show the detailed construction in the next section. Now we present the template for the functional component (FC) in Algorithm 1.

Build :
(1) input a document and the mappings of all words in .
(2) specified according to the functionality.
(3) output a local functional structure .
Filter :
(1) input a set of encrypted document , the corresponding local functional
 structures , and the mappings of the queried keywords .
(2) specified according to the functionality.
(3) output a subset of the documents .

We note that, in order to obtain the loose coupling property, any specific parameter is not allowed. Therefore, the uniform mappings of the words become the ideal common parameter. Another advantage of the mapping is that the main information needed for any functionality is retained: the difference of each word and the order of all words in the document. Based on this information, the word frequency, rank, subset, and so forth could also be inferred without the plaintext, which facilitates the designs of the Token and Search algorithms.

4.4.2. Constructing Symmetric Part

For symmetric setting, the deterministic mapping of a document could be computed with ease. Let the tokens map to the words “day,” “by,” and “night,” respectively. Then the deterministic mapping of a sentence could be written as

Both the Enc and Token algorithms could generate these mappings, and the main process is as follows. For each document , scan all words and compute the corresponding tokens, which are further hashed to the fixed-size mappings.

Suppose there are documents , corresponding ciphertexts , functional components FC = , and queried keywords . In addition, we define a hash function as follows: where is the length of the mapping according to the hash function. For example, if we use MD5 [24] as , then is 128 bit. For clarity, we present the encryption scheme in Algorithm 2 and the search scheme in Algorithm 3 and finally present the complete scheme in Algorithm 4.

Input: the encryption key , the documents ).
Output:
(1) : encrypted documents ).
(2) : global searchable structure (index-based).
(3) : local functional structures ).
Method:
(1) compute SSE·Enc ). Here ).
(2) for each document and the corresponding ( ) do
(3)  scan for all words to form a word list ).
(4)  for each word ( ) in   do
(5)   compute SSE·Token ).
(6)   compute ).
(7)  end for
(8)  let ).
(9)  for each functional component ( ) do
(10)   compute ·Build ).
(11)  end for
(12)  append ) to .
(13) end for
(14) let and ), and output ).

Input:
(1) : the public key is not available here.
(2) : encrypted documents ).
(3) : global searchable structure (index-based).
(4) : local functional structures ).
(5) : search tokens ).
Output: matched documents ).
Method:
(1) compute SSE·Search ). Here ).
(2) for each token ( ), compute .
(3) let ) and.
(4) for each functional component ( ) do
(5)  let , then the corresponding )
   and . Let .
(6)  compute ·Filter .
(7) end for

Gen : compute SSE·Gen , and output .
Enc : described in Algorithm 2.
Token :
(1) for each keyword ( ) in   do
(2)  compute SSE·Token .
(3) end for
(4) output .
Search : described in Algorithm 3.
Dec : compute SSE·Dec , and output .

4.4.3. Constructing Asymmetric Part

For asymmetric setting, the data sender does not have the private key; therefore the mapping will fail while searching since any encryption using the public key is probabilistic (CPA security). For example, let represent an encryption of a word, and the same sentence will become (note that both and map to the word “day”)

Therefore, we delay the construction for such mapping after the construction of the searchable structure in algorithm Enc and use this searchable structure as an independent token for the corresponding word in algorithm Search. Recall that PEKS(), then the tokens which map to the words “day,” “by,” and “night” will be transformed to when the test in the search algorithm outputs 1. Then we have

In this way, the data sender could construct the deterministic mapping for the document and indirectly obtain the deterministic tokens just using the public key. Similar to the symmetric setting, the process is as follows. For each document , scan all words and compute the corresponding tokens according to searchable structures, which are further hashed to the fixed-size mappings. While searching, the tokens are mapped to different searchable structures according to each document.

There are some differences from the symmetric counterpart, as shown in Figure 3. First, the searchable structures are appended to each encrypted data such that the global index is not available. Second, a public key is involved for the searchable structure. However, due to the conversion, the public key is unnecessary for the functional components.

Now we present the encryption scheme in Algorithm 5 and the search scheme in Algorithm 6 and finally present the complete scheme in Algorithm 7.

Input: encryption key , the documents ).
Output:
(1) : encrypted documents ).
(2) : global searchable structures ).
(3) : local functional structures ).
Method:
(1) for each document ( ) in   do
(2)  compute ASE·Enc ).
(3)  scan for all words to form a word list ).
(4)  extract distinct keywords ) from .
(5)  for each word ( ) in   do
(6)   compute ASE·PEKS .
(7)   compute .
(8)  end for
(9)  let map to map to .
(10)  for each word ( ) in   do
(11)   find the that the corresponding word .
(12)   set .
(13)  end for
(14)  let .
(15)  for each functional component ( ) do
(16)   compute ·Build .
(17)  end for
(18)  append to .
(19) end for
(20) output , , .

Input:
(1) : the user's public key.
(2) : encrypted documents.
(3) : global searchable structures .
(4) : local functional structures .
(5) : the search tokens .
Output: matched documents .
Method:
(1) compute ASE·Search . Let .
(2) for each and the functional structure ( ) do
(3)  let denote the searchable encryptions of ,
   where is the number of keywords in .
(4)  for each ( ) in   do
(5)   find where ASE·Test .
(6)   compute .
(7)  end for
(8)  let , .
(9) end for
(10) for each functional component ( ) do
(11)  let , then the corresponding )
  and . Let .
(12)  compute ·Filter .
(13) end for

Gen : compute ASE·Gen , and output .
Enc : described in Algorithm 5.
Token :
(1) for each keyword ( ) in   do
(2)  compute ASE·Token .
(3) end for
(4) output .
Search : described in Algorithm 6.
Dec : compute ASE·Dec , and output .

We note that the process of “find s” at line 5 in Algorithm 6 could be simply done by directly using the intermediate results from the algorithm ASE.Search at line 1.

4.4.4. Proof of Security

As we encapsulate the basic symmetric and asymmetric searchable encryptions in the global layer, the core is semantic secure against chosen keyword attack (CKA) [8, 9, 11]. The only thing we need is proving that the interface is CPA secure, and other functionalities are analyzed independently.

Theorem 9. If the core symmetric or asymmetric component is semantic secure against chosen keyword attack (CKA-secure), then LSE is interface-CPA-secure.

Proof. We briefly prove this theorem since the proof is straightforward. We claim that no polynomial-size distinguisher could distinguish () from equal-size random strings (). As proved in [8, 11], the CKA-security of the core component guarantees that () are indistinguishable from (). For symmetric setting, is the hash of which is indistinguishable from . For asymmetric setting, is the hash of the searchable structure which is indistinguishable from random, say . Therefore, the hash value is indistinguishable from the hash value .

5. Realizing Various Functionalities

In this section, we show how to realize various functionalities based on LSE. We fist present the overview of the searchable encryption schemes with various functionalities and then propose two representative constructions for ranked keyword query and range query. Finally, we briefly discuss the methods for realizing the other functionalities.

5.1. Overview

As shown in Table 1, we present various functionalities for searchable encryption schemes: symmetric setting (Symm), asymmetric setting (Asym), ranked keyword query (Ranked keyword), range query (Range), phrase query (Phrase), fuzzy keyword query and wildcard query (Fuzzy keyword), similarity query (Similarity), and subset query (Subset). “Yes” means that the corresponding scheme directly supports such functionality. “Possible” means that the underlying data structure is compatible, and such functionality could be realized through minor modification of the original scheme. “—” means that realizing such functionality is quite challenging or the cost is relatively high.

5.2. Ranked Keyword Query

Ranked keyword query refers to a functionality that all matched documents are sorted according to some criteria, and only the top-relevant documents will be returned to the user. The SQL query format is “ORDERED BY ‘keyword’.” In [14], the authors introduced the computation for the relevance scores and proposed a comparing method over the encrypted scores based on order preserving symmetric encryption (OPSE) [27]. By using the same cryptographic primitive, the functional structure could record the encrypted relevance scores and setup an index with (token, score) pairs in order to obtain the score with computation complexity.

5.2.1. Preliminaries

Order-preserving encryption (OPE) aims to encrypt the data in such a way that comparisons over the ciphertexts are possible. For , a function is order-preserving if for all , if and only if . We say an encryption scheme OPE = (Enc, Dec) is order-preserving if Enc(, ) is an order-preserving function. In [28], Agrawal et al. proposed a representative OPE scheme that all numeric numbers are uniformly distributed. In [27], Boldyreva et al. introduced an order-preserving symmetric encryption scheme and proposed the security model. The improved definitions are introduced in [29]. Informally speaking, OPE is secure if the oracle access to OPE.Enc is indistinguishable from accessing to a random order-preserving function (ROPF). The security model is described as Pseudorandom Order-Preserving Function against Chosen Ciphertext Attack (POPF-CCA) [27].

A sparse look-up table is often managed by indirect addressing technique. Indirect addressing is also called FKS dictionary [30], which is used in symmetric searchable encryption scheme [8]. The addressing format is address, value, where the address is a virtual address that could locate the value field. Given the address, the algorithm will return the associated value in constant look-up time and return otherwise.

5.2.2. Construction

We build a sparse look-up tablethat records the pair (keyword, relevance score) with all data encrypted. When queried, the server searches the relevance scores of all documents and finds the top- relevant documents. Note that, in order to security use OPE scheme to encrypt relevance scores, a preprocessing is necessary.

We build an OPE table to preprocess all plaintexts and store the encrypted relevance scores as follows. Given a document collection . For each document , scan it for keywords. Compute the relevance score (based on word frequency) for each keyword in , and record a matrix for with the th line recording , where is the position where the first occurs. For all documents, setup the OPE with numbers . For each number , the encryption is . Transform the previous matrix to an OPE table with the th line recording where is the encryption of .

For a document, it has at most keywords (note that each keyword is followed by a separator such as a blank). The look-up table is padded to entries in order to achieve semantic security. Now we present the concrete construction for ranked keyword query component in Algorithm 8.

Build :
(1) input a document and the mapping for words.
(2) let the entries of in OPE table be .
(3) for each , build index .
(4) padding the remaining entries with random strings.
(5) output a local functional structure .
Filter :
(1) input ciphertexts , the corresponding functional structures
and the mappings of the queried keywords
(single keyword).
(2) for all functional structures, compute
  and select the top results corresponding to .
(3) output .

5.2.3. Proof of Security

Informally speaking, the functional component must guarantee that given two documents’ collection with equal size and then the challenger flips a coin and encrypts using LSE (the order of the ciphertexts are randomized). The adversary could query a keyword and receive the ordered document collection but he could not distinguish which one the challenger selected. By combining the security models defined in [8, 27], we formally define the notion of non-adaptive chosen ranked keyword attack (CRKA) as follows.

Definition 10 (semantic security against nonadaptive chosen ranked keyword attack, CRKA-secure). Let be the functional component for ranked keyword query. Let be the security parameter. We consider the following probabilistic experiments, where is an adversary and is a simulator. the challenger runs Gen to generate the key . The adversary generates a document collection (the size of each document is fixed) and receives the encrypted documents and functional structures with random order from the challenger. is allowed to query a keyword , where and receives a mapping from the challenger. Finally, returns a bit that is output by the experiment. given the number of documents , the size of each document , and the size of the mapping , generates , and and then sends the results to . Finally, returns a bit that is output by the experiment.
We say that the functional component is CRKA-secure, if for all PPT adversaries , there exists a PPT simulator such that where the probabilities are over the coins of Gen.

Theorem 11. If LSE is interface-CPA-secure and the underlying OPE is POPF-CCA secure, then the ranked keyword query component is CRKA-secure.

Proof. The simulator generates , and as follows. As to , generates random strings of size . As to , let ; generates random strings with each has size . generates an matrix , where each element is a random number. Then for each document, generates an index (). As to , randomly selects .
We claim that no polynomial-size distinguisher could distinguish () from (). Since the encryption key is kept secret from the adversary, the interface-CPA-security directly guarantees that is indistinguishable from . It also guarantees that is indistinguishable from . Upon receiving or , the adversary could invoke Filter() or Filter() to obtain or . POPF-CCA security guarantees that the set is indistinguishable from ; that is, the adversary is unable to distinguish the result of OPE from the result of a random order-preserving function. Therefore, is indistinguishable from .

5.3. Range Query

Range query refers to a functionality that the server could test if the submitted keyword (integer) is within a range. The SQL query format is “WHERE ‘ operator ’.” For example, the user submits an integer , and the server could return the documents where the corresponding searchable fields satisfying .

Although OPE could be applied here to support range query (similar to ranked keyword query), we propose another solution to demonstrate that how to apply the methods used in asymmetric setting to LSE. In [2], the authors introduced a construction based on bilinear map (asymmetric setting), which is not compatible with symmetric setting. However, the idea of transforming the comparison into a predicate (e.g., if where is a predicate) could be used, and the functional structure could record all possible predicates and provide predicate test using a bloom filter.

5.3.1. Preliminaries

A bloom filter [31] is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set . The set is coded as an array of bits. Initially, all array bits are set to 0. The filter uses independent hash functions where each for . For each element where , set the bits at positions to 1. Note that, a location could be set to 1 multiple times. To determine if , just check whether the positions in are all 1. If any bit is 0, then . Otherwise, we say with high probability (the probability could be adjusted by parameters until acceptable).

In addition, we write to denote the identifier of a document such as the cryptographic hash of the pathname, and write to denote for simplicity.

5.3.2. Construction

For range query, the document is labeled by some numbers. Here we only consider a single label . Therefore, the aim of range query is to enable the user to submit a number to search for the documents that satisfying the SQL-like query such as “WHERE “””. We consider the five basic range query operators “.” The other operators such as “” could be naturally derived from the basic operators.

We consider the whole range to be a sequence of discrete numbers , where . Then we set five shared virtual documents , , , , and for all user’s documents. The virtual document could be encrypted by LSE’ core as a normal document. Therefore, for any keyword such as “>” where , there always exists a mapping .

Based on the notion of virtual document, a label for a user’s document satisfying (or ) could be represented as keywords , and these keywords are stored in a bloom filter . Suppose the user queries a keyword “>,” where ; then the query is transmitted to the bloom filter to test if .

For example (we only consider the operator “>” here for simplicity), suppose we have two documents labeled , respectively. Then the transformed sets are and . If the user submits >7, then only matches the query, which is the same result as direct comparisons since and , and then is returned. Similarly, the query “>3” will match both documents, and are returned.

Now we construct the secure version of the aforementioned scheme. Let denote the domain of the label, and setup the bloom filter with independent hash functions . The identifier of a document is always bound to the document or the ciphertext . The concrete construction is presented in Algorithm 9.

Build :
(1) input a range document = (> , ≥ , > , ≥ , ≥ , = , ≤ , < , ≤ , < , ≤
 and the mapping . Here is the transformed form for the label .
(2) initialize a bloom filter with all bits set to 0.
(3) for ( ) do
(4)  compute codewords .
(5)  insert the codewords into the bloom filter .
(6) end for
(7) output a local functional structure .
Filter :
(1) input ciphertexts , the corresponding functional structures ,
 the mappings of the queried keywords (single keyword),
 where is the mapping of “> ”.
(2) for ( ) do
(3)  compute codewords .
(4)  if all locations in bloom filter are 1, then add to .
(5) end for
(6) output .

The size of the bloom filter could be dramatically reduced if the domain is bucketized [32] for example, bucketizing the subrange as tag 10 and the subrange as tag 20. Then a query for “>13” could be mapped to the closest query “>10.” In other words, the whole domain is divided to multiple subranges that the queried range is transformed to the approximate range. The optimization of the idea of bucketizing the range is introduced in [33]. In such way, the number of the data stored in the bloom filter will become smaller. However, this will induce inaccuracy for the query result.

5.3.3. Proof of Security

For simplicity without loss of generality, we only consider the operator “>” here, and the other operators are the same. Informally speaking, the functional component must guarantee that the adversary is unable to guess the queried range as well as the range in the ciphertext, and the basic game works as follows. Given two documents that are labeled with two numbers , respectively, the challenger flips a coin and encrypts . The adversary is allowed to adaptively query keywords , where each that . Note that querying that is not allowed since the document is immediately distinguished (only the document with is matched and returned). We propose the notion of chosen range attack (CRA) and formally define the security model for semantic security as follows.

Definition 12 (semantic security against chosen range attack, CRA-secure). Let be the functional component for ranked keyword query. Let be the security parameter. We consider the following probabilistic experiments, where is an adversary and is a simulator. : the challenger runs Gen to generate the key . The adversary generates a document and the labeled number and receives the encrypted document and the functional structure . is allowed to adaptively query keywords . For each query “,” receives a mapping from the challenger. Finally, returns a bit that is output by the experiment.: given the document size , the cardinality of the range , and the size of the mapping , generates , and , and then sends the results to . Finally, returns a bit that is output by the experiment.
We say that the functional component is CRA-secure, if for all PPT adversaries , there exists a PPT simulator such that where the probabilities are over the coins of Gen.

Theorem 13. If LSE is interface-CPA-secure, then the ranked keyword query component is CRA-secure.

Proof. The simulator generates and as follows. As to , generates a random string of size . As to , generates a random string and distinct and random strings . For each , computes codewords and inserts the codewords into a bloom filter . Let . As to , for each , randomly selects a distinct maps to , such that . Note that, if for some locations , the mapping is the same.
We claim that no polynomial-size distinguisher could distinguish () from (). Since the encryption key is kept secret from the adversary, the interface-CPA-security directly guarantees that is indistinguishable from . It also guarantees that each is indistinguishable from the random string such that is indistinguishable from . Therefore, the locations of in bloom filter is indistinguishable from the locations of in bloom filter . Therefore, locations in are indistinguishable from locations in . Thus, is indistinguishable from .

5.4. Other Functionalities

Due to space limitation, we only discuss the above two representative functional components. We briefly introduce how to realize some other functionalities based on LSE as follows.

Phrase Query. It refers to a query with consecutive and ordered multiple keywords. For example, searching with phrase “operating system” requires that not only each keyword “operating” and “system” must exist in each returned document, but also the order that “operating” is followed by “system” must also be satisfied. In [21], the authors introduced a solution based on Nextword Index [34]. It allows the index to record the keyword position for each document and enables the user to query the consecutive keywords based on binary search over all positions. However, it has computation complexity for each document. Based on LSE, this functionality could be realized using bloom filter (as demonstrated in range query scheme) which recording biword or more words based on Partial Phrase Indexes [35]. As a result, the scheme coude achieve approximately computation complexity (note that, the index in the global layer could reduce a large number of results for multiple keywords).

Fuzzy Keyword Search. It refers to a functionality that the user submit a fragment of a keyword (or a keyword that does not exist in all documents) and the server could search for the documents with all possible keywords that are closed to the fragment. In [3], the authors introduced a wildcard-based construction that could handle fuzzy keyword search with arbitrary edit distance [36]. By using the same method, the functional structure could realize this functionality by recording and indexing the fuzzy set of all mappings instead of keywords.

Similarity Query. It refers to a functionality that the server could return to the user some documents containing keywords which are similar to the queried keyword. In both [18, 19], the authors realized this functionality based on fuzzy set. Therefore, although different methods are used, the construction of the fundamental component is similar to the construction of fuzzy keyword search scheme.

Subset Query. It refers to a functionality that the server could test if the queried message is a subset of the values in the searchable fields. For example, let be a set that contains multiple e-mail addresses. If the user search for some encrypted mails containing Alice’s e-mail , then the server must have the ability to test if without knowing any other information. A solution was also introduced in [2]. Similar to the range query scheme, this test could also be viewed as a predicate and therefore the solution is the same.

5.5. Performance Analysis

The algorithms of ranked keyword query component and range query component are coded in C++ programming language and the server is a Pentium Dual-Core E5300 PC with 2.6 GHz CPU. Each document is fixed to 10 KB with random words chosen from a dictionary, and the query is also some random keywords (random numbers). For bloom filter used in range query component, the number of hash functions is set to 8. The time costs of the filter algorithms are shown in Figure 4.

Let denote the number of documents. For ranked keyword query, the main operations are retrieving the relevance scores from the secure table managed by indirect addressing technique ( search complexity) and selecting the top- scores ( computation complexity). For range query, the main operation is computing 8 hash values ( computation complexity). Note that the current document will be passed if any position in bloom filter is 0. Therefore, not all eight hash functions are executed all the time. The figure demonstrates that, even for a single server, the algorithms are both efficient. Note that, since the functional components are loosely coupled with each other, they could be deployed to different servers. For example, two core components (Core), two ranked keyword query components (Rank), and three range query components (Range) could be executed as a data-flow boxes as shown in Figure 5. Each box could be deployed to any server. The detailed methods are out of scope of this paper and we will not discuss this further.

6. Conclusions

Layered searchable encryption scheme provides a new way of thinking the relationship among the searchable structure, functionality and security. It separates the functionalities apart from the core searchable structure without loss of security. Therefore, the loose coupling property provides compatibility for symmetric and asymmetric settings and it also provides flexibility for adding or deleting various functionalities. Furthermore, following the popular boxes and arrows paradigm, the loose coupling property makes the scheme more suitable for distributed and parallel computing environment.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This work is supported by the Science and Technology Department of Sichuan Province (Grant no. 2012GZ0088 and no. 2012FZ0064).