Abstract

For both convenience and security, more and more users encrypt their sensitive data before outsourcing it to a third party such as cloud storage service. However, searching for the desired documents becomes problematic since it is costly to download and decrypt each possibly needed document to check if it contains the desired content. An informative query-biased preview feature, as applied in modern search engine, could help the users to learn about the content without downloading the entire document. However, when the data are encrypted, securely extracting a keyword-in-context snippet from the data as a preview becomes a challenge. Based on private information retrieval protocol and the core concept of searchable encryption, we propose a single-server and two-round solution to securely obtain a query-biased snippet over the encrypted data from the server. We achieve this novel result by making a document (plaintext) previewable under any cryptosystem and constructing a secure index to support dynamic computation for a best matched snippet when queried by some keywords. For each document, the scheme has storage complexity and communication complexity, where is the document size and is the snippet length.

1. Introduction

Cloud storage provides an elastic, highly available, easily accessible, and cheap data repository to users who do not want to maintain their own storage or just for convenience, and such a way of storing data becomes more and more popular. In many cases, especially when the users want to store their sensitive data such as business documents, it requires the security guarantees against the cloud provider since an internal staff may access to the data maliciously. Directly encrypting the sensitive documents using traditional encryption techniques such as AES is not an ideal solution since the user will lose the ability to effectively search for the desired documents.

One solution for effectively searching over encrypted data is searchable encryption technique. It enables a user to securely outsource his private documents to a third party while maintaining the ability to search the documents by keywords. The scenario is simple: the user submits some encrypted keywords to the server, and then the server performs the search and returns the encrypted documents which contain the queried keywords. However, current searchable encryption techniques either directly return the matched documents or return in the first round some limited information (guided mode) which is prestored in metadata, such as the name and a short static abstract for each matched document. The more documents stored, the more possible matched results will be, and finding the desired documents also becomes a problem. Moreover, the bandwidth cost must be taken into consideration such that returning a large amount of matched documents seemed to be impractical.

Another solution for effectively searching for the desired data is through content preview, which is the main topic of this paper. In modern search engine, if a user searches for a web page by keywords, the search engine will return the name, URI, and a small query-biased snippet for each matched page. The snippet explains why such page is matched. Then the user could make a final choice and selectively browse the needed pages without opening all matched links. The same way could be used for searching the desired encrypted documents since the scenario is the same. It could also be combined with searchable encryption to improve the user experience.

However, obtaining a query-biased snippet from an encrypted data is quite challenging. For a general search engine, in order to get a query-biased snippet from a plaintext, it must scan each matched document dynamically, extract the snippets where the keywords occur, then rank the results and finally return the top-ranking snippet. While data is encrypted, dynamic scanning becomes quite impossible. Precomputing a snippet file for preview is also impossible because there is no way to know in advance what the queried keywords are, and building all static (keyword, snippet) pairs for each document costs too much storage space even far more than the document itself. Thus, we consider dividing a document to many equal-size encrypted snippets and preconstruct an index to address each snippet. The index stores the information about the keyword frequency in each snippet, which enables the server to dynamically calculate the best snippet for the user when queried by multiple keywords.

There are two major security problems. First, the snippet is the part of a document; therefore the encryption scheme used may affect the snippet retrieval. We use a pad-and-divide scheme to preprocess the document to make it compatible with any cryptosystem such as DES and RSA. Second, the information in the index is private, and no partial information about the document should be leaked to the server. Therefore, we encrypt the index based on the core method of searchable encryption. Since each keyword maps an entry in the index, if queried by some keywords, directly returning the related score information without calculating leaks the information about the number of queried keywords (equals to the number of returned entries) to an eavesdropper, and it also costs multiple communication bandwidth as the number of requested keywords increases. A homomorphic encryption scheme could be adopted such that the server could directly operate over the encrypted data and produce a single result, while keeping the ciphertext still secure. However, homomorphic encryption scheme is often costly when dealing with a large amount of data. Observing that all the data are very small, we propose a novel lightweight substitution for homomorphic encryption to construct such secure index.

In this paper, our contributions are the following. (1) To the best of our knowledge, we formalize the problem of securely retrieving query-biased snippet over encrypted data for the first time. We generalize the notion of secure query-biased preview (SecQBP) and its security model. (2) We propose a lightweight solution to deal with matrix data with partial homomorphic property, named matrix additive coding (Matrix-AC), which could efficiently add two rows of small numbers while keeping the data still encrypted. (3) Based on Matrix-AC and private information retrieval protocol, we construct a secure additive ranking index (SecARI) that enables the server to efficiently compute the top-ranking snippet over encrypted data while no partial information about the document is leaked, and then we propose the complete construction to realize SecQBP and prove that it is secure under our security model. (4) We propose a high level solution to combine the preview scheme with searchable encryption technique, which greatly improves the user experience.

The rest of the paper is organized as follows. Section 2 presents the related work. Section 3 presents the notations and preliminaries. Section 4 presents our proposed additive coding scheme. In Section 5 we formally define the preview scheme and its security model and present the construction in detail. We present the application in searchable encryption and analyze the performance in Section 6. Section 7 concludes this paper.

We categorize the related work into four topics, and each topic is summarized separately.

2.1. Query-Biased Snippet

Query-biased snippet refers to a piece of the content in a document that contains the queried keywords. Query-biased snippet generation schemes are widely used in modern search engine. It is also named dynamic summary or keyword-in-context (KWIC) snippet generation. The term was used firstly in [1]. The improvements were introduced in [26]. However, as far as we know, all query-biased schemes are focused on dynamically retrieving snippets from the plaintext. If the document is encrypted, dynamic scanning becomes impossible. Static preview refers to a snippet summarizing the content in advance, which is always the same regardless of the query. It is generally composed of either a subset of the content or metadata associated with the document. A lightweight static preview scheme over the encrypted data was introduced in [7]. For more details, please refer to [8] for a survey of the recent preview schemes.

2.2. Searchable Encryption

Our proposed scheme and security model are based on searchable encryption technique. The basic goal of searchable encryption is to enable a user to privately search over encrypted data by keywords. The first scheme was introduced in [9]. Later on, many index-based symmetric searchable encryption schemes were proposed. The first secure index was introduced in [10], and the security model of adaptive chosen keyword attack (IND-CKA) was also introduced. Reference [11] introduced two constructions to realize symmetric searchable encryption: the first is SSE-1 which is nonadaptive and the second is SSE-2 which is adaptive. A generalization for symmetric searchable encryption was introduced in [12]. Another type of searchable encryption schemes is public-key based. The first scheme was introduced in [13], the improved definition was introduced in [14], and the strongest security model was introduced in [15].

There are many functional extensions for the basic searchable encryption schemes. Reference [16] introduced a scheme supporting conjunctive keyword search. References [1719] introduced ranked keyword search over encrypted data. References [2022] introduced fuzzy keyword search over encrypted data. Similar to fuzzy keyword search but different, [23, 24] introduced similarity search over encrypted data.

2.3. Homomorphic Encryption

Our proposed additive coding method is based on the core concept of homomorphic encryption. The classical homomorphic encryption schemes are based on group operation such as the unpadded RSA in [25], the variant of ElGamal introduced in [26], Goldwasser and Micali’s bit homomorphic encryption scheme introduced in [27, 28], and Paillier’s encryption scheme introduced in [29]. Many improvements have been proposed based on these classical series of schemes. The referred schemes are public-key based, and few symmetric homomorphic schemes have been proposed. The series of symmetric homomorphic schemes which is based on one-time pad was introduced in [30]. Some ring-based homomorphic schemes have been proposed recently, which are also referred to as full homomorphic encryption, such as the one in [31] that is based on ideal lattices and the one in [32] that does not require ideal lattices.

2.4. Private Information Retrieval

We encapsulate a private information retrieval (PIR) protocol and extend the use of it in our scheme. PIR schemes allow a user to privately retrieve the th bit of an -bit database. The notion was fist introduced in [33] by Chor et al., and the notion of private block retrieval (PBR) was also introduced. Kushilevitz and Ostrovsky introduced a single-server and single-round computational PIR scheme in [34], which achieves communication complexity of for the basic scheme and could achieve with arbitrary small theoretically ( is achieved assuming security parameter is polylogarithmic in ). In [35], Cashin et al. introduced a single-database PIR scheme with polylogarithmic communication complexity for the first time, about as suggested. Gentry and Ramzan introduced a PBP scheme with communication cost in [36], where is a security parameter that depends on , which is nearly optimal.

3. Notations and Preliminaries

3.1. Basic Notations

We write to represent sampling element uniformly random from a set and write to represent the output of an algorithm . We write to refer to the concatenation of two strings and . We write to represent its cardinality when is a set and write to represent its bit length if is a string. We write to represent bitwise exclusive OR (XOR) and “” to represent bitwise shift left for bits. We write to represent the least integer less than or equal to . We write to represent a bit string that contains either 0 or 1 (e.g., ). A function is negligible if for every positive polynomial there exists an inter such that for all , . We write and to denote polynomial and negligible functions in , respectively.

We write to present a dictionary of words in lexicographic order. We assume that all words are of length polynomial in . We write to refer to a document that contains words. We write to represent the identifier of that uniquely identifies the document, such as a memory location. We write to refer to a snippet (50 characters in general) extracted from the document and write to represent the identifier of , such as the position in the document.

3.2. Cryptographic Primitives

A function is pseudorandom if it is computable in polynomial time in and for all polynomial size adversaries , it cannot be distinguished from random functions. If is bijective then it is a pseudorandom permutation. We write the abbreviation PRF for pseudorandom functions and PRP for pseudorandom permutations.

Let ES represent an encryption scheme. Let ES.Gen represent the key generation algorithm ( is the secure parameter). Let ES.EncK represent the encryption algorithm that encrypts data using key , and let ES.DecK represent the decryption algorithm that decrypts data to gain the plaintext . In our scheme, a lot of data will be encrypted using the same key; therefore the encryption scheme must be at least CPA (chosen plaintext attack) and CCA (chosen ciphertext attack) secure. For example, ECB (electronic codebook) mode in DES or RSA without OAEP (optimal asymmetric encryption padding) should not be used.

3.3. Homomorphism

Let denote the set of the plaintexts, let denote the set of the ciphertexts, let denote the operation between the plaintexts and the operation between the ciphertexts, and let “” denote “directly compute” without any intermediate decryption. An encryption scheme is said to be homomorphic if for any given encryption key , the encryption function or the decryption function satisfies

Sometimes, property (2) is also referred to as homomorphic decryption. If the operation is upon a group, we say it is a group homomorphism. If the operation is upon a ring, we say it is a ring homomorphism and is also referred to as full homomorphism. If the operator is addition, we say it is additively homomorphic, and if the operator is multiplication, we say it is multiplicatively homomorphic.

3.4. Private Block Retrieval Protocol

Let represent a database of blocks; all blocks have equal size . The user wants to privately retrieve the th block from the server; therefore he runs a private block retrieval protocol. At a high level, we define the single database and single round computational PBR as follows.

Definition 1 (computational PBR protocol). A computational PBR protocol scheme is a collection of four polynomial-time algorithms CPBR = (Setup, Query, Response, Decode) such that we have the following. Setup is a probabilistic algorithm that takes as input the database and outputs a parameter set . It is run by the database owner, and is known to all users.Query is a probabilistic algorithm that takes as input a block index and outputs a token . It is run by the user. is sent to the server.Response is a deterministic algorithm that takes as input the requested token and outputs a result . It is run by the server. is sent to the user.Decode is a deterministic algorithm that takes as input the response from the server and outputs the requested data block . It is run by the user.

In our preview scheme, we adopt the computational PBP scheme as a primitive introduced in [36]. In the setup algorithm, we set the database size as the maximal possible document size (e.g., 10 MB) and reuse prime number set and prime power set in all documents. The communication complexity is where is the document length and is the snippet length.

4. Secure Additive Coding

Before introducing the preview scheme, we first introduce a novel coding method called matrix additive coding (Matrix-AC) that enables addition of two rows in a matrix in a homomorphic fashion, which is very fast and suitable for dealing with small numbers (the integer is coded to a specific bit string) and is especially useful for computing statistical table in encrypted form. Since all operated integers are correlative, it is not a homomorphic encryption scheme which could encrypt data independently.

Matrix-AC is used in the preview scheme to construct the secure additive ranking index (SecARI). Becouse a large number of small numbers will be calculated in the preview scheme, using homomorphic encryption schemes is costly. Therefore, we use Matrix-AC scheme as a substitution for homomorphic encryption scheme to achieve optimal performance.

We note that, for all the schemes (including the preview scheme in the next section), we only consider the confidentiality of the data. Mechanism about protecting data integrity is out of the scope of this paper.

4.1. Basic Idea

The basic idea of coding small integers with homomorphic property is simple: we consider an integer vector , where and . We define a “vernier” that has bits, and each integer is mapped to such vernier for bits in different position. A global cursor is autoincreased to process the mapping. To code a message, a random string as a one-time-pad key is used and XORed with the mapped data. The decoding is simple: just operate with the key and count the number of bit-1 to make reverse mapping.

For example (as shown in Figure 1), consider a vernier that has 8 bits, and we map three integers as a vector to three pairs (), (), and (). It is easy to see that the homomorphic property holds as , and which has exactly 5 bit-1. Thus, let the vector be , , , and let the keys be that each key is a random string; then the ciphertext vector will be , , . To perform addition for any two plaintexts , the server could directly compute the corresponding ciphertexts and the decryption key becomes . Using the new key, it is easy to decrypt the ciphertext and count the number of bit-1 to restore the result.

The problem of the basic scheme is that the vernier may be used up. That is why we set the restriction that a vernier is just used in a single vector. Another drawback is that, as the max increases, the length of the vernier also increases in linear . Thus, the targeted data must be small enough to save storage space. A good application is not for dealing with few such integers but for computing a large number of small data in parallel.

4.2. Coding a Matrix

We extend the basic idea to code the data matrix. Let represent a matrix with rows and columns, let represent the element in row and column , and represent the th row. Matrix-AC scheme is described in Algorithm 1. Note that there are cursors that control the mapping for each column.

Input random strings of length
:
(1) check the validity of the input, continue if and for all ,
   , or output Ø otherwise
(2) initialize a ciphertext matrix , each element is set to
  ( bits), and initialize cursors and set each ( bits)
(3) for   do
(4) for   do
(5)  for   do
(6)   
(7)   
(8)  end for
(9) end for
(10)encrypt the ith row as
(11) end for
(12) output
Decode :
(1) initialize a temporary data row with each element
be bits, and set
(2) for each in , let the binary form of be , then
(3) output

Let us check the homomorphism for decoding: let represent the decryption algorithm, for arbitrary two ciphertext rows and , , where the decryption key for is .

There is a problem if the scheme is directly used in the application. In the real world, there is no way to directly represent, for example, data of 5 bits (there is an extended “bitset” class in C++, but it treats the bits as a set, and all operations are performed over set, and it is very slow). In computer, the data is represented by “byte” that a valid number is stored in such a byte. Thus, 5-bit data is stored in one byte (8 bits) as a “character,” 12-bit data is stored in two bytes (16 bits) as a “short integer,” and a 20-bit data is stored in four bytes (32 bits) as an “integer.” Thus, the XOR operation is performed over byte, and the data should be extended to such standard length. However, since all data in Matrix-AC are in fact a bit string, sometimes the data in the same row could be “chained” together. For example, suppose and there are 6 data in a row; the row could be chained to a 30-bit string and stored in a 32-bit integer. Another problem is that the “bit-counting” algorithm is realized indirectly by “mod 2” operation or setting masks to see if the masked bit is 1. Therefore, the performance would be improved if some dedicated hardware directly dealing with bits is used.

4.3. Proof of Security

Intuitively, the scheme is secure if any two matrices (the numbers of elements are the same) prepared by the adversary are indistinguishable, which also implies that any two elements from the same matrix are indistinguishable. We define the security of Matrix-AC as follows.

Lemma 2. If are random strings, then Matrix-AC is CPA secure.

Proof (sketch). We briefly prove the scheme since the mechanism is simple. We describe a PPT simulator for all PPT adversaries . generates matrix with random strings of length . For any row in the original matrix , no matter how it maps, the last computation is a string XORed with a one-time-pad random string ; thus the result is indistinguishable from random. For any two rows and from the same matrix, each row is XORed with different random strings such that the results are indistinguishable from each other. For any two rows and from two matrices, as discussed previously, is indistinguishable from the random string .

5. Secure Query Biased Preview Scheme

The preview scheme contains two steps: (1) storage at which the data owner prepares the previewable document and a searchable index; (2) retrieval at which the user privately retrieves the snippet from the server.

The basic idea of constructing a query-biased previewable document is as follows: divide the document into snippets with equal size, extract keywords from each piece to form a keyword set which records the snippet information as (keyword, frequency) pairs, and build an index to address the snippets according to the distinct keywords. The index is a two-dimensional matrix of the form (keyword, snippet index), and the value is the keyword frequency in the corresponding snippet. An example is shown in Table 1. The keyword is represented by , and the snippet index is represented by .

The main process of retrieving the best snippet by multi keywords follows the following steps. The user submits multikeywords to the server. The server retrieves the multirows in the index according to the submitted keywords and adds the rows together. The result is a single entry that contains the information about the best matched snippet. The user decrypts the entry, selects the snippet identifier (index number) with the highest score (for simplicity, the score equals the frequency), and privately retrieves the snippet from the server by running a PBR protocol. In order for the server to perform the “addition” operation over the encrypted data, a homomorphic encryption scheme could be used to encrypt the index. We adopt Matrix-AC as the encryption scheme instead of a standard homomorphic encryption scheme as discussed previously.

Now we begin to introduce the definition and the security model of the preview scheme. Note that we assume the server is honest but curious. Additional methods could be added to make those solutions robust against malicious attack; however, we restrict our discussion on honest-but-curious fashion. We also note that all documents are treated as text files the same way as search engine does. For example, if a document is a web page, the style tags will be pruned.

5.1. Scheme Definition

The secure-query biased preview (SecQBP) scheme contains two parties: a user and a remote server . encrypts his private document to , generates a secure additive ranking index (SecARI) , and then outsources them to . stores the document, performs the computation for the scores when queried by multiple keywords, and returns the result to . then selects the best snippet indexed by and privately retrieves it from .

Without loss of generality, we consider the construction for a single document. The scheme could be extended to a document collection with ease. Now we define the SecQBP scheme as follows.

Definition 3 (secure query-biased preview scheme). SecQBP scheme is a collection of six polynomial-time algorithms SecQBP = (Gen, Setup, Query, ComputeScore, DecScore, DecSnip) as follows. Gen is a probabilistic algorithm that takes as input a security parameter and outputs the secret key collection . It is run by the user, and the keys are kept secret.SetupK is a probabilistic algorithm that takes as input a document and outputs a encrypted document (using any cryptosystem) and an index . It is run by the user, and , are outsourced to the server.QueryK is a deterministic algorithm that takes as input the queried multiple keywords and outputs a secret query token . It is run by the user, and is sent to the server.ComputeScore is a deterministic algorithm that takes as input the secret query and the index and outputs the result that contains the final score information about each snippet. It is run by the server.DecScoreK is a deterministic algorithm that takes as input the queried keywords , the document identifier , and the query result and outputs the snippet index number . It is run by the user.DecSnipK is a deterministic algorithm that takes as input the ciphertext and outputs the recovered plaintext snippet . It is run by the user. Note that, if the user retrieves the entire encrypted document, he could decrypt the document by decrypting each snippet.

5.2. Security Model

Informally speaking, SecQBP must guarantee that, first, given the encrypted document and the index , the adversary cannot learn any partial information about the document; second, given a sequence of queries , the adversary cannot learn any partial information about the queried keywords and the matched snippet (including the index number and the content). We now present the security definition for adaptive adversaries.

Definition 4 (semantic security against adaptive chosen keyword attack, CKA2-security). Let = (SecQBP algorithm + SecQBP protocol) be the preview scheme. Let be the security parameter. one considers the following probabilistic experiments, where is an adversary and is a simulator. : the challenger runs Gen to generate the key . generates a document and receives SetupK from the challenger. makes a polynomial number of adaptive queries (each set contains multiple keywords in ), and for each queried keyword set , receives a query token QueryK from the challenger. Finally, returns a bit that is output by the experiment.: generates a document . Given only the size , generates and sends to . makes a polynomial number of adaptive queries (each set contains multiple keywords in ), and for each queried keyword set , receives a query token from . Finally, returns a bit that is output by the experiment.
We say that SecQBP is semantic secure against adaptive chosen keyword attack if, for all PPT adversaries , there exists a PPT simulator such that where the probabilities are over the coins of Gen and Setup (related to the underlying cryptosystem).

Note that, with or , could run ComputeScore to get the result , and any internal state is also captured by . could also send query according to the previous result.

5.3. Concrete Construction

Now we describe the concrete construction for SecQBP. We describe the constructions for some core components, and then represent the complete construction.

5.3.1. Encrypting a Document

We consider the problem of extracting keywords from a document. In general, a keyword is followed by a separator. Thus, in a general snippet of 50 characters, no more that 25 keywords are contained. Another problem is that not all words are keywords, and such words do not need indexing, for instance, the words “a,” “the,” and “and.” This kind of words can be found in most of the sentences such that it is useless as a key to index a file. They are called stop-word and firstly researched in [37]. The most classical stop word list used abroad is a list of 425 words suggested in [38].

There is a problem that the last word in a snippet may be cut off. In other words, the last word of a snippet may be not short enough to fit the space, and it cannot be split into two words because neither of them is a valid keyword. In a general search engine, such overflowed word is omitted. However, in the scenario of precomputing snippets, if the word is omitted, a keyword may be lost. It means that, when querying the omitted keyword, there will be no matched snippet returned, where actually there is a match for the document. Thus, we add the full word to both the keyword sets of the snippets which contain part of the keyword.

The basic idea for encrypting a document is dividing the document with equal size; therefore, a padding scheme is needed when the last piece of the document is not long enough. We modify the CBC plaintext padding scheme introduced in [39] to meet our goal. Let represent the length of the snippet; the snippet is treated as a sequence of bytes. If the last snippet is bytes, then pad the snippet with bytes with value . After decryption, the padding will be deleted to recover the original plaintext. For instance, suppose ; if the final snippet has 15 plaintext bytes, then pad the snippet as where there are 35 bytes that have the number 35. If the snippet is divisible by , here is 50, then add a new snippet with all bytes being 50:

Let represent a document, and is the encrypted form of . We introduce the scheme for encrypting a document, shown in Algorithm 2. In the algorithm, “a valid keyword” means the token is not a separator, not a stop word, and not a random-looking string. A word dictionary could be used to check its validity.

Input: a document , the encryption key
Output: the encrypted document , a keyword-frequency
set collection
Method:
(1) padding according to snippet length as discussed
(2) treat as pieces
(3) for   do
(4) create a keyword-frequency set for
(5) scan for distinct valid keywords. For each keyword, count the
  keyword frequency, and add the vector (keyword, frequency) to
   as set element
(6) encrypting the snippet as
(7) end for

5.3.2. Constructing the Secure Index

The secure additive ranking index (SecARI) is the encryption form of the snippet index, as shown in Table 1 (PAD denotes the padding with a random string), and each row is an encrypted entry. For security reason, the number of entries of SecARI must be padded to a certain amount which is independent of the actual number of keywords in the content, or it will leak the information about the number of distinct keywords in the document (it equals the number of rows). An example of a SecARI is shown in Table 2. In the table, is a pseudorandom permutation which randomizes the order of the keywords, and the value is the encrypted score.

Let us consider the secure amount of the entries. If the document is small, let a keyword occupy only one byte; then the maximum possible number of keywords is (as discussed, a valid keyword is at least bytes); thus, the number of entries must be set to (the fractional part is ignored). If the document is large, the maximum possible number of keywords equals the total number of words in the dictionary. Reference [40] made a detailed word statistical analysis based on 450 million words on Corpus of Contemporary American English (1990–2012). The statistics show that the total words used are about 60000. We set the dictionary used as and define the maximum keyword amount as . Thus, we define the number of entries as follows.

Definition 5 (number of entries). To guarantee security, the number of entries for a SecARI is

SecARI is in fact a sparse look-up table, and we use indirect addressing method to manage it. Indirect addressing method is also called FKS dictionary introduced in [41], which is also adopted in symmetric searchable encryption scheme in [11]. It manages sparse table of the form (address, value). The address is a virtual address that could locate the value field. Given the address, the algorithm will return the associated value in constant look-up time and return otherwise.

In addition, we make use of a pseudorandom permutation to index an entry and a pseudorandom function to generate the one-time-pad keys for Matrix-AC: where is the keyword length and is the length of the document identifier. is the upper bound discussed in Matrix-AC and is the number of snippets that is calculated from the document size. The submitted keyword is encrypted by such that the server cannot figure out what the keyword the user queries.

Let be a data matrix managed by indirect addressing technique as discussed previously. Now we describe SecARI in Algorithm 3.

Input:
(1) : the document identifier
(2) : the keyword-frequency set collection
(3) : consists of the master key for row encryption
  and the permutation key for
Output: A secure additive ranking index
Method:
(1) scan , extract distinct keywords
(2) create an data matrix , the value of each data is
  the frequency of the keyword in the th snippet
(3) for each row , the encryption/decryption key is .
  Thus the keys for all rows form a vector
(4) create an matrix , each cell has length
(5) compute using Matrix-AC
(6) for all , set where is the th row of the encrypted matrix
(7) if , set remaining entries of to random values

5.3.3. The Complete Scheme

In order to hide the information about the number of queried keywords, a SecARI is not enough. When the user submits the queried multiple keywords, each query should be of the same length so that an eavesdropper cannot learn the information about the number of keywords in a query. Let the maximum number of keywords allowed in a single query be ; the remaining space must be padded. The user and the server should initiate a secure channel such as SSL to transport such message, or the padding may be discovered by an eavesdropper. Since the size of a keyword is small, the bandwidth waste of the padding is rather negligible.

We also determine the upper bound for Matrix-AC. As discussed, a general snippet contains at most 25 keywords; thus we set (stored as a standard integer).

Let be the pseudorandom function, and is the pseudorandom permutation as described previously. Now we describe the complete scheme in Algorithm 4, and describe the storage and retrieval protocol in Protocol 1. The retrieval protocol describes the retrieval of a query-biased snippet from document by submitting a multikeyword query .

Storage:
(1) the user U runs to generate the key
(2) runs to get the encrypted document and the index ,
  and sends to the server S
Query:
(1) U runs to get a token and sends it to S
(2) runs to produce the score result , and
  sends it to U along with document identifier
(3) runs to get the index number (best matched snippet)
(4) runs a CPBR protocol as discussed, generates a query token
 from and sends to S
(5) responses with from
(6) runs to get the encrypted snippet
(7) runs to get the plaintext

:
(1) sample index keys , generate document
  encryption key
(2) output
:
(1) invoke to get and the encrypted document
(2) invoke to get the secure index
(3) output
:
(1) for each keyword in , compute
(2) put into query and pad it to length
(3) output the query
:
(1) let represent the snippet amount, unpack to get the queried
  tokens , set a flag
(2) select a subset of where is in . If no element
  is in , then set
(3) create the result if , or else randomly
  select an index and set
(4) put into query and pad it to length
(5) output
:
(1) unpack and get
(2)  if  the flag   then
(3) randomly select an index
(4)  else
(5) according to , generate the decryption key
  for each matched keyword in
(6) compute the decryption key
(7) invoke using Matrix-AC
(8) choose a snippet number in with the highest score
(9) end if 
(10) output the snippet index
: output the plaintext snippet

Note that it is a scenario for a single document. The protocol also works for a document collection. Thus, the user could retrieve multiple snippets for multiple documents in the same round.

5.4. Proof of Security

The server stores the SecARI, performs homomorphic computation for a query, and returns to the user the score information as a single entry. We prove the security by introducing a theorem as follows.

Theorem 6. If is a pseudorandom function, if is a pseudorandom permutation, and if ES is CPA and CCA secure, then SecQBP is CKA2 secure.

Proof. We describe a polynomial-size simulator , for all polynomial-size adversaries , and are indistinguishable. Consider the simulator that given the size of the document , generates the data as follows. (1)(Simulating ) computes , . For , generates a string such that each is a distinct string of length chosen uniformly at random, and each is a string of length bits chosen uniformly at random. All strings form .(2)(Simulating ) prepares a query list that stores the query history. The value in is of the form (, ). When queried by a keyword set , for each keyword in , first scans to see if there is a match. If not, randomly chooses a distinct which is not in and stores the pair (, ) into . gets according to and sets .(3)(Simulating ) sets to a -bit string chosen uniformly at random. Note that is a global parameter known by the user and the server.
We claim that no polynomial-size distinguisher could distinguish the following pairs. (1)( and ) recall that consists of values. Each value consists of either a string of the form () or a random string. In any case, with all but negligible probability, the PRP key is not included; therefore the pseudorandomness of guarantees that is indistinguishable from random. The PRF key is also not included; therefore the pseudorandomness of guarantees that the derived key for each data row is indistinguishable from random, and then the underlying Matrix-AC is CPA-secure, which means that is indistinguishable from random. contains random values. Therefore, as discussed, and are indistinguishable.(2)( and ) recall that is the evaluation of the PRP . In any case, with all but negligible probability, the PRP key is not included; therefore the pseudorandomness of guarantees that all in are indistinguishable from random, and is a random string of the same length of .(3)( and ) recall that is encrypted by a CPA and CCA secure encryption scheme. Since the encryption key is not known by the adversary, the security of the encryption scheme guarantees that and are indistinguishable.

6. Comparison, Application, and Performance Analysis

First, we compare the functionalities and performance of our work with previous works. Then, as a significant example, we discuss how to combine the preview scheme with symmetric searchable encryption to improve the user experience. We also discuss the performance of the preview scheme in the concrete application example.

6.1. Scheme Comparison

Let denote the snippet length and the document size; the comparisons of our work with other representative works are shown in Table 3.

The query-biased preview mode is widely used in general search engine, as introduced in [5]. In the scheme, the search engine dynamically scans the document line by line to find the top-ranking snippet. Therefore, the computation complexity is . In [7], Mithal and Tayebi proposed a static preview scheme over encrypted data based on content mask technique. In the scheme, some segments of the plaintext are extracted in advance and are masked with noise in such a way that the so called “masked preview content” could be sent to the user as a preview when queried. The static scheme is fast and informative but does not explain why a document is matched by a query. Note that our scheme costs one extra round of communication since the score results have to be returned to the user in the first round.

6.2. Symmetric Searchable Encryption Extension

We review the generalized definition of symmetric searchable encryption (SSE) introduced in [12]. We assume that the searchable encryption scheme is in guided mode. In other words, the server will first return to the user the identifiers of the matched documents, and the user makes a final choice to select some document identifiers and sends them to the server to retrieve the selected ones.

Definition 7 (extended symmetric searchable encryption). In guided mode, a symmetric searchable encryption scheme is a collection of six polynomial-time algorithms SSE = (Gen, Enc, Token, Search, Retrieve, Dec) such that we have the following. is a probabilistic algorithm that takes as input a security parameter and outputs a secret key . It is run by the user, and the output key is kept secret by the user. is an algorithm that takes as input a secret key and a document collection and outputs a searchable structure and a sequence of encrypted documents . It enables a user to query some keywords, and the server returns the matched documents. For instance, in an index-based searchable symmetric encryption scheme, is the secure index. It is run by the user, and is sent to the storage server. is a deterministic (possibly probabilistic) algorithm that takes as input a secret and a set of some keywords and outputs a search token (also named trapdoor or capacity). It is run by the user. (guided mode) is a deterministic algorithm that takes as input the query token and the searchable structure and outputs the matched document identifiers . It is run by the server, and the result is sent to the user. Note that, if not in guided mode, this algorithm returns the matched documents directly. It is run by the server. is a deterministic algorithm that takes as input the encrypted documents and the selected document identifiers and outputs the selected documents corresponding to the identifiers. It is run by the server. is a deterministic algorithm that takes as input a secret key and the returned encrypted document and outputs the recovered plaintext . It is run by the user.

The preview scheme is applied in SSE as follows. The user runs SSE.Gen, SecQBP.Gen, SSE.Enc, and SecQBP.Setup, respectively. The server stores the outsourced structure generated by SSE and the encrypted documents generated by SecQBP scheme. To search for some documents, the user runs SSE.Token and SecQBP.Query, respectively, and sends them to the server. The server produces the identifiers of the matched documents, runs SecQBP.ComputeScore for the corresponding documents one by one, and returns the document identifiers and the score results together. The user decodes the score, retrieves the preview snippets from the server, then makes the choice, and sends the selected document identifiers to the server to retrieve the interested documents.

6.3. Performance Analysis

We adopt SSE-2 introduced in [11] as an instance of a SSE scheme. Table 4 shows the time complexity and storage complexity for single SSE-2 scheme and SSE-2 plus SecQBP in detail.

Let represent the encrypted document collection, so the total size is bytes. Other than the returned encrypted documents, the extrastorage cost for SSE-2 is bytes; thus the storage cost is . The extrastorage cost for SecQBP is for each document. By definition, the storage cost is . For SSE, the server searches the matched documents and decrypts the identifier list. For SecQBP, the server searches the indices for all matched documents, returns score results for all matched documents, and finally returns the snippets. They are both in time complexity of . The number of rounds for SSE is two (guided mode). First, the server returns the identifiers of the matched documents and next returns the selected documents. SecQBP adds extra round for retrieving snippets from the snippet server. Moreover, for each matched document, the size of the messages for SEE is . SecQBP is , where is the document length, and is the snippet length, since the user will receive a score result of size and a snippet of size .

The detailed performance of SSE is analyzed in [42]; therefore we just analyze the performance of the SecQBP part. The content of a document is varied in the real world. By observation in [40], the number of keywords in a document increases along with the document size which satisfies log model, and the worst case satisfies linear model (each word in the document is keyword, such as a dictionary). However, the design for security in our scheme guarantees that the encrypted indices generated from any document are indistinguishable. Therefore, the computation for the server is independent from the models (i.e., the computations for all documents are the same). To simulate the reality, we design the data generator that simulates documents using log model.

In order to demonstrate the optimization for the server, we compare our suggested Matrix-AC scheme with the simplest and, as far as we know, the fastest symmetric homomorphic encryption scheme [30] denoted by SHE and a well-known homomorphic cryptosystem [29] denoted by Paillier cryptosystem. We consider that 100 users submit queries simultaneously. Each query contains 5 keywords, and the score computation is over 100 matched documents (SSE generates the identifiers of the matched documents). The size of each document increases from 50 KB to 1 MB (the sizes for all stored documents are the same), and the computation cost is described by millisecond.

The algorithms are coded in C++ programming language and the server is a Pentium Dual-Core E5300 PC with 2.6 GHz CPU. The result is shown in Figure 2. It demonstrates that:the following. (1) The scheme is secure. The figure shows a linear computation cost, which means the computation is independent of the document content. In other words, the server does not see any differences for all documents while performing the search. (2) In cloud environment, computation for 100 users simultaneously on a single server becomes a burden as the size of the document increases. In other words, the number of servers run as services is determined by the size of the stored documents and the accepted queries. (3) The performance is improved as we adopt Matrix-AC to substitute the homomorphic encryption schemes. From the data, Matrix-AC is about faster than using SHE or Paillier cryptosystem. We assume that the user does not modify the document frequently, and the main operation is just searching for some documents. Therefore, the performance improvement is significant since it could save about virtual machines in the cloud.

7. Conclusions

In this paper, we propose a generalized method of securely retrieving query-biased snippet over outsourced and encrypted data, which allows the users to take a sneak preview over their encrypted data. The preview scheme has strong security and privacy guarantees with relatively low overhead, and it greatly improves the user experience.

Acknowledgments

Part of this work is supported by the Fundamental Research Funds for New Century Excellent Talents in Chinese Universities (Grant no. NCET-10-0298) and Ministry of Science and Technology of Sichuan province (no. 2012HH0003).