Abstract
With the explosive increase of document files, more and more data owners outsource their documents to the public cloud, which can decrease the costs of local data management systems. However, the problem of information privacy leakage in the cloud is a great challenge and it has been attracting more and more attention. In this article, we propose a secure and efficient document search scheme, named SES, based on both the cloud and fog systems. All the documents are symmetrically encrypted before being outsourced to the cloud, and an index vector is constructed based on the keywords for each document. Specifically, we integrate the position information of keywords into the TF-IDF model to generate document vectors, which are accurate and inherent summarizations about the documents. In query requests, a data user needs to provide a set of keywords, which are first extended by the Word2Vec tool and then mapped to a query vector. The extension process of keywords makes the provided keywords more comprehensive and accurate, and hence, it improves document search accuracy. To achieve the forward and backward security, both the document vectors and query vectors are appended with an ingenious vector. The relevance score between a document and a query is defined as the inner product of the document vector and the query vector. We return the most relevant documents as the search results to the data users. To protect the contextual information stored in the document and query vectors, we encrypt the vectors by the secure kNN algorithm. To improve the search efficiency, a searchable index structure for the document set is constructed based on the Diffie–Hellman secret key negotiation algorithm. The analysis and simulation results illustrate that the proposed scheme performs well in terms of both security and search efficiency.
1. Introduction
With the rapid development of big data and cloud computing technologies, more and more data owners choose to outsource the data to the cloud, in order to reduce the cost of local data maintenance, and share the data among a large set of data users in a more convenient manner. However, outsourcing data to the cloud server will bring security risks to data privacy, because the cloud server is an “honest and curious” semi-trusted entity. In real life, the cloud server may attempt to mine the privacy of documents and query requests, or leak the privacy information of documents to the unauthorized data users, which are huge losses for the data owners. Therefore, the data need to be encrypted before being outsourced to the cloud server, and this can properly protect the data privacy. However, an accompanying challenge is that the encrypted data uploaded to the cloud server cannot be retrieved in plaintext manner and the search schemes in the traditional environment are no longer feasible. It is severely necessary to design an effective ciphertext retrieval scheme, which is suitable for the cloud computing environment.
Searchable encryption has been widely researched, and existing schemes are mainly divided into two categories, that is, symmetric searchable encryption and asymmetric searchable encryption [1]. In the initial phase, most existing symmetric searchable encryption schemes only supported single keyword search [2–6]. However, in the actual situation, in order to describe the content of interest more accurately, users often input multiple keywords for interested document retrieval and hope that the returned search results are the most related files to the input keywords. In order to meet this demand, in recent years, many schemes supporting multikeyword sorting and retrieval have been proposed.
Cao et al. [7] first proposed a privacy-preserving multikeyword ciphertext retrieval result ranking scheme (MRSE) in the cloud computing environment to efficiently retrieve the ciphertexts of documents based on a set of keywords. They adopted the vector space model to calculate the inner product of document vectors and retrieval vectors, and ranked the document retrieval results based on the results of the inner product operation. After that, a set of schemes [8–10] are proposed in this research branch. Although they share a similar framework, the search efficiency, flexibility, and accuracy in the encrypted document search process are improved. However, most existing schemes need to be further improved in the following several aspects.
First, most existing schemes did not thoroughly consider the influence of positions on the importance degrees of keywords in a document. The same keyword in the title, index terms, conclusion, and other parts of a document is assumed to have the same importance. This is unreasonable considering that the keywords in the title and index terms are more important than that in other parts. Apparently, the authors always carefully design the title of a document to emphasize its field and diversify the document to the other documents in the same field. Therefore, the keywords in the title are more important and we can infer that the position of a keyword is also an inherent property.
Second, the forward and backward security problem in the encrypted document retrieval field is proposed recently [11]. In dynamic document collection, the cloud server can employ the forward and backward attack to analyze the contextual information of the encrypted documents, document vectors, and trapdoors. Specifically, the cloud server may employ the outdated trapdoors to search the newly inserted documents or search the deleted documents by the new trapdoors. Considering that this threat is very novel, most existing schemes cannot properly defend it. Apparently, it is severe to improve the security of the existing document search systems.
Third, in some existing schemes [7], the encrypted document vectors are randomly organized and the search efficiency is very low considering that all the vectors need to be scanned in a query. To improve efficiency, the vectors are embedded into an index tree in some other schemes [12, 13]. With the increase in the size of the document collection, the index tree also monotonously increases and the document search time also increases. If we can prune a set of documents based on the query request before document search, the search efficiency can be further improved.
In this article, we propose a novel scheme to support secure and efficient document search over encrypted cloud file systems. First, we map the documents to a set of index vectors based on a revised TF-IDF model in which the positions of keywords are taken into consideration. Then, an index structure, named search table, for these document vectors is designed to improve search efficiency. In the document retrieval phase, the data users provide a set of keys and they are extended by the Word2Vec tool. Then, the extended keyword set is mapped to a query vector based on the keyword dictionary. To properly protect the contextual information in document vectors and query vectors, they are encrypted by the secure kNN algorithm [14] before being uploaded to the cloud server. Although the cloud server does not know the plaintexts of all the vectors, it can accurately calculate the relevance scores between a query vector and all the document vectors. At last, the search results are returned to the data users and the document search process is completed. Benefiting from the encrypted index structure designed based on the Diffie–Hellman secret key negotiation algorithm, the search complexity is shortened to logarithmic complexity.
Our contribution can be mainly summarized as follows:(i)We set different position weights for the keywords in different positions and design a method to calculate the overall position weight scores. Then, we calculate the correlation scores between the keywords and documents by combining both the domain-weighted scoring method and TF-IDF model.(ii)We extend the search keyword set, which is provided by an authorized data user, by applying the Word2Vec tool. The extended keyword set is more comprehensive, and the keywords can express the semantic information of the data user in a more accurate manner.(iii)The document index vectors and query vectors are encrypted based on the secure -nearest neighbor algorithm. Moreover, an extra vector is also appended after these vectors. In this way, the document collection in our scheme can be dynamically updated with forward security and backward security.(iv)A searchable encrypted index table is designed by embedding the Diffie–Hellman key agreement scheme. With the exponential growth of the number of documents, the search time is shortened from exponential time complexity to linear time complexity.
The rest of this article is organized as follows. Section 2 summarizes the related work in the field of encrypted document search. The system model, threat model, and design goals are discussed in Section 3. We present the proposed scheme SES in Section 4. The update problem of document set is discussed in Section 5. Section 6 analyzes the security of our scheme, and the efficiency is discussed in Section 7. At last, we conclude this article in Section 8.
2. Related Work
In the multikeyword ranked document search scheme proposed by Cao et al. [7], the document vectors are randomly organized and the search complexity linearly increases with the increase in the document number. In a query process, all the relevance scores between the query vector and document vectors need to be calculated and ranked to get the search results. Apparently, the retrieval efficiency needs to be improved.
Xia et al. [15] designed a novel index structure named keyword balanced binary tree (KBB tree) to manage the encrypted document vectors. The most important property of the tree is that the relevance score between a node and a query vector is always not less than that between the child nodes and the query. This property is used to dynamically prune the search paths, and the search efficiency greatly improves.
A disadvantage of the KBB tree is that it is large in size, because the vectors are chaotically located in different clusters. The similar document vectors may belong to different clusters, and the vectors in a cluster may very different from each other. Chen et al. [16] design an index structure based on the hierarchical clustering algorithm. Similar document vectors gradually aggregate with each other to form larger clusters until all the vectors belong to a single cluster. Although this index structure can provide near-optimal results, it cannot support completely accurate document retrieval services. In Ref. [17], the documents are encrypted while maintaining the orders and then the encrypted documents are searched by multiple keywords. The simulation results show that the search accuracy is improved.
Some other tree-based index structures are also proposed in this field. In Ref. [18], B+ tree is employed to support dynamic document retrieval scheme. In Ref. [19], the document index is constructed based on the binary tree and it greatly improves the search efficiency while decreasing the cost of storage space. To support dynamic document sets, Sun et al. [12] propose a novel multiple keywords document search based on the balanced binary tree. To further decrease the cost of index structure and time consumption, Feng et al. [20] propose a distributed document retrieval model based on the Boolean filter.
Except for retrieval accuracy and efficiency, the scalability of encrypted document search is also an important measurement. In Refs. [21–26], the authors research on the algorithms of dynamic inserting, updating, and deleting on the cloud server. Several search schemes are proposed to support multiple keyword retrieval. However, in the search process, the data users must provide the accurate query keywords and the documents containing the keywords are returned. The relevance between keywords are not considered. Apparently, this service is quite simple and it cannot satisfy the data users in real life. In fact, when the data users are not familiar with a specific region, it is extremely difficult for them to always provide a set of proper keywords. In this case, the above schemes fail. Therefore, when the provided keywords cannot fully reflect the intentions of data users, the cloud server needs to return the results based on the semantic meanings of the keywords.
To return proper search results even the keywords provided by the data users are impertinent, the semantic information and relations of the keywords need to be taken into consideration [27–30]. Xia et al. [13, 28] employ the semantic meanings of query keywords in document retrieval. They first construct a semantic graph of keywords for the data users. In the query process, the query keywords are properly rectified to a new query and it is used to search the document collection. This scheme also supports multiple keywords search. In Refs. [9, 31–34], multikeyword-based fuzzy search schemes are proposed and they can return the best fit results.
Although many encrypted document retrieval schemes have been proposed, most existing schemes ignore the impact of positions on the keywords’ importance degrees. In these schemes, all the keywords in a document are equally treated no matter where they are located and as a consequence, the search accuracy can be improved. Similar to our scheme, Sun et al. [35] improve the scheme in Ref. [7] by assigning different weights to different keywords when constructing document vectors and query vectors. The relevance scores are calculated by cosine, and search accuracy increases. However, a disadvantage of the two schemes is that both the document vectors and query vectors are constant and they do not support dynamic document collection.
Moreover, all the existing schemes did not consider the forward and backward security and they need to be improved in terms of security. Wei et al. [11] first proposed the definitions of the forward security and backward security. They assume that the cloud server is untrusted and it may furtively store the historical query vectors and deleted documents. Then, the cloud server can search the new documents based on historical query vectors or retrieve the deleted documents based on the new query vectors. Apparently, this may leak some private information about the encrypted document collection and queries.
3. Problem Statement
3.1. System Model
In order to meet the requirements of both efficiency and security in the encrypted document retrieval process, this article uses the hybrid cloud-fog system model shown in Figure 1. The whole system mainly comprises four entities, including the data owner, data user, fog node, and semi-trusted public cloud server. The main functions of these entities are presented in the following.

Data owner is responsible for collecting document files. For a collected document set , he first extracts keywords from the plaintext document set and they are used to form a keyword dictionary of the document set. He then sets the weights of the keywords in at different positions of the document and calculates the correlation score between the keywords and the documents based on the modified TF-IDF model. After that, each document is mapped to document index vector based on the keyword dictionary. The value of a keyword in the corresponding position of the index vector is set as the product of position weight score and correlation score. The ciphertext set and the searchable encrypted index are obtained by encrypting and , respectively. Finally, the data owner sends to the fog node. Meanwhile, the ciphertexts of documents and the search table composed of are uploaded to the cloud server.
An authorized data user can search his interested documents with the help of the document retrieval system. Specifically, he needs to construct an enquiry request based on a set of interested keywords and then sends it to the fog node. After a large amount of calculation in the fog node and cloud node, the ciphertexts of the ranked top--related encrypted documents will be returned to the data user. At last, the plaintext target document can be decrypted by the data user based on the symmetric secret key distributed by the data owner.
Fog node is an “honest and reliable” entity with a high security level and a small storage space [36–38]. This is reasonable considering that the fog nodes are usually controlled by the data owner. The fog node needs to receive and store the keyword dictionary uploaded by the data owner. Given an enquiry request sent by the user, the fog node uses Word2Vec to extend the semantics of by adding some extra related keywords into the query. Once the fog node obtains the semantic extended enquiry request , it judges whether the keywords in match the keywords in the corresponding positions in and gets the search vector . To satisfy forward and backward security, the fog node expands by appending an extra vector to obtain . At last, the fog node encrypts to get the document retrieval trapdoor based on secure kNN algorithm and sends it to the cloud server.
Cloud server is an “honest and curious” semi-trusted entity with a low security degree and a large storage space. It can be used to store ciphertext document sets and a search table made up of encrypted indexes. The cloud server uses the received document retrieval trapdoor and encryption index to get a set of candidates. To accurately locate the search results, the cloud server calculates the inner products of the trapdoor and candidate vectors. At last, the cloud server sorts the candidate ciphertext documents according to the results of inner product operation and returns the ciphertexts of the most relevant documents to the data user.
3.2. Threat Model
The public cloud server is generally regarded as an “honest and curious” semi-trusted entity. In general, it honestly executes the instructions of the data owner, fog node, and data users, and it does not disclose the privacy information of documents to the public. However, we assume that it is “curious” to analyze and mine users’ retrieval requests, and obtain the privacy information of data users’ requests. Note that, the data users are not willing to disclose their search requests, because this may leak some information about their behaviors. The cloud server also tries to explore the contextual information stored in the ciphertexts of documents, encrypted index vectors, and trapdoors. This contextual information may leak the contents of the documents and query requests. Based on the effective information obtained by the public cloud server in the system, this article mainly considers the following three threat models. Known ciphertext threat model. In this article, the public cloud server can obtain the ciphertext documents submitted by the data owner. However, it cannot obtain any effective plaintext information about the documents. In this case, the public cloud server can only choose the ciphertext only attack mode. Known background threat model. The public cloud server can statistically analyze the hidden information in the user’s retrieval records according to the user’s retrieval requests and can mine some other useful information, such as the user’s document retrieval preference, associated records, and retrieval results. Therefore, the public cloud server can carry out statistical analysis-based attacks. Forward and backward threat model. In this model, the cloud server is curious and it may store all the deleted document index and outdated trapdoors. This information can be used to analyze the private information of newly generated documents and trapdoors. Therefore, the proposed scheme should be forward and backward secure. Specifically, the cloud server cannot retrieve the deleted documents by the new-generated trapdoors and we call this the backward security. Meanwhile, the old-generated trapdoors cannot be employed to the new-inserted documents and this is the forward security.
3.3. Design Goals
(i)Search efficiency. In order to cope with the explosive growth of the number of documents in the big data era, we have to design a novel encrypted index structure to improve the retrieval efficiency of documents.(ii)Query accuracy. Query accuracy is a basic requirement for the data users, and they always want to access their interested content as accurate as possible. The proposed scheme should support query expansion before submission, so as to solve the limitation of improper keywords provided by the data users.(iii)Privacy-preserving. Our goal is to prevent cloud servers from obtaining additional sensitive information about the document sets, indexes, and search requests. Specifically, symmetric cryptography is a method to protect document privacy. We use the secure kNN algorithm to encrypt the index and search request, which can prevent the adversary from stealing useful information.(iv)Forward and backward security. The document collection is dynamic, and the documents can be either deleted from the collection or inserted to the collection. The proposed scheme should be forward and backward secure. Specifically, employing the outdated trapdoor to search the newly inserted documents or searching the deleted documents by the new trapdoors is illegal.
4. Proposed Scheme
In this section, we define the framework of secure and efficient document Search (SES) scheme. First, we introduce the notations in this article and the way of constructing the document vectors. Then, we introduce the Word2Vec tool to extend the query. At last, we present the details of the whole scheme.
4.1. Notations
(i)—the plaintext document collection, denoted as a set of document files .(ii)—the encrypted document collection, denoted as . In this article, each document is encrypted to based on a symmetric encryption algorithm with secret key . The employed secret keys for different documents are different and independent with each other.(iii)—the keyword dictionary, that is, the keyword set consisting of keywords selected from the document collection by the data owner, denoted as .(iv)—the document index set, denoted as , where is constructed based on .(v)—the vector generated by the dimension extension of . The extension of is employed to match the extension process of .(vi)—the searchable encrypted index generated from .(vii)—a subset of keywords in , representing the enquiry request provided by a data user.(viii)—the keyword set formed by the semantic extension of . Note that, the extra keywords are also selected from and hence .(ix)—the search vector constructed based on , denoted as . Note that can be either 0 or 1.(x)—the vector generated by the dimension extension of . This extension process is employed to defend the forward and backward security of the system.(xi)—the encrypted form of . In this article, we name as the trapdoor of the search request. The trapdoor can be used to search the encrypted documents by the cloud server; however, the plaintext of cannot be recovered without the secret key.(xii)—a collection of numbers initialized to , where is the maximum number of documents in the outsourcing document set. The numbers are used to act as the identifiers of documents, and they are also used to negotiate agreements between the fog node and cloud node.(xiii)—a collection of numbers initialized to an empty set. In the process of construct , acts as an indicator and we will discuss this in Section 4.4.(xiv)—ciphertexts of the top--related documents, which are returned to the data user by the cloud server.4.2. Construction of Document Index Vectors Based on the Enhanced TF-IDF Model
In this article, we construct the document vectors based on both the keyword weight and position weight. The definitions of them are presented as follows.
(1) Keyword Weight. In the field of information retrieval, the TF-IDF model [39] is widely used to calculate the correlation between keywords and documents. TF-IDF model consists of two parts: keyword frequency and inverse document frequency. In order to express the correlation score between a given keyword and a document , in this article, we use the following formula to calculate it:where represents the number of keywords in document , is the total number of keywords in document , is the total number of document set , and indicates the number of documents in document set that contains the keyword .
(2) Position Weight. Existing schemes equally treat the keywords in different positions in a document. Therefore, each keyword in a document has the same effect when retrieving the interested documents based on relevance score. In fact, the keywords play different roles when they appear in the title, abstract, and main body of the same document. Therefore, to reflect the affection of keyword positions on the important degrees, this article sets different position weights for the keywords in three different regions of a document, that is, the title, abstract, and main body.
The position weight of the title is set to , the position weight of the abstract is set to , and the position weight of the main body is set to . Moreover, the weights , , and meet the following conditions:
The same keyword may appear in three positions of the document, so it is necessary to comprehensively measure the weight score of keywords in three positions. In this article, are used to indicate whether a keyword appears in the title, abstract, and text, respectively. If it appears, the corresponding score is 1; otherwise, the score is 0. The total weight score of the keyword in the whole document can be calculated by the following formula:
As an example, if a keyword appears in the title and text of document , the position weight score of keyword in document is .
Based on the keyword weight and position weight, the data owner generates a document index vector for each document . For the entry in the vector , it is calculated as follows:
It can be observed from the equation that if the corresponding keyword appears in the document, is equal to a positive number. Otherwise, we set .
4.3. Query Extension Based on Word2Vec
In real life, the data users may want to query the documents based on synonyms or they may provide some improper keywords in the query process if they are not familiar enough to a specific field [9, 28]. Most existing schemes did not take this into consideration, and they do not support the synonyms-based query. Moreover, if the data users cannot provide a set of accurate keywords, they cannot get the proper search results. In this article, we solve this problem by query extension based on the Word2Vec [1, 5] tool.
Word2Vec is a neural network-based tool for computing word vectors, and it was designed by Google in 2013. Word2Vec can be efficiently trained on millions of dictionaries and datasets, and get the word vectors of training results. Word vectors can be used to calculate the similarities between the words. The smaller the distance between two words, the higher the similarity between them. In this article, the data owner uses Word2Vec as the training tool and uses the Skip-gram model [40, 41] to train the data to obtain the word vector for each word.
In the query process, the data user first provides a set of keywords and outsources the keywords to the fog node. Then, the fog node selects other several similar keywords based on vectors generated by the Word2Vec model. In this article, the number of extending keywords is the same as the number of keywords provided by the data user, as shown in Figure 2. This is reasonable considering that the accuracy is the highest in this case and we will present the simulation results in Section 7.

4.4. The Framework of Privacy-Preserving Document Retrieval
In this section, we present the overall document retrieval framework by mainly employing the functionalities of the fog node and cloud server, which contains six parts: Setup, GenIndex, Enc, GenTrapdoor, Search, and Dec.(1)Setup In the initialization phase, let be a multiplicative cyclic group of large prime order , which is generated and exposed by the data owner. Then, he needs to generate the secret key set , including (1) one randomly generated -bit vector and (2) two invertible matrices and . Moreover, the symmetric secret keys in are generated by the data owner to encrypt the files independently.(2)GenIndex The data owner first calculates the correlation score between keyword and document according to the TF-IDF method. Then, the data owner uses the position weight scoring method to calculate the weight score of the three positions of in document . At last, he generates a document index vector for every document as discussed in Section 4.2, where if the corresponding keyword appears in the document , and otherwise, we set .
Then, the data owner extends the -dimensional document index vector to the -dimensional vector , where is an -dimensional vector generated by Algorithm 1. In this algorithm, , and three elements are returned including . It can be inferred from Algorithm 1 that each includes two numbers and selected from . Meanwhile, all the other entries are set as 0. The positions of in different are always different from each other, because once a position is selected, it will be deleted from .
|
To protect the privacy of document vector , we need to encrypt them. Specifically, the data owner first splits into two random vectors according to the secret vector . For each , if , then and , where is a random number. On the contrary, if , then we set . At last, the searchable encrypted index is calculated as by using matrix multiplication with the matrix and in .
In this article, we employ the search table structure to organize these document index vectors and this can improve the search efficiency. The data owner first needs to randomly select a set of numbers as the identifiers of keyword , respectively. After this, the data owner puts the document set containing the keyword together in a row of the search table. Meanwhile, the encrypted index set is generated in a similar way. Finally, an example of the encrypted search table can be created and outsourced to the cloud server as shown in Table 1.(3)Enc After constructing the document index vectors, the documents can be encrypted and sent to the cloud server. In this article, the data owner uses a secure symmetric encryption algorithm (e.g., AES) to encrypt the plaintext files in the document collection one by one with different keys and at last outsources them to the cloud server.(4)GenTrapdoor In the preparation phase of this system, the data owner also needs to send to the fog node. Note that, is a random number bound to . In the query phase, when the fog node receives a query request from a data user, It first uses Word2Vec to extend the semantics of keywords in to obtain . According to and , the search vector can be constructed by the fog node as shown in Figure 2. Specifically, if , is set to 1; otherwise, is set to 0. Based on the TF-IDF model, we can calculate the relevance score between the query and a document by calculating . To hide the exact keywords in the documents, a negotiate process between the fog node and cloud server is designed. When , the fog node selects a random number associated with . Then, he computes and , and sends to cloud server for . To defend the forward and backward threat, the fog node also needs to extend the -dimensional search vector to the -dimensional vector as given by the following formula: Then, is split into two random vectors as based on the secret bit string vector . Different to the split process of , if the th bit of is 0, we have ; on the contrary, if the th bit of is 1, then we set and , where is a random number. After that, the encrypted search vector is generated as and they are sent to the cloud server.(5)Search In the document search process, the cloud server and fog node first negotiate a secret key based on the Diffie–Hellman key agreement algorithm. We first assume a multiplicative cyclic group and an element having order . The cloud server chooses at random, then computes , and sends to the fog node. The fog node chooses at random, then computes and sends to the cloud server. The cloud server computes , and the fog node computes . The security of the Diffie–Hellman key agreement is based on the difficulty of finding discrete logarithm.
Upon receiving the from fog node, the cloud server needs to find the numbers in to satisfy . Here, we know that . If can satisfy the equation, we can infer that all the encrypted index in this row can be a candidate of the top- relevant documents. Considering that the returned encrypted index sets of different identifiers may intersect with each other, we need to delete the redundant indexes and merge the returned index sets based on Algorithm 2.
|
Subsequently, we can obtain encryption index for . Then, the cloud server computes the relevance score between and index as follows:
It can be observed that although the cloud server does not know the plaintext keywords, document index vectors and trapdoors, the cloud server can always accurately calculate the relevance scores between a query and all the encrypted documents. After sorting all relevance scores, the top- documents, that is, , are returned to the data user by the cloud server.(6)Dec Once a set of encrypted documents are received, the data user utilizes the secret key provided by the data owner to decrypt the ciphertext documents in and finally obtains the interested plaintext documents in , which completes the retrieval process.
5. Dynamic Update of Document Set
After the insertion or deletion of a document, we need to synchronously update the encrypted index table stored in the cloud server. The dynamic update process of our scheme comprises three phases, that is, generation of update information, update operations on the cloud server, and update operations on the fog node. We design three algorithms corresponding to the three phases, and they are discussed in the following,:(1) GenUpdateInfo. This algorithm generates the update information and , which will be sent to the cloud server and fog node, respectively. In order to reduce the communication overhead, we assume that the data owner stores a copy of unencrypted index table for all the documents. Here, we employ the to denote as either an insertion operation or a deletion operation for the document . We employ the to denote as the updated table. The specific process is presented as follows. The notion represents the set of keyword identifiers included in document .(i)If is equal to , the data owner deletes the document from the document collection and finds out all the keyword identifiers in the document , and they form a set, , of keyword identifiers. Then, the data owner deletes the encrypted index corresponding to the keyword identifier in the encrypted index table and generates a new encrypted index table . The data owner uploads to the cloud server. Meanwhile, the data owner adds to the set , which is sent to the fog node.(ii)If is equal to , the data owner first adds the new document to the document set and find out all the keyword identifiers set in the document . Then, the data owner adds the encrypted index corresponding to the keyword identifier in the encrypted index table and generates a new encrypted index table . The data owner encrypts the document with secret key to obtain and uploads to the cloud server. Subsequently, the data owner generates a vector for the document and deletes the element that determines the position of in from the set . Finally, the data owner sends to the fog node.(2) Update. In this algorithm, the cloud server updates the search table and encrypted document set based on the received information from the data owner. Specifically, the cloud server always first replaces the outdated search table with the new search table . This process is irrelevant to the update type of or . In the process of updating the encrypted document set, the cloud server needs to check the type of update first. If is equal to , cloud server inserts the encrypted document into , obtaining a new collection . If is equal to , cloud server deletes the encrypted document from to get a new collection .(3) Update. The update process in the fog node is quite straightforward. In this algorithm, fog node replaces by . Then, in the future, the updated is employed to generate new trapdoors. In this way, our scheme can defend the forward and backward attack. Through the above process of deleting or inserting documents, it shows that the dynamic update process of our scheme is easy and hence the scheme is of great scalability. And we will analyze the security of the scheme in the next section.
6. Security Analysis
As discussed in Section 3.3, we mainly consider the security of documents, privacy of document vectors, and trapdoors. Moreover, we also consider the forward and backward security of our scheme. They will be presented in the following.
6.1. Document Privacy
In this article, the mature symmetric encryption algorithm AES is used to encrypt the document set , and the ciphertext document set is generated and uploaded to the cloud server. Note that, the secret keys for different documents are different and independent with each other in the encryption process. Even the cloud server can successfully get a set of the secret keys, they cannot infer the other keys based on the ciphertexts. Therefore, the proposed scheme can effectively ensure the security of the document content and defend against the known ciphertext threat model.
6.2. Privacy of Document Vectors and Trapdoors
In terms of document vectors and trapdoors, the data owner needs to randomly generates a bit string and two reversible matrices and . After the extension, the extension index vector and the extension search vector are first split based on . Then, we generate the encrypted index and the search request trapdoor based on and . At last, the encrypted vectors rather than plaintext vectors are uploaded to the cloud server. Because the space of the key matrix is very large, the probability that the untrusted cloud server can correctly forge the key matrix to crack the encrypted index and the search request trapdoor is almost zero. An efficient manner is constructing equation groups based on the background information and accurately calculates the two matrices. However, as proved in Ref. [14], the adversary cannot construct enough equations to calculate all the unknown variables in the matrices. Therefore, the security of information contained in document index and search vector is effectively guaranteed. In conclusion, since the untrusted cloud server can only obtain ciphertext document set, encrypted index, and search request trapdoor, it cannot obtain any useful plaintext information. Therefore, our scheme is safe for the known ciphertext model.
In order to further prevent the untrusted cloud server from mining and disclosing the document privacy information according to the known background knowledge, that is, according to the internal relationship between document index and retrieval trapdoor, we also mislead the adversary by dynamically changing the appearance of the trapdoors. Our article uses the split indicator in the key . In this article, the extension index vector and the extension search vector are randomly split by using the split indicator in key , and the random numbers and are introduced to ensure that there is no correlation between multiple document indexes and search vectors. In other words, even if the user employs the same query request multiple times, the search request trapdoor received by the untrusted cloud server is different. This effectively resists the attack of statistical analysis. Therefore, our proposed scheme is also safe for the known background knowledge threat model.
6.3. Forward and Backward Security
As discussed in Section 1, forward and backward security is of great importance although most existing encrypted document retrieval schemes did not consider this threat. Our scheme can properly defend this attack and we prove this in the following.
In real life, the cloud server is not trusted, and it may save the trapdoors that have been searched or documents that have been deleted. If so, the untrusted cloud server can use the outdated query vectors to retrieve the newly inserted documents and use the new query to retrieve the deleted documents. In our scheme, when the enquiry request is valid, the extended part of the vector ; then, ; however, when the enquiry request is illegal, the extended part of the vector ; then, . As can be seen from the above equation, when the query is illegal, the correct correlation score will be confused by the randomness factor .
Without the loss of generality, we consider a simple document collection with 2 files . We assume that the maximal number of outsourced documents is 4, that is, . In the initial phase, we set and .
The data owner needs to extend the document vectors of and based on Algorithm 1 as follows:where we assume that and . Based on the definition of and , we can calculate them as follows:
Note that, is delivered to the fog node. At last, the extended document vectors and are split and encrypted as follows:
The encrypted index vectors and are sent to the cloud server.
In a query process, the fog node also needs to extend the search vector based on as follows:where . This is reasonable considering that and we can infer that . Then, the extended search vector is split and encrypted as follows:
At last, is sent to the cloud server.
Now, we consider the dynamic of the document collection and analyze the forward and backward security of our scheme. When the data owner inserts a new document into the document collection, its vector needs to be extended as follows:where we assume that . Meanwhile, we can calculate and by Algorithm 1. Then, the data owner sends to the fog node. Then, the extended document vector is split and encrypted by the data owner as follows:
At last, is sent to the cloud server.
The dishonest cloud server may conserve some old trapdoors, and they cannot be used to search the newly inserted document. This can be explained by the following equation:
It can be observed from the above equation that when calculating the relevance score between the old trapdoor and the newly inserted document, there is always an error . Considering that is a randomly selected number, the cloud server cannot infer the accurate result of and our scheme is secure.
We further assume that the data owner deletes a document , and then, it needs to update and as follows:
Subsequently, the data owner sends the updated to the fog node.
For a new search vector , the fog node extends it based on as presented in the following:where . After, splitting and encryption, the fog node sends to the cloud server.
The dishonest cloud server may not actually delete the document , and it wants to calculate the relevance between the new trapdoor and . However, the relevance score cannot be accurately calculated, because , and there is always a nonnegligible error.
It can be observed from the above process that our scheme can defend against the forward and backward security and properly protect the privacy information of the documents. Therefore, the scheme well meets the design goals we put forward.
7. Performance Evaluation
In this section, we first analyze the time complexity of the designed SES scheme, which shows the efficient of the scheme. Secondly, we implement the proposed scheme using python language on a computer with 2.9 GHz Intel Core, Windows 10 operation system, and 16G memory. We compare the performance of the proposed scheme with that of the MRSE scheme, PRSE scheme, and MRSE-HCI scheme. In experiments, we randomly select 5000 files from the Request for Comments Dataset (RFC) [42] and select in total 2839 keywords from the document files. In the following, we evaluate the performance of our scheme in terms of the index construction efficiency, document retrieval accuracy, and document retrieval efficiency. Each simulation is repeated 10 times, and the average simulation results are presented and analyzed.
7.1. Time Complexity Analysis
Firstly, as shown in Table 2, we analyze the time complexity of the SES scheme proposed in this article and the MRSE_II_TF [7] scheme proposed by Cao et al. Both schemes adopt a secure KNN architecture and TF-IDF idea. Among them, represents the average length of documents. And represent the time consumption for AES algorithm encryption and decryption, Word2Vec expansion algorithm, and power operation in multiplication group, respectively. The time complexity analysis of the algorithm is as follows.
In the algorithms Setup, GenIndex, Enc, and Dec, the proposed SES scheme has similar time complexity to the MRSE scheme. It is worth noting that in the MRSE scheme is a parameter that can be set at will. When , the time complexity of the two schemes will be exactly the same.
In the GenTrapdoor algorithm, and are roughly similar (they are exactly the same when ). Therefore, it can be considered that in the GenIndex step, the time complexity of the SES scheme is more than that of MRSE. However, considering that for data users, GenIndex and search steps are performed by fog nodes and clouds, respectively, we should pay attention to the time interval between users sending retrieval requests and receiving retrieval results, that is, the time taken by the sum of the two steps. In the algorithm Search, it is noted that each query contains few keywords and the number of relevant documents is much less than n. This means that at least consumes less time than . Moreover, . Therefore, when we consider algorithms GenTrapdoor and Search together, the total time complexity is , which means that the efficiency of retrieval using the SES scheme is higher than that of MRSE.
To sum up, through the time complexity analysis of the SES scheme and MRSE scheme, it can be clarified that our scheme has higher efficiency and shorter retrieval waiting time, which is due to our use of the Diffie–Hellman key agreement algorithm to optimize the retrieval algorithm. Next, we will use simulation experiments to verify our analysis.
7.2. Trapdoor Construction Efficiency
In our scheme, the keywords in query requests are first extended based on the Word2Vec model. According to the analysis in Section 7.1, the time consumption of constructing the encrypted query vectors in our scheme is slightly enlarged by the extension process. As shown in Figure 3, the total time consumption comprises two parts, that is, the time of the extension process and that of the encryption process.

It can be observed from Figure 3 that with the increase of the number of extended keywords, the time consumption monotonously increases. This is reasonable considering that some extra time is consumed to extend the keyword set; meanwhile, the time consumption of encryption keeps stable. This can be explained by the fact that the length of the vectors is approximately constant. Compared with the total time consumption, the extra time consumption is very slight and it can be ignored. In the extreme case, that is, the keywords are extended by five times, the time consumption is only 3.989 ms and it is about 8% of the total time consumption.
7.3. Document Retrieval Accuracy
In this article, we measure the document retrieval accuracy by the proportion of related documents in all the returned results. In each query, the top-20 related documents are returned and search accuracy with different extension parameters is presented in Figure 4. With the increase of the extension parameter, the search accuracy increases in the initial phase and then decreases. When we set the extension parameter as 2, the search accuracy is the highest and it is about 91.5% as shown in Figure 4. This is reasonable considering that when the number of extended keywords is small, the extended keywords can provide extra information for document search. However, if the number of the extended keywords is too large, some random noises are also added to the search process and they may mislead the search results. As a consequence, the search accuracy decreases. In the following experiments, we extend the keywords by 2 times.

We further compare the search accuracy of our scheme to that of MRSE. It can be observed from Figure 5 that the search accuracy of our scheme is always higher than 90% with the increase in the number of documents. However, the search efficiency of MRSE gradually decreases from nearly 90% to 80%. In conclusion, the employment of keywords extension process greatly improves the search accuracy in the encrypted cloud file systems.

7.4. Document Retrieval Efficiency
In this section, we compare our scheme with MRSE, MRSE-HCI, and PRSE in terms of document retrieval efficiency. As shown in Figure 6, with the increase in the number of documents in the collection, the search time of all the four schemes increases. The search time of MRSE increases in an approximately linear manner. This is reasonable considering that all the document vectors need to be scanned to find the search results. The schemes of MRSE-HCI and PRSE perform better, and however, the performance of the proposed scheme is the best. This can be explained by the fact that a novel index structure, that is, the search table, is employed to improve the retrieval efficiency.

Except for the number of documents, the number of keywords also affects the average search time. We further compare the total time consumption of trapdoor construction plus document retrieval of our scheme with that of MRSE scheme, PRSE scheme and MRSE-HCI scheme. The number of total keywords is set as 2839, and the keywords are extended by 2 times in which case the search accuracy is the highest. As shown in Figure 7, the waiting time for the data user retrieval of MRSE, PRSE, MRSE-HCI, and our scheme increases with the increase in the number of keywords in the query. However, it is worth noting that our scheme and MRSE-HCI have a lower retrieval delay among the four schemes. Especially when the number of keywords in the query is less than 20, our scheme has the best document retrieval efficiency, which is in line with the actual retrieval habits, because it is impossible for users to enter so many keywords for retrieval.

8. Conclusion and Future Work
In this article, we propose a secure and efficient encrypted document retrieval scheme based on both fog and cloud computing. A fog node is employed to act as a bridge between the cloud server and data users. The workloads of data owner and data users decrease, and meanwhile, the level of system security improves. We modify the TF-IDF model by taking the position information of keywords into consideration, and it can reflect the importance degrees of the keywords in a more accurate manner. In this way, the document vectors are more reasonable. To always return the proper search results even for imperfect query requests, the keywords in query requests are first extended based on Word2Vec before constructing query vectors. We employ the secure kNN algorithm to encrypt these document vectors and query vectors. Therefore, the security of the vectors greatly increases while maintaining their searchability. Forward and backward security is also considered and properly protected in our scheme. An extra vector is appended after the document vector to prevent the cloud server from querying the deleted documents by a new trapdoor or querying the new documents by an outdated trapdoor. At last, an index structure is designed to improve search efficiency. The security of our scheme is carefully analyzed from several aspects. Analysis and simulation results illustrate that our scheme can provide secure and efficient document retrieval service over encrypted cloud file systems to data users.
In our future work, we would like to apply neural network architecture in the retrieval process, which may make the rank order more accurate. Also, how to improve the retrieval efficiency is a subject for future work.
Data Availability
The figures and tables used to support the findings of this study are included in the article.
Conflicts of Interest
The authors declare no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China under Grant numbers: 62001055 and 62102017, and the Fundamental Research Funds for the Central Universities (YWF-22-L-1273).