Abstract

Data deduplication serves as an effective way to optimize the storage occupation and the bandwidth consumption over clouds. As for the security of deduplication mechanism, users’ privacy and accessibility are of utmost concern since data are outsourced. However, the functionality of redundancy removal and the indistinguishability of deduplication labels are naturally incompatible, which bring about a lot of threats on data security. Besides, the access control of sharing copies may lead to infringement on users’ attributes and cumbersome query overheads. To balance the usability with the confidentiality of deduplication labels and securely realize an elaborate access structure, a novel data deduplication scheme is proposed in this paper. Briefly speaking, we drew support from learning with errors (LWE) to make sure that the deduplication labels are only differentiable during the duplication check process. Instead of authority matching, the proof of ownership (PoW) is then implemented under the paradigm of inner production. Since the deduplication label is light-weighted and the inner production is easy to carry out, our scheme is more efficient in terms of computation and storage. Security analysis also indicated that the deduplication labels are distinguishable only for duplication check, and the probability of falsifying a valid ownership is negligible.

1. Introduction

As a flourishing service mode, cloud computing adopts load balancing, distributed computing, and other technologies to conveniently provide computation and storage functions for remote follow-up users, thus saving local resources and promoting work efficiency. However, if the users immoderately outsource their data to the cloud, a serious problem may occur due to massive duplicated data. As reported in [1], almost half of the cloud storage is wasted because of data redundancy. Consequently, the budget for managing duplicate data raises up to eight times than that of source data maintenance [2, 3]. With the explosive growth of data nowadays, the tremendous storage requirements or the exorbitant administrative expenses have put enormous pressure on cloud service providers. Therefore, how to store and manage data economically and efficiently has become a serious challenge for these enterprises.

To cut down the costs caused by redundant data, deduplication technology has been widely used by cloud service providers [4]. In such a technology, duplication check and proof of ownership are two key problems. Till now, the problem of how to balance the conflict between comparability and confidentiality for secure duplication check remains unsolved [5]. Meanwhile, the problems of how to efficiently validate the access authority and how to achieve complex access structures are also urgent to address, considering that the mechanism of query matching is cumbersome and the downloading certificates may be abused to launch various attacks.

As a research hotspot, lots of attentions are put on the efficiency and security of data deduplication. In the published literature, Li et al. [6] suggested carrying out deduplication by comparing the fingerprint of the outsourced file with the uploaded ones in a direct way. However, this method is deficient since the communication and comparison of those fingerprints are inefficient and the contents of data are exposed. To reduce the traffic of deduplication labels and conceal the data, Puzio et al. [7] used the hash function to code the same plaintexts into identical values, which serve as the labels for duplication check. Although this method achieved the goals of transmission efficiency and storage saving, it is vulnerable to dictionary attacks since the hash values are overt.

In order to ensure the confidentiality of deduplication labels, Chen et al. [8] utilized the message lock encryption (MLE) to encrypt those hash values of data. However, the traditional MLE scheme is not semantic secure and vulnerable against quantum attacks [9].

Fortunately, cryptographers have been devoted to design secure, efficient, and effective crypto systems to resist quantum attacks in recent years. In 2005, Regev et al. [10] proposed a novel paradigm as an underpinning of cryptography, namely, learning with errors. They proved that the difficulty of solving it is equivalent to the hardness of shortest vector problem (SVP) over lattice, and thus, it can resist the attacks based on quantum computing. Besides, it is provided with the capacity of homomorphic and linear computation. Therefore, we consider exploiting it in our scheme to ensure the functionality, efficiency, and security of deduplication labels.

As for the proof of ownership, the best solutions till now are all based on Merkle hash tree (MHT) [11, 12]. In detail, the cloud and the user independently hold an MHT computed from the outsourced data. Thus, the user can upload the same MHT to the cloud for comparison. The disadvantages of such scheme are not only high storage and communication overheads but also low computation efficiency. Therefore, Chen et al. [13] improved it by randomly asking the cloud to select some leaf nodes of the MHT to challenge the user. The user must trace the path from the root to these leaves as a reply to prove that he possesses the same tree. Although this method does not require the transmission of the whole MHT for comparison, it demanded that the user and the cloud should construct and store a complete MHT for each file. Moreover, the challenge-response mode implies a long delay.

In order to promote the performance of PoW, the advantages of inner product predicate gradually entered the researchers’ sight [1416]. Roughly speaking, only if the inner product results 0, the user can be granted a permission to access the corresponding file. The most significant merit of this method is using computation instead of comparison to efficiently perform ownership proof. Therefore, we adopted it in our scheme to balance the conflict between the variety of access structures and the security of users’ privacy.

Aiming at checking replication over semantic secure deduplication labels and achieving fine-grained access control, this paper proposed a novel cloud data deduplication scheme by exploiting LWE (learning with errors) together with inner product predicate. Our contributions are abbreviated as follows:(i)Though designed for the purpose of deduplication, the deduplication labels are indistinguishable to any process except for duplication check. This property is achieved in virtue of semantic secure and homomorphic LWE, which is also resistant to quantum attack.(ii)The proof of ownership is carried out by inner product, which is computationally efficient. Besides, we impose the accessibility of users on their attributes, implying the functionality of the elaborate access structure and ownership transfer.(iii)For each file, only one light-weighted downloading certificate should be stored by the cloud, while the clients should only carry out and upload its corresponding proof on demand. That is to say that both the storage and bandwidth are economic for cross-user access.

The rest of this paper is organized as follows. In Section 2, some formal definitions related to LWE and inner product predicate are given. Section 3 depicts our deduplication scheme, including the detailed way for duplication check and ownership proof. The correctness of our scheme is formally validated in Section 4, followed by security and performance analysis in Sections 57 that concludes the paper.

2. Preliminaries

For better understanding of our scheme, the concepts related to learning with errors and inner product predicate [2, 17] will be introduced in advance.

Definition 1. (Integer lattice). An integer lattice is the integer linear combination of vectors over , expressed as

Definition 2. (LWE hardness assumption). On parameters and a discrete Gaussian distribution , wherefor , we select a noise from and uniformly sample a vector together with a matrix . Based on the value oftwo versions of LWE hardness can be defined as follows:(a)LWE-Search hardness: Given multiple pairs of on constant and , searching for the value of is difficult.(b)LWE-Determination hardness: For uniformly sampled , the tuples of and are statistically indistinguishable. It means that it is difficult to tell if the second term of those tuples are randomly chosen or computed from formula (3).In fact, the LWE-search hardness is equivalent to the problem of finding a short enough vector in lattice (GapSVP), and the LWE-determination hardness can be reduced to the problem of solving linearly independent shortest vectors (SIVP) of a lattice in the worst case. Therefore, the LWE assumption can be used to guarantee the one-way property for encryption with semantic security.

Definition 3. (Inner product predicate). The inner product predicate is defined on the Cartesian product thatFrom the perspective of functional encryption (FE), I can be deemed as the space of ciphertexts and is composed of secret keys. Once a correct key is known, we are able to learn the output of function .
To construct an attribute-based access control policy, the access structure is coded as a vector , thus the access authority can be verified with respect to the consistency of authorization certificate .To avoid obfuscation, the symbols used in this paper is listed in advance, as in Table 1.

3. Duplication Check Based on LWE

To prevent dictionary attacks caused by the exposure of deduplication labels, we intended to make them indistinguishable except for the process of duplication check. Therefore, LWE is adopted to randomize the hash value of file to ensure the indistinguishability of deduplication labels and resist the attacks of quantum computation. In addition, we exploit inner product predicate to control the accessibility of clients, which is flexible for functions such as cross-user sharing and ownership transfer. The logical idea of our scheme is illustrated below, which is shown in Figure 1.

4. File Upload

A user denoted as A, who possesses a file and expects to upload it, is not aware of its existence over cloud at the very beginning. To avoid unnecessary storage and bandwidth, he is supposed to check if there is a copy already held by the server.

Drawing support from any strong-collision resistant hash functionthe user figures out the hash value of file asand codes it as a vector of elements. On fixed public matrix and a pseudorandom sequence generator (PSRG), he produces a vectorand exploits LWE to obtain

Herein, stands for the last row of , where is considered as a dimensional vector and is randomly chosen from .To this point, the user is ready to take the dimensional vector as a deduplication label and upload it to the cloud. Since the subsequent actions he should take depend directly on the result of duplication check, we will discuss the situations for original uploader and repeated uploader, respectively, who are denoted as A and B for clarity.

4.1. The Process of Original Uploading

We defer the description of duplication check to the circumstance of repeated upload, if suppose that user A is informed with the inexistence of file . For further deduplication, he should secretly upload the deduplication certificate to the cloud. To ensure the confidentiality of his file, its hash value can be taken as a symmetric key to hide the plaintext as

Then, the cloud preserves the uploaded ciphertext for storage and the deduplication certificate for duplication check. To further retrieve the file, user A ought to upload a downloading certificate as well, like the following.

Assuming that the attributes of user A correspond to a secret vector , which can also be regarded as a polynomial

It is worth mentioning that the user is aware of the elements of only if he corresponds to those attributes. To actualize a functional encryption which reflects the access structure in covert manner, he uniformly samples two vectors and . Similarly, the vector can also be expressed as a polynomial , which is equivalent to a cyclic matrixwith respect to the homorganic between polynomials and cyclic matrices.

In order to construct the correct downloading certificate, he computes and figures out for by

After that, the user uploads as the downloading certificate and submitsto the cloud for further expansions on access structure.

At the end, the user preserves the hash value , the essential elements of matrix , and the replied link of outsourced file. While the ciphertext can be held by the cloud server, attached with ,, and for duplication check, access expansion, and ownership proof.

4.2. The Process of Repeated Uploading

As mentioned before, once a deduplication label is figured out, any user should firstly hand it over to the cloud for duplication check. Assume that the deduplication certificate of an existing file is , the cloud can inspect its consistency with another deduplication label as the following:

When user B expects to upload his file , he submits its deduplication label to the cloud and keeps the hash value private.

Based on an outsourced deduplication certificate , the cloud computes within a lifted interval which is as follows:

It can be seen that, if the two files are identical, only will remain in formula (14). Therefore, when the result satisfiesthe cloud can ensure the duplication of file with negligible false positive.

To validate his accessibility, user B should also figure out the downloading right of the corresponding file. However, it is more reasonable to use existing download rights held by the cloud server for the purpose of storage saving. Based on this, user B can use the following subprotocol to obtain the download right of the duplicate file, and the cloud will simply send the link back to him for further retrieval.

4.3. The Subprotocol for Access Expansion

Denoting the secret corresponding to the attributes of repeated uploader B as . To bind the access structure with his own attributes, he should also figure out a cyclic matrix which can be used to compute his proof of ownership which is as follows:

Though the downloading certificate cannot be exposed to prevent unauthorized access, the cloud can provide user B with the values of and to help him calculate the correct cyclic matrix . Thus, the downloading right can be carried out by user B in Algorithm 1.

User B
Input:
(1)Samples , where is irreversible and cofficients belong to
(2)Computes , ,
for all
, , ,
(3)Computes ,
(4)Samples , ; and computes
(5)For all
(6)
Output:

5. Proof of Ownership

Once any legal user obtained his downloading right, he should be authorized to retrieve the corresponding file from the cloud. To improve the efficiency of ownership proof, access authorization is executed in a computational way.

After uploading, the legal user A will be provided with the last row of the cyclic matrix. Therefore, he only needs to form the cyclic matrix and combines it with his attribute vector to figure out the downloading right. Based on the resulted vector, the cloud can easily verify his accessibility by functional encryption. The process of PoW is completely given in Algorithm 2.

User A
Input:
(1)Computes , ,
,
(2)Computes , ;
(3)For all
,
,
(4)Sends to Cloud.
Cloud
Input:
(5)Computes
(6)If
Output: ; Otherwise
Output: NULL.
(7)Sends to User A

After obtaining the ciphertext , user A can decrypt the file by computing because he is aware of the secret key .

In fact, the ownership proof process for user B is similar to that of user A. The reason why user B can also decrypt the file is due to the equality of plaintexts and . Since he is able to obtain the corresponding file via .

6. Downloading Right Transfer

On noting that, without the secret vectors corresponding to the attributes of legal users, other users are incapable of computing the downloading right even if the last row of cyclic matrix is known. Since the access controls subprotocol, any legal user can directly transfer the resultant downloading right to other users to avoid redundant operations such as peer to peer transmission. However, it may lead to the abuse of downloading right and violate the confidentiality of user’s attributes. Practically, legal users are prone to transfer the downloading right of their file to others who share party of common attributes with him. Therefore, we designed a protocol that any legal user can update the downloading right and transfer it to a group of users with the same set of attributes. In this way, the owner does not have to download the file from the cloud and only needs to transfer the downloading right to other users to complete file sharing, which effectively reduces the consumption of communication bandwidth.

Definition 4. (Common attributes vector). Suppose that the file owner A can be identified by attributes vector , and all users in the same group have common attributes denoted by . Then, the common attributes vector can be defined as a partial ordering relation that if and , otherwise.
Specifically, the process that the user A constructs the common attribute vector is detailed in Figures 2 and 3.
As shown in Figures 2 and 3, the user A mainly retains the secret attributes shared by the same group and sets the attributes which are distinct in the user group as 0. Finally, he outputs a common attribute vector .

6.1. Proof of Ownership

The user A performs the following steps to realize the PoW and retrieves and .If the downloading right is valid, the inner product will result in 0, meaning that the user A is authorized to retrieve the file. Therefore, the cloud server returns and back to him. Similarly, the values of and can be used to update the downloading right for a group of users. Specifically, the process of PoW is shown in Algorithm 2, which is the same for any valid user even if the updated downloading right is used.

6.2. Update the Downloading Right

To share the file to a group, the downloading right update process can be carried out by the user A as the following. In a clear form, the process that the user A calculates the downloading right for a group of users is shown in Algorithm 3.

User A
Input:
(1)Samples , where is irreversible with cofficients belong to
(2)Computes
for all
; ; ;
(3)Computes
(4)Computes
(5)Samples , , and Computes
(6)For all
Output:
6.3. Sharing the Downloading Right

After the previous two stages, the user A can share the vector and the secret key to all users who are within the same attributes set. In these ways, a group of users are provided with the downloading right, which can be valid if the common attributes vector is known.

7. Correctness Proof

The previous section is mainly composed of three parts, namely, the file uploading, the proof of ownership, and the downloading right transfer. To verify the correctness of our design, this section intends to prove that file duplication can be effectively eliminated and only authenticated users can access the file.

Firstly, the correctness for the deduplication label is given by Theorem 1.

Theorem 1 (Correctness of deduplication label). Suppose that the cloud holds a deduplication certificate which is correspondent to file . After the user B uploaded the deduplication label before outsourcing the same file , the cloud can perform deduplication correctly with negligible false positive.

Proof. Due to the deduplication certificate stored on the cloud, where . After the user B uploaded the deduplication label of the same file to the cloud for , the cloud executes the following calculation on each deduplication certificate. Once is met, the inner product can be carried out as follows:Since is a deterministic algorithm, when ,. Meanwhile, according to the common matrix , it is obvious that . Thus, we can easily see that from equation (17). Because the inner product of is definite, the inner product of can also be guaranteed, meaning that duplication can be detected with 100% probability.

Theorem 2 (Correctness of download right). Suppose that the cloud possesses a downloading certificate corresponding to file , then any legal user can correctly pass the procedure of PoW in terms of his downloading right.

Proof. For user A, who uploads the original file to the cloud, he rotates to right by one bit to get and uses it to reconstruct the cyclic matrix . Then, user A calculates download rightwhere are the attributes of the user A. After which the user A sends the download right to the cloud. Finally, the cloud calculates the inner product ofBased on the last element of download certificate is , so that the result of can transfer as . Therefore, the inner product of is zero. For the repeated file user B, the first two steps are the same for the user B.
Then, he also gets the result of download right and sends it to the cloud. Moreover, the inner product of calculates the process as follows:In a word, all legal users who hold the download right corresponding to file can pass the PoW.

8. Security Analysis

This part will prove that the deduplication label is indistinguishable except for duplication check process, and the downloading right is resistant to forgery. To begin with, the security about deduplication label is given in Theorem 3.

Theorem 3 (Security of deduplication label). For legitimate users, whether uploading the same or different files to perform deduplication, the deduplication labels are only distinguishable to the duplication check process.

Proof. The following analysis will be divided into two cases, with respect to the deduplication labels corresponding to same files and different files.Case 1. Supposing user A and user B possess the same file. They have the same hash value of two identical files, and their deduplication labels areAccording to the deterministic algorithm , we can see . Moreover, for the common matrix , it is obvious that . However, and are randomly sampled from and , respectively. The probability that the deduplication labels are identical is , which is negligible. Therefore, we claim that the results is almost impossible, which means and satisfy semantic security.Case 2. Supposing user A and user B possess different files. That is to say, they have different file hash values that , and the deduplication labels areSimilarly, since , the probability that deduplication labels are the same is , which is indistinguishable from the distribution of Case 1.
Therefore, we can conclude that, since the deduplication labels of the same file are different, Case1 is of the same distribution indistinguishable from Case2, and the deduplication labels are semantic secure. In summary, the deduplication tags corresponding to the same file and different files are indistinguishable.

Theorem 4 (Security proof of downloading right). None of the users can forge a valid downloading right which can deceive access control.

Specifically, the security analysis of the downloading right can be guided by Lemmas 1 and 2.

Lemma 1. After the original uploader A outsourced the file to the cloud, the entire download certificate is known only by the cloud.

Proof. According to inner product predicate, the user A's downloading right can make the inner products output 0.
However, the download certificate is calculated by the user A who samples and sets the last element of the download certificate to be .
Then, when the user A uploads for the first time, the cloud obtains the completed download certificate corresponding to A’s secret attributes. For now, if there is an illegal user who tries to falsify the download certificate to cheat the PoW system, his advantage iswhich is negligible.

Lemma 2. For repeated file uploaders, they do not know the remaining elements of the download certificate except for .

Proof. Take a repeated file uploader B as an example, he uses to update the last three elements of download right into .
In detail,Since the value of is known, the result of can be calculated. However, because the rank of formula (25) is equal to 1 and , the formula of (25) contains two unknowns variables. Thus, the results of and are infinite. Therefore, when the user B calculates the downloading right, he does not know the remaining elements of the download certificate except for .
Considering that the solutions of formula (25) are infinite, the security of downloading right can be effectively protected, namely, of the user B. Thus, it also guarantees the confidentiality of legal users’ attributes. If an illegal user attempts to forge the remaining elements of to get the new download certificate , his advantage is justwhich is negligible. Therefore, our scheme will not expose the remaining elements of the download certificate .
In terms of Lemmas 1 and 2, it can be seemed that no user can forge a valid downloading right since the complete download certificate and the attributes vector will not be exposed.

9. Performance Analysis

Then, the performance of our schemes will be analysed comparing with other main technologies. The notation of symbols can be found in Table 1, as for functions, such as the necessity of third-party, deduplication level, participants, and the necessity of key fusion, and the comparison can be found in Table 2.

Compared with the schemes from [2, 3, 9], our scheme does not require any third-party, which effectively avoided extra trusting relationships and can save numerous computation/communication resources. Moreover, our scheme executes file deduplication amongst multiple users, implying that it is more flexible and more adaptive to various cloud environment. From the perspective of key fusion, when compared with the literature from [2, 3, 8, 9], any key fusion process is unnecessary in our scheme, so that it can be applied even if the user resources are limited.

Then, we compare the computation overheads for deduplication taken by the client, third-party, and cloud in the above schemes. The details are given in Table 3.

Compared with the cost on client side in scheme [3], that of our scheme is , where a pseudorandom number sequence is generated instead of convergence keys. In fact, it means that our scheme is more efficient since can be iterated generated via small numbers, not saying that our scheme if free of any third-party. Moreover, the hash value of file can be secretly used as the encryption key in this paper. Therefore, there is no need for multiple users to reconstruct the convergence key, which further outperformed the scheme of [3] by avoiding the consumption of key distribution and fusion.

Compared with the schemes in [8, 9], our method does not need to construct Bloom filter or attribute binary tree on client side, so the computational cost is slightly advantageous. In addition, since our scheme does not involve any third-party, the computational cost of TTP can be neglected. As for the overhead on cloud side, our scheme does not have to initialize any ownership data structure compared with that of schemes [8, 9]. Therefore, the calculation is deduced to since it is not related to the file size but only to the length hash value.

Now, we compare the computational overhead for PoW, respectively on client, third-party and cloud side. The results are shown in Table 4.

It can be seen from Table 4 that users have to preserve and search the Bloom filter or attribute binary tree to accomplish PoW in [2, 3, 8, 9]. So, there is an additional cost or on the client side. However, our scheme does not require this process, so the calculation cost is only , where the second term is just times of add operation. Comparing the cost on cloud side, our scheme dose also outperformed that of [2, 3, 8, 9], which is . The reason is similar that the calculation cost on cloud side has nothing to do with the file size but only the number of attributes.

Finally, taking the file of 256 bits as an example, we compare the communication overhead for deduplication and PoW amongst the same set of schemes. The details are shown in Figure 4.

According to Figure 4, our scheme has obvious advantage on communication overheads compared with other schemes. Our solution can effectively reduce the usage of bandwidth as well as time delay. Moreover, since all deduplication check and ownership proof processes are independent, our scheme is capable of parallel processing, which is more fit for batch implementation.

10. Conclusions

This paper proposed a novel deduplication scheme based on LWE and FE to balance the conflict between the accessibility and the indistinguishability of data. Focusing on the purpose of deduplication check, LWE is exploited to construct deduplication labels which are distinguishable only if their deduplication certificates are known. To realize more efficient and flexible access control, inner product predicate is used that data can be retrieved only if both users downloading right and attributes vector are possessed. Thanks to the separation of downloading right and user’s attributes, the downloading right can be recalculated for repeated uploading and authorization transfer without changing the corresponding deduplication label or download certificate over cloud. Correctness and security analyses proved that deduplication can be accomplished only by the duplication check process with negligible false positive, and it is almost impossible for any adversaries to fabricate a legal downloading right. Compared with other main technologies, our scheme is more applicable to multiuser environment and freed from trusted third-party. Since both duplication check and ownership proof are realized by inner product, the performances of computation and communication are more advantageous in our method, not mentioning its capacity of batch processing due to parallelism.

Data Availability

The data set was obtained from Chongqing Tongnan Electric Power Co., Ltd (telephone: 023-44559308; official website: http://www.12398.gov.cn/html/information/753078881/753078881201200006.shtml).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank Bo Mi and Fengtian Kuang for their comments and suggestions. This work was supported in part by the National Natural Science Foundation of P.R. China (Grant nos. 61573076, 61703063, and 61903053); the Science and Technology Research Project of the Chongqing Municipal Education Commission of P.R. China (Grant nos. KJZD-K201800701, KJ1705121, and KJ1705139); and the Program of Chongqing Innovation and Entrepreneurship for Returned Overseas Scholars of P.R. China (Grant no. cx2018110).