Abstract

Enabling remote data integrity checking with failure recovery becomes exceedingly critical in distributed cloud systems. With the properties of a lower repair bandwidth while preserving fault tolerance, regenerating coding and network coding (NC) have received much attention in the coding-based storage field. Recently, an outstanding outsourced auditing scheme named NC-Audit was proposed for regenerating-coding-based distributed storage. The scheme claimed that it can effectively achieve lightweight privacy-preserving data verification remotely for these networked distributed systems. However, our algebraic analysis shows that NC-Audit can be easily broken due to a potential defect existing in its schematic design. That is, an adversarial cloud server can forge some illegal blocks to cheat the auditor with a high probability when the coding field is large. From the perspective of algebraic security, we propose a remote data integrity checking scheme RNC-Audit by resorting to hiding partial critical information to the server without compromising system performance. Our evaluation shows that the proposed scheme has significantly lower overhead compared to the state-of-the-art schemes for distributed remote data auditing.

1. Introduction

Distributed cloud storage provides an on-demand data outsourcing service and has become a popular research trend due to its elasticity and low maintenance cost. However, the shortcoming of this system is that some storage nodes could be untrustworthy, which makes data integrity or reliability become increasingly troublesome tasks for data owners. Regenerating coding, an alternative technology designed for these networked storage systems, shows a potential advantage to reduce data repair bandwidth if the outsourced data integrity is the precondition of such systems. Therefore, enabling remote data integrity verification becomes fundamental and crucial for regenerating-coding-based cloud storage systems [13].

Many solutions realizing outsourced integrity checking without local copy have been presented under several system and security models, such as provable data possession (PDP) [26] and proof of retrievability (PoR) [710]. PDP approaches always resort to some homomorphic authentication schemes for ensuring possession of files on untrusted storages, while PoR approaches combine spot-checking and error correcting codes to ensure both possession and retrievability of files on archive service systems. To enable integrity verification, both solutions require a cloud server to return the proof response of some particular hosted data blocks specified by the auditor (or the user itself). If the returned proof cannot pass the auditor’s verification checking, the auditor concludes that data hosted in that server are corrupted. Both PDP and PoR are the earliest solutions of data integrity auditing in a single cloud environment, in which only one copy of user data is stored in the cloud. Considering that files are usually striped and redundantly stored across some distributed systems (i.e., multiple servers or multiple clouds), the works in [7, 912] explore integrity verification suitable for such distributed setting with different redundancy schemes, such as replication, erasure codes, and regenerating codes. In this paper, we focus on the remote data integrity verification problem in regenerating-code-based distributed storage systems.

Most traditional PoR approaches cannot support popular cloud storage applications because of their random permutations for auditing. Although some PDP schemes combined error correction codes suggesting to preprocess the outsourced data, this is inefficient by decoupling error correction. Therefore, the state-of-the-art approaches can only partially solve cloud data integrity checking, and they still face usability challenges in practical scenarios due to their low efficiency for regenerating-coding-based distributed cloud storage [1315]. As we all know, usability requires that a cloud data integrity scheme is practically secure, supports real-time applications, and also runs fast. Actually, most of the existing solutions either involve large-scale encoding computation of the stored data or need to generate much-processed parity data authenticated. When these issues meet regeneration-coding-based systems [5, 8, 9], the distributed storage performance will degrade dramatically, far from reaching the usability.

There have been only a few number of works that have been devoted to usable regeneration-coding-based applications. For achieving lightweight implementation cost, Le et al. [11] proposed a symmetric-key-based privacy-preserving auditing scheme called NC-Audit, which presents relatively efficient performance. The authors claimed that it can realize remote data privacy-preserving auditing along with failure repair to cloud storage nodes. However, the scheme has some fatal security weakness to break the integrity checking protocol as illustrated in this paper. Recently, Lakshmi and Deepthi [10] proposed a homomorphic encryption scheme based on channel coding for regenerating-coding-based storage system, which realizes verifiable computation and error correction with a very small amount of bandwidth. However, large-scale matrix multiplication operations are involved during the process of audit and error correction, which brings in heavy online computation overhead.

In parallel, another kind of interesting and possible solution to maintain remote system security is presented in [1618], which targets malware detection or tamper resistance in storage networks. However, these works only focus on the software or hardware security, orthogonal to the data security in this paper. Different from these active attack detection mechanisms, we only focus on remote data integrity verification solutions in distributed cloud storage.

The contributions of this paper are threefold as listed below.(1)We point out that the scheme NC-Audit for regenerating-coding-based storage is actually insecure for that it cannot satisfy the fundamental auditing security, which makes an adversarial storage node to successfully forge an illegal response to pass the auditing verification even if the storage node has deleted the user’s whole file.(2)We further propose an improved algebraic security-based remote data auditing scheme named RNC-Audit (Revised NC-Audit) to fill the security gap between the coding reliability and the usability of integrity checking. The methodology behind this work is to scramble partial key parameters to strengthen the security.(3)The proposed scheme is practically secure and does not rely on any computation-heavy arithmetic calculations. It supports existing distributed cloud storage applications and works over computation-efficient finite fields.

The rest of the paper is organized as follows. In Section 2, we formulate the system model and the threat model. In Section 3, we describe the scheme NC-Audit execution between the user and a single storage node. In Section 4, we elaborate security analysis on NC-Audit. In Section 5, we put forward an improved algebraic security-based scheme RNC-Audit. In Section 6, we explain the correctness of RNC-Audit. In Section 7, we evaluate the communication and computational efficiency. In Section 8, we conclude the paper.

2. System and Threat Model

2.1. System Model

Similar to [11], we consider a cloud storage service involving a user and a third-party auditor (TPA) and some regenerating-coding-based storage nodes which make up the cloud storage provider (CSP). The user uploads his data to the storage nodes and resorts to the TPA to check for the integrity of the outsourced data at each node. Particularly, the user does not want the TPA to learn about his data privacy. The auditing system model is shown in Figure 1.

Before data uploading, the user encodes the file using a regenerating code and uploads the encoded data to N storage nodes in a distributed way. The detailed procedure is as follows.(1)Divide the file into a sequence of initial message vectors, i.e., , where , and is a finite field of size .(2)Expand the vector to where , and is an -dimensional unit vector whose -th element is 1.(3)Computefor , where is randomly chosen in and is termed as ’s encoding coefficient and is composed of the last elements of .(4)For , upload the digital signed vectors , to the storage node .

When a regenerating code is adopted in a distributed storage system with storage nodes (each node stores blocks), the data file distributed this system can be restored by accessing data blocks from any of the healthy nodes. When a server node fails, the data stored in the failed server can be reconstructed by retrieving data blocks from any healthy servers, and therefore, the repair bandwidth is . In the example given in Figure 2, .

The symbolic representations described above will continue to be used in the following text. We refer the interested reader to the literature about regenerating code construction [1912].

2.2. Threat Model

We consider semitrusted storage nodes that are faithful and do not deviate from the auditing protocol. However, they may deliberately delete rarely accessed user’s data to reduce storage costs; they may also try to hide data corruptions caused by either internal or external factors to maintain reputation. For clarity, we concentrate on our discussion between a single storage node and the TPA.

We assume that the TPA, who is in charge of remote data integrity checking (i.e., data auditing), is independent and reliable. The TPA has no willingness to collude with existing storage nodes but has a strong desire to extract or leak user’s secret keys. This is a general assumption when relying on a TPA for data auditing to reduce the user’s burden [5, 6, 8].

As the practical requirements of cryptographic protocols, both the TPA and the storage node are fully aware of the protocols used.

2.3. Auditing Model

Generally, a remote data auditing scheme for cloud storage always includes a three-stage process, i.e., initialization, outsourcing upload, and data auditing. The initialization stage mainly generates system parameters and protocol security keys; in outsourcing upload, a user needs to generate authentication tags for all the outsourced blocks and outsources them to storage nodes; in the stage of data auditing, cloud server computes a response (i.e., a linear combination of some given blocks and its authentication tag) challenged by the TPA, and the TPA verifies whether the tag is a valid tag of the combined block.

3. Description of NC-Audit

For clarity, this section focuses on the discussion of NC-Audit execution between the user and a single CSP storage node.

In NC-Audit, every initial message block consists of n characters in , while the last two characters are padded randomly. The scheme introduces three pseudo-random functions (PRFs):where indicates the positive integer set, denotes the PRF key set, and is the file identifier set. NC-Audit consists of three phases as follows.(1)Initialization:(a)Setting security parameters and PRFs, then the user shares a unique key and with the storage node and the TPA, respectively, where is used for encryption at the storage node and is for verification at the TPA.(b)Inputting and , the user and TPA compute the vectors and with , where is constructed by the first elements of and for .(c)Inputting , the user generates the vector with and computesfor , and then sends to . Here the -th element of the vector is determined by .(2)Outsourcing upload:For , the user works as follows.(a)Generate as in Section 2.1 and computeas the authentication tag of the vector .(b)Retain the secret keys and the encoding coefficient vector , send to TPA, and upload the vector to . In fact, the user can delete the whole data of the file denoted by id.(3)Data auditing:(a)The TPA selects an outsourced vector index set , chooses randomly, and then sends the challenge message to .(b) computes the aggregated vector:and randomly chooses to perform following operations: Step 1: computing for and a mask vector Step 2: computing Step 3: sending the response message to TPA.(c)The TPA computes according to , extracts , and then verifies if the equation holds.

If so, passes the auditing in that time and outputs 1; otherwise, it outputs 0.

4. Security Analysis on NC-Audit

The authors in [11] have presented the scheme NC-Audit with ciphertext indistinguishability under chosen plaintext attacks. However, it is not enough to guarantee the security of the auditing mechanism. As claimed, the vector (or ) can only be shared between the user and the TPA, which means that once is leaked to an adversarial storage node, the adversarial storage node is able to forge at least one illegal vector to pass the auditing verification when is large. The following analysis shows that how an adversary can deduce the private vector easily.

According to equation (3), we can construct a system of linear equations about vector (with unknowns) as follows:

Since , we can see that is clearly a dimensional matrix in . Meanwhile, every element of can be considered random in because it is generated by a PRF, so the probability that the rank of the matrix is can be high up to 1 when is large enough. This means that the adversary has the ability to solve the vector with a high probability.

According to equation (8), we then have

As stated above, the adversary can solve the vector with a high probability when is large enough. Moreover, he can indeed know the public vectors . Therefore, the adversary can easily deduce at most vectors satisfying

Since the vector is fixed during each encryption in NC-Audit, it holds that

According to equations (10) and (11), we havewhich meansstill holds regarding the forged vector . That is, the adversary can elaborately disguise an illegal plaintext vector with the specific vector and the tag , such that the forged response message can succeed in passing the auditing verification.

5. An Improved Remote Data Integrity Checking Scheme for Distributed Storage

According to the algebraic analysis in Section 4, we can conclude that the security of NC-Audit depends on the security of the secret vector , and vice versa. Inspired by the notion of algebraic security in [20], an alternative and feasible method of realizing secure integrity checking is to prevent an adversary to deduce the vector . For example, if the adversary can get less than values out of , there is no way for him to launch such attack as in Section 4.

Subsequently, we present an algebraic security-based auditing scheme called RNC-Audit, where a special randomization is performed in order to protect partial critical security parameters, which makes the auditing protocol satisfy the algebraic security criterion, i.e., the adversary has no ability to solve the linear system of equation constructed from his known information. To attain this end, without loss of generality, we assume that the values and in equation (9) have been randomized to prevent the adversary’s security analysis in the following section.

Besides the PRFs and used in NC-Audit, RNC-Audit also introduces another PRF:where is the set of identifiers of auditing tasks. Seeing that the system parameters of the protocol are distinct from that of NC-Audit, RNC-Audit will perform different auditing computation.

The detailed execution of RNC-Audit between the user and a single CSP storage node is highlighted as follows.(1)Initialization:(a)Setting security parameters and PRFs, then the user shares a unique key and with the storage node and the TPA, respectively, where is used for encryption at the storage node and is for verification at the TPA.(b)Both the user and the TPA compute the vectors and as in NC-Audit.(c)The user generates the vector as in NC-Audit, then computes and sends the parameters to (rather than the former parameters used in NC-Audit), and simultaneously sends to TPA, wherein which and are both selected randomly in .(2)Outsourcing upload:For , the user works as follows.(a)Generate as in Section 2.1 and computeas the authentication tag of the vector .(b)Retain his secret key and , send the vector to TPA, and upload the vector to . Usually, the user can delete the whole data of the file denoted by id.(3)Data auditing:(a)The TPA generates and sends the challenge message .(b) computes the aggregate vector , i.e.,and then performs the following operations: Step 1: generating , and computing a mask vectorwhere is the identifier to label the current audit task. Step 2: computing Step 3: sending the response message to TPA, where(c)TPA computes according to , extracts from resp, and computesand then verifies if the following equation holds.

If so, passes the auditing in that time and outputs 1; otherwise, it outputs 0.

6. Correctness of RNC-Audit

The correctness of RNC-Audit is guaranteed if the file is corrected by the following derivation process.

Let , where .where is a zero vector of size .

7. Performance Analysis

7.1. Security Guarantee

Similar to the analysis in Theorems 2 and 4 in [11], it can be easily proven that RNC-Audit can provide data possession proof and privacy-preserving guarantee.

Especially, RNC-Audit can effectively conquer the security weakness existing in NC-Audit in Section 4. The user has the ability to protect the values against both the honest-but-curious server and the TPA, which makes the adversary to only at most obtain a linear system with equations with respect to unknowns as follows:

It is easy to see that when is large properly, the adversary has no ability to solve the vector except the brute force guess, which can effectively resist the adversary analysis as in Section 4 and thus guarantee the auditing security of RNC-Audit. Note that we can easily conclude that the proposed scheme realizes the algebraic security as defined in [20].

7.2. Computation and Communication

Under the same security level, the scheme NC-Audit and the scheme in [10] were once considered to be efficient in computation performance among the current data auditing schemes for regenerating-coding-based storage. Therefore, the following text will compare the performance of RNC-Audit with that of the schemes in [10] and NC-Audit. The systematic performance features of the three schemes are shown in Table 1.

The scheme in [10] is the one specifically constructed for online user auditing, which differs in system model from the existing schemes because the entity TPA is never needed. This design simplifies the protocol process and eliminates the security risks from the TPA, but the incidental side effect is that the user itself must be always online and it incurs high processing burden. Furthermore, the scheme involves large-scale error correction encoding and channel decoding operations, which makes the computation overhead more expensive compared to RNC-Audit and NC-Audit.

Although NC-Audit is also excellent in computation cost, it cannot guarantee user’s data privacy. In contrast, RNC-Audit can ensure data privacy and achieve similar implementation performance as NC-Audit. Actually, RNC-Audit only operates two more multiplications in compared to NC-Audit during each audit. In addition, it does not increase the storage overhead of every entity in the system during the protocol running.

For each audit round, the main communication overhead is the response transmission from the storage node to the TPA (or the user), which is dominated by the size of the (encrypted) data block. NC-Audit and RNC-Audit also keep the equivalent performance in communication efficiency. Comparatively, the scheme in [10] has lower communication overhead because of its simplified auditing method.

7.3. Implementation

We implement in C to compare the online computation performance of RNC-Audit with two typical schemes, i.e., NC-Audit and the scheme in [10]. The experimental result reported for these three schemes is the average of 1000 runs on a computer with Intel(R) Core (TM) i7-8650U 1.9 GHz and 16G RAM.

For a fair comparison, we set (i.e., 4 kB block), . The experiment ignores the addition operation and focuses on the online processing time of multiplications over with the help of lookup table. Table 2 shows the average computation performance of the TPA and one CSP server (both have the same configuration) during one auditing work, when and 300. represents the average online computing time of entity when . The result shows that RNC-Audit outperforms the other two schemes in execution efficiency. Since the scheme in [10] needs to additionally perform complex decoding operations, the user-side computation overhead is relatively large, but the server-side overhead is small. Among the three schemes under comparison, the computational time always increases with .

Taking the above discussions into account, the proposed scheme RNC-Audit is excellently customized to achieve providing proof of retrievability and privacy-preserving auditing without any data security compromise while potentially being efficient in real-time applications and thus much usable for the regenerating-coding-based distributed storage.

8. Conclusion

Privacy-preserving data auditing is one of the key issues in distributed cloud storage applications. This paper first points out that there exists a fatal security flaw in the scheme NC-Audit [11] according to our algebraic analysis. Inspired by the algebraic security, an improved PoR-based scheme called RNC-Audit is presented, which can not only effectively prevent the algebraic analysis but also maintain the competitive implementation efficiency in coding-based storage systems. Our analysis and evaluation results demonstrate that RNC-Audit is more efficient and usable than the state-of-the-art schemes in practical resource-constrained scenarios.

The ongoing work targets the application extensions of our research methodology from regenerating-coding-based storage to privacy-preserving forwarding [21], security-as-service frameworks [22], mobile crowdsensing [23, 24], authorized deduplication [25, 26], and the adoption of a stronger threat model involving malware detection [16, 18, 27].

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This study was supported by the National Natural Science Foundation of China (nos. 61702105 and 61872088), Natural Science Foundation of Fujian Province (no. 2019J01276), Natural Science Foundation of Shaanxi Province (no. 2021JQ-196), China Postdoctoral Science Foundation (no. 2019M663629), Xi’an Science and Technology Project (no. 2020KJWL02), and China National Study Abroad Fund Grant.