Abstract

Distributed file storage aims to support credible access to data on distributed nodes. There are some application scenarios, for example, data centers, peer-to-peer (P2P) storage systems, and storage in wireless networks. Nevertheless, among these applications, data blocks are inevitably replaced and inaccessible when there exists nodes failure. As a result, data integrity and credibility is absent. To overcome such a challenge, blockchain is explored to protect the distributed data. Through analysis and evaluation, we demonstrate that blockchain advocates data integrity and credibility for distributed file storage, as well as the application of blockchain technology for distributed file storage.

1. Introduction

Distributed file storage contributes to storing data over the network by distributed storage nodes. As known, there is an emergence of a large number of applications involving large data centers and peer-to-peer storage systems [15]. All these applications utilize nodes over the internet to approach distributed file storing. To obtain reliable storage in networks as wireless sensor networks(WSNs), additional data recovery may be required [4, 6], especially in case of a disastrous environment [5, 7].

In these applications, data reliability demands inevitably for data redundancy. In order to simplify redundancy, replication is a very suitable form, which is commonly utilized in distributed file storage systems. In research of replication, there are various methods to approach data redundancy, while erasure coding gains better storage efficiency performance. Generally, we segment a file of size S into parts, namely, each part is of size ; after that step, we code these parts into encoded portion by adopting () maximum distance separable code (MDSC); at last, all coded portion are stored on nodes. In this way, we can recover the original file from any set of linearly independent coded parts. As a result, optimal performance can be harvested, such as better decisions for incredibility and redundancy trade-off. There are some researches utilizing erasure codes to reduce data redundancy [68].

At present, among many domains, such as academia and industry, they have paid attention to the distributed ledger and blockchain technology as well as its massive potential in managing complicated systems. The distributed ledger is mainly composed of a certain amount of blocks [9], chained back that utilizes a linked hash-pointer list, so data blocks store valid sequential transactions with digital assets (see Figure 1).

Nevertheless, when any data block is attacked, we cannot recover the source file, since all segmented pieces are linearly independent. In order to solve this issue, literature [8] proposed a scheme that aims to reduce computation complexity by probabilistic-ally verifying blocks of data. Liang et al. [10] put forward a recommendation scheme to obtain data credibility. It is depicted that, by using a homomorphic signature, intranetwork verification can harvest data credibility [11, 12], a polynomial-time algorithm is applied to protect networks from malicious attacks [13], and a scheme is designed to resist pollution attacks by utilizing a polynomial hashing function [14].

In this paper, in order to conquer data integrity and incredibility issues, we utilize a blockchain-based approach for distributed file storage. The main contribution of this paper is the following: firstly, we utilized distributed hash table to enhance the credibility for stored files; secondly, we proposed a blockchain-based framework for a distributed file storage system to approach integrity and security; finally, we carried out detailed discussion on the load balance, throughput, and risk of attacks.

2.1. Distributed File Storage System

Nowadays, there are various types of distributed storage systems, such as cloud storage systems, and peer-to-peer (p2p) storage systems. In all these storage systems, data can be stored, archived, and back up over distributed nodes, such as AmazonS3. Users can make use of their stored files any time anywhere; this is an outstanding advantages as to distributed storage systems. There are many researches focused on the design and construction of distributed file systems. Napster [15], Kazaa [16], and Gnutella [17] implement distributed file systems and prompt it to be an exciting and popular research area. Bit-torrent [18] is one of the most popular and successful peer-to-peer distributed file systems and has more than 100 million online users presently. It is a large-scale deployed in which millions of users log-in and log-out every day. Storage resources, as well as system clients in a distributed file system, are scattered in the network. In these systems, users act as both creators and consumers of data, therefore, to provide massive of incentives by a secure and efficient approach.

Various recent studies have explored and carried out evaluations for distributed file storage systems [2, 6, 1922]. Some literatures [2, 7, 8, 2226] proposed and evaluated redundancy management strategies. Among these, replication and erasure codes are compared in bandwidth and reliability trade-off in literatures [2, 7, 8]. Literature [7] argues that, compared with replication, erasure codes can harvest better bandwidth performance. Literature [2] also approves this conclusion through a distributed file storage system. Based on a novel data clustering optimization model, Liang et al. put forward an intrusion detection algorithm for the industrial network [27]. A hybrid strategy is proposed in the literature [8] to conquer the repair problem. The node storing the replication can generate new pieces and then deliver them to the new users. As a result, it only transfers only to a new segment. Nevertheless, supporting an extra replication decreases the bandwidth-performance; that is because when the replication is polluted or lost, new segments cannot be produced. Meanwhile, there exists support [28] for static policies to solve data block replications. These schemes should be manually configured and primarily focus on archival purposes [29], by utilizing disk caching, MixApart [30], and Rhea [31] research data retrieval. What is more, literature [30] schedules tasks by using remote data and local caching, while static analysis strategy to harvest storage performance is utilized [31]. Liang et al. [32] put forward an efficient protocol to approach identity authentication in the IoT environment.

2.2. Blockchain

The blockchain technology is a chain of blocks based on time-stamp that is jointly sustained by each on-chain node. Every block acts as a container role aggregating all on-chain transactions and chained by cryptography technology. That is, each participating block is chained together and signed by their private secret key as well as their respective hash value. Once a new block is created, this new block will be chained together. In this way, the blockchain provides a steady data storage, so any deletion or update on processed transactions is impracticable [33]. Due to this characteristic, we can make full use of this advantage in the proposed work. In other words, all transactions are reliable without a third-party authority. The advantage of blockchain resists all stored data from repudiation. Moreover, user identity and authenticity are guaranteed by cryptography and digital signatures, so thus any illegal read or write will be refused over the blockchain.

Bitcoin, which is regarded as the first practice of the blockchains, is a public distributed ledger; it plays an essential role in promoting blockchain. After that, smart contracts [34] emerged, which is an autonomous program deployed on the blockchain network and makes all transactions intelligently. In the practice of blockchain, smart contracts act the role of triggers [35]. For example, based on the smart contract, all services will not hold funds unless all tasks in the contract have been finished. According to this theory, Ethereum regards and promotes the smart contract to the top level. Nowadays, blockchain has turned to be a promising topic in both industry and academy area, and combining blockchain and distributed file system becomes an exciting and promising solution, in which blockchain provides incentives and security for distributed files. Up to now, the popular and famous distributed file systems are IPDFS [36], Storc [37], Swarmer [38], and PPIO [32]. In these systems, IPDFS is a peer-to-peer distributed file system that is used to store and access files, applications, websites, and data; Storc is another peer-to-peer decentralized cloud storage platform allowing users to share data and has no need of any third-party data provider; Swarmer, based on Ether, is a distributed storage platform and content-distributed service, and PPIO that permits users to store and retrieve data on web anywhere and anytime is a programmable distributed storage network.

With the introduction of the blockchain, three distributed file systems utilize File-coin [39], Ether [40], and Meta-disk [41] as correspondingly stimulative mechanisms. Based on the industrial blockchain network environment, Liang et al. proposed secure data storage and recovery strategy [3], while Zhang et al. utilize the blockchain to improve 5G performance [42].

3. Problem Statement

Due to issues in integrity, trust, control, and credibility, we focus in this paper on overcoming the issue of integrity and credibility for distributed file storage. There are various systems and platforms for distributed file storage, and they aim to collect all kinds of data. Notably, this incurs a severe privacy problem, since most users have no knowledge of these actions, much less about control of such actions. To solve this problem, we suppose in this work that all provided services should obey the smart contracts, especially some assigned protocols. Based on this, this proposed work devotes to the following issues:

Data Credibility. Our research focuses on the data credibility for distributed file storage; we should guarantee that authorized users must control all personal data. Meanwhile, the systems and platforms regard the services as guests who have corresponding permissions.

Data Integrity. All data should be verified and detected to guarantee the integrity of stored data. All data-trace is totally transparent for each authorized user, and any illegal modification is impractical on the platform.

Access Control. Any users should be granted access permission as they log in the system or platform. These permissions should define which resources the users can utilize. Within the permissions, users can change the access range of their stored data. Meanwhile, all participating users must store data access control strategies or policies on the blockchain. Thus, illegal access is hardly impossible.

4. Our Solution

4.1. Distributed System

In this paper, we design a decentralized system. There are mainly three parts comprising the system: nodes, users, and services, as shown in Figure 2. Users can store and utilize their corresponding distributed files, as all operations on distributed data are supported by the services; the nodes play an essential role as storing users’ distributed files encrypted with their private keys. In order to simplify user authentication, we produce message digest for each stored file on the chain. In the proposed system, blockchain is very critical, since it only accepts two kinds of data, namely, and . The former is utilized for access control, while the latter is used for user data storage or data retrieval. The two types of operations can be arranged on the SDK (Software Development Kit), and users can use them through complete services.

To describe the proposed system in details, we assume the following application service: when a user intends to store files on the proposed system, he will first install the system application. The user signs up on the proposed system, and the system will generate an identity for the user, so then this identity will be informed to the blockchain, as well as the user’s permissions. The user’s files will be segmented into pieces and then encrypted by their shared keys. After that, all segmented files are stored on the nodes in a distributed way. In the meanwhile, all encrypted files are sent to the chain with , and a special pointer produced by the hash of segmented files to the blockchain is maintained.

When users issue data query requests, they can use together with the aforementioned unique pointer. Once the blockchain receives this request, it will check the identity of the users by verifying users’ digital signature. Only when the users have passed verification, they can carry out the operation within their authorized permission. They can have an overview of their file data and modify corresponding permissions. All these operations are recorded in the blockchain. We utilized D-HT (Distributed Hash Table) to carry out a key-value store for off-blockchain in this implementation, and it interacts with the blockchain through an interface. During the processing, the D-HT is utilized by the nodes on the chain, so any general operations, such as read and write, shall be approved by D-HT, and thus, users’ files can be of high availability.

4.2. Building Blocks

In this subsection, to approach the proposed solution, a detailed description of how to building the blocks follows next. Accordingly, the process of building blocks follows Bitcoin [43]. (1)Identities. To identify users, we utilize a pseudo-identity scheme. That is, each user on the chain can produce pseudo-identity by their public keys, and the practical requirements determine the number of pseudo-identities, which can critically improve the user’s privacy. In this work, we explore compound identities that are originated from the existing theory. As there may be more than two participators during transactions, some participators can hold this compound identity, though the remaining have no permissions to use it. As depicted in protocol one, we assume there are only one owner and one guest, and this protocol describes how we implement this operation. To guarantee the credibility, we utilize asymmetric key pairs to authorize user’s identity, as to encrypt or decrypt the user’s distributed files, we use asymmetric key that can promote the efficiency of encryption and decryption. In this way, all data are secure for each one of the users. The compound identity is defined as the following:

As to the whole identity, we formulate it as a 5-tuple: (2)Policy. In this paper, we define a group of permissions, which a data owner grants a guest , as . Supposing the following scenarios, if the owner deploys an application that calls for access to ’s location or contacts, this can be denoted as, . It describes that all types of data should be stored according to this way, supposing the service does not tear up the protocol and error-mark the data, then safeguards, which are utilized to avoid this, should be preferentially recommended to SDK. Moreover, according to this method, every user can quickly verify the legality of service, since any change is fully visible(3)Auxiliary Functions. In the function , de-serializes these messages that are delivered to a general transaction containing the arguments , as is described in the protocol two. This function validates whether the initiator has relevant permissions, which can guarantee the validity of each operation

Protocol 1: Compound identity description. 1: Procedure CompoundIdentity(o,g)
2: o and g compose of a secure channel
3: u executes:
4:  
5:  
6:  u shares with s
7: s executes:
8:  
9:  g shares with o
10:// Both o and g have
11: Return
12: end Procedure

Protocol 2: Verify permission. 1:Procedure:ChkPolicy()
2: g←0.
3: apolicy=H()
4: if L[apolicy≠φ] then
5: 
6:  if or
7:  ( and ) then
8:  g←1.
9:  endif
10: endif
11: return s
12:end Procedure

In order to guarantee the validity of each operation, we create a checking-policy function to verify permission, which simplifies the access verification compared with these existing approaches [22, 35, 39].

4.3. Smart Secure Contracts

In this section, we explore a blockchain-based framework for a distributed file storage system to approach security. As depicted in Figure 3, the framework is composed of two parts: (1)In contracts, it contains each user’s operation data, including variate that denotes user’s private data that carries out computation for distributed data. Suppose the following scenario that, during an open auction, only the winner’s final bids approach to the seller, and the other bids are totally refused. Thus, variate guarantees the security for users’ distributed data(2)Variate , which has no user’s private data, denotes users’ public data. Meanwhile, we define the cryptography protocol that is used during the transaction on the chain

Security guarantees. Security guarantees mainly include the following aspects: (i)Chain-to-Chain Privacy. Chain-to-chain privacy indicates that the user’s distributed file or data should be protected against any users not included on the blockchain, only if the legal users intend to inform others of their information. In our proposed protocols, all users should interchange data and depend on blockchain to guarantee fairness. That is, all users transmit their encrypted files or data to the chain, what is more, as also all transactions are based on zero-knowledge authorization(ii)Security. As chain-to-chain privacy prevents the user’s data from the public chain, all users’ data are entirely independent of each other. Meanwhile, asymmetric encryption guarantees the authenticity and confidentiality. We take a public auction as an example to describe the security of the scheme. The above Algorithm 1 shows the process of a public auction. In this example, auction transaction contains , which indicates that who wins the bidder and how much he should pay. Meanwhile, variate , which depends on the deposits, is used to avoid the winners from abandoning

1:Declare Member(Seller/ M parties /)
2:Declare Timeouts(/ timeouts /)
3:Declare Function contract auction(In &input, Out &output)
4:Set win = −1
5:Set btprice = −1
6:Set secdprice = −1 7:loop
8: for each j < m do
9:  if input.pat[j].value > btprice then
10:  Let secdprice = btprice
11:  Let btprice = input.pat[j]
12:  Let winner = j
13: else if input.pat[j].value > secdprice then
14:  Let secdprice = input.pat[j].value
15://The winner pays the bidder
16://The others are refused
17:Let output.seller.value = secdprice
18:Let output.pat[win].value = btprice − secdprice
19:Let output.win = win
20:for each j < m do
21: if j ≠ win then
22:  output.pat[j].value = input.pat[j].value

An auction is of the abovementioned specified requirements, especially in terms of security and confidentiality, and this is accomplished by cryptocurrency, as present in some existing systems [44, 45]. The program, as depicted in the algorithm, declares timeout parameters. The timeout parameters are declared as . P1: the contract stops receiving bids after P1. P2: the bidder should tell the price within time P2; otherwise, its input bid is regarded as 0. By doing this, the auction transaction continues. P3: supposing the auction manager abandons the bid, bidders may withdraw their bids when time P3 elapses.

Variate plays an essential role in the auction transaction. As it not only checks the time but also manage the timeout. The system will invoke the function only if the operation completes within P3. Otherwise, the system will invoke the manager’s TimeOut function.

5. Theoretical Analysis

5.1. Credibility

In a blockchain system, it supposed that all nodes should be untrustworthy. That is, every node should be verified. Moreover, nodes’ resources for computing determine their credibility level [43]. For example, a node , resources () denotes that how much weight node votes, which means that wither the node is vulnerable when there is high energy consumption, or there is high latency of a transaction.

In this paper, as to formulate the value for all nodes’ trust, we compute each data block of a node like the following: in which the step size is defined by β.

In the above equation, it is regarded that these nodes on-chain has higher weight as well as more efficiency in computation. Due to this reason, these nodes have the ability to resist fraud attacks.

5.2. Risk of Attack

In a blockchain, there is the risk of fraud, but this is complex, because it should approach 51%, the risk occurs. However, the risk exists, although hardly impossible to reach such a high percent of nodes failure. The current public blockchain structure is vulnerable to some particular scenarios: the software update, blockchain entry changes, as an example. This is due to, when all transactions are processing, any new participators can have knowledge of the decisions of the network. As to the network, based on its majority rules, 51% of undergoing transactions could do any operation on the chain.

51% attack on the chain may sharply increase the vulnerability, as there may exist fork attack. This is because more than two networks share the resources of a single network, leading to a quick decrease in computation ability. Namely, the cost of launch attack on the networks is lower, and cause to a growing risks for the network.

The probability that an attack chain can catch up with the honest chain is shown in the following equation.

represents the probability that honest miners find the next block, is the probability that attackers find the next block, and is the probability that attackers change the trading content of the current blocks.

5.3. Load Balance

In our research, load balancing is an important index. We are aiming at distributing all requests efficiently on each node, and this can harvest great improvement in load balancing. In this paper, any nodes act the role of sustaining as many online I/O connections as possible; we use variate (NumAct[sj]) to represent the number of connections, while variate denotes storage media. These two variables should be stored to the nodes within any transactions, correspondingly. Generally, the more the number of connections to the storage media, the lower the proportional of the throughput of the nodes. That is, if we intend to approach the better performance of load balancing, the participating nodes should have fewer connections. For ease of description, we introduce function , and formulate the load balancing as the following: Where function approaches maximum when all storing nodes have the lest count of active connections. Function approaches to the upper bound when the very lest connecting number occurs. This is an exciting discovery for each storage media. As a result, we can optimize the function and harvest the following formulation:

5.4. Throughput

As to distributed file storage systems, we devote to harvest best throughput performance. We store data in the manner of tiers, so thus, we can make full use of fast storage characteristics of tiers, which helps harvest optimized throughput. Once there is a request, the system will check the ability level for reading and write throughput of storing node nodei by a quick I/O test, and the read and write throughput is denoted as ReadTh[nodei], WriteTh[nodei], respectively. After that, we calculate the average value and store them on the node.

To approach the maximum throughput, any operation for distributed file storage is of the optimal write or read throughput. Generally, to obtain the common value, we regard the proportional peak value of nodes’ throughput as the final value. Moreover, in order to scale the throughput down, we introduce the logarithm function for these throughput values.

Similarly, we formulate the throughput function as:

As to the storage nodes, by computing function , we retrieve the throughput for storage node, when a specified number of nodes with the optimal throughput are on the chain. Once there are always many nodes with optimal throughput, the function approaches the upper bound. Therefore, we can optimize the function and formulate it as:

5.5. Theoretical Results

In Table 1, we assume that there are participators who intend to calculate a one-bit outcome and send it to all participators. We made comparisons among literature [46, 47], and blockchain, and we conclude that blockchain is most useful for distributed file storage. Public storage in the blockchain-based system was first approved in the literature [48]. Fairness can hardly be impractical in general models for multiparticipator transactions, which is proposed in the literatures [49, 50]. Base on the script language, some works about construct abstractions for protocols emerge, for example, “Declare-or-refuse” [46] or “multiple locks” [47].

5.6. Blockchain Application Systems

Table 2 shows four blockchain-based application systems, and we make a comparison in terms of blockchain form, protocol, cryptocurrency, and intelligent contracts. Super-ledger [51], which is an open-source system based on blockchain, was developed to improve the efficiency of distributed file storage. It was developed by superior language and supported any application on the chain, and meanwhile, it supported distributed components and maintained membership.

The multiple chain [52] system aims to create the private key for users, as well as deploy the blockchain. It depends on the API to expand the core API, which permits managing all transactions, assets, and resources. This system has good operationality for users to interact with networks, such as users can directly utilize command tools, and distributed clients can carry out transactions with the network by JSON, especially Ruby, Node.js, and Cij. This characteristic makes this system have excellent convenience of operation.

As one of the many systems, ETH [53] is very popular with distributed file storage for nowadays, especially on its excellent advantage of smart contracts in blockchain. This platform can both run on fysieke computer and virtual machine, meanwhile, and it can be programmed with general procedure language. Therefore, it is an exciting platform for users in distributed file storage.

LTC [54], which is illustrated in Table 2, is a public chain-based technology for the distributed file storage system. It has very distinct features, such as fast speed for all transactions and marvelous efficiency for file storage. As its all transactions are executed in intensive memory, it needs very fewer nodes to participate computations, even though its transactions are encrypted and signed either by symmetric or asymmetric manners.

From Table 2, we can conclude that, in most systems, there are fewer smart contracts, which might cause risks for blockchain application. In this system, when a user deploys blockchain, there is a trade-off on cryptocurrency and blockchain. Moreover, this system is capable of supporting all applications based on blockchain. Users can assemble their own infrastructure, just like some popular cloud platforms, such as Amazon and Google.

6. Experiment and Evaluation

6.1. Experimental Results

The evaluation is executed on a Windows 10 machine equipped with an Intel(R) Core(TM) i7-7700M CPU @ 3.60 GHz, 16 GB RAM. The transaction nodes have been deployed in a virtual machine which is supported with Ubuntu 16.04.

Processing time overhead refers to the time consumed by the transaction nodes to verify data blocks. The experimental results are shown in Figure 4. At first, the processing time is about 5.5 ms. As more data blocks are generated, the processing time overhead increases; especially, there is a sharp increase of time overhead when the number of blocks changes from 20 to 30. After that, the time overhead increases smoothly. When the number of blocks comes to 60, the processing time overhead approaches about 71 ms.

6.2. Evaluation

In order to support our proposed scheme, we carried out evaluations and utilized the mostly adopted benchmarks, namely DFSIO [4], which is mainly focused on measuring network throughput for users’ general operations, such as read and write. Additionally, this benchmark is based on a distributed approach.

The principal evaluation methodology is as the following: we focus on the data storing policy as well as the optimization goals, especially for data writing and reading throughput. Moreover, we make a comparison among the proposed schemes, the HDFS [4] and the rule-based strategy.

As depicted in Figure 5, the comparison result, we carry out about 20 times of evaluation and obtain the average throughput of every node for writing. The blockchain-based method gets the highest throughput, nearly about 138 MB/s, which is mainly due to the full use of its advantages, such as the optimal design of storage tier. However, the curve smoothly descends when the storage space (mainly memory) is mostly consumed, and this is a universal phenomenon for distributed file storage systems. The HDFS performs the worst, since its throughput only approaches average about 88 MB/s; this is due to the abandon of storage metrics.

Figure 6 shows the results of the READ throughput, where the HDFS strategy changes smoothly around 99 MB/s, while the HDFS with SSD policy exhibits a similar trend. While our proposed scheme harvests the best performance, the observations are twofold. As the former one, the proposed scheme equally distributes all requests onto all nodes, while the proposed scheme utilizes more HDDs storage media as the latter one. Accordingly, it can write more data blocks than the other policies. Neither the Original HDFS nor HDFS gets worse performance for reading data, respectively. Especially, HDFS yields the worst read performance.

7. Conclusion and Future Work

The distributed file storage system is susceptible to malicious use and fraud attack; users sometimes cannot have full control over their data. In this paper, we innovatively explores blockchain in distributed file storage, users no longer require a third-party, and own heavy supervision of their data. Through analysis and evaluations, our proposed scheme significantly improves data integrity and credibility for distributed file storage. Besides, based on blockchain, decisions on distributed file storage shall be more easier and reasonable. Finally, we carried out detailed discussion on the latest relative systems and demonstrated the advantages of this proposed work in distributed file storage.

As future directions, considering the network latency of the blockchain-based system, we will investigate the time tolerance of blockchain-based distributed file system and focus on the combination of network coding and blockchain to explore optimal network performance.

Data Availability

The test data, simulation data, and the proposed method used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

Acknowledgments

This work is supported by the Science and Technology Projects of Hunan Province of China under Grant No. 2017SK1040, the Science Research Project of Education Department of Hunan Province under Grant No. 19B180, the Natural Science Foundation of Hunan Province under Grant No. 327 2018JJ2107, the Scientific research project of Guangzhou College of Technology and Business under Grant No. KA202031, the National Natural Science Foundation of China under Grant No. 61702180, and the Natural Science Foundation of Hunan Province under Granted No. 2019JJ50167.