Abstract

In the era of big data, the value of data is infinite. With the development of the Internet of Things, everything is interconnected, and the sharing and circulation of data are particularly important. This paper was aimed at studying the data sharing system based on blockchain and big data technology. Based on blockchain and big data technology, this paper proposes a data-sharing system based on an HDFS file system. And it strengthens the security of data sharing based on blockchain technology, proposes a security key for industrial data, and greatly upgrades the security of data sharing. Experiments in this paper have proved that the data-sharing system in this paper has strong robustness, and choosing the appropriate and can take into account both the computational overhead and security, such as (3, 5), (5, 10), and (6, 10).

1. Introduction

With the advent of the information globalization era, the Internet of Things, as a new type of technology, has assumed a critical role in everyday life. The Internet of Things is not only a technological advancement but also the start of a new revolution in today’s social life. With the continued advancement of Internet of Things technology, some new fields have gained vitality, which plays a critical role in improving the economy and achieving the long-term and stable development of society. With the continued development of the Internet of Things, it is important to remember that in order to achieve comprehensive development, we must fully respect social market laws and realize an organic combination of government supervision and market control. To promote the stable development of society, fully grasp the market trend while realizing technological innovation. While realizing the Internet of Things’ development, it must fully respect the entire market’s development principles, achieve balanced, stable, and coordinated development across multiple fields, and steer the entire society in the right direction with economic growth.

With the continuous development of technologies, such as cloud computing, big data, and the Internet of Things, the importance of identity management and data sharing to online services is self-evident. In the face of increasing security threats, how to establish a stable trust system for multiple parties involved in identity management and data sharing in a distributed environment, to resist malicious attacks and protect user privacy, has become an urgent problem to be solved. Therefore, it is necessary to study the data sharing and circulation of the industrial chain.

This paper has the following two innovations on the basis of the predecessors for the data sharing mechanism: (1) For blockchain and big data technology, a data sharing system based on the HDFS file system is proposed, and the HDFS file system is the core technology of the big data sharing mechanism. In this paper, based on the wireless sensor network technology of the Internet of Things, we upgrade the HDFS file system in terms of efficiency and transmission data volume to meet the needs of massive data transmission. (2) For the data sharing system, the security of the system data circulation is mainly considered. Based on the blockchain technology, the permission of shared data and transmission encryption, decryption, and other steps are introduced in detail to achieve the security effect of the data.

Data sharing has always been a research hotspot of scholars. Jin et al. believe that most of the existing ABE solutions have the shortcomings of high computational overhead and weak data security, which seriously hinders resource-constrained mobile device customization services. So they address this challenging problem by proposing a new attribute-based data sharing scheme. The performance analysis shows that their proposed scheme is safe and efficient [1]. Medical imaging data were studied by Doel et al. Their experimental results suggest that GIFT-Cloud simplifies the transfer of image data from clinics to research institutions, facilitates the development and validation of medical research software, and shares results with clinical partners [2]. Changkun et al. propose a quality-aware data-sharing marketplace where users of perception data can sell data to others who request the data but do not want to perceive the data themselves. Their simulation results show that P2P data sharing can greatly improve social welfare, especially in models with high transmission costs and low transaction prices [3]. According to the current state of the medical industry, Xue et al. believe that clinical data verification, storage, and synchronization are difficult, so clinical data sharing among institutions has become a difficult task. They proposed a decentralized, high-security, collective maintenance, and tamper-resistant blockchain-based medical data sharing model [4]. The medical data released by the British Ministry of Health was analyzed, the data sharing mechanism was studied, and extensive experiments on data security issues were conducted [5]. “Verifiable Search Encryption with Aggregated Keys,” was proposed by Liu et al. To selectively share the search and verification rights of his/her document set, the data owner only needs to distribute a single aggregated key to other users [6]. Patranabis believes that one of the most important requirements of any organization today is online data sharing in order to increase productivity and efficiency. The advent of cloud computing pushes the limits of sharing across geographies and enables numerous users to contribute and collaboratively share data [7]. Xue et al. proposed a novel fog-to-cloud based architecture for data sharing in VCC. Their scheme is a cryptography-based mechanism that enables fine-grained access control. In their design, the complex computational burden is securely outsourced to fog and cloud servers with confidentiality and privacy protection. At the same time, by predicting vehicle mobility, prepushing data to a specific fog server can further reduce the response delay without consuming more fog server resources. Furthermore, their scheme can provide a verifiable audit of fog server reports, assuming there is no collusion between different providers of cloud and fog servers. The scheme is proven to defend against existing adversaries and nascent security threats. Experimental tests show significant performance improvements in terms of overhead savings and response latency reduction for edge devices [8]. However, through relevant research, it can be found that although the research on data sharing is broad, it is basically limited to the field of its own research, without diffusion and lack of corresponding standards.

3. Blockchain and Big Data

3.1. Blockchain Technology

In the modern society, with the rapid development of science and technology, the network has spread all over the world and penetrated into every field of society. The Internet of Things represents the current development of the information age, is a part of information technology, and is in a key position. From the three words “Internet of Things,” it can be seen that it is like a network, and people and things are the nodes on the Internet, which connect people and things to each other through the network. And these simple three words also represent two deep-level contents: (1) The network is the core. The Internet of Things is built on the basis of the network. Without the Internet, there would be no Internet of Things. That is, the role of the Internet of Things needs to be played through the Internet. (2) Wide range of extension. For the Internet of Things, the connected objects can be things or people; the connection between things can be realized, and the connection between people can be realized, and even two completely different objects can be connected. Communication perception technology is divided into several aspects, such as intelligent perception technology and general use computing. The Internet of Things realizes its own functions through these technologies. Because of its great influence and wide coverage, it is called the third revolution in the information industry, alongside the Internet and computers. The Internet of Things is fundamentally a kind of application or a kind of business, just a further expansion on the Internet. However, in the face of the development speed of the Internet market, in order to make achievements in the construction of the Internet of Things, innovation must be placed in the first place. In the innovation process, we must always keep in mind the importance of customers, that is to say, the core of the Internet of Things is the user experience. The Chinese interpretation of ITU is the International Telecommunication Union, which is an international organization that conducts supervision and research on the development of the Internet. The ITU has published a report on the development of the Internet, in which a standard definition of the profound meaning of the Internet of Things has been made, that is, the Internet of Things is through information sensing devices. Under the restrictions of relevant protocols, it is a network that combines things with the Internet to realize information exchange and further management. Among them, the common information sensing devices are two-dimensional code identification devices and infrared sensors. The Internet of Things in the Internet of Things industry is a more practical network that is closely related to people’s lives. As long as it is an item in the world, it can be connected to other items or people through the Internet of Things to form a huge network, which is more convenient to manage and identify. In 2011, the scale of the Internet of Things industry has developed to an unprecedented level, and the transaction volume through the Internet of Things has reached 260 billion yuan. The industrial scale of the five levels of the Internet of Things all occupy a certain proportion. Among them, the perception layer and the transmission layer have reached a terrifying level due to the rapid development of the perception layer and the transmission layer.

3.1.1. Key Technologies

(i)Sensing skills: in the current social life, sensors play a vital role and are an indispensable part of the modern computer field. At this stage, the sensors generally seen are used to receive some related signals, but due to the emergence of sensors, some virtual objects can be turned into specific languages that can be perceived

(ii)RFID tags: in essence, RFID is a kind of sensor technology, which integrates radiofrequency technology and embedded technology, which has some breakthroughs and innovations for some knowledge in the field of logistics

(iii)Import skills: this skill is to realize an organic combination of computer hardware and software, and it is a relatively common form currently used, which is an innovation for today’s technology. From the emergence to the development of embedded, although it has only been several decades, the technology has been quite mature, and it is frequently used in various terminal intelligent products and has obtained a good customer experience. For example, mobile walkman and aerospace satellite systems all use embedded system technology. Before people have noticed, embedded has gradually changed people’s living habits and also has a great role in promoting industrial production and other fields. Using a very vivid example to illustrate the important role of embedded, embedded is equivalent to the human brain, which processes and analyzes the information perceived by the five senses, transmits it to the body of the Internet of Things, and performs a series of activities

3.1.2. Blockchain Technology

Since the birth of Bitcoin [9] in 2009, digital cryptocurrencies have had a huge impact on traditional finance. Blockchain is a distributed ledger technology. Consensus nodes in the blockchain connect blocks in a chain into a specific data structure in chronological order and use cryptography to ensure their unforgeability and integrity. Since everyone can host the distributed ledger, the transaction process does not depend on a third party. The transaction process is transparent and traceable and cannot be tampered with, thus establishing a robust trust system in a distributed trustless environment.

The most commonly used blockchains include public chains and consortium chains. The public chain is a fully decentralized blockchain that verifies transactions and incentives in a mutually unknown network based on consensus, thus establishing a fully decentralized trust mechanism. A consortium chain is generally a blockchain that is jointly managed by several institutions. The confirmation time and transaction per second (TransactionPerSecond, TPS) of the alliance chain are quite different from those of the public chain.

Smart contracts are deployed on the blockchain. It does not rely on any central structure, automated code execution, and is flexible and programmable. These smart contracts are created and invoked by the way entities send transactions and are controlled by code, so the designer cannot unilaterally tamper with the rules.

3.1.3. Data Storage of Wireless Sensor Network

As a subdivision of the Internet of Things, wireless sensor network technology integrates multiple technologies such as communication technology [10], sensor technology, information processing technology, embedded technology, and distributed technology. In the area covered by the sensor network, it can perceive data, that is, it can collect the relevant data of the monitored object in real time. And it can use its own computing power to process the data to a certain extent and finally send the data to the user through the network.

Wireless sensor network is an important module in the Internet of Things technology. It can store data, which is very convenient. Its structure is shown in Figure 1. The network layer is responsible for transmitting the sensor data to the data management layer, and the data management layer is responsible for preprocessing the data. After the processing is completed, the data management layer sends the data to the application layer. The main design goal of the wireless sensor network data architecture is to provide transparent data services for upper-layer applications, so that users do not need to consider the underlying network design and original data types.

An important feature of wireless sensor network is that it is data-centric, and its data collection focuses on the value of specific observation indicators in the sensing area. The wireless sensor network needs to meet the requirements of stability and low redundancy for data collection. Commonly used collection methods include offline data collection, node preprocessing, and model-based data collection.

Compared with the traditional Internet data management system, the wireless sensor network data system design has two notable features. First, because the data in the network is generated by many sensor nodes distributed in different locations, the wireless sensor network data is distributed. Second, the data of wireless sensor network is dynamic. The data in the form of data flow generated in the sensor network is constantly changing in the process of processing, transmission, and delivery to the application layer user. The storage location and organization form are constantly changing, so the data management in the wireless sensor network has obvious dynamic characteristics.

3.2. Big Data Technology

The rapid development of the Internet has led to a sharp increase in the number of users and data in various Internet applications. It is no longer possible to solve the massive data processing needs of users and enterprises simply by increasing the storage capacity of single-point devices. At the same time, the big data ecosystem developed on the basis of the distributed file system to meet the requirements of massive data storage and computing has been gradually improved. So far, the Hadoop Distributed File System (HDFS) implemented according to Google’s paper has become the most widely used open source distributed storage system architecture. MapReduce, an open source distributed computing framework implemented on the basis of this storage architecture, has also been developed to the second edition. At the same time, with the maturity of various technologies in the field of big data processing, the demand for data mining [11] has become more and more obvious. Related machine learning algorithms have also been applied to the big data ecosystem, and a series of open source machine learning [12, 13] frameworks for big data ecosystems, represented by Mahout and MLlib, have emerged.

With the increasing real-time requirements of enterprises and users for massive data processing, some emerging big data computing frameworks such as Spark have also attracted much attention in recent years and are increasingly used in the production field. The above open source technologies and frameworks have been widely used in the Internet field, but there are still many areas for improvement and adaptation for wireless sensor networks. This section introduces the popular open source frameworks and related machine learning algorithms and open source libraries in the aforementioned big data ecosystem.

HDFS is originally designed to be implemented on commercial cheap hardware to ensure low time cost access to data. The default size of each block in HDFS is 64 MB, and HDFS will copy different blocks of the same file (three copies are saved by default) to different hosts. Physically, the distributed file system is composed of multiple hardware nodes, which are divided into two categories according to their functions.

Figure 2 shows the basic architecture of HDFS.

The main job of the name node is to maintain the NameSpace of the distributed file system, which saves the two main structures of FsImage and EditLog. The file system tree is maintained by FsImage, and the metadata of all files and folders in the file tree, all for file operations, such as generation, renaming, and deletion, are recorded in EditLog. The EditLog will continue to grow during operation. When the EditLog file is large, the name node startup operation will be very slow. During the startup time of the name node, the HDFS system is in safe mode and will not provide any external write operations, which will affect the use of users. To solve this problem, the secondary name node (SecondaryNameNode) component is used in the HDFS architecture. When the name node fails, it can be restored according to the Fslmage and EditLog data in the SecondaryNameNode of the backup server.

The HDFS distributed file system, for the sake of fault tolerance and availability of the system, uses file backup to store data redundantly. Usually, a data block will be backed up as multiple copies on other data nodes and distributed on other data nodes. Replication stores data blocks in two different data nodes. This multicopy storage method has the following advantages, as shown in Table 1.

As shown in Figure 3, the data is at the bottom layer of the DIKW model. It comes directly from the facts and only means that the data is real. It is an intuitive reflection of what happened in reality and has no other meaning. Information is the valuable data that can be directly used after data processing. Knowledge refers to valuable information, which can mine the laws between data and form theoretical guidance. Wisdom is the result of knowledge being used and can propose solutions to research problems.

4. Blockchain-Based Multiauthority Attribute-Based Encryption Data Sharing Scheme

Outsourced data contains a large amount of critical, highly private information; therefore, outsourced data must never be stored in clear text on third parties, and only data owners and authorized users can access outsourced data. For data privacy, it is necessary to ensure that even outsourced data stored in ciphertext cannot leak personal information about the data owner, that is, to ensure the privacy of the access policy. This paper proposes a blockchain-based multiauthority revocable CP-ABE scheme (MA-RABE) to ensure policy privacy and to achieve efficient revocation of users and predecryption of cloud servers. In addition, the process of user registration and revocation by AAs and the process of distributing keys are publicly recorded on the blockchain in a transactional manner.

Therefore, this scheme can further avoid the problem of centralized power in CP-ABE, effectively solve the trust crisis among the participants in data sharing, and improve the collaboration of participating parties [14].

4.1. System Architecture

As shown in Figure 4, the system model includes four entities: data owner (DataOwner, DO), data user (DataUser, DU), attribute authority set (AttributeAuthorities, AAs), and cloud service provider (CloudServiceProvider, CSP). Due to the trust problem of multiple parties in a distributed environment, the blockchain is used as the underlying architecture of the solution. The blockchain includes the system contract (hereinafter referred to as the contract) of this scheme. The contract is AAs for the user registration (UserRegister) function, the user revocation (UserRevoke) function, and the user rejoin (UserRejoin) function. The contract for AAs also provides an interface for calling the PSSS protocol for negotiating and generating keys. The contract also contains a user binary tree, and two lists that can store N tuples [15, 16].

((i)DO: DO owns a large amount of data, encrypts the data by formulating LSSS policies, and outsources the ciphertext together with the access policy to the CSP

(ii)DU: the data user is the user, and the DU will register its own attributes with the AAs. To access the data of the DO, the DU can initiate a predecryption request to the CSP and obtain the symmetric key by exponentiating the returned predecryption result

(iii)AAs: AAs are responsible for distributing the user’s attribute confusion key (AttributeConfusionKEY, ACK). It generates the node key (NodeKEY, NK) of the DU through a global user binary tree and generates the update key (UpadteKEY, UK) according to the KUNode algorithm, also known as the key update message. In order to prevent the centralized power of a single AA, the above operations are performed by all AAs through PSSS

(iv)CSP: responsible for storing the ciphertext uploaded by DO and providing proxy decryption service to all DUs Note that this paper assumes that CSPs are semitrusted, i.e., CSPs have an honest and curious nature. Therefore, the CSP actively leaks the privacy of the policy in the ciphertext but executes the protocol honestly [17].

4.2. System Construction Scheme

The data encryption process for data sharing in the system is shown in Figure 5.

The following is a detailed description of the scheme construction.

Choose a random security parameter and the maximum number of users as input, and generate a bilinear map as Equation (1):

The anticollision hash function is shown in Equation (2):

Calculated locally and submitted to the contract as in Equation (3), Equation (4), and Equation (5):

The global parameter is Equation (6):

User registration: each DU generates its own private key, as shown in Equation (7):

In order to verify the ownership of the attribute by the DU, confirmation is required, the formula is as shown in Equation (8):

If the verification is successful, the obfuscated key token of the attribute is calculated, as shown in Equation (9):

DU decrypts with its own private key to obtain attribute obfuscation key .

The contract receives nodes to jointly generate the user’s node public key segmentation, and the calculation is as shown in Equation (10):

For each attribute, the calculation formula is as Equation (11) and Equation (12):

The user’s node key is shown in Equation (13) and Equation (14):

The contract receives the division of the key update message generated by nodes together, and the calculation is as shown in Equation (15) and Equation (16):

The key update message is shown in Equation (17):

In order to alleviate the computational pressure of bilinear operations, predecryption can run on CSP without revealing privacy, and the predecrypted ciphertext CT can be obtained by calculation, as shown in Equation (18):

The user decrypts the received CT and uses his own private key to perform an exponential operation to obtain the symmetric key KEY, as shown in Equation (19):

Finally, the DO pair C0 is updated, and then, the updated C0 is uploaded to the CSP, the formula is as shown in Equation (20):

The abbreviations in the above formula are listed in Table 2 for easy reading.

4.3. System Security Architecture

The system security structure considers a blockchain crowd-sensing system, which consists of a BCS service platform, blockchain consensus nodes, and data sharing nodes. Its main goal is to perform data sharing tasks for safe and reliable high-quality data sharing. In the BCS system, both the BCS service platform and the data sharing node are blockchain nodes, that is, they are both registered in the certification center and obtain their identity information as blockchain nodes [18, 19]. As a miner in the blockchain, the consensus node will verify the identity information and transaction information of the task participants, and the smart contract deployed on the blockchain and stored by the consensus node realizes the automation of the data sharing process [20].

The complete data sharing process is shown in Figure 6, which includes the following six steps.

Registration: the data sharing node sends its own identity information and requests to register as a data sharing node. After receiving the registration transaction, the consensus node will execute the registration contract (RC) and verify the identity of the data sharing node. If the identity is valid, the RC will output a configuration file containing node identity information and node reputation value.

Release task: when the BCS system needs to collect data of a specific target, it will release the data sharing task through the BCS service platform. The task information includes target data to be collected, time requirements for completing the task, task budget, and data quality requirements.

Report coefficient of variation: according to the task information released by the BCS service platform, data sharing nodes that intend to participate in the task will report its coefficient of variation.

Publishing the payment strategy: after the blockchain network obtains the registration information and coefficient of variation of the node, the Payment Policy Design Contract (PPDC) deployed by the BCS service platform will be triggered to execute, and the payment strategy will be output.

Upload data: the data sharing node decides the data collection strategy and uploads the data according to the payment strategy. Considering that the blockchain is a database that cannot be deleted, in order to save storage space on the chain, the hash value will be stored in the on-chain method (that is, the collected data is hashed and the hash value is uploaded in the form of a transaction) and store the plaintext in an off-chain way (that is, upload the original data as ordinary information).

Reputation Update and Reward Release: the uploaded hash data triggers the Reputation Update and Reward Release Contract (RURDC), through which outlier detection and true value estimation are performed on the data, and the data quality of the node is calculated. RURDC will output reward issuance transactions and reputation update transactions. Note that during the data sharing process, consensus nodes will periodically perform consensus, package transaction information into new blocks, and add them to the blockchain.

4.4. System Security Analysis

To illustrate the superiority of the PSO-based system utility optimization algorithm, the PSO algorithm is next compared with the genetic algorithm- (GA-) based and simulated annealing- (SA-) based system utility optimization algorithms. In order to compare the convergence and time complexity of the three algorithms more fairly, the population size of both the PSO and GA algorithms is set to 50, and the number of iterations of the SA algorithm at the same temperature is also set to 50. Table 3 shows the time complexity of different algorithms. It can be seen that the time complexity of PSO algorithm and SA algorithm is almost equal, while the time complexity of GA algorithm is about 2 times that of PSO and SA algorithm. Figure 7 depicts the convergence of system utility under different algorithms, noting that is the average of 50 runs. It can be seen from Figure 7 that with the increase of the number of iterations, the convergence value of the PSO algorithm is the largest, its convergence speed is between the GA algorithm and the SA algorithm, and its stability during convergence is the best between the two algorithms. Therefore, considering the convergence, the PSO algorithm has the best performance. Considering the time complexity and convergence of the algorithm, we can get the best performance of the PSO algorithm.

Figure 8 depicts the number of nodes uploading qualified data, that is, the changes in the theoretical and actual values of the system utility, in 10 consecutive data sharing tasks starting from the initial time . The theoretical value is the maximum value obtained by solving the problem , while the actual value is the number of nodes whose data quality is qualified after the system detects outliers and estimates real values on real data. As shown in Figure 8, as the number of data sharing increases, the theoretical value of the system utility decreases over time, while the actual value rises, until the two converge at the same level. This is because when , all nodes’ reputation is 1. Because the number of nodes that meet the quality requirements is currently too large, and the system’s cost exceeds the budget, the system will prioritize nodes with a lower coefficient of variation. As a result of the selection of malicious nodes and a large number of faulty nodes, the actual value of the system utility is low, and the theoretical value at this time is the maximum value allowed by the current budget. When , the reputation value of the faulty node and malicious node uploading data for the first time is updated to a smaller value, which makes more normal nodes with relatively large coefficient of variation selected during the second data sharing, so the actual value increases. Since the number of nodes that meet the requirements is still large and the budget is still insufficient, the theoretical value is still the maximum value under the current budget. When , the reputation values of malicious nodes and some faulty nodes with relatively small coefficient of variation are too small to be selected by the system. In the case of limited budget, due to the relatively large coefficient of variation of other nodes, the final number of nodes that can be selected, that is, the theoretical value, will be reduced; accordingly, the actual value will increase. When , the reputation value of the abnormal node is small enough not to be selected by the system, so the theoretical value and the actual value of the system utility eventually tend to the same value. Figure 8 illustrates that the proposed reputation update strategy makes the actual utility of the system increase with the increase of . In addition, the gap between the theoretical value and the actual value is gradually reduced, which makes the determination of payment parameters more reasonable.

4.5. System Effectiveness Analysis

Since the scheme adopts PSSS () to distribute user attribute keys and manage users, the scheme can resist single point of failure, and the robustness is affected by the threshold and the total number of AAs . If the attacker makes some AA in MA-RABE unable to run normally by some means, that is, the number of AA running normally in the scheme is less than k, suppose the probability that a single AA cannot run normally is , and the probability of system downtime satisfies the Bernoulli distribution, which is . Figures 9(a) and 9(b) show the line graphs of the probability of system downtime as a function of the downtime probability of a single AA when is 5 and 10, respectively.

If the attacker controls some AA in MA-RABE by some means, so as to control the whole system, that is, at least AA in the scheme are controlled. Assuming that the controlled probability of a single AA is , then the controlled probability of the system satisfies the Bernoulli distribution, which is .

Figures 10(a) and 10(b) show the line graphs of the probability of the system being controlled as a function of the probability of being controlled by a single AA when is 5 and 10, respectively.

It can be seen from the analysis of Figures 9 and 10 that the system of this scheme has better robustness than the system based on a single AA. When is close to in the scheme, the probability of downtime will increase, and the probability of being controlled will decrease; conversely, when is smaller than , the probability of downtime will decrease and the probability of control will increase. On the other hand, when is larger, the computational cost of reconstructing the secret increases. Therefore, choosing appropriate and can take into account the computational overhead and security, such as (3, 5), (5, 10), and (6, 10).

5. Conclusions

Data-sharing systems are a product of a new era and are now playing an ever-increasing role. The data sharing system designed in this paper has good performance and good robustness through experiments. In the process of research, this paper first briefly explains the background of data sharing, and then analyzes the relevant research results, and proposes the necessity of this paper. After that, the blockchain and big data technology are explained, and the blockchain and big data technology are analyzed in detail. Finally, the data sharing system is designed for the blockchain and big data technology. In the design of the system, there are still some problems. For example, the understanding of blockchain technology is not deep enough, which makes it difficult to read and delete data in the system. Therefore, future research should deepen the research on blockchain-related technologies.

Data Availability

The dataset used to support the findings of this study is available from the corresponding author upon request.

Conflicts of Interest

The authors do not have any possible conflicts of interest.

Acknowledgments

This study was supported by the Science and Technology Project Support of China Southern Power Grid Corporation Project Nos. 037800kk5220005 and gdkjxm20201754.