Abstract

Since industry 4.0 was put forward in 2013, industrial process around the world has been moving rapidly towards the age of intelligent manufacturing. Industry 4.0 is known as the fourth industrial revolution dominated by intelligent manufacturing, which has changed the production mode of global manufacturing and triggered far-reaching industrial changes. However, when intelligent machines communicate with each other under industrial 4.0, a large amount of data adopting distributed control will be generated. The infographic in the data is mainly a visual design of industry 4.0 data. Therefore, this paper mainly studies the distributed data optimization processing for industry 4.0. Considering that data leakage is one of the biggest challenges faced by the data storage systems, this paper proposes a data storage method that considers the efficiency and security of data access. The concept of security distance not only guarantees data security but also takes into account the emphasis of different user groups on data security. To minimize data access time, this paper proposes a data access node selection algorithm to minimize data access time while ensuring data security. The simulation proves that compared with baselines, the data access time of the proposed algorithm in random topology and Internet2 topology is less than that of the current data storage algorithm while ensuring data security. The experimental results are simulated on Internet2 topology and random topology with Matlab and Omnet + + simulation platform, showing that the proposed algorithm can select the optimal data storage node under the condition of satisfying the security distance constraint, thus reducing the data access time.

1. Introduction

Industry 4.0 is proposed and applied at Hannover Messe in 2013, which is mainly aimed at the future manufacturing industry [1]. After the three industrial revolutions, it integrates network technology and digital technology to represent the fourth industrial revolution, which makes industry 4.0 attract high attention in the global industrial field [2, 3]. At present, industry 4.0 not only takes intelligent development as the primary target but also extensively applies advanced measures such as information technology, information interaction, and process reengineering. Based on meeting the personalized and differentiated needs of different consumers, flexible production is performed to achieve maximum decision optimization [46].

Today is the era of big data. In the era of Industry 4.0, the manufacturing industry will be built on an interactive platform based on the Internet and information technology. Industrial big data will become the core driving force of intelligent manufacturing [7, 8]. The main thinking direction of Industry 4.0 is to predict demand and production through data analysis and then use data to integrate the industry chain and value chain, so as to create greater value [9]. The production-related data are called the master data of the enterprise, which includes a series of product-related data such as design, process, modeling, test, maintenance, product structure, component configuration, and change records. These data are recorded, transmitted, and processed to enable the product to achieve life cycle management and further satisfy customers’ personalized product needs [10, 11].

The huge amount of data in Industry 4.0 makes people suffer from information overload. In recent years, data analysis, data processing, and data presentation have become a research hotspot, among which infographics presented to users is a key link, which can enhance the readability and attractiveness of data information and increase the acceptance and dissemination [12]. Infographics is an excellent way to present data and information concisely and clearly [13]. In the era of data and information explosion, higher requirements are put forward for the design of infographics, but how to present more rich data content from multiple perspectives more clearly and concisely has become a problem. The essence of infographic processing is a large amount of industrial data. The mass and diversity of industrial data make distributed systems become the best choice for data storage and management. Currently, distributed data storage systems are divided into peer-to-peer (P2P) storage technology and cloud storage system represented by cloud computing [14]. The advent of the big data era makes the research on the distributed storage system of great significance. For mass data storage, distributed data storage surpasses traditional centralized storage technology with its good scalability, robustness, and high efficiency.

Distributed data storage uses a large number of low-cost PC servers that are widely distributed in different geographical areas and connected to each other to store massive data [15]. This storage method can greatly save storage costs, but the availability of nodes is low. Meanwhile, the expansion of data storage greatly increases the probability of system failure. Based on cloud computing, cloud storage technology can combine different devices and different types of data to work together through application software, distributed file system, cluster technology, and network technology. However, storage nodes in different locations have different storage capabilities and link bandwidths, making it difficult to improve data access speed [16, 17]. In terms of data access time, graph partitioning is widely used at present [18]. This method has sufficient mathematical theory as support, but graph partitioning does not consider the node performance and link performance comprehensively, so it cannot solve the actual problem. How to reduce data access time while ensuring certain data security is the key point of distributed infographic design for industry 4.0.

To meet security requirements and support distributed infographic design of Industry 4.0, the concept of a K-distance topological subgraph is proposed in this paper; that is, in an undirected graph, if there is a subgraph whose distance between any two nodes is greater than K, then this subgraph is called the K-distance topological subgraph of the original graph. Based on the above definition, this paper uses K-distance topological subgraph in the original topology to place data so as to meet the security requirements. Moreover, to minimize data access time, this paper proposes a node selection algorithm based on a priority of nodes. The nodes are arranged in ascending order according to the access time of data, and then the data storage nodes are selected in turn under the constraints of security distance to form the optimal K-distance topological subgraph. Then, the data are placed on the K-distance topological subgraph.

Accordingly, the main contributions of this paper are summarized as follows:(i)The concept of K-distance topological subgraph is proposed(ii)A low complexity data placement algorithm is proposed(iii)By comparing the effectiveness of the proposed algorithm on different network scales, the superiority of the proposed algorithm is proved

The rest of this paper is organized as follows. Section 2 reviews related work. In Section 3, we study the distributed data storage algorithms. The simulation results are presented in Section 4 and Section 5 concludes this paper.

2.1. Data Analysis for Industry 4.0

Since industry 4.0 was put forward in 2013, the industrial process around the world has been moving rapidly towards the age of intelligent manufacturing. The development of data perception technology further helps to collect massive industrial data, and the innovation of industrial informatization is an opportunity. However, industrial data have the characteristics of large-scale, high-dimension, variable structure, and complex content, so it is a severe challenge to analyze industrial data. Diez et al. [19] conducted a comprehensive survey of the latest developments in data fusion and machine learning for industrial forecasting, focusing on identifying research trends, opportunities, and unexplored challenges. Peres et al. [20] proposed intelligent data analysis and real-time monitoring framework, which provided the basis for realizing scalable and flexible data analysis and real-time monitoring systems for the manufacturing environment. Raptis et al. [21] investigated the latest literature on the application of data management in a networked industrial environment and identified several open research challenges in the future. Costa et al. [22] aimed to find out the relationship or association between emerging technologies in industry 4.0 and applied data mining technology to a new bibliometric method to help identify association networks. Villalobos et al. [23] proposed a three-level hierarchical architecture for industrial 4.0 data storage in a cloud environment, which helped to manage and reduce the costs. Jiang et al. [24] proposed an analysis framework based on big data to analyze and extract the network behavior of cellular networks in industry 4.0 applications by using Hadoop and other technologies from the perspective of big data. Soltysik et al. [25] determined the trend and keywords for promoting the use of open data in industry 4.0. Li et al. [26] proposed a system framework based on the concept of industry 4.0, including the fault analysis and treatment process of machine center predictive maintenance.

2.2. Study for Distributed Data Storage

The large-scale use of the Internet has radically changed the data storage mode. With the increasing popularity of data sharing, local file systems cannot meet the needs of data sharing. More and more data are stored in distributed structures through the network. The distributed storage technology for file sharing emerges as the times require. Through the distributed data storage technology, people can easily and quickly exchange data and work together. Wu et al. [27] proposed a robust and auditable distributed data storage scheme to support safe and reliable edge storage in edge computing and ensure the reliability and integrity of data in the distributed edge storage servers. Cangir et al. [28] preliminarily classified the blockchain-based distributed storage technology. Shi et al. [29] proposed a data placement algorithm based on fault-domain, which provided a new idea for the design of the distributed storage system. Yao et al. [30] introduced a remote image design of a dual node storage cluster, which could protect data in case of system failure. Liao et al. [31] considered a more practical data center network with fat-tree topology and used deep learning technology K-means to help store data blocks, so as to improve the read-write delay of data center networks. Jin et al. [32] introduced how to use distributed database HBase maintained by Apache to manage power data.

3. Distributed Data Storage Algorithm

3.1. K-Distance Topological Subgraph

Due to the limitation of security distance in data storage, it is necessary to find a list of storage node sets that meet the requirement of security distance before data chunks are placed [33]. To find such node sets, the concept of a K-distance topological subgraph is proposed in this paper.

Let represent the network topology of a distributed storage system and be also an undirected connected simple graph, where represents the set of storage nodes and represents the link between the nodes. If there is a node set and for , , and we have , where represents the shortest hop number between two points, then is called the K-distance topological subgraph of graph .

Given the above, the K-distance topology subgraph of graph is the set of nodes meeting the security distance limitation [34, 35]. Based on this, we propose a K-distance topology subgraph generation algorithm. The pseudo-code of Algorithm 1 is as follows.

(i)Input: , and security distance K
(ii)Output: Nodes set K-dis-min-graph
(1)  Select any node
(2)  Connect nodes with distance less than K
(3)  L1: for i = 1
(4)   for j = 1
(5)   if
(6)    K-dis-min-graph
(7)    delete
(8)    
(9)    continue L1
(10)   end-if
(11)   end-for
(12)   end-for
(13)  return K-dis-min-graph

According to Algorithm 1, given an undirected graph , select a node arbitrarily at the beginning, then find the node whose distance from this node is K, and then continue to find the point whose distance from is K. Repeat this step until the graph is traversed. The set K-dis-min-graph found is the topological subgraph of the K-distance.

According to the description of Algorithm 1, it is easy to get that the K-distance topology subgraph of graph is not unique, as shown in Figure 1. Considering a 10-vertex topology graph , different initial nodes and intermediate nodes will be selected to obtain different K-distance topology subgraphs. Figure 1(b) is the schematic diagram of a 2-distance topology subgraph, and the node-set is {2, 4, 6, 9}. Figure 1(c) is also a 2-distance topological subgraph of graph with a node-set of {1, 3, 5, 7, 8, 10}.

3.2. Storage Node Selection Algorithm

In this paper, the data placement problem satisfying certain security can be transformed into another problem; that is, given the security distance K, the problem of finding the K-distance topology subgraph satisfying the minimum data access time can be found in the network topology. As a result of the undirected graph, K-distance topology subgraph is not unique, and this paper proposes an algorithm based on node priority, which arranges the nodes in order of unit data access speed. If the two nodes have the same access speed, they are arranged according to the node’s self-protection capability (SPC). When selecting the storage node, the node with the highest priority should be selected as far as possible to ensure a high data access speed [36, 37]. Considering that the complexity of finding the K-distance topological subgraph is O(n2), a node selection algorithm is proposed in this paper, which minimizes the speed of data access and reduces the complexity of the algorithm based on satisfying the safe distance K.

SPC is the aggregate value of intrusion detection system capability value, anti-virus capability value, firewall capability value, and authentication mechanism capability value [38]. This paper assumes that all data center nodes have the above four security measures. Assuming that the data access point is node in an undirected graph , the unit data access speed from node to data access point is defined as for all nodes in the graph. The pseudo-code of the data storage node selection algorithm (Algorithm 2) is as follows.

(i)Input: , K, Link bandwidth matrix, node
(ii)Output: Optimal nodes set (Opt-nodes set)
(1)  for i = 1
(2)  Unit data access speed = 
(3)end-for
(4) Rank the nodes according to step 2 from largest to smallest, and the ranked set is UDAS_D
(5) Opt_nodes set UDAS_D1
(6) delete UDAS_D1 from UDAS_D
(7)for i = 1
(8) dis = Dijkstra(A, UDAS_Di)
(9)  if dis K
(10)   Opt_nodes set UDAS_Di
(11)   delete UDAS_Di from UDAS_D
(12)  end-if
(13)end-for
(14) return Opt_nodes set

4. Simulation and Analysis

4.1. Simulation Environment

In this paper, Omnet + +  [39] simulation platform and Matlab R2020a were used to verify the effectiveness of the storage node selection algorithm proposed in this paper. The network topology is divided into two types: random topology and Internet2 network connections [40], as shown in Figure 2. In Figure 2, the number on the line is the weight of the connection. In Figure 3, the larger the weight of the connection is, the thicker the connection line is. The former can measure the performance of data storage algorithms in various scenarios, and the latter can measure the performance of data storage algorithms in real scenarios. As shown in Table 1, we give the specific parameter and Table 2 shows settings of the simulation environment.

We compare the data access time of Algorithm 1 and Algorithm 2 proposed in this paper with that of CDPVDA [41], ACO-DPDGW [42], and UnifyDR [43].(i)Cloud model-based Data Placement Algorithm with Virtual Data Agent (CDPVDA)(ii)Ant colony optimization-based data placement of data-intensive geospatial workflow (ACO-DPDGW)(iii)UnifyDR : A generic framework for unifying data and replica placement

4.2. Simulation results
4.2.1. Random topology

In this paper, we first compare the data access time results of various algorithms in random topological networks with different data volumes and network nodes, as shown in Figure 4. Figure 4 shows that with the increase of data volume, the data access time of the proposed algorithm is the smallest, which is about 50% shorter than that of baselines, and the data access time increases slowly. This is because the proposed algorithm adequately selects the nodes with good link condition to minimize the data access time. Figure 5 shows that with the increasing number of nodes, the data access time of the proposed algorithm is still the smallest compared with baselines, and the data access time is reduced by about 60%–70% compared with baselines. This means that the proposed algorithm can select the best-performing nodes to store data under the condition of satisfying the security distance limit, thus minimizing the data access time.

4.2.2. Internet2 topology

As can be seen from Figure 6, as the volume of data in the Internet2 topology continues to increase, the data access time of all algorithms increases. As can be seen from Figure 6, data access speed on the Internet 2 topology is increasing with the increase of data volume, but the data access time of the algorithm proposed in this paper is still the smallest, and the data access time is reduced by about 50% compared with other baselines. Since the bandwidth in the Internet2 topology is 1 GBps, UnifyDR and the algorithm proposed in this paper select the nearest nodes when the data volume is small, which makes that the data access time is the same. However, with the increasing of the data amount, when some nearest nodes are full of storage, the algorithm proposed in this paper is better than baselines in finding suboptimal nodes, so the performance of the algorithm proposed in this paper becomes better with the increase of data volume. As indicated in Figure 7, the data access time in the proposal is still reduced by about 50% compared with baselines with the increasing security distance.

5. Conclusions

As for the large amount of data generated by Industry 4.0, this paper proposes a data storage method considering the efficiency and security of data storage. Considering data security and user experience, to meet the needs of different user groups, the concept of security distance is proposed, which enables different users’ requirements for data security to be used in the same data storage method. Considering the efficiency of data storage, a storage node selection algorithm is proposed to minimize data access time while ensuring certain data security, thus improving the user experience. Finally, simulation results show that compared with other existing data storage algorithms, the proposed algorithm can reduce data access time while ensuring certain data security.

In a distributed data storage system, a cloud storage system has a long distance between storage nodes and is generally distributed all over the world. However, a structured P2P network is highly volatile, which makes it difficult to ensure user experience in the networking strategy of the distributed data storage system. At present, the trend of distributed data storage system research is security, reliability, speed, and low energy consumption. Many existing works only optimize some of the above four conditions but do not achieve comprehensive optimization. Therefore, it is necessary to design distributed data storage methods and consistent maintenance policies that can meet the above requirements in the follow-up work to improve user experience.

Data Availability

All data used to support the findings of the study are included within this paper.

Conflicts of Interest

The author declares that there are no conflicts of interest in this paper.

Acknowledgments

This work was supported by Anhui Social Science Innovation and Development Research Project in 2021 (No. 2021cx136) and Anhui Quality Engineering Project in 2020 (No. 2020xfxm56).