A Load-Aware Multistripe Concurrent Update Scheme in Erasure-Coded Storage System

Chen, Junqi; Wang, Yong; Ye, Miao; Zhang, Qinghao; Ke, Wenlong

doi:https://doi.org/10.1155/2022/5392474

Wireless Communications and Mobile Computing

On this page

Abstract Introduction Background and Related Work Model Implementation and Evaluation Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Semantic Sensor Data Annotation and Integration on the Internet of Things 2021

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 5392474 | https://doi.org/10.1155/2022/5392474

A Load-Aware Multistripe Concurrent Update Scheme in Erasure-Coded Storage System

Junqi Chen,^1,2Yong Wang ,^1,2Miao Ye,^3,4Qinghao Zhang,³and Wenlong Ke^3,4

Academic Editor: Ashish Bagwari

Received24 Jan 2022

Accepted07 May 2022

Published19 May 2022

Abstract

Erasure coding has been widely deployed in today’s data centers for it can significantly reduce extra storage costs while providing high storage reliability. However, erasure coding introduced more network traffic and computational overhead in the data update process. How to improve the efficiency and mitigate the system imbalance during the update process in erasure coding is still a challenging problem. Recently, most of the existing update schemes of erasure codes only focused on the single stripe update scenario and ignored the heterogeneity of the node and network status which cannot sufficiently deal with the problems of low update efficiency and load imbalance caused by the multistripe concurrent update. To solve this problem, this paper proposes a Load-Aware Multistripe concurrent Update (LAMU) scheme in erasure-coded storage systems. Notably, LAMU introduces the Software-Defined Network (SDN) mechanism to measure the node loads and network status in real time. It selects nonduplicated nodes with better performance such as CPU utilization, remaining memory, and I/O load as the computing nodes for multiple update stripes. Then, a multiattribute decision-making method is used to schedule the network traffic generated in the update process. This mechanism can improve the transmission efficiency of update traffic and make LAMU adapt to the multistripe concurrent update scenarios in heterogeneous network environments. Finally, we designed a prototype system of multistripe concurrent updates. The extensive experimental results show that LAMU could improve the update efficiency and provide better system load-balancing performance.

1. Introduction

The scale of the distributed storage system is rapidly expanding to deal with the proliferation of the global datasphere. Meanwhile, the node failures and data loss caused by various reasons are increasing, such as system crashes, natural disasters, hacker attach, and power outages [1–3]. To avoid irreversible losses caused by these threats and improve the reliability of the storage system, the redundancy mechanism is indispensable in data centers. The two most typical redundancy mechanisms are replications and erasure coding. Replications copy each chunk of original data to other storage devices to improve the system redundancy. However, it considerably incurs extra storage costs, especially in today’s data scale explosion and growth. As another alternative, erasure coding can provide better storage efficiency via encoding computations, meeting the same degree of fault tolerance as replications [4]. Specifically, erasure coding divides the original data into several data chunks, and then, these data chunks are encoded into a few redundant chunks (also called parity chunks). These data chunks and parity chunks together form an erasure-coding stripe. When data failure occurs, as long as the number of failed chunks does not exceed the recovery threshold, the lost chunks can still be recovered from the living chunks. Since the erasure coding can significantly reduce the extra storage cost while providing high storage reliability, it has been widely deployed in today’s data centers, such as Facebook [5], Azure [6], and Google GFS [7].

However, while providing high reliability with less extra storage cost, erasure coding introduces more network traffic and computation overhead during the data update process. When the data chunk is updated, all the parity chunks in the same stripe should be updated simultaneously to maintain the consistency of the stripe, which boosts the disk I/O load and the update time. In addition, various real trace analyses show that over 90% of writing in the storage system is data update [8–10], indicating that data update is prevalent. If the data failure occurs during the update process, the system cannot recover the failure data correctly. Therefore, the update efficiency of erasure coding affects not only the performance but also the reliability of the distributed storage system.

There are two major challenging factors impacting the erasure-coding update. Challenge 1 is the heterogeneity of the storage node and network status. For example, the storage nodes purchased in different periods during the expansion of storage system have different performance [11, 12]. Meanwhile, these storage nodes may also be processing various tasks in real time, such as MapReduce [13] and system heartbeat and data migration [14], making the status of network links dynamic and heterogeneous. In this case, the computational load and traffic caused by the update may significantly impact the system performance and reduce the update efficiency. Challenge 2 is the multistripe concurrent update. Due to the potential correlation between the data of each stripe [15, 16], the update of one erasure-coding stripe will result in the contemporary update of multiple correlation stripes [15], which amplifies the node load and the update time. Therefore, how to improve the update efficiency of erasure code storage and guarantee the system load balance is still a critical problem. However, the existing update scheme ignores the node and network heterogeneity and only focuses on the single stripe update scenario, which cannot sufficiently deal with the problems of update efficiency declines and system load imbalance caused by the multistripe concurrent update.

This paper proposed a Load-Aware Multistripe concurrent Update (LAMU) scheme. As we will explain in Section 3, LAMU adopts a centralized update architecture in which the data update is divided into the data-delta convergence, parity-delta computation, and parity-delta divergence. The centralized update architecture can mitigate the system overhead by preventing the separate connection between the data node and the parity node. Firstly, we introduce Software-Defined Networking (SDN) to measure and collect node load information (such as CPU utilization, residual memory, disk I/O load, and node access bandwidth) and network status (such as network topology, link residual bandwidth, and link transmission delay) in real time. Secondly, based on the node load information, we select the nonrepetitive computing nodes with a lower load for each update stripe. Finally, the TOPSIS method is used to tailor the best path for data-delta convergence and parity-delta divergence for each update stripe. The decision-making factor uses diverse weights for different network load scenarios so that LAMU could be suitable for various environments.

The main contributions of this paper can be summarized as follows: (1)Aiming to solve the problem that existing research cannot sufficiently deal with the efficiency decline of multistripe concurrent updates, this paper first establishes the optimization model of multistripe updates with multiple QoS constraints in the heterogeneous environment. The update efficiency can be improved by minimizing the cumulative weighted update delay of multistripe updates and balancing the link utilization. To the best of our knowledge, this is the first work attempt to improve the efficiency of multistripe concurrent updates with multiple QoS constraints(2)This paper introduces SDN to perceive the node load status and network status of the erasure-coded storage system in real time and proposes a Load-Aware Multistripe concurrent Update (LAMU) scheme. LAMU selected the nonrepetitive computing nodes with better capacity for each update stripe. Then, the TOPSIS method is used to schedule the update traffic generated in the update process to improve the efficiency of multistripe updates. As far as we know, this is the first work that considers the heterogeneity of nodes and network status simultaneously during the erasure-coding update process(3)We designed a prototype system of multistripe concurrent updates based on Containernet [17] to verify the effectiveness of LAMU. The extensive experimental results show that LAMU could improve the erasure-coding update efficiency and maintain better system load balancing

The rest of this paper is organized as follows: Section 2 presents the background and related work of the erasure-coding update. Section 3 describes the multistripe update problem in the erasure-coded system and provides the optimization model. Section 4 introduces the details of our LAMU scheme. We conduct extensive experiments to evaluate LAMU in Section 5. Section 6 concludes this paper.

2.1. Basics of Erasure Coding

In this paper, we concentrate on a well-known erasure code called the Reed-Solomon (RS) codes [18], which are widely used in today’s commercial data centers [7]. To be precise, the system configures the RS codes with two parameters and and denote the code by RS codes. In RS codes, the original data are divided into data chunks , and these data chunks are encoded to parity chunk through the linear operation of equation (1). These chunks distributed in different nodes of the storage system form an erasure code stripe .

where is the conversion coefficient from to , . According to the linear characteristics of equation (1), as long as the number of surviving chunks in the stripe is larger than , any chunks can reconstruct the whole stripe.

Figure 1 depicts the process of encoding, updating, and decoding of RS . First, the system divides the original data into 6 data chunks , and the data chunks are encoded by equation (1) to get 3 parity chunks ; these 9 chunks form an erasure code stripe. When the data chunk is updated to , the 3 parity chunks will be synchronously updated to . In decoding, through the linear operation of equation (1), the whole stripe can be reconstructed from any 6 surviving chunks (such as ).

As we can see from Figure 1, in the data update process of erasure coding, when the data chunk is updated, all the parity chunks in the same stripe must also be updated simultaneously to maintain the consistency of the stripe. Based on whether the complete data chunks need to be transmitted, there are 2 classes of update framework: full-stripe update and delta-based update. In the full-stripe update, the data chunk (where ) is updated to , and then, equation (1) is used to calculate the new parity chunk (where ), which needs to transmit the whole chunks and consumes significantly large network resources. In the delta-based update, we can update each parity chunk into by equations (2a) and (2b):

To elaborate further, when data chunks are updated to , each data chunk sends the data-delta to the computing node, which calculates based on equation (2a) and then distributes to parity nodes to complete the whole update process. Clearly, the delta-based update can save network resources and improve the update efficiency compared with the full-stripe update.

2.2. Related Work

As mentioned above, the data update is prevalent in the storage system. It has a significant impact on the performance of the distributed storage system. Therefore, various update schemes have been proposed to improve the erasure-coding update efficiency in recent years. T-Update [19] builds the minimum update tree using the Prim [20] algorithm to deal with the single-node update problem, but T-Update neglects the network status during construction of the update topology, which is prone to cause network congestion when the system load is high. TA-Update [21] adds a rollback-based failure handle method based on T-Update, making the update process more adaptive. To cope with the problem of multiple-node update in erasure coding, PUM-P [22] first proposed the centralized update architecture that collects the data-delta to the middle node close to the data nodes and distributes the by the random choice route path. Although the PUM-P can reduce the connection number between the data node and the parity node, PUM-P ignores the heterogeneity of nodes when selecting the compute node and ignores the link status when scheduling the update traffic. In order to improve the data transmission efficiency of multiple-node updates, ACOUS [23] constructs an update tree that considers the link delay provided by the commercial cloud service provider, which reduces the multiple-node update time. But ACOUS also neglects the heterogeneity of nodes when selecting the compute node, and it is difficult to obtain the delay parameter from the service provider in common storage clusters.

The work mentioned above improves the update efficiency by optimizing the data update process. Shen et al. [15] proposed the CASO that solves this problem by organizing data chunks with high correlation into the same stripes to reduce the update traffic. Specifically, CASO mines the correlation of different stripes from the real storage system work trace [16] and then organizes the highly correlated data into the same stripe to reduce the number of concurrent update stripes and improve the update efficiency. CASO can only mitigate the correlation between stripes, but it cannot entirely eliminate the correlation of stripes. Consequently, the multistripe concurrent updates are still frequently triggered by association stripes, especially in the storage system that the stripe is organized without consideration for data correlations. Therefore, improving the multistripe concurrent update efficiency and maintaining the system load balance are still very challenging tasks.

In summary, most of the existing update schemes of erasure codes only focus on the single stripe update scenario and ignore the heterogeneity of the node and network status, which cannot sufficiently deal with the problems of low update efficiency and load imbalance caused by the multistripe concurrent update. To solve these problems, this paper introduces SDN and multiattribute decision-making methods and proposes the Load-Aware Multistripe concurrent Update (LAMU) scheme in heterogeneous erasure-coded storage systems.

3. Model and Formulation of the Multistripe Concurrent Update Problem

In this section, we first state the multistripe concurrent update problem in the erasure-coded system. Our motivation is to find the best computing nodes, convergence path, and divergence path for each strip. These computing nodes and route paths are combined to form an update forest. Then, we give the optimization model of multistripe updates with multiple QoS constraints in the heterogeneous environment.

3.1. Problem Statement

Figure 2 shows a simple distributed erasure-coded storage system including several racks, in which each rack contains multiple storage nodes and each node stores many chunks from diverse erasure-coding stripes. For example, denotes the first data chunk of stripe and denotes the first parity chunk of stripe . We can see from Figure 2 that the 4 data chunks and 4 parity chunks have been distributed in different racks or nodes in the system.

We use the centralized update architecture similar to PUM-P [22], which reduces the connection between the data and parity nodes by introducing middle computing nodes. We take the update process of 4 data chunks in stripe described in Figure 2 as an example: Firstly, stripe updates data chunks to and then converges the data-delta to the computing node selected by the controller. Secondly, the computing node calculates the parity-delta by equation (2a). Finally, the computing node distributes the to corresponding parity nodes.

The detailed mathematical model of multistripe concurrent updates with multiple QoS constraints in a heterogeneous environment is introduced in Section 3.2. The network topology of an erasure-coded storage system can be modeled by a graph , in which presents the set of switches and denotes the set of links between adjacent switches. For easy reference, the notations used in this section are shown in Table 1.

3.2. Problem Formulation

3.2.1. Cumulative Weighted Update Delay for Multistripe Update

The first objective function is aimed at minimizing the cumulative update delay of stripes defined as in equation (3a). Specifically, the update delay of stripe is defined as in equation (3b), which is composed of (a) the data-delta convergence delay , (b) the parity-delta compute delay , and (c) the parity-delta divergence delay .

(1) The Data-Delta Convergence Delay. The data-delta convergence delay is formulated as follows: where denotes the set of all possible convergence paths from the updated data nodes to the computing node of stripe . is an element in the set , and it is composed of multiple point-to-point path from data nodes to computing node. denotes the convergence delay of stripe selecting as the convergence path. The constraint in formula (4b) ensures that the total bandwidth requirement for all convergence paths through path does not exceed the bottleneck bandwidth. The constraint (4c) is met to ensure that only one divergence path will be assigned to stripe . In constraint (4d), denotes a binary variable: it is 1 if stripe selects as the convergence path and 0 otherwise.

(2) The Parity-Delta Computing Delay. This section adopts the definition of node computing capacity in the erasure-coded system proposed by Fenglin et al. [24]. It uses the sequence to denote the factors which affect the processing ability of a node, such as CPU utilization, remaining memory, and disk I/O, and the corresponding weight factors are . Therefore, the computing capacity of each node in the erasure-coding update can be expressed as

It is assumed that represents the update volume of stripe ; the parity-delta computing delay is formulated as follows: where indicates the capacity conversion coefficient.

(3) The Parity-Delta Divergence Delay.

where denotes the set of all possible divergence paths from the computing node to the parity nodes of stripe . is an element in the set , and it is composed of multiple point-to-point path from the computing node to parity nodes. denotes the divergence delay of stripe selecting as the divergence path. The constraint in formula (6b) ensures that the total bandwidth requirement for all convergence paths through path does not exceed the bottleneck bandwidth. The constraint (6c) is met to ensure that only one divergence path will be assigned to the stripe . In constraint (6d), denotes a binary variable: it is 1 if stripe selects as the convergence path and 0 otherwise.

(4) The Proposed Objective Function. According to formulas (3)–(6), the objective function of the cumulative weighted delay of the multistripe update can represent as

3.2.2. Network Load-Balancing Performance for Multistripe Update

While improving the update efficiency, the load balance of the network is also critical. The objective function of minimizing the maximum link bandwidth utilization is defined as follows:

The constraint (8b) ensures that the used bandwidth of the link cannot be in excess of the link capacity; is a binary variable for the link selection of the update traffic.

3.3. The Proposed Multiobjective Optimal Model of Multistripe Update

Our goal is to minimize the cumulative weighted delay of multistripe updates denoted as and minimize the maximum link bandwidth utilization represented as . However, it is difficult to achieve the minimum values of and at the same time. The overall objective function of this paper is defined as follows:

4. SDN-Based Load-Aware Multistripe Concurrent Update Scheme

To solve the objective functions (9), we propose the LAMU scheme. Figure 3 presents the system architecture of LAMU, which includes four main modules: the Node Monitor (NodeM) module, Network Monitor (NetM) module, Compute Node Selection (CNS) module, and Path Selection (PS) module. In the process of LAMU, seeking the best computing node and transmission path for the multistripe erasure code data update is briefly described as follows: Firstly, the NodeM and NetM modules update the real-time node load information and network information. Then, the CNS module is used to select the computing nodes for update stripes according to the network and node status. Last, LAMU employs the PS module to find an appropriate convergence path between data nodes and the compute node and an appropriate divergence path between the compute node and parity nodes. The combination of the computing node, convergence path, and divergence path forms the update tree. Multiple update trees constitute an update forest.

4.1. The NodeM Module and NetM Module

The Software-Defined Network (SDN) can significantly simplify the network configuration and alleviate the measurement overhead compared with the traditional network architecture. For example, SDN can provide a flexible and efficient monitor strategy through the centralized control plane. In the LAMU scheme, the NodeM and NetM modules interact with switches through the OpenFlow protocol of SDN to discover the global network topology. It updates the load information of the storage node and the network link status in real time to provide a knowledge plane for LAMU. The node load information recorded by the NodeM is as follows: where denotes the set of CPU utilization of each storage node and denotes the set of the residual memory capacity of each storage node. Both the basic functions of compute nodes and the calculation of parity increment require CPU and memory resources. where denotes the set of the I/O load of each node. The I/O load performance represents the reading and writing performance of the storage node. Since computing nodes receive and forward data involving data reads and writes, a more accurate node selection weighting factor can be obtained by considering the I/O load. where denotes the set of access bandwidths of each storage node. In the scenario of the multistripe concurrent update, the compute node, as the convergence node of and the divergence node of , has a relatively larger demand for access bandwidth. Therefore, larger access bandwidth means less possibility of congestion, thus improving the overall update efficiency.

The set of CPU utilization , residual memory capacity , and I/O load can be obtained by periodically requesting status information from storage nodes. The set of the node access bandwidth can be calculated by using the SDN-based network measurement method in our previous work [25].

The NetM module follows the OpenFlow protocol of SDN to obtain the global network topology and update the real-time network information. The network information obtained by the NetM module is as follows: where is the set of point-to-point paths between all nodes of each stripe. This path set can be obtained by the Dijkstra [26] algorithm. is calculated during LAMU initialization, and can be accessed directly in the subsequent path calculation, without repeated path calculation, which reduces the cost of the algorithm. where represents the set of residual bandwidths for each path . It can be calculated using the SDN-based network measurement method in our previous work [27]. where represents the set of transmission delays for each path. It can be obtained by using the SDN-based network measurement method [27]. where denotes the hops of data from the start node to the end node. It can be calculated from the length of each path .

We use NodeM and NetM to obtain the storage node load and discover the network global status information mentioned above (10)–(17), which provides data support for the following computing node selection, aggregation path, and divergence path selection.

4.2. The CNS Module

As shown in Figure 3, LAMU selects nonduplicate computing nodes with better performance for multiple stripes by the CNS module. Firstly, when computing nodes are assigned to multiple update stripes, it is necessary to avoid numerous stripes selecting the duplicated computing node. Otherwise, the efficiency of parity-delta computing will be reduced and network congestion will occur. Secondly, according to Section 3.2, the parity-delta computing efficiency is positively correlated with the computing capacity of nodes, so the load status of heterogeneous nodes should be considered when selecting the computing node. Specifically, the CNS module uses the node load information obtained by the NodeM module to select computing nodes with better capacity by equation (20). Then, it deletes the selected nodes from the candidate computing node set to avoid concurrent update stripes from selecting the duplicate computing node. The entire process of the CNS module is as follows.

4.2.1. Normalizing the Load Attributes

To eliminate the dimension of each node load factor, a min–max normalization method is used. Equation (18) is used for the node CPU utilization and disk I/O load factor, which can achieve better performance with smaller values. Equation (19) is used for the node residual memory and node access bandwidth factor, achieving better performance with larger values.

Then, the normalization decision factor vector can be obtained as follows: , , , , and ; represents the sequence number of the candidate computing node.

4.2.2. Calculating the Capacity of the Node

The capacity of each candidate node can be calculated using the following equation: where is the vector of weighted coefficients for the node CPU utilization, residual memory, disk I/O load, and node access bandwidth. represents the weighted summation of the normalized factor of candidate node . A node with a larger value is a better computing node.

4.2.3. Selecting the Computing Node

To prevent severe network congestion and excessive node load, we need to avoid multiple update stripes selecting the same computing nodes. The entire process of the computing node selection is summarized in Algorithm 1.

1. Inputs:
: candidate computing nodes set
: concurrent update stripes set
: CPU utilization
: remaining memory
: I/O load
: access bandwidth
Output: best computing nodes for concurrent update stripes
2. Fors in stripe set do
3. For in node set do
4. Obtain the load parameters , , ,
5. Normalize load parameter to , , , according to (18) and (19)
6. Calculate the capacity of node according to (20)
7. Set the node has largest capacity as computing node for stripe
8. Delete from to ensure that the computing nodes selected by multiple stripes are not duplicated
9. End for
10. End for

4.3. The Path Selection (PS) Module

As described in Figure 3, when processing the multistripe concurrent update request, after LAMU selects the computing node with the CNS module, the system uses the PS module to schedule the update traffic, which includes the convergence traffic between the data nodes and computing node and the divergence traffic between the computing node and parity nodes. Specifically, LAMU uses the real-time network status and the multiattribute decision-making method based on TOPSIS to schedule the update traffic. In order to improve the update efficiency and maintain better system load balancing, we adjust the weight of decision factors under different network loads. The entire process of the PS module is as follows:

Step 1. Obtain candidate path.

The PS module firstly filters the existing path set according to the network bandwidth requirements of the update traffic and then obtains the candidate path set , where is the network bandwidth requirements of the update traffic . where is the decision-making matrix for finding the best path for each point-to-point update traffic; each column in presents a candidate path. The symbols , , and in each column denote each path’s residual bandwidth, transmission delay, and network hops, respectively. These network attributes are obtained by the NetM module.

Step 2. Construct and normalize the decision-making matrix.

To eliminate the influence of dimensions between each network attribute, the min–max normalization method is used, as shown in equations (18) and (19) in Section 4.2. Equation (18) is used for the path delay and network hop attribute, which can achieve better performance with smaller values. Equation (19) is used for the residual bandwidth, achieving better performance with larger values. Then, the normalization decision-making matrix is described as follows: where , , , and ; is the sequence number of the matrix column, corresponding to the sequence number of the candidate path of the update traffic. where is the vector of weighted coefficients; , , and are the weight coefficients of the residual bandwidth, path delay, and network hops, respectively. The value of the weight coefficient set is usually determined through experiment [28] and will be introduced in Section 5. The weighted decision matrix can be obtained using the following equation: where and .

Step 3. Construct the weighted decision matrix.

Step 4. Obtain the positive and negative ideal solutions.

where represents the positive ideal solution of the attribute value, which is composed of the maximum value of each decision factor, and represents the negative ideal solution of the attribute value, which is composed of the minimum value of each decision factor. where is an element in the candidate path and and are the distances from each candidate path to the positive and negative ideal solutions, respectively.

Step 5. Calculate the distance from the candidate path to the positive and negative ideal solutions.

Step 6. Calculate the relative closeness between each candidate path and the optimal candidate path.

When the relative closeness is larger, the path is more suitable for the update traffic.

The entire process of the PS module is summarized in Algorithm 2.

1. Inputs:
Candidate path set path residual bandwidth set path delay set ; path hop set source node and destination node of convergence flow or divergence flow; bandwidth requirement of update flow; vector of weighted coefficients for the residual bandwidth, end-to-end delay, and network hops
Output: the best path from update traffic source to update traffic destination
2. for path in path set do
3. if is from to and do
4. add to path set
5. end if
6. end for
7. Build the decision matrix based on according to Equation (22)
8.Normalize to according to Equation (23)
9. Construct the weighted decision matrix based on according to Equation (25)
10. Calculate the positive and negative ideal solution of weight matrix according to (26)
11. Calculate the Euclidean distance from each candidate path to the positive and negative ideal solutions according to (27)
12. Calculate the relative closeness between each candidate path and the best candidate path according to (28)
13. Return the candidate path with the largest relative closeness as the route path

5. Implementation and Evaluation

5.1. Experiment Environment

The performance of the proposed erasure-coding update scheme is evaluated in this section. We implement the prototype of LAMU on Containernet [17], a fork of the famous Mininet [29] network emulator. Different from Mininet, Containernet uses the Docker [30] containers as hosts in emulated network topologies. This feature allows Containernet to better simulate distributed storage systems. Ryu [31] is used as the SDN controller that supports the OpenFlow protocol. The entire experimental environment is deployed on an Ubuntu 18.04 system on a Sugon A840r-G, which has AMD processors and 128 GB of memory. In terms of the experimental topology, we use Containernet 3.1.0 to simulate the fat-tree topology [32]. As shown in Figure 4, the bandwidth capacity of each link in the fat tree is set to 200 Mbps because the simulation experiment assumes limited resources. Storage nodes in the fat-tree topology are heterogeneous; when selecting the computing node for each update stripe, we set the weight of the access bandwidth, CPU utilization, residual memory, and I/O load to .

To evaluate the performance of LAMU in a more realistic environment, we use the real distributed storage system background traffic pattern, which was measured in our previous work [33], to reproduce the realistic network condition, as shown in Table 2. According to [33], the speed of the heartbeat traffic is set to 1 Mbps to reduce the packet loss rate in the experimental environment; all the background traffic is maintained for a long time to ensure that the background traffic exists throughout the whole update process. To further evaluate the efficiency of our LAMU method under different network loads, three kinds of traffic load scenarios are set in the evaluation, as follows: (i)Low-load (LL) scenario: 10 heart beating flows, 10 user data flows, and 10 migration flows(ii)Middle-load (ML) scenario: 20 heart beating flows, 20 user data flows, and 20 migration flows(iii)High-load (HL) scenario: 30 heart beating flows, 30 user data flows, and 30 migration flows

The value of the weight coefficient set is usually determined through experiments [28]. As the system load increase, the network bandwidth resources become more limited. Therefore, we increase the bandwidth weight with the load increase. We set mentioned in equation (24) to , , and for the LL, ML, and HL scenarios, respectively.

In the evaluation, we compare LAMU with PUM-P [22] and DelaySelect. For PUM-P, it also improves the update efficiency by introducing the computing node. Yet, PUM-P ignores the heterogeneity of computing nodes and network status; all nodes and route paths have an equal probability of being selected. DelaySelect is extended from [23], which also adopts a centralized update framework and improves the update efficiency by selecting the path with the least delay as the routing path for update traffic. The experimental comparison items include the average update time, the standard deviation of bandwidth, and the link maximum bandwidth utilization of the system.

We focus on the update performance between different update schemes under various system load scenarios. In terms of experimental parameters, we use the parameters that may impact the update performance, including the number of update data nodes, the number of parity nodes, the size of data-delta, and the number of update stripes. The range of these parameters is listed in Table 3. To get a more convincing experiment result, each experiment was done 10 times, and the average value of these experiments was taken as the result.

5.2. Update Efficiency

5.2.1. Average Update Time with Varying Numbers of Parity Nodes in Different Load Scenarios

This subsection presents extensive comparisons of the average update time of three update schemes under different experimental parameters and load scenarios. Figure 5 shows the average update time increase along with when . As the number of parity nodes increases, more parity-delta needs to be transmitted, increasing the average update time. While the load becomes higher, the update time between different schemes begins to present differences. As we can see, in the high-load (HL) scenario, LAMU reduces the average update time by 17.9% and 43.1% compared with DelaySelect and PUM-P, respectively.

5.2.2. Average Update Time with Varying Numbers of Update Data Nodes in Different Load Scenarios

Figure 6 presents that the average update time is generally stable with the increase of update data nodes in different load scenarios. This is because, on the premise that the data volume is constant, the increasing number of update data nodes will reduce the average data-delta sent by each data node. Therefore, the extra time caused by connecting more data nodes is offset. While the load becomes higher, the update time between different schemes begins to show more significant differences. As we can notice, in the low-load (LL) scenario, the three update schemes achieve a comparable average update time. In the middle load (ML) scenario, LAMU starts to show better update efficiency. In the HL scenario, LAMU reduces the average update time by 18.8% and 49.5% compared with DelaySelect and PUM-P, respectively.

5.2.3. Average Update Time with Varying Sizes of Update Data Volume in Different Load Scenarios

Figure 7 illustrates how the average update time increases along with the update data volume in different load scenarios. The three update schemes achieve comparable average update times in the LL scenario. In the ML scenario, LAMU starts to show better update efficiency. Compared with DelaySelect and PUM-P, LAMU reduces the average update time by 12.1% and 26.5% under the ML scenario, respectively, and 19.7% and 43.4% under the HL scenario, respectively.

5.2.4. Average Update Time with Varying Update Stripes in Different Load Scenarios

Figure 8 illustrates the average update time variation with the number of concurrent update stripes. As we can see, with the increase of the number of concurrent update stripes in all three scenarios, the average update time of LAMU is only increasing a little. It illustrates that LAMU is more efficient in dealing with the multistripe concurrent update. However, the update time increases significantly when adopting the DelaySelect and PUM-P schemes with the increase of the number of update stripes. Specifically, compared with DelaySelect and PUM, LAMU reduces the average update time by 10.4% and 28.8% under the ML scenario, respectively, and reduces the average update time by 16.3% and 43.4% under the HL scenario, respectively.

5.3. Network Load-Balancing Performance

5.3.1. Standard Deviation of Link Bandwidth Utilization with Varying Update Stripes in Different Load Scenarios

To verify the load-balancing performance of the three update schemes, we evaluate the standard deviation of link utilization, as presented in Figure 9. The lower the standard deviation of link utilization, the more balanced the link loads are. PUM-P has the largest standard deviation in all three scenarios. This finding is because PUM-P does not consider the network status when scheduling the update traffics, which easily leads to load imbalance, while we can notice that in the HL scenario, the standard deviation of PUM-P is a little lower than that of the ML scenario. The reason is that, with the load increasing, PUM-P makes more and more links saturated. Thus, the standard deviation is decreased. The DelaySelect has a lower standard deviation than PUM-P for DelaySelect uses the link delay to schedule the update traffic, which has better load-balancing performance. The LAMU has the lowest standard deviation in all three scenarios. This result is because LAMU comprehensively considers link bandwidth, delay, and path hop when scheduling the update traffic; LAMU achieves better load balancing and avoids network congestion caused by several links reaching the full load.

5.3.2. Network Maximum Link Bandwidth Utilization with Varying Update Stripes in Different Load Scenarios

The maximum link bandwidth utilization represents the utilization of the most congested link in the system, and the larger it is, the more unbalanced the system is. As shown in Figure 10, in the ML scenario, full load links have already appeared in PUM-P and have also appeared in DelaySelect in the HL scenario. It means that some links in the system are highly congested. The LAMU method has the lowest maximum link bandwidth utilization in all three scenarios, which means LAMU can achieve better load balancing and avoids network congestion caused by links reaching the full load.

6. Conclusions

Erasure coding has become an indispensable redundancy mechanism in today’s large-scale distributed storage system. However, the data update of erasure coding introduces additional computing load and network traffic, reducing the efficiency of data updates and affecting the system load balancing. Most of the existing erasure-coding update schemes ignore the heterogeneity of node and network status and the multistripe concurrent update caused by data correlation. To solve this problem, this paper establishes the optimization model of multistripe updates with multiple QoS constraints in the heterogeneous environment and then proposes LAMU, a load-aware multistripe concurrent update scheme. Firstly, LAMU introduces SDN to measure the node load and network status in real time, and then, the obtained nodes and network information are used to select nonduplicated computing nodes with better capacity for multiple update stripes. Finally, the multiattribute decision-making method is used to schedule the network traffic between data nodes, computing nodes, and parity nodes. Extensive experimental results show that LAMU can reduce the average update time while providing better load-balancing performance.

Moreover, we’ll consider implementing LAMU in a real erasure-coded storage system in the future. Another direction for future work is to use reinforcement learning to adjust the decision parameter weight when scheduling the update traffic and making a trade-off between the number and the location of the computing nodes to achieve better results.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research work obtained the subsidization of the National Natural Science Foundation of China (Nos. 61861013 and 62161006), the Science and Technology Major Project of Guangxi (No. AA18118031), and the Innovation Project of Guangxi Graduate Education (No. YCSW2022271).

References

D. Ford, F. Labelle, F. I. Popovici et al., “Availability in globally distributed storage systems,” in Proceedings of the 9th USENIX conference on Operating systems design and implementation, pp. 61–74, Vancouver, BC, Canada, 2010.
View at: Google Scholar
C. A. Rincón, J.-F. Pâris, R. Vilalta, A. M. Cheng, and D. D. Long, “Disk failure prediction in heterogeneous environments,” in International Symposium on Performance Evaluation of Computer and Telecommunication Systems, pp. 1–7, Seattle, WA, July 2017.
View at: Publisher Site | Google Scholar
S. S. Arslan and E. Zeydan, “On the distribution modeling of heavy-tailed disk failure lifetime in big data centers,” IEEE Transactions on Reliability, vol. 70, no. 2, pp. 507–524, 2021.
View at: Publisher Site | Google Scholar
H. Weatherspoon and J. D. Kubiatowicz, “Erasure coding vs. replication: a quantitative comparison,” International Workshop on Peer-to-Peer Systems, Springer, Cambridge, MA, USA, pp. 328–337, 2002.
View at: Google Scholar
S. Muralidhar, W. Lloyd, S. Roy et al., “f4: Facebook's warm BLOB storage system,” in 11th USENIX Symposium on Operating Systems Design and Implementation, pp. 383–398, Broomfield, CO, USA, 2014.
View at: Google Scholar
C. Huang, H. Simitci, Y. Xu et al., “Erasure coding in windows Azure storage,” in Proceedings of the USENIX conference on Annual Technical Conference, p. 2, Boston, MA, 2012.
View at: Google Scholar
M. Sathiamoorthy, M. Asteris, D. Papailiopoulos et al., “XORing elephants,” Proceedings of the VLDB Endowment, vol. 6, no. 5, pp. 325–336, 2013.
View at: Publisher Site | Google Scholar
D. Narayanan, A. Donnelly, A. Rowstron, and Usenix, “Write off-loading: practical power management for enterprise storage,” in 6th USENIX Conference on File and Storage Technologies, pp. 253–267, San Jose, CA, 2008.
View at: Google Scholar
J. C. Chan, Q. Ding, P. P. Lee, and H. H. Chan, “Parity logging with reserved space: towards efficient updates and recovery in erasure-coded clustered storage,” in 12th {USENIX} Conference on File and Storage Technologies, pp. 163–176, Santa Clara, CA, USA, 2014.
View at: Google Scholar
D. J. Ellard, Trace-Based Analyses and Optimizations for Network Storage Servers, Harvard University, 2004.
Z. Fengyan, W. Yan, and L. Nianshuang, “Survey of heterogeneous-based data repair strategies for erasure codes,” Application Research of Computers, vol. 36, no. 8, pp. 2249–2255, 2019.
View at: Google Scholar
M. Ye, R. Wei, W. Guo, Q. Jiang, H. Qiu, and Y. Wang, “A new method for reconstructing data on a single failure node in the distributed storage system based on the MSR code,” Wireless Communications & Mobile Computing, vol. 2021, pp. 1–14, 2021.
View at: Publisher Site | Google Scholar
S. Maitrey and C. K. Jha, “MapReduce: simplified data analysis of big data,” in 3rd International Conference on Recent Trends in Computing, pp. 563–571, Delhi, India, 2015.
View at: Google Scholar
Z. Yuan, X. You, X. Lv, and P. Xie, “SS6: online short-code RAID-6 scaling by optimizing new disk location and data migration,” The Computer Journal, vol. 64, no. 10, pp. 1600–1616, 2021.
View at: Publisher Site | Google Scholar
Z. R. Shen, P. P. C. Lee, J. W. Shu, and W. Z. Guo, “Correlation-aware stripe organization for efficient writes in erasure-coded storage: algorithms and evaluation,” IEEE Transactions on Parallel and Distributed Systems, vol. 30, no. 7, pp. 1552–1564, 2019.
View at: Publisher Site | Google Scholar
Z. Li, Z. Chen, S. M. Srinivasan, and Y. Zhou, “C-Miner: mining block correlations in storage systems,” in 3rd Conference on File and Storage Technologies, pp. 173–186, Usenix Assoc, San Francisco, CA, 2004.
View at: Google Scholar
“Contanernet,” 2022, https://containernet.github.io/.
View at: Google Scholar
I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,” Journal of the society for industrial applied mathematics, vol. 8, no. 2, pp. 300–304, 1960.
View at: Publisher Site | Google Scholar
X. Pei, Y. Wang, X. Ma, and F. Xu, “T-update: a tree-structured update scheme with top-down transmission in erasure-coded systems,” in IEEE INFOCOM -The 35th Annual IEEE International Conference on Computer Communications, pp. 1–9, San Francisco, CA, USA, April 2016.
View at: Publisher Site | Google Scholar
S. Manen, M. Guillaumin, and L. Van Gool, “Prime object proposals with randomized Prim's algorithm,” in IEEE International Conference on Computer Vision, pp. 2536–2543, Sydney, Australia, 2013.
View at: Google Scholar
Y. Wang, X. Pei, X. Ma, and F. Xu, “TA-update: an adaptive update scheme with tree-structured transmission in erasure-coded storage systems,” IEEE Transactions on Parallel Distributed Systems, vol. 29, no. 8, pp. 1893–1906, 2017.
View at: Google Scholar
F. Zhang, J. Huang, and C. Xie, “Two efficient partial-updating schemes for erasure-coded storage clusters,” in IEEE Seventh International Conference on Networking, Architecture, and Storage, pp. 21–30, Xiamen, China, 2012.
View at: Google Scholar
L. Qian, H. Yupeng, Y. Zhenyu, X. Ye, and Q. Zheng, “An ant colony optimization algorithms based data update scheme for erasure-coded storage systems,” Journal of Computer Research and Development, vol. 58, no. 2, p. 305, 2021.
View at: Google Scholar
Q. Fenglin, G. Qingyuan, Z. Yangfan, and X. Wang, “Heterogeneity-aware node selection for data repair in distributed storage systems,” Journal of Computer Research and Development, vol. 52, no. 2, pp. 68–74, 2015.
View at: Google Scholar
Y. Wang, M. Ye, Q. He, Y. Huan, and W. Kang, “A new node selecting approach in Ceph storage system based on software defined network and multi-attributes decision-making model,” Chinese Journal of Computers, vol. 42, no. 2, pp. 93–108, 2019.
View at: Google Scholar
M. Barbehenn, “A note on the complexity of Dijkstra's algorithm for graphs with weighted vertices,” IEEE Transactions on Computers, vol. 47, no. 2, pp. 263–263, 1998.
View at: Publisher Site | Google Scholar
W. Ke, Y. Wang, M. Ye, and J. Chen, “A priority-based multicast flow scheduling method for a collaborative edge storage datacenter network,” IEEE Access, vol. 9, pp. 79793–79805, 2021.
View at: Publisher Site | Google Scholar
Y. Luo, J. Xia, and T.-p. Chen, “Comparison of objective weight determination methods in network performance evaluation,” Journal of Computer Applications, vol. 29, no. 10, pp. 2624–2626, 2009.
View at: Publisher Site | Google Scholar
“Mininet,” 2022, http://mininet.org/.
View at: Google Scholar
“Docker,” 2022, https://www.docker.com/get-started.
View at: Google Scholar
“Ryu,” 2022, https://osrg.github.io/ryu/.
View at: Google Scholar
C. Zhang, S. Zhang, B. Jin, W. Li, Z. Wang, and Y. Wang, “A3: an automatic malfunction detection and fixation system in FatTree data center networks,” in Conference of the ACM-Special-Interest-Group-on-Data-Communication, pp. 24–26, Beijing, China, 2019.
View at: Google Scholar
W. Ke, Y. Wang, and M. Ye, “GRSA: service-aware flow scheduling for cloud storage datacenter networks,” China Communications, vol. 17, no. 6, pp. 164–179, 2020.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Junqi Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Views

300

Downloads

325

Citations

Wireless Communications and Mobile Computing

Semantic Sensor Data Annotation and Integration on the Internet of Things 2021

A Load-Aware Multistripe Concurrent Update Scheme in Erasure-Coded Storage System

Abstract

1. Introduction

2. Background and Related Work

2.1. Basics of Erasure Coding

2.2. Related Work

3. Model and Formulation of the Multistripe Concurrent Update Problem

3.1. Problem Statement

3.2. Problem Formulation

3.2.1. Cumulative Weighted Update Delay for Multistripe Update

3.2.2. Network Load-Balancing Performance for Multistripe Update

3.3. The Proposed Multiobjective Optimal Model of Multistripe Update

4. SDN-Based Load-Aware Multistripe Concurrent Update Scheme

4.1. The NodeM Module and NetM Module

4.2. The CNS Module

4.2.1. Normalizing the Load Attributes

4.2.2. Calculating the Capacity of the Node

4.2.3. Selecting the Computing Node

4.3. The Path Selection (PS) Module

5. Implementation and Evaluation

5.1. Experiment Environment

5.2. Update Efficiency

5.2.1. Average Update Time with Varying Numbers of Parity Nodes in Different Load Scenarios

5.2.2. Average Update Time with Varying Numbers of Update Data Nodes in Different Load Scenarios

5.2.3. Average Update Time with Varying Sizes of Update Data Volume in Different Load Scenarios

5.2.4. Average Update Time with Varying Update Stripes in Different Load Scenarios

5.3. Network Load-Balancing Performance

5.3.1. Standard Deviation of Link Bandwidth Utilization with Varying Update Stripes in Different Load Scenarios

5.3.2. Network Maximum Link Bandwidth Utilization with Varying Update Stripes in Different Load Scenarios

6. Conclusions

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright