Abstract

Data placement considerably affects the I/O performance of distributed storage systems such as HDFS. An ideal placement algorithm should keep the I/O load evenly distributed among different storage nodes. Most of the existing placement algorithms with I/O load balance guarantee depend on the information of data popularity to make the placement decisions. However, the popularity information is typically not available in the data placement phase. Furthermore, it usually varies during the data lifecycle. In this paper, we propose a new placement algorithm called Balanced Distribution for Each Age Group (BEAG), which makes data placement decisions in the absence of the popularity information. This algorithm maintains multiple counters for each storage node, with each counter representing the amount of data belonging to a certain age group. It ensures that the data in each age group are equally scattered among the different storage nodes. As the popularity variance of the data belonging to the same age group is considerably smaller than that of the entire data, BEAG significantly improves the I/O load balance. Experimental results show that compared to other popularity independent algorithms, BEAG decreases the I/O load standard deviation by 11.6% to 30.4%.

1. Introduction

In the big data era, distributed storage systems have attracted considerable attention [13]. In a distributed storage system, the data placement algorithm considerably affects the overall I/O performance. An ideal data placement algorithm should not only generate a balanced usage of disk space but also a balanced distribution of the I/O load among the different storage nodes.

The balanced usage of disk space is easy to achieve, but to guarantee a balanced distribution of the I/O load is extremely difficult. The I/O load caused by one data file is equal to the product of its size and popularity. Data popularity typically follows a skewed distribution such as Zipf. Hot files are frequently accessed, while cold ones are rarely accessed. Therefore, storage nodes assigned with more hot files are prone to be overloaded, while the other nodes may be idle, causing the entire system to be underutilized [4, 5]. Even worse, the popularity of each data file, measured by the average number of requests accessing the file per unit time, is typically not available in the data placement phase and dynamically changes during the entire data lifecycle.

To achieve the I/O load balance, many data placement algorithms for distributed storage systems have been proposed. According to whether the popularity information is taken as the necessary prerequisite for making placement decisions, the existing data placement algorithms can be divided into two categories: popularity-dependent algorithms and popularity-independent algorithms.

Popularity-dependent algorithms take data popularity information as the necessary prerequisite for making placement decisions. They mainly include two steps. First, the I/O load caused by each data file is calculated as the product of the data file size and its popularity. Second, an optimization algorithm is used to minimize the variance of the I/O load assigned to each storage node.

A popularity-independent algorithm makes data placement decisions without using any popularity information. A typical popularity-independent algorithm is the pseudorandom algorithm based on hash computations. It takes the file identifier as the input of a hash function and takes the output of the function as the destination storage node. Such a placement algorithm is extremely easy to implement in engineering applications. A finely designed hash function ensures the balanced usage of disk space. However, its drawback is the poor level of I/O load balance as the distribution of data popularity is highly skewed. Moreover, even if the placement result was accidently desirable within a certain time period, it may be unsatisfactory in another time period as the data popularity changes dynamically.

The contributions of this paper include the following. Firstly, it presents the idea of making data placement decisions depending on the information of the data creation time instead of the data popularity. It takes advantage of the discipline existing in many applications in which the popularity variance of the data belonging to the same age group is considerably smaller than that of the entire data. Secondly, it proposes a practical algorithm realizing the abovementioned idea. The algorithm maintains multiple counters for each storage node, with each counter representing the amount of data created within a certain time period. It ensures that the data created within each time period are equally scattered among the different storage nodes. Thirdly, extensive experiments have been conducted showing that the proposed algorithm achieves 11.6% to 30.4% reduction of the standard deviation of the I/O load.

The rest of this paper is organized as follows. Section 2 summarizes the related work. In Section 3, we elaborate the BEAG algorithm. Section 4 presents the experimental results, and Section 5 concludes this paper.

I/O load balance is an important method to improve the I/O performance of distributed storage systems [6]. The level of I/O load balance is considerably affected by the data placement algorithm. Designing a data placement algorithm with the ideal I/O load balance guarantee is very challenging, as data popularity follows an extremely skewed distribution.

Many data placement algorithms have been proposed to achieve the I/O load balance. According to whether the data popularity information is taken as the necessary prerequisite for making placement decisions, the existing data placement algorithms can be divided into two categories: popularity-dependent algorithms and popularity-independent algorithms.

The popularity-dependent algorithms can be further divided into two subclasses: static popularity-dependent algorithms and dynamic popularity-dependent algorithms. Static popularity-dependent algorithms assume that the popularity of each data file is known in advance and will never change in the future. Dynamic popularity-dependent algorithms assume that there is a popularity monitor that periodically collects the popularity information of each data file. The algorithms dynamically make data migration decisions according to the information provided by the popularity monitor, thereby improving the level of I/O load balance. However, their main drawback is the high overhead. On one hand, to figure out the optimal data migration plan requires complex computations. On the other hand, to execute the data migration plan causes additional I/O load competing for the scarce I/O bandwidth resources.

SP [7] is a static popularity-dependent algorithm. First, the average I/O load assigned for each storage node is calculated according to the size and the popularity of each data file. Second, the data files are sorted in the descending order of their sizes. Third, the data files are assigned to the storage nodes in the abovementioned order. The data files will not be assigned to a new storage node until the current storage node is assigned with its I/O load reaching the average value calculated in advance.

CDRM [8] is a dynamic popularity-dependent algorithm. It continuously monitors the data popularity variance and updates the blocking probability on each storage node. On the basis of the blocking probability, it dynamically replicates data on the idle storage nodes to improve the level of I/O load balance.

Kari et al. [9] also proposed a dynamic popularity-dependent algorithm that migrates data from one node to another for load balance or in the event of a system failure or upgrades. The main contribution is the maximization of the number of simultaneous transfers to minimize the data migration time. Further, Sha et al. [10] proposed a dynamic popularity-dependent algorithm for mapreduce applications.

DRS [11] is a dynamic popularity-dependent algorithm based on Markov model. It constructs a transition probability matrix based on the file accessing times in a period and calculates the stationary probability distribution of the system. It utilizes the results to distinguish different data types and then increases extra replicas for hot data, cleaning up these extra replicas when the data cool down.

Kinesis [12] is a popularity-independent algorithm. It adopts a hash-based replica placement strategy. It devises independent hash functions and generates candidate storage nodes for each data file. Among the storage nodes, only () nodes will be finally selected to store a replica of the file.

Xie and Chen [13] also proposed a hash-based popularity-independent algorithm. It offers an elastic distributed storage system with power proportionality. When the system load decreases, some of the storage nodes will be powered off to reduce energy consumption.

Wang et al. [14] proposed a placement strategy dealing with the data placement problem among multiple data centers.

3. BEAG Placement Algorithm

3.1. Main Idea

Data popularity directly reflects the pressure on the I/O load. However, its value is typically not available in the data placement phase. Therefore, we need another variable whose value is easily obtainable and can reflect data popularity, perhaps in an indirect way. The age of data is a good alternative. It represents how long the data has been created. On one hand, the age of the data is extremely easy to obtain. To calculate the age, the only required information is the creation time of the data, which has already been recorded in the existing storage systems. On the other hand, the age of the data is correlated with the data popularity in statistics. The popularity variance of the data belonging to the same age group is considerably smaller than that of the entire dataset.

The main idea of the BEAG algorithm can be stated as follows. The algorithm calculates the age of each data file according to its creation time. All of the data are then classified into multiple groups according to their age. The algorithm ensures not only that all of the data are equally scattered among the different storage nodes but also that the data in each age group are equally scattered, preventing that some nodes store more new data while other nodes store more old data. As the popularity variance of each age group is considerably smaller than that of the entire dataset, the algorithm can obtain a better I/O load balance than the pseudorandom algorithms.

3.2. Elaboration of the Algorithm

To equally distribute the data in each age group among the different storage nodes, the algorithm should maintain an array of counters for each storage node. Each counter represents the amount of data allocated to the storage node within a certain age group. The algorithm keeps the counter arrays for different storage nodes approximately the same all the time.

The BEAG placement algorithm mainly contains three subalgorithms. The first one is called the initialization subalgorithm. It is responsible for the initialization of the counter arrays and the placement of the initial set of data files. The second one is called the in-progress subalgorithm. It keeps running all the time after the completion of the initialization subalgorithm. It is responsible for handling all the possible events that can change the data placement. Such events include the creation of a new file, deletion of an existing file, joining of a new node, and the exit of an existing node. The last one is called the self-refreshing subalgorithm. It also keeps running all the time after the completion of the initialization subalgorithm, but it will not change the data placement results while changing only the values of the counter arrays. With the passage of time, the age group that a data file belongs to changes. Therefore, the counter arrays should be updated in time.

We focus on distributed storage systems that use the architecture adopted by HDFS [15], that is, the master-slave architecture. There are two types of nodes in the system: a name node and multiple data nodes. The name node, also called the metadata node, is responsible for making the data placement decisions and maintaining the mapping of data files and data nodes. A data node, also called a storage node, is responsible for storing data and handling access requests. The proposed placement algorithm, BEAG, runs on the name node.

3.2.1. Initialization Subalgorithm

The initialization subalgorithm works in two stages. In the first stage, it initializes the counter arrays. In the second stage, it determines the placement of the initial set of data files.

To initialize the counter array for each storage node, the initialization subalgorithm needs to determine the number of counters contained in each counter array as well as the age group that each counter corresponds to. Let denote the number of counters contained in each counter array; then, the data files are divided into age groups. To describe the age groups, we just require positive numbers as the division points. Let denote the set of division points. The first age group contains the data whose age is not greater than . The last age group contains data older than . The th () age group contains the data whose age is between and . We call the value the age span of the th age group. Note that the age span of the first age group is equal to and that of the last age group is positive infinity. Moreover, we take “day” rather than “year” as the measurement unit for the age of the data. As the data age greater than is not differentiated from the point of view of the age group, we call the maximum differentiable age.

To determine the division points, we choose an exponential function; that is, . The first age group only contains the data files created no more than two days back, the second age group contains the data files created more than two days but not more than four days back, the third age group contains the data files created more than four days but not more than eight days back, and so on.

The above method for age group division has two advantages. First, only a small is required to obtain the ideal maximum differentiable age. For example, means that the maximum differentiable age is days, that is, more than 1,436 years. For realistic storage systems, the data lifecycle is only tens or hundreds of years long. Therefore, irrespective of the type of storage systems, we can always divide the data into 20 age groups. Second, the popularity variance is very small for each age group. Although the age span increases exponentially with an increase in the number of groups, the popularity variance does not increase. As the popularities of old data usually fall in the range between zero and a small positive number, the popularity variance is not very large.

After determining the number of counters in each counter array and the age group that each counter corresponds to, the initialization subalgorithm assigns each counter with zero for all the storage nodes. The next step is to solve the placement of the initial set of data files. For each data file, the algorithm calculates its age, that is, the difference between the current time and the file’s creation time. Note that the initial files can be created at different times, as they may be stored earlier in other devices or systems. According to the age of the file, the algorithm determines which age group it belongs to. The files are then grouped into subsets on the base of age. We independently distribute each subset. For each subset, each storage node maintains a counter to record the amount of data assigned to it. Files in the subset are distributed one by one. Every time a file in the subset is distributed to the storage node with the minimum counter. Eventually, files in each subset (or age group) are approximately equally distributed to different storage nodes.

The initialization subalgorithm can be described as Algorithm 1. The symbol represents the number of storage nodes in the system. represents the number of counters in each counter array. As discussed above, 20 is an acceptable value for . is a matrix. The element represents the value of the th counter for the th storage node. represents the set of division points. represents the initial set of data files, represents the set of storage nodes, and represents the number of storage nodes in .

Data: Data file set ; storage node set .
Result: Mapping of the data file set to the storage node set ; counter arrays .
Initialize each element with 0, each element with , and each element with
while there is an unhandled file in do
Query the file’s creation time
Calculate the file’s age
while is between 1 anddo
ifthen
break;
end
end
End
while is between 1 anddo
while there is an unhandled fileindo
while is between 2 anddo
ifthen
end
end
Place the file on the storage node
end
End
3.2.2. In-Progress Subalgorithm

After the initialization, the storage system may exist for a long time such as tens of years. During its lifetime, new data placement decisions are required to be made under the following four circumstances. Firstly, a new file is created into the system. Secondly, an existing data file is required to be deleted from the system. Thirdly, a new node joins into the system. Finally, an existing node leaves from the system. The in-progress subalgorithm mainly handles the abovementioned four types of events. It contains four reactors with each one handling a different type of event. Whenever an event takes place, the corresponding reactor is activated.

The file-creation reactor (Algorithm 2) is used to handle the event of the creation of a new file. The age of a new file is equal to zero, which falls in the range of the first age group. Firstly, the reactor checks the first counter in each counter array to find out the minimum one. Secondly, the reactor places the file into the storage node with the minimum first counter. Finally, the counter is increased by the size of the file.

Data: Newly created file
Result: Destination storage node for
Initialize with , with 1 and with 2
while is between 2 anddo
ifthen
end
end
Place the file on the storage node

The file-deletion reactor (Algorithm 3) is used to handle the event of the deletion of a file. Firstly, the reactor queries the creation time of the file and calculates its age. Secondly, the reactor checks which age group the file belongs to and which storage node the file is placed on. Thirdly, the reactor deletes the file from the storage node and decreases the corresponding counter for the storage node by the size of the file.

Data: File to delete
Result: Renewal of the counter array
Initialize with 1
Query the file’s creation time
Calculate the file’s age
whileis between 1 anddo
ifthen
break;
end
end
Query the holding storage node
Delete from the storage node

The node-joining reactor (Algorithm 4) is used to handle the event of a new node joining into the storage system. The system has to assign an equal I/O load to the new node as that assigned to the other nodes. For each age group, the reactor first calculates the total amount of the corresponding counters, the average amount before the new node joining, and the average amount after the new node joining. The difference between the average before and the average after is the amount of data required to migrate from each existing node to the new node. Next, the reactor migrates the required amount of data belonging to the age group from each existing node to the new node. The corresponding counter for each existing node is decreased, while that of the new node is increased by the amount of the migrated data.

Data: New storage node
Result: Data migration; renewal of the counter array
Initialize with 1
whileis between 1 anddo
whileis between 1 anddo
end
whileis between 1 anddo
select a subset of files from the th
storage node belonging to the th age group with
the total size of the files in
approximately equal to ;
migrate the files in from the th storage
node to the new storage node;
end
end

The node-leaving reactor (Algorithm 5) is used to handle the exit of a node. Without any loss of generality, suppose that the node numbered leaves from the system. For each age group, the reactor first calculates the average amount of data required to migrate to each left nodes and divides the files in the age group into approximately equal subsets. Then, the reactor migrates each subset to each left node and increases the corresponding counter by the amount of the migrated data.

Data: Leaving storage node
Result: Data migration; renewal of the counter array
Initialize with 1
whileis between 1 anddo
Divide the files belonging to the th age group on the
th node into subsets, with the total size of
the files in each subset approximately equal to ;
whileis between 1 anddo
migrate the th subset from the th node to the
th node;
end
end
3.2.3. Self-Refreshing Subalgorithm

With the passage of time, the age of each file increases and the age group that it belongs to changes. Therefore, the counter arrays need to be updated in time. Otherwise, the in-progress algorithm will make incorrect decisions.

As we take “day” as the unit for the data age, the data age of each file has to be updated every day, and thus, the self-refreshing algorithm is triggered every day. Suppose that the files are sorted in the descending order of the creation time, thus the ascending order of the data age. It is noted that the files are required to be ordered only once. With the passage of time, all the files will become older and their orders will remain the same. Suppose that () represents the largest order number of the file belonging to the th age group. We know that the age of the th file is not greater than and that of the th file is greater than . Once the self-refreshing algorithm is triggered, the most important task is to update the values of each . Once the value of each is determined, the value of each counter can be deduced, that is, the total number of files stored in the node with the global order no bigger than .

The self-refreshing algorithm can be described as Algorithm 6: The symbol represents the set of files stored in the storage system.

Data: Counter arrays and each file’s age
Result: Renewal of the counter arrays
Increase the age of each file by one day
Set each element with 0
whileis between 1 anddo
whiledo
ifthe age of the th file is larger thanthen
break
end
++
end
ifthen
end
whileis betweentodo
Query the number of the node that holds the th file
end
End
3.3. Complexity Analysis

First, the time complexity of the initialization subalgorithm is . represents the number of the files and represents the number of the storage nodes. Since , that is, the number of the age groups, is typically not greater than 20, the time complexity of the initialization subalgorithm can be written as . Second, the time complexity of the file creation process is and that of the file deletion is . Third, both the time complexity of the node joining process and that of the node leaving process are . Finally, the time complexity of the self-refreshing subalgorithm is . Therefore, BEAG is a light-weight solution to the data placement problem.

4. Evaluation

We implemented the BEAG data placement algorithm and compared it with both the pseudorandom algorithm and Kinesis [12]. We chose the standard deviation of the I/O load assigned to each storage node as the criterion for the performance evaluation. Let denote the number of storage nodes and () denote the I/O load assigned to the th storage node. The standard deviation of the I/O load assigned to each storage node is denoted by and calculated as follows: . In this expression, . The I/O load assigned to the th storage node is equal to the sum of the I/O load generated by each data file placed onto the storage node, and the I/O load generated by each data file is equal to the product of its size and popularity. Note that all of the three algorithms, that is, BEAG, Kinesis, and the pseudorandom algorithm, make placement decisions without using any popularity information, and the popularity is only used to evaluate the performance afterwards. Here a smaller means a higher level of I/O load balance.

The method of performance evaluation described above requires convincing I/O workload generators. We used two types of I/O workload generators. One is called Reproducer, which reproduces the file creation and access process on the basis of the metadata collected from a real-world application (i.e., the blog system of ScienceNet.cn). The other is called Medisyn [16] designed by Hewlett Packard Labs, which relies on a simulation model to describe the process of file creation and access in video-on-demand applications.

The evaluation mainly includes four steps. First, we use the I/O workload generator to get each file’s age, size, and popularity. Second, we, respectively, employ the three algorithms to map the files into the storage nodes. Third, we calculate the I/O load assigned to each storage node by summing up the I/O load generated by each file placed on it. Finally, we evaluate the performance through the standard deviation of the I/O load assigned to each storage node.

4.1. Evaluation on Real-World Datasets

The I/O workload generator Reproducer generates a number of tuples in the form of . In the tuple, represents the file’s creation time, denotes the file’s current popularity, and represents the size of the file. The file’s creation time is recorded in the system and can be directly obtained. The file’s current popularity is expressed with the number of the access requests on the observed day. Once again, note that is only used for the performance evaluation, and all of the three algorithms make placement decisions without using any popularity information. In the experiment, we developed a web crawler to collect the publication time and download times of 4,301 articles from the blog system of ScienceNet.cn. By analyzing their publication time, we found that the articles’ ages fell in the range between one day and 2,865 days. Therefore, we divided the files into 12 age groups, with the th group containing the files ranging from to days in age.

Firstly, we analyzed the data popularity distribution. The aim was to compare the popularity variance of the data in the same age group and that of the entire dataset.

Secondly, we compared the proposed algorithm with both Kinesis and the pseudorandom algorithm. We adopted , that is, the standard deviation of the I/O load assigned to each storage node, as the metric for the I/O load balance.

4.1.1. Popularity Distribution Analysis

We used the standard variance for describing the variance of data popularity. For the data files as a whole, the standard variance of data popularity was calculated as . Here, represents the current popularity of the file , represents the average current popularity of the files in , and represents the number of files contained in the file set. For the data files divided into multiple age groups, we first computed the standard variance of data popularity in each age group. Then, we obtained the expectation of the standard variance of the data popularity by using the following formula: . Here, represents the standard variance of the data popularity in the th age group and represents the number of files contained in the age group.

Through the analysis of the data generated by the I/O workload generator Reproducer, we obtained , that is, the standard deviation of the popularity for the files as a whole, was equal to 60.6, and , that is, the expectation of the standard deviation of the popularity for the files divided into multiple age groups, was equal to 10.3. The above results validated our assumption that the popularity variance of the data in the same age group was considerably smaller than that in the entire data fileset.

For a further comparison of the popularity diversity in the entire file set and that in each age group, we computed the ratio of the maximal popularity to the minimal popularity. A larger ratio means a higher level of diversity. The results are listed in Table 1. In the table, represents the file subset composed of files belonging to the th age group and represents the entire file set. This shows that the diversity of the data popularity in each age group is considerably smaller than that in the entire file set.

4.1.2. Performance Comparison

In the experiment, we assumed a storage system composed of 10 storage nodes. We, respectively, used the pseudorandom algorithm, Kinesis, and BEAG to place the files onto the storage nodes. The comparison results of the three algorithms are shown in Table 2. represents the amount of the I/O load on the most loaded storage node, while represents the amount of the I/O load on the least loaded storage node. The unit of I/O load is KB/s.

The BEAG algorithm yielded the best I/O load balance, with the standard deviation of I/O load assigned to each storage node decreasing by 30.4% and 22.7%, respectively, compared with the pseudorandom algorithm and Kinesis.

4.2. Evaluation on Synthetic Datasets

The I/O workload generator Medisyn can also generate a number of tuples in the form of . It does not recur in any realistic systems, while depending on a stochastic model to simulate the process of file creation and access. It was demonstrated that the model could correctly describe the I/O workload characteristics for video-on-demand applications. In this model, the file creation process obeys the Poisson distribution, the popularity among different files obeys the Zipf distribution, the popularity evolution process for each file obeys the log-normal distribution, and the size of the file also obeys the Zipf distribution. The Poisson distribution is described with the parameter , the Zipf distribution is described with the parameter , and the log-normal distribution is described with two parameters and . Moreover, we need two parameters and . represents the total number of files, and represents the total I/O load.

4.2.1. Popularity Distribution Analysis

With =0.5, =0.8, =45,000, =180,000,000, =3, and =3, we generated 45,000 tuples with the form . By analyzing the data age, we found that the ages fell in the range of 0 to 2499 days. Therefore, we divided the files into 12 age groups. For the entire dataset, the standard variance of the data popularity was calculated as . For the data files divided into multiple age groups, we first computed the standard variance of the data popularity in each age group. Then, we obtained the expectation of the standard variance of the data popularity through the formulation .

Through the analysis of the synthetic data, we obtained , that is, the standard deviation of the popularity for the files as a whole, was equal to 27.0, and , that is, the expectation of the standard deviation of the popularity for the files divided into multiple age groups, was equal to 4.6. The above results also validated our assumption that the popularity variance of the data in the same age group was considerably smaller than that in the entire data file set.

Once again, for a further comparison of the popularity diversity in the entire file set and that in each age group, we computed the ratio of the maximal popularity to the minimal popularity; the results are listed in Table 3. In the table, represents the file subset composed of the files belonging to the th age group, and represents the entire file set. The table shows that the diversity of the data popularity in each age group was also considerably smaller than that in the entire file set.

4.2.2. Performance Comparison

In the experiment, we assumed a storage system composed of 100 storage nodes. We, respectively, used the pseudorandom algorithm, Kinesis, and BEAG to place the 45,000 files onto the storage nodes. The comparison results of the three algorithms are shown in Table 4. The unit of I/O load is KB/s.

The BEAG algorithm yielded the highest level of I/O load balance, with the standard deviation of the I/O load assigned to each storage node decreasing by and , respectively, compared with the pseudorandom algorithm and Kinesis.

5. Conclusion

Most traditional data placement algorithms for distributed storage systems depend on the information of data popularity for making placement decisions to realize the I/O load balance. However, data popularity is usually unknown in the data placement phase and changes dynamically during the data lifecycle. We proposed a new data placement algorithm without using any popularity information. The algorithm makes use of the correlation between a file’s creation time and its popularity. It ensures that the data created in each time period are evenly scattered among different storage nodes. Compared to other popularity independent algorithms, the proposed algorithm guarantees a higher level of load balance, with the I/O load standard deviation decreasing by 11.6% to 30.4%. However, the algorithm only considers homogenous storage systems. In the future, we plan to extend it to heterogeneous environments.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (Grant No. 61702408), Natural Science Basic Research Plan in Shaanxi Province of China (Program No. 2017JQ6053), the National Natural Science Foundation of China (Grant No. 61472316), and the Innovation Group for Interdisciplinary Computing Technologies in Xi’an University of Science and Technology.