Abstract

The paper is regarding the fair distribution of several files having different sizes to several storage supports. With the existence of several storage supports and different files, we search for a method that makes an appropriate backup. The appropriate backup guarantees a fair distribution of the big data (files). Fairness is related to the used spaces of storage support distribution. The problem is how to find a fair method that stores all files on the available storage supports, where each file is characterized by its size. We propose in this paper some fairness methods that seek to minimize the gap between used spaces of all storage supports. In this paper, several algorithms are developed to solve the proposed problem, and the experimental study shows the performance of these developed algorithms.

1. Introduction

The manner that ensures the storage process is a primordial issue. Indeed, facing big data and few storage supports, it is important to seek a method that stores all given files in these storage supports. These methods must guarantee an equity distribution of the files in the available storage supports. We mention that the equity distribution can be called file size balancing. The file size balancing depends on some scheduling algorithms to guarantee a minimum gap between the used spaces of the storage supports. In the nonbalancing case, we face a problem where some storage supports have high used space, and at the same time, some storage supports probably have low used space. To avoid these cases, appropriate scheduling algorithms can be applied. In this paper, we proposed some balancing algorithms to obtain an equity distribution of the files to the storage supports, as far as we know this problem is never studied in the literature review. Some research works related to the balancing process can be cited. Singh et al. [1] proposed a dynamic load balancing algorithm of strongly connected servers, which takes into account these servers capability of parallel processing and their request queuing capacity in order to classify the overloaded servers and the least loaded servers, and once a server gets overloaded, its load is migrated to the least loaded one. In [2], Hung et al. introduced an enhancement of the max-min scheduling algorithm by decreasing the completion time of the clients’ requests; this algorithm uses a “supervised machine learning” that clusters utilization percent of virtual machines and clusters size of requests, and then the virtual machine which has the least utilization percent is appointed to the largest cluster requests. However, another work innovated new strategy increases the utilization of virtual machines in an efficient way. This strategy is a heuristic based on a load balancing algorithm which is applied on infrastructure as a service cloud [3]. Besides, Ragmani et al. [4] devised another new strategy of load balancing to increase cloud performance. On the contrary, another work integrated a load balancing algorithm to resource scheduling to provide a higher quality of cloud service [5]. In [6], Hung et al. proposed a load balancing algorithm called max-min and max algorithm that computes the average completion time for every task in all nodes, and then the task with the maximum average completion time is dispatched to the unassigned node with the minimum completion time which is less than the task maximum average completion time. The work in [7] focuses on maintaining information about every virtual machine efficiency in an allocation table that resides in a data center, by increasing the allocation count for an efficient virtual machine which has been allocated by a request and decreasing its allocation count after completing that request. Some other works apply balancing algorithms to solve real-life problems. Indeed, Jemmali et al. [8, 9] treated the problem of gas turbine aircraft engines. However, Jemmali [10] focused on the equity distribution of projects revenues assignment. In the latter work, authors proposed several approximate solutions to solve the problem. Several other works treated the problem of balancing in different applications. Hasan et al. [11] applied balancing algorithm on small-cell networks to adjust handover parameters of the overloaded cells with adjacent cells. However, the balancing algorithm was applied on voltage loads of capacitors in a modular multilevel converter [12].

Xu et al. [13] introduced a technique that rewrites data blocks and defragments the backups of images as well as the authors proposed a technique for restoration of image backups by caching these data blocks. However, in [14], Xu et al. proposed a method based on enhanced -means clustering that finds images of duplicated segments to be selected, and then these images can be loaded into memory.

Jain and You and Koseki and Ogawa [15, 16] proposed a method of load balancing of a set of nodes within the cluster storage system. This method identifies a source node and target node based on a threshold value of the load as well as a proximity between the source and target nodes. This method chooses the data objects to be moved from the source node to the target node without exceeding the threshold value of load in the target node after moving these data objects.

Besides, Gulati et al. [17] introduced a software system which handles the placement of virtual machines and implement load balancing between several devices automatically by applying migrations of data between devices and without the need to the storage arrays.

In [18], Hu et al. introduced a load balancing strategy of resources using genetic algorithm and the system previous data and its current state. This strategy selects the best load balancing and alleviates or gets rid of dynamic migration.

However, Aerts et al. [19] proposed models of load balancing that can be done based on NP-hard retrieval time as well as blocks basis.

In this paper, our study focused on distributed load balancing algorithms, due to using the centralized algorithms limit the scalability in the future as well as they make the system less fault tolerable. Besides, our algorithms deal with a batch of files that need to be stored in temporary storage, and depending on the system planned backup time , the load balancer is triggered. Therefore, number of files is not a matter.

This paper is organized as follows. In Section 2, we present the studied problem and we give some details about the problem in general. Section 3 presents six proposed algorithms for the studied problem, and the experimental results are presented in Section 4.

2. Problem Definition

The problem studied in this paper is the proposition of the fairness method that guarantees the fair distribution of several files to the storage supports. The problem can be presented as follows. Let be the set of given files that must be stored on a fixed number of storage supports. The number of files is denoted by and the number of storage supports is denoted by . The set of storage supports is denoted by . Each file with is characterized by its size . When the file is stored, the cumulative file size is denoted by . The total used space for the storage support when all files are stored is denoted by with . The minimum (maximum) used space after the termination of the backup procedure is denoted by (). Example 1 can illustrate the studied problem.

Example 1. Let and . Table 1 represents the sizes for each file .
We seek to store the seven files on the two given storage supports. Applying an algorithm rule, the result is given in Figure 1.
The results given by the scheduling shown in Figure 1 are as follows. We store the files {2, 6, 3, 7} in storage support 1 and files {4, 5, 1} will be stored in storage support 2. Based on the latter schedule, the used space for storage support 1 is 33. However, storage support 2 has a used space of 24. The gap between storage support 1 and storage support 2 is . Seeking to reduce the latter gap is our primordial objective in this research work. Thus, we must search for a schedule that reduces the gap with a more efficient value less than 9.

3. Approximate Solutions

Our objective in this paper is to minimize the gap between storage supports. To do that, we must, in the first step, define the gap in a general case. The gap can be calculated using different methods. We propose the indicator that can calculate the gap as follows: for each storage support, we subtract the minimum value of all used spaces from the used space of the corresponding storage support. Therefore, considering the storage supports, the total capacity gap () is given by the following equation:

Our objective is to minimize given in equation (1).

Based on the standard three-field notation in [20], the studied problem is denoted by .

Proposition 1. The is an NP-hard problem.

Proof. Since is NP-hard problem [21] the studied problem is NP-hard problem because  = .
To achieve the goal of the work, we propose several algorithms to give approximate solutions.
The proposed algorithms in this paper are based on three methods to solve the studied problems: the first method used the dispatching rules (nonincreasing sizes order algorithm () and nondecreasing sizes order algorithm ()), the second method is based on swapping approach (swapping nonincreasing-decreasing sizes order algorithm , swapping nondecreasing-increasing sizes order algorithm ), the third type of method is more complicated and is based on a mixed approach between the largest files and smallest ones (swapping nonincreasing-decreasing sizes with order algorithm , -Swapping nondecreasing-increasing sizes with order algorithm ).

3.1. Nonincreasing Sizes Order Algorithm ()

This algorithm is applied since all files are ordered by the nonincreasing order of its sizes. After that, we store the files which have the greatest size in the storage support that has the minimum used space.

3.2. Nondecreasing Sizes Order Algorithm ()

This algorithm is applied since all files are ordered by the nondecreasing order of its sizes. After that, we store the files which have the smallest size in the storage support that has the minimum used space.

3.3. Swapping Nonincreasing-Decreasing Sizes with Order Algorithm

Instead to apply just one order (nonincreasing or nondecreasing), we adopt a mixture one by one. This means that for a first selection, we pick the file which has the largest size and for the second selection, we take the file which has the smallest size and so on until all the files are stored. This algorithm works in fact on swapping between two algorithms. The function that call the algorithm , is denoted by , and the function that call the algorithm is denoted by . These two functions return the file index that satisfies the applied algorithm. The function is responsible to store the file in the most available storage supports. The most available storage supports are the storage supports which have the minimum used space. The algorithm of is given in Algorithm 1.

(1)Set ,
(2)whiledo
(3)ifthen
(4)  
(5)else
(6)  
(7)end if
(8)
(9)
(10)
(11)end while
3.4. Swapping Nondecreasing-Increasing Sizes with Order Algorithm

This algorithm is based on the same idea of . The difference here is instead of beginning with the file that has the largest size, we inverse it and begin with the file that has the smallest size. The second file will be the one that has the largest size and so on.

The algorithm of is given in Algorithm 2.

(1)Set ,
(2)whiledo
(3)ifthen
(4)  
(5)else
(6)  
(7)end if
(8)
(9)
(10)
(11)end while
3.5. Swapping Nonincreasing-Decreasing Sizes with Order Algorithm

This algorithm is based on the following idea. Instead of swapping files by files applied in , we select once the largest file and once the smallest file. The question is how the algorithm becomes if we go ahead for the swapping by two files or three files or files.

If we choose the 2-swapping, this means we select the two files having the largest sizes and we store it. After that, we select the two files having the smallest sizes and so on two by two.

The algorithm of 2-swapping, which means that , is given in Algorithm 3.

(1)Set ,
(2)whiledo
(3)fordo
(4)  
(5)  
(6)  
(7)  
(8)  ifthen
(9)   Break;
(10)  end if
(11)end for
(12)ifthen
(13)  fordo
(14)   
(15)   
(16)   
(17)   
(18)   ifthen
(19)    Break;
(20)   end if
(21)  end for
(22)end if
(23)end while

Algorithm 3 can give the solution just for 2-swapping files. We can generalize Algorithm 3 by searching the solution when . The algorithm for a predetermined is given in Algorithm 4.

(1)Set ,
(2)whiledo
(3)fordo
(4)  
(5)  
(6)  
(7)  
(8)  ifthen
(9)   Break;
(10)  end if
(11)end for
(12)ifthen
(13)  fordo
(14)   
(15)   
(16)   
(17)   
(18)   ifthen
(19)    Break;
(20)   end if
(21)  end for
(22)end if
(23)end while

The performance and generalization of the abovementioned algorithm are based on the iteration of Algorithm 4 several times, and then we select the best solution. This generalization will be given as Algorithm 5.

(1)fordo
(2)
(3)end for
(4)
(5)Return
3.6. -Swapping Nondecreasing-Increasing Sizes with Order Algorithm

This algorithm has the same idea as the abovementioned algorithm. The difference here is instead of starting with the largest files, we start with the smallest files.

In Algorithm 5, we replace by in instruction 4 and we replace by in instruction 14. This modification will give the new algorithm . In Algorithm 5, we modify by in instruction 2, and then the new algorithm will be obtained.

4. Case Study

In this case study, we give the comparison between and all other heuristics expected because for 100% of cases, is better. So, we show a case study of instances that our proposed algorithms are better than .

4.1. Comparison of and

Let the instance with 10 files be assigned to 2 storage supports. The sizes of the 10 files are given in Table 2.

This case study is given for the first execution of the algorithm . This means the used space is zero for both storage 1 and storage 2.

The first step of the algorithm is ordering the files in a nonincreasing way. This gives the following order of a batch of files .

The steps of algorithm are as follows: the largest file will be placed in the storage support with minimum used space. Storage 1 is selected. Then, the second largest file is placed in storage 2 which is the one with minimum used space at this point. After that, the third file will be assigned to storage 2 because it has the minimum used space at this point (storage 1 : 280, storage 2 : 268). Now, the used space in storage 2 is 526 and so on.

Therefore, the schedule given by applying the algorithm is as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 939. However, in storage 2, the algorithm assigned the files . So, the total size assigned to storage support 2 is 988. Therefore, the gap between storages is

On the contrary, applying the algorithm , the schedule is given as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 973. However, in storage 2, the algorithm assigned the files . So, the total size assigned to storage support 2 is 954. Therefore, the gap between storages is

It is clear to observe that is less than . Therefore, by comparing the results given by the algorithms and , the difference is 30. Thus, gives the minimum gap.

4.2. Comparison of and

Let the instances with 10 files be assigned to 2 storage supports. The sizes of the 10 files are given in Table 3.

The schedule given by applying the algorithm is as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 269. However, in storage 2, the algorithm assigned the files . So, the total size assigned to storage support 2 is 251. Therefore, the gap between storages is

On the contrary, applying the algorithm , the schedule is given as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 260. However, in storage 2, the algorithm assigned the files . So, the total size assigned to storage support 2 is 260. Therefore, the gap between storages is

It is clear to observe that is less than . Therefore, by comparing the results given by the algorithms and , the difference is 18. Thus, gives the minimum and optimal gap because .

4.3. Comparison of and

Let the instance with 10 files be assigned to 3 storage supports. The sizes of the 10 files are given in Table 4.

The schedule given by applying the algorithm is as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 132. However, in storage 2, the algorithm assigned the files and the total size assigned to this storage support is 135. For the third storage, the assigned files will be and the total size assigned to the latter storage support is 120. Therefore, the gap between storages is

On the contrary, applying the algorithm , the schedule is given as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 131. However, in storage 2, the algorithm assigned the files . So, the total size assigned to storage support 2 is 123. In storage 3, assigned the files . So, the total size assigned to this storage support is 131. Therefore, the gap between storages is

It is clear to observe that is better than . Therefore, by comparing the results given by the algorithms and , the difference is 9. Thus, gives the minimum gap.

4.4. Comparison of and

Let the instances with 10 files be assigned to 2 storage supports. The sizes of the 10 files are given in Table 5.

The schedule given by applying the algorithm is as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 4961. However, in storage 2, the algorithm assigned the files . So, the total size assigned to storage support 2 is 4969. Therefore, the gap between storages is

On the contrary, applying the algorithm , the schedule is given as follows: in storage 1, assigned the files . So, the total size assigned to this storage support is 4965. However, in storage 2, the algorithm assigned the files . So, the total size assigned to storage support 2 is 4965. Therefore, the gap between storages is

It is clear to observe that is better than . Therefore, by comparing the results given by the algorithms and , the difference is 8. Thus, gives the minimum and optimal gap because .

Inspired from this case study, we proposed to apply our algorithms on a cloud computing domain by adding a new component called “scheduler” in the architecture of the cloud computing. This component will be responsible for applying the proposed algorithms and it gives the optimal schedule.

5. Experimental Results

In this section, we propose different classes of instances to compare the performance of proposed algorithms. The main comparison in this paper is that we compare the developed algorithms with the algorithm. The algorithm is used in the literature review as algorithm which has several applications in the industry.

The proposed algorithms in this paper were coded and executed by Microsoft Visual C++ (Version 2013). The computer that is utilized to run all programs coded in C++ has the characteristics as follows:(i)Processor: Intel® Core™ i5-3337U CPU @ 1.8 GHz.(ii)Operating system: Windows 10.

We adopt the choice that the classes used to discuss the results obtained by the developed algorithms are inspired by the classes proposed in [21].

The generation of the file sizes will be through two kinds of distributions. The unit of the size is proposed as Mo. The studied classes are given as follows:(i)Class A: is in (ii)Class B: is in (iii)Class C: is in (iv)Class D: is in (v)Class E: is in

is the uniform distribution and is the normal distribution. The total number of generated instances depends on the choice of , , and . The pair (, ) can have different permutation values. The determination of the (, ) values is given in Table 6.

From Table 6, we can deduce that there are 5300 instances in total. To show the performance of the developed algorithms compared to the oldest one, we can use several indicators. Thus, we propose the following indicators for this paper:(i) is the best algorithm returned value after running all heuristics.(ii) is the discussed heuristic.(iii) = , if , we have .(iv) is the average running time. This time will be in seconds. The symbol “−” means that the time is less than 0.001 s.(v) is the percentage among the total instances (5300) that the constraint is obtained.

Several statistics can be presented in this work. We start by the overall Table 7 that shows the percentage of each heuristic when the studied heuristic equals to the best one. The corresponding average time is calculated for each heuristic in Table 7. The algorithm is the best one among all algorithms with percentage and average time . However, the algorithm has and has . The algorithm that consumes more running time compared with others is .

Table 8 presents the behavior of and when the number of files is changed. For all algorithms, when the number of files increases, the increases. The worst value which is equal to 0.99 is obtained for the algorithm when . However, the best gap 0.09 is obtained for algorithm when (Table 9).

Table 10 represents the behavior of and when the number of classes is changed. It is noticed that the worst gap is 0.98 when the algorithm is applied to class A. On the contrary, the algorithm achieved the best gap in the same class which equals 0.02. The table also shows that the best execution time is less than 0.001s for the algorithms , , , and . However, the algorithms and have the worse execution times which are larger than 0.1s, and the highest execution time is 0.125 s for algorithm when it is applied to class C.

For the algorithm , we can observe that the gap is 0.31 and 0.25 for classes D and E, respectively. However, these values are higher than the gap values of classes A, B, and C which are less than 0.1.

6. Conclusion

In this paper, we focused our study on the resolution of the NP-hard problem of assignment of several files to different storage supports. We developed six algorithms to solve the latter problem; these algorithms are essentially based on the dispatching rules with variant methods. These methods are categorized into the nonincreasing (decreasing) order rule and the mixture method that uses both the nonincreasing and decreasing order rules. In addition, we proposed the -swapping methods which are based on storing the first files using the nonincreasing rule and then storing the next files by applying the nondecreasing rule and so on until storing all files. In this paper, we chose the number which equals . The experimental results show that the best algorithm is , which outperforms the old algorithms in the literature review. The proposed algorithms can be enhanced to develop more performed new algorithms.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank the Deanship of Scientific Research at Majmaah University for supporting this work under Project no. RGP-2019-13.