Abstract
The slackbased algorithms are popular binfocus heuristics for the bin packing problem (BPP). The selection of slacks in existing methods only consider predetermined policies, ignoring the dynamic exploration of the global data structure, which leads to nonfully utilization of the information in the data space. In this paper, we propose a novel slackbased flexible bin packing framework called reinforced bin packing framework (RBF) for the onedimensional BPP. RBF considers the RLsystem, the instanceeigenvalue mapping process, and the reinforcedMBS strategy simultaneously. In our work, the slack is generated with a reinforcement learning strategy, in which the performancedriven rewards are used to capture the intuition of learning the current state of the container space, the action is the choice of the packing container, and the state is the remaining capacity after packing. During the construction of the slack, an instanceeigenvalue mapping process is designed and utilized to generate the representative and classified validate set. Furthermore, the provision of the slack coefficient is integrated into MBSbased packing process. Experimental results show that, in comparison with fit algorithms, MBS and MBS’, RBF achieves stateoftheart performance on BINDATA and SCH_WAE datasets. In particular, it outperforms its baseline MBS and MBS’, averaging the number increase of optimal solutions of 189.05% and 27.41%, respectively.
1. Introduction
As a classical discrete combinatorial optimization problems [1, 2], the bin packing problem (BPP) [3, 4] aims to minimize the number of used bins to pack items and it is NPhard [5, 6]. In the past few decades, four main approaches have been extensively studied to resolve the BPP, such as exact approaches [7–9], approximation algorithms [4, 10], heuristic algorithms, and metaheuristic algorithms [11, 12]. The exact algorithms typically prune the lower bound information to address the BPP, which is suitable for smallscale instances. When the scale of datasets increases, the BPP becomes challenging to the approximation algorithms. The implementation of the metaheuristic algorithms is difficult due to the rigorous requirements on parameter adjustment and calculation complexity [13]. In the contrast, the heuristic algorithm is a popular bin packing method due to its efficiency on solving NPhard problems.
As one of typical heuristic algorithms, the minimum bin slack (MBS) is particular useful to problems where an optimal solution requires most of the bins, if not all, to be exactly filled [14]. It is also useful for solving the problems where the sum of requirements of items is less than or equal to twice the bin capacity. In MBS, the selection of the packing sequence of the items is based on a predetermined strategy, which ignores the sampling deviation between the data of the items to be packed and cannot explore the global data space.
Therefore, the MBS algorithm may quickly fall into local optimal solutions ignoring the exploration of global item space in the training process. In the stage of iterative training, the deviation of the locally optimal solution is accumulated continuously, and the global optimal solution space is shifted steadily. It may result in a significant difference between the algorithm’s packing result and the optimal solution, which may lead to the failure to achieve the desired performance [14].
In order to solve the problems of MBS described above, we propose a reinforced bin packing framework, dubbed RBF, to resolve the BPP, where a reinforcement learning (RL) method, i.e., the Qlearning algorithm, is exploited to select a highquality slack for the packing process. The RBF treats Qlearning as a prior data spatial information detector. To ingeniously select data samples as representatives of the datasets, it explores an intrinsic spatial distribution of sample bins by interacting with the environment and estimating the optimal slack of the global bins. The learned slacks are finally exploited in the improved MBS algorithm to pack items.
The proposed RBF can be distinguished from previous work in terms of the following characteristics:(1)The reinforced learning algorithm is exploited to generate the slack automatically, which is further integrated to the MBS algorithm. With highquality slacks, automatically, rather than manual design or empirical speculation, our method prevents the bin packing process from falling into a local optimal solution, which is a quite challenging problem especially for the largescale dataset.(2)The instanceeigenvalue mapping function is introduced to efficiently select representative and classified validate set of the input instances based on their similarity. This enables RBF to reduce the learning cost while generating a dynamic slack during the packing process.
The rest of this paper is organized as follows. Related work is presented in Section 2. The formulation of the BPP is depicted in Section 3. In Section 4, we briefly overview the design of the RBF and then detail its key components, such as the RLsystem, the reinforcedMBS strategy, and the instance mapping process. Experimental results and theoretical analyses are presented in Section 5. Finally, conclusions are drawn in Section 6.
2. Related Work
Existing methods that address the BPP can be roughly classified into four major categories: the exact approaches [15–20], the approximation algorithms [4, 21, 22], the heuristic algorithms [14, 23–27], and the metaheuristic algorithms [11–13, 28–30]. In recent years, the RLbased methods [31–35] have also been proposed to resolve the BPP.
2.1. Exact Approaches
The exact approaches establish mathematical model and obtain the optimal solution of the problem by solving the mathematical model through optimization algorithms. CPLEX [36] solved the problem with mixed integer programming. Polyakovskiy and M’Hallah [15] characterized the properties of the twodimensional nonoriented BPP with due dates, which packed a set of rectangular items, and experimentally proved that a tight lower bound enhanced an existing bound on maximum lateness for 24.07% of the benchmark instances. Since the quality of the solution depends on whether the model is reasonable or not, they are only applicable to smallscale instances.
Subsequent improvements focused on reconsideration about constraints in a novelty manner. Chitsaz et al. [18] proposed an algorithm to separate the subcontour elimination constraints of fractional solutions to solve production, inventory, and inbound transportation decision problems. The inequalities and separation procedures were used in a branchandcut algorithm. A similar idea was proposed in Mara’s work [20], where an exact algorithm was proposed based on the classic constraint method. The method addressed N singleobjective problems by using reduction with test sets instead of an optimizer. Besides, one classic method belonging to this group was the arcflow formulation method [9] which represented all the patterns in a very compact graph based on an arcflow formulation with side constraints and can be solved exactly by generalpurpose mixed integer programming solvers. Generally, when the scale of the problem becomes larger, the phenomenon of “combinatorial explosion” will lead to heavy computational overhead in the optimization process. It is difficult for the exact algorithm to be applied to largescale combinatorial optimization problems.
2.2. Approximation Algorithms
Approximation algorithms are popular because their time complexity is polynomial, while they do not guarantee to find the optimal solution. Typical approximation algorithms include greedy algorithms and local search. Based on the observation and arrangement of Earth observation satellites, the authors in [21] proposed an indexbased multiobjective local search to solve multiobjective optimization problems. Kang and Park [37] considered the problem of variablesize bin packaging and described two greedy algorithms. The objective was to minimize the total cost of used bins when the unit size cost per bin did not increase as the bin size increased. Moreover, the survey [38] presented an overview of approximation algorithms for the classical BPP and pointed out that although the approximation algorithms are universal, there is always a gap between the solution and the optimal solution under the polynomial time complexity. However, approximation algorithms are commonly subjective to polynomial time and cannot give guarantees of solutions.
2.3. Heuristic Algorithms
Heuristic algorithms are based on the intuitive and empirical design. Several new heuristics for solving the onedimensional bin packing problem are presented [39]. Coffman and Garey [10] reviewed various heuristic algorithms, such as NF (Next Fit), FF (First Fit), BF (Best Fit), and WF (Worst Fit) [23]. These are typical online packing algorithms [40, 41] and are called as fit algorithms. Their corresponding offline packing algorithms are NFD [24], FFD, BFD, and WFD [23], which differ significantly from online packing algorithms in which offline algorithms rely on overall information for sorting. The fit algorithms, for example, FF, WF, and BF, give priority for further packing to the bins that have already been packed with items, and a new bin will be activated only when there are no suitable nonempty bins for the current item. The strategy adopted by the fit algorithms ensures that each arriving item can always find a bin to be accommodated. However, it cannot guarantee that the item is the target item for optimum solution under the current situation. To address the issue, Cupta and Ho [14] proposed MBS, which mainly centered on bins and tried its best to find the collection of items that fill the bins. One problem with this method is that its sequence selection strategy often falls in the local region of the input space, which makes it hard for accurate estimation of the slack. Thus, it may result in a locally optimal solution. To solve the above problems, some methods have been proposed. Fleszar and Hindi [42] found that one effective hybrid method integrated perturbation MBS’ and a good set of lower bounds into variable neighbourhood search (VNS), so as to improve its ability in reasonably short processing times. However, due to the complexity and uncertainty of combinatorial optimization problems, heuristic algorithms that rely on empirical criteria are not always reliable.
2.4. Metaheuristic Algorithms
Metaheuristic algorithms are widely used to find optimal solutions for solving problem of BPP. Early typical representatives include genetic algorithms [28] and simulated annealing algorithms [29]. The former is a promising tool for the BPP and one significant improvement is mainly used: grouping genetic algorithms (GGAs). Dokeroglu and Cosar [43] proposed a set of robust and scalable hybrid parallel algorithms. In GGACGT (grouping genetic algorithm with controlled gene transmission) [44], the transmission of the best genes in the chromosomes was promoted while keeping the balance between selection pressure and population diversity. Kucukyilmaz and Kiziloz proposed islandparallel GGA (IPGGA) in [45]. It realized the choice of communication topology, determined the migration and assimilation strategies, adjusted the migration rate, and exploited diversification technologies. Crainic et al. [46] proposed a twolevel tabu search for the threedimensional BPP by reducing the size of the solution space. Kumar and Raza [47] incorporated the concept of Paretos optimality for the BPP with multiple constraints and then proposed a family of solutions along the tradeoff surface. However, due to the lack of particle diversity in the later stage of genetic algorithms as well as PSO algorithms, premature convergence always occurs [28].
2.5. RLBased Methods
Machine learning has been extensively studied to resolve the NPhard BPP by scholars in recent years. Ruben Solozabal’s model tackled the BPP with RL. It trained multistacked long shortterm memory cells to perform a recurrent neural network agent, which could embed information from the environment. The performance of the model was just comparable to the FF algorithm when introducing neural network overhead. Inspired by Pointer Network [48], a deep learning technology was successfully applied to learn and optimize the placing order of items [32], solved the classic TSP problem [33], and tackled the 3D BPP. These methods utilized RL to ensure the solution would not converge to local optimum, while they attempted to exploit neural networks [49] in the RL to solve the BPP, which increased the computational cost and time complexity. Heuristic algorithms rely on empirical criteria to consider predetermined strategies and ignore the dynamic exploration of the global data space in BPP. RLbased methods can intelligently mine data information from the environmental space through trial and error. Perhaps, it can help the existing heuristic algorithms to fully explore the effective information in the sample space, which inspired our method.
3. Formulation of the BPP
The classic onedimensional BPP is formalized as follows. It is assumed that there are n items to be packed into bins with equal capacity . The general objective is to find a packing way to arrange all items with the minimum number of bins, of which the formal mathematical description can be defined as :
Therein, represents the indicator whether the th bin is used or not. A value of 1 indicates that the bin is used, and a value of 0 indicates that it is not used. Note that once the bin is used, the total load of the items placed in cannot exceed the capacity of . Thus, we havewhere means the load of the th item and is an indicator whether the th item is packed into the th bin or not. Especially, = 1 if the th item is placed into the th container, otherwise = 0. Furthermore, an equally fundamental constraint is that each item is just placed into one bin:
The detailed explanation of parameters for the formalization is defined in Table 1.
4. Design of RBF
In this section, the design of the proposed RBF framework is presented. First, the overview of the RBF is outlined, and then, the details of its key components, such as the RLsystem, the reinforcedMBS strategy, and the instance mapping process, are presented.
4.1. Overview
The classical MBS algorithm follows two steps:(1)Utilize lexicographic search optimization procedure [14], also referred as the algorithm, to find the item set that should be allocated to the bin (2)Utilize Step 1 to traverse all items to be packed and the minimum bin slack is , where is the load of the packed th item
The steps above means that the slack is utilized to jump out of the optimal local trap randomly in the classical MBS algorithm, while the exact distribution of the sample space is ignored. To resolve the instability of the random slack, a new bin packing framework, RBF, is presented, where the slack is learnable and adjusted according to the samples’ structure.
The framework of RBF is illustrated in Figure 1, which consists of a RLsystem, a reinforcedMBS strategy, and an instanceeigenvalue mapping process, and defined as follows:(1)RLsystem: the RLsystem is used to generate a suitable slack by a reinforcement learning strategy, where the best action selection strategy is controlled by Qagent(2)ReinforcedMBS strategy: with the provision of the slack coefficient from the RLsystem, the reinforcedMBS strategy is exploited to resolve the packing process(3)Instanceeigenvalue mapping: instead of using the whole dataset directly, the instanceeigenvalue mapping is utilized to generate the representative and classified validate set for the RLsystem based on the similarity of the input instances
The main idea of RBF is to utilize the RLsystem to learn the slack according to the spatial variation of the sample dataset, and then, the slack can be adapted to the distribution of bins and the remaining items in the data space during the iterative packing process. With the instanceeigenvalue mapping, the representative and classified validate set of the input instances is generated. The validate set is further integrated into the RLsystem, where an adaptive slack is generated by the Qagent. The coefficient of the slack is finally applied in the reinforcedMBS strategy for the packing process.
4.2. InstanceEigenvalue Mapping
To reduce the amount of calculation for the slack, the representative items are selected for the Qagent, which can learn the data space without traversing all instances. Here, an instance classification method, called as instanceeigenvalue mapping, is proposed and defined aswhere is an given instance, is the average value of the items in the th instance in the dataset, and , respectively, represent the minimum value and maximum value of the items in the th instance, and denotes the instances eigenvalue of the th instance.
According to the value of the instanceeigenvalue, the whole instances are reordered. The dataset can be divided into different subsets . The last instance of each subset is taken to form a validation set. Then, the validation set is utilized to iteratively learn the slack. Therefore, at each time step in RBF, instead of using the whole instances, Qagent utilizes the validation set to reduce the repetitive work of the system.
4.3. RLSystem
The validate set is integrated into the RLsystem with a Qlearning algorithm [50], where Qagent is utilized to learn the appropriate strategy and then improve the MBS strategy by selecting highquality slack.
The process of the RLsystem can be described as Markov decision processes (MDP) which is represented as a tuple . In the decisionmaking process of MDP, is the state set, is an action set, is the transition probability between states, is the return value after taking a certain action to reach the next state, and is the discount factor. To be adaptive to the packing circumstances, for example, the current distribution of containers and the remaining items, we proposed a slack learning algorithm and the detailed process is shown in Algorithm 1, where the parameters are illustrated in Table 2. By observing the current state of the environment, Qagent selects one action that maximizes the value of reward function according to the state observed. With Qagent continually interacting with the environment, we explore an suitable data selection strategy of the slack coefficient. The algorithm returns both a reward and a new state to Qagent in each packing iteration, of which the change of states depends on the state transition probability :

The agent receives the performancedriven reward , and then, the sum of discount reward at time step is represented as :
Therein, , and it defines the weight of future reward and discount reward in the sum of reward. The closer is to 0, the more incentive is to consider shortterm benefits. The closer is to 1, the more incentive is to consider longterm benefits.
The goal of the Qagent at each time step is to select an action that can maximize future discount rewards by finding an optimal policy . Here, is the strategy of taking the optimal action at state , while is the strategy of taking action at state . Under the policy , is defined as the expectation of the stateaction value function. When the agent takes at , is represented aswhere is the expected function.
The maximum stateaction function over all policies is represented in
The update rule of value is shown inwhere is the learning rate of the RL agent.
At each time step , Qagent observes the current state and selects the action from a discrete set of behaviors , where the value of is equal to the number of the items to be packed. At the beginning, the action is randomly initialized, that is to say, the action corresponding to the random number between 1 and is selected. Then, the RLsystem selects the action that can maximize the value at each time step :
The agent uses a greedy learning strategy [51] to choose actions. It selects actions according to the optimal value of Q table with probability and randomly selects with probability.
The state is represented as the remaining space capacity of the bin after each round of packing. At each time step , the remaining items prefer to be packed into as few bins as possible. When the bin is full, the agent is given a reward . If the bin is overflowing, the agent is punished severely and it is told that state like this is not allowed.
The slack is defined as
Therein, is a constant, is the immediate reward achieved by Qagent, and represents the initial value in the first iteration process. By returning the reward value , the slack is adjusted in each packing round accordingly. The slack can be changed in a range with the change of the reward value . Ultimately, the new round of environment is updated as the subtraction of bin capacity and slack .
Qagent captures this intuition through performancedriven rewards . At each time step , the agent’s reward is defined as . Therein, is the number of bins that are exactly filled, is the number of bins that are filled in the slack space, is the weight coefficient of positive reward, is the number of overflowing bins, is the punishment coefficient of negative reward less than 0, and is a constant to regulate the value of the entire reward function .
4.4. ReinforcedMBS Algorithm
By introducing the slack learned by the RL agent into the MBS, we propose the ReinforcedMBS algorithm. Therein, is defined aswhere is the slack parameter learned by the agent applying RL. It is calculated as by minimizing the number of used bins on the validation set. Then, the coefficient of slack is passed into our reinforcedMBS algorithm. In detail, the idea of the reinforcedMBS algorithm is shown in Algorithm 2.

In Algorithm 2, the improved dictionary search procedure is utilized to find the set of items that should be assigned to the bin during the iterative process. The improved dictionary search procedure is shown in Algorithm 3.

5. Experimental Evaluation
In this section, experiments are carried out to verify the effectiveness and robustness of the proposed RBF. First, experimental evaluation indexes are introduced and the datasets used in the experiments are detailed. Afterwards, the experimental results are presented and analyzed. Finally, the robustness and stability of our method is discussed.
5.1. Evaluation Indexes and Datasets
5.1.1. Competition Ratio
The competition ratio [52, 53], , is defined aswhere represents the number of bins used by the concrete algorithm and is the number of bins in the optimal solution for the packing instance. The competition ratio equal to 1 means that the algorithm has found the optimal solution.
Generally, OPT() has a lower limit, as shown in formula (14), where is the ceiling function. Due to the limitation of bin packing conditions, the number of bins used in each bin packing iteration cannot be less than the ratio of the total load of the items to the capacity of a single bin:
5.1.2. FSOL
For a dataset, FSOL represents the number of feasible optimal solution instances achieved by the algorithm; in other words, the number of instances whose is 1. For the specialized algorithm alg and the dataset data, is specialized as .
5.1.3. Realization Rate
Realization rate (RT) is defined as formula (15), where INS is the number of instances in the packing dataset:
5.1.4. Gap
is referred to the deviation between the number of used bins obtained by the algorithm and the optimal number of the packing. The relative is exploited to evaluate the performance of the algorithms, which is calculated as
The [54] and [55] datasets are used in the experiment for evaluation. Therein, dataset includes three subsets, such as Bin1data, Bin2data, and Bin3data. The details of datasets are shown in Table 3, such as the number of instances, weight of items, capacity of bins, and the number of items in the instances.
5.2. Experimental Results and Analysis
The performance of the RBF is compared with that of the classical Fit algorithms, the MBS algorithm, and the MBS’ algorithm on and datasets shown in Table 3. For each instance of each dataset, the number of items in each category is the same. The experimental results reported in this paper are the average of ten runs under per hyperparameter settings.
5.2.1. Results on
Table 4 lists the results of FSOL, RT, and of the algorithms involved in comparison on , while Table 5 lists the value of . In comparison with the classical heuristic algorithms, such as NFD, FFD, WFD, AWFD, BFD, MBS, and MBS’, RBF obtains the maximum on , while its and are minimum. Furthermore, the improvement of , represented as and defined as formula (17), is further calculated, where and . Especially, for Bin1data and Bin2data, is 165.08% and 179.2%, respectively, while is 5.53% and 41.3%, respectively. For the dataset Bin3data, RBF is the only one that can obtain 2 optimal solutions in the total 10 cases, while the others obtain zero optimal solutions:
5.2.2. Results on
The results of the compared algorithms on are listed in Table 6. And, the detail results of RBF on , that is, SOL, OPT, and the time cost (Runtime) on each instance, are shown in Tables 7 and 8. In the setting of Table 7, the number of items is 100 and the container capacity is 1000. Table 8 shows the statistical results when the number of items is 120. Meanwhile, of each algorithm are shown in the last column of Table 5. It is shown that, in comparison with other algorithms on , RBF achieved the minimum and Gap, but the maximum was FSOL and RT. Especially, for , is 472%, while is 346.88%.
5.2.3. Cumulative Results
For all the instances in and , the cumulative packing results of the compared algorithms, such as the cumulative , the average RT, the average , and the average , are shown in Table 9. It is shown that RBF obtained 1162 cumulative and the RT was 82.41%, which greatly overwhelmed other compared algorithms. Especially, according to formula (17), the cumulative improvement of RBF to MBS and MBS’, correspondingly, and , is 189.05% and 27.41%, respectively. Figure 2 graphically shows the statistical of the comparison algorithm on each dataset. It is noted that, from the radar in Figure 2, the quantitative curve of RBF for is at the outermost. It means that RBF achieved the largest number of optimal solutions in the test cases. Also, RBF achieved the minimum and Gap. Results indicate that, in comparison with the typical heuristic algorithms, RBF has a stronger global optimal performance. Overall, the proposed RBF obtains first ranks in all metrics according to the results of all compared algorithms listed in Tables 4–6. In detail, the improvement in of uncomplicated datasets, Bin1data, is limited among all compared methods. RBF achieves best results and wins with few advantages. However, RBF works much better than other methods on difficult datasets, bin2data, bin3data, and . Besides, RBF achieves huge advantages in , and it wins almost all other methods on used datasets.
5.3. Robustness and Stability
The construction of the validation set is a key procedure of RBF. This experiment is carried out to verify the validity of the eigenvalue mapping function on the Bin1data. Since 10 instances of the Bin1data are selected to form the validation set by the eigenvalue mapping function, here different selection policy are applied for comparison in the packing process. The first policy is that the first 10 instances of the Bin1data are selected, the second policy is that the last 10 instances of the Bin1data are selected, and the third policy is that 10 random instances of the Bin1data are selected to form the validation set. The packing results with different selection policies are depicted in Table 10. It can be seen that the value of slack learned by Qagent is different with different selection policies. Especially, with the selection policy of the eigenvalue mapping function, RBF achieved the maximum FSOL and RT, while the minimum was and Gap. The results verified the validity of the eigenvalue mapping function, which helped RBF achieved better performance.
6. Conclusion and Future Work
In this paper we propose reinforced bin packing framework (RBF) to tackle the onedimensional BPP. The proposed RBF consists of three main components: the RLsystem, the instanceeigenvalue mapping process, and the reinforcedMBS strategy. The RLsystem is designed to construct a slack selection policy automatically by Qagent to select highquality slack for the heuristic algorithm integrated in RBF. The instanceeigenvalue mapping process is utilized to generate the representative and classified the validate set based on the similarity of the input instances, which greatly eliminates the computational overhead and improves the generalization performance of the model. Finally, with the provision of the slack coefficient from the RLsystem, the reinforcedMBS strategy is exploited to resolve the packing process. We evaluate our models on BPP tasks, where RBF exhibits excellent packing ability and experimental results validate its superior performance compared to stateoftheart proposals on and datasets. Compared to its baseline methods, MBS and MBS’, the average number of optimal solutions achieved by RBF increases by 189.05% and 27.41%, respectively.
For future work, we plan to investigate slack selection policies and new mechanisms to learn them automatically. We also foresee the extension of our method to more complex multiagent reinforcement learning frameworks, where the use of new aspects of the multiagent communication environment is crucial to boost the packing performance.
Data Availability
The datasets used in this paper contain onedimensional bin packing datasets, such as BINDATA and SCH_WAE datasets, which can be found in http://people.brunel.ac.uk/∼mastjjb/jeb/orlib/binpackinfo.html.
Conflicts of Interest
The authors declare that they have no conflicts of interest.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (no. 61472139) and the Shanghai 2020 Action Plan of Technological Innovation (no. 20dz1201400).